Setting up document processing on FAST

One of the tasks in integrating a third-party search engine into the system is setting up the processing of source documents (roughly speaking, indexing). The complexity of setting up such a process depends on the functional requirements for the search system and the capabilities of the search engine. Settings can either be limited to a couple of clicks in the search engine admin panel, or result in writing your own procedures, scripts, etc. If the standard features of the system (especially if its code cannot be modified) we are used to trust, then for our own scripts I would like to have tests, the implementation of which is not always provided by the engine.

We are faced with the need to implement a search on the MS FAST ESP 5.3 platform. This serious engine has impressive customization capabilities for processing documents, some of which we have touched upon in our project. In general, we want to share our way of testing custom stages on this engine.

The documentation describes the process of creating stages quite well. We will not retell it, we confine ourselves only to the information necessary for understanding the above.
In FAST ESP terminology, the entire sequence of actions that must be performed when indexing a single document is called Pipeline, and the individual actions are called Stage. Stage runs in a specific context with which it can interact, one of the elements of which is a document. For example, a stage can read and write attributes of a document being processed. Schematically, the entire document processing process is as follows:

')
Stage is presented in the form of two files - xml-specification and implementation (in FAST ESP 5.3, the implementation involves the use of the python v.2.3 language).

Below is an example of a stage that writes 500 in the quality field if the hotornew document attribute is true.

 <processors>
  <processor>
   <load module = "processors. SetOnEqual" class = "SetOnEqual" />
   <desc> Set high rank to documents, which have HotOrNew = yes
   </ desc>
   <config>
         <param name = "Input" value = "hotornew" type = "str" />
         <param name = "Output" value = "quality" type = "str" />
         <param name = "InputFieldValue" value = "true" type = "str" />
         <param name = "OutputFieldValue" value = "500" type = "int" />
        
   </ config>
   <ops>
     <add />
   </ ops>
  </ processor>
 </ processors>

(the specification is not directly involved in the tests, given for the integrity of the image)

from docproc import Processor, DocumentException, ProcessorStatus class SetOnEqual(Processor.Processor): def ConfigurationChanged(self, attributes): self.input = self.GetParameter('Input') self.output = self.GetParameter('Output') self.inputfieldvalue = self.GetParameter('InputFieldValue') self.outputfieldvalue = self.GetParameter('OutputFieldValue') def Process(self, docid, document): testField = str(document.GetValue(self.input, None)) if testField == str(self.inputfieldvalue): output = int(self.outputfieldvalue) document.Set(self.output,output) else: document.Set(self.output, 0) return ProcessorStatus.OK

In order to check the work of the created stage in the native context of document processing, it is necessary to do the sequence of actions indicated in the documentation:
1. To put in a specific directory files specification and implementation;
2. Restart the document processing service - Document Processor (procserver. Upon launching it compiles the step code);
3. Include a new stage in the pipeline;
4. Index the test document;
5. Look at the result of processing the document (you can output to the log file, or you can “find” a new document through the standard frontend and see all the attributes of the document).

If we did not get the expected result (for example, in the document we set the field hotornew = true, and the value of the Quality field has not changed), we will have to debug, which means:
- Search for an error in the procserver log;
- To check if our stage did exactly what they asked for by putting the Spy stages before and after performing the stage. (Spy uploads a document dump together with its attributes to a file on disk);
- Search for an error in the stage code.

Once the error is corrected, you need to check again - i.e. again perform steps 1, 2, 4, 5.
This is a chore. It is more convenient to debug the stage code “as it is” by the usual methods, for example, in a unit test:

Therefore, in order to “recreate” the context, we have made primitive classes of classes, with objects that the stage works with:

 class Document(object): """ Mock  Document """ def GetValue(self, name, default): return getattr(self, name, default) def Set(self, field, value): setattr(self, field, str(value)) class Processor (object): """ Mock  Processor """ def GetParameter(self, name): return getattr(self, name) def Set(self, field, value): setattr(self, field, str(value))

In reality, the document entity provides other methods. We are limited to those that use.
Now you can write / debug / test a stage without a search engine:

 import unittest import docproc.Processor as proclib from docproc import ProcessorStatus import SetOnEqual class testSetOnEqual(unittest.TestCase): def setUp(self): self.stage = SetOnEqual.SetOnEqual() self.stage.Set('Input', 'hot') self.stage.Set('Output', 'quality') self.assertEquals(self.stage.GetParameter('Input'), 'hot') self.assertEquals(self.stage.GetParameter('Output'), 'quality') def test_true(self): self.stage.Set('InputFieldValue', 'true') self.stage.Set('OutputFieldValue', '600') doc = proclib.Document() doc.Set('hot', 'true') self.stage.ConfigurationChanged('') status = self.stage.Process("docid", doc) self.assertEquals(status, ProcessorStatus.OK) self.assertEquals(doc.GetValue('quality', ""), '600') def test_false(self): self.stage.Set('InputFieldValue', 'true') self.stage.Set('OutputFieldValue', '600') doc = proclib.Document() doc.Set('hot', 'no') self.stage.ConfigurationChanged('') status = self.stage.Process("docid", doc) self.assertEquals(status, ProcessorStatus.OK) self.assertEquals(doc.GetValue('quality', ""), '0') def suite(): suite = unittest.TestSuite() suite.addTest(unittest.makeSuite(testSetOnEqual)) return suite if __name__ == "__main__": unittest.main()

The code is entirely in the archive .

What gave us this approach:
- Simplification of life itself (great time saver and allows you to not deviate from the practice of unit testing);
- Simplify the life of the tester.

Author:
Lia Shabakayeva
Lead Developer
Softline Development Department

Source: https://habr.com/ru/post/139044/

All Articles

Setting up document processing on FAST

More articles: