Lucene Full-Text Search Library provides the ability to organize a search on text documents. There are also tools with which you can organize the search for "similar" chemical structures, for example, OpenBabel. Sometimes it may be necessary to combine these two types of search in a single “canvas”. For example, if you need to create a system that can respond to such requests: find a substance in the textual description of which has the word "amino acid" structurally similar to indole (it is expected that we will find the amino acid tryptophan). This article describes the solution to this problem based on the Lucene full-text engine.
As you know, the basis of full-text search is the construction of an inverted index. The text of the documents is divided into separate words, for each of which a list of documents is constructed in which the word is found. Usually, besides the list of documents for a given word, the index contains the positions of the word in each document.
Search for similar molecules in databases can be based on "molecular fingerprints". Often the “molecular imprint” is a bit string in which the presence of a certain property or structural fragment corresponds to each bit. In the role of a measure of proximity, the Tanimoto coefficient is used (a special case of
the Jaccard coefficient ).
As a simple example of a molecular imprint, consider the following: Suppose we want a molecular imprint of length n. The structure is cut into one-dimensional chains of atoms no longer than k in length. A hash function is applied to each received chain, which matches a number from 0 to n-1 to the chain. This number determines the number of bits that will be set to 1 in the print.
')
For such manipulations with chemical structures, you can use the java-library
Chemistry Development Kit (hereinafter referred to as CDK). CDK is distributed under the LGPL license and contains java-classes for representing structures, calculating several types of “molecular fingerprints”, supports many chemical file formats. For example, cutting a molecule into fragments can be implemented like this:
/ CDK IAtomContainer. IAtomContainer structure; ... for (IAtom startAtom : structure.atoms()) {
The molecular imprint described is computed using the CDK class org.openscience.cdk.fingerprint.Fingerprinter.
In order to use Lucene to search for structures, we reformulate the search for structures by fingerprints into a kind of “word search”. The structure, as well as in the imprint construction method described above, is cut into short one-dimensional chains, which will be analogous to words in a text search. Thus, if we provide the system with another structure as a search query, then the query structure will also be cut into chains, and the result will be structures in which there are the same atomic chains. The mechanism for organizing documents by relevance will tend to display in the first positions results of structures in which, on the one hand, they have many coincidences, on the other hand, these coincidences are slightly “diluted”. That is, it is logical to expect that in the first place there will be exactly the same structure (if, of course, it is among the indexed ones), then similar structures differing in one atom or group, then even more different structures, etc.
It is worth noting that the default ranking formula used in Lucene (vector space model + TF-IDF) differs from the Tanimoto proximity measure. You can override the Similarity class from Lucene under the Tanimoto coefficient formula, but I will not dwell on this, especially since the default formula should give a “qualitatively correct” result: structures with a large proportion of fragment intersections will come first.
This approach to indexing and searching for molecules can be implemented by creating a special “
chemical tokenizer ” for Lucene. The tokenizer is the Lucene component, which breaks the text into separate tokens (most often tokens are words). I have already cited above the main logic of this tokenizer. The difference will be that in the Lucene tokenizer it is not necessary to immediately receive all the tokens in the loop. Instead, you need to implement the
Tokenizer.incrementToken () method, which will produce textual representations of chains of atoms one by one.
There are several formats for the presentation of molecules as a text string. For now, we’ll dwell on the support of only one of them - the
SMILES format. The CDK provides a SMILES string parser (org.openscience.cdk.smiles.SmilesParser class). It is not difficult for them to use:
SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance()); String smiles = "OCC(O)C(O)C(O)C(O)CO";
The logic of converting the SMILES representation of molecules into a structure for the purpose of further cutting it into chain-tokens is placed in the
SmilesTokenizer class.
I will allow myself to omit a number of insignificant details, I will only say that the source code is
uploaded to GitHub . I will go straight to the example of using these classes to search for structures. Suppose that the search documents contain three fields: name is the name of the chemical compound, description is a text description, and the smiles field is information about the structure of the molecule.
Document indexing might look something like this (using the
SmilesAnalyzer helper class, which simply creates the
SmilesTokenizer ):
Map<String,Analyzer> analyzerPerField = new HashMap<String,Analyzer>(); analyzerPerField.put(SMILES_FIELD, new SmilesAnalyzer() ); PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_40), analyzerPerField); Document doc = new Document(); doc.add( new TextField( "name", "Acetic acid", Field.Store.YES ) ); doc.add( new TextField( "description", "Acetic acid is one of the simplest carboxylic acids. Liquid.", Field.Store.YES ) ); doc.add( new TextField( "smiles", "CC(O)=O", Field.Store.YES ) ); Directory directory = null; IndexWriter indexWriter = null; try { directory = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzerWrapper ); indexWriter = new IndexWriter( directory, config); indexWriter.addDocument(doc); } ...
The search will look something like this:
reader = IndexReader.open(directory ); IndexSearcher searcher = new IndexSearcher(reader); String querystr = "smiles:CCC"; Query q = null; q = new QueryParser(Version.LUCENE_40, FREE_TEXT_FIELD, getAnalyzer()).parse(querystr); TopScoreDocCollector collector = TopScoreDocCollector.create(5, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs;