The exact search implemented in databases is very good when it comes to accurate phrases. But what to do when there is a
Kiev map among the documents, but there is no
Kiev map ? Language filters are included in the case. Firstly, already at the lexical level, it becomes difficult to operate with a monolithic block of text in order to take into account all sorts of permutations of words and the distances between them. Secondly, the deeper to dig into the language, the clearer it becomes that the semantic web is an incredibly
complex bar for automatic analyzers and generators of some images and models, let alone something to write RDF manually. Morphology studies the change in form objects in different fields of science (botany for example). Therefore, there are two ways - either
consider all forms when searching, or cut out the root of the word and look for
only by it. The latter method is called stemming, different
speed, simplicity and does not need dictionaries. It is used by
Bitrix and
MS Sharepoint , Sphinx. Problems arise with words where the root is changeable (run-run, grow-gain, lion-lioness). I will not talk about stemming, see how
php is implemented
with Russian morphology . I'm more interested in
dictionaries . The national
corpus of the Russian language gives about what characteristics any word can have. Now we smoothly approach the understanding that we need a modern morphological database of words (
RMU ,
AOT ), a prototype for the semantic network.
Indexing and searching
The idea is to use a database (Postgre) with tables of morphs (all possible words) and associated tokens (roots and affixes). When indexing the document occurs:
- Breakdown of a document into words
- Normalization - each word is associated with a morph if there is one
- If there is no morph, then in the future they are manually added to the dictionary due to the registration of the frequency of mentioning certain words.
During the search, a similar process takes place - each query word is normalized if it is among the tokens and a list of documents is obtained from the “query-token-morph-document” relationships. To speed up the dictionary, you can load the entire table into the RAM at once (I heard from Eugene about the lightweight database -
hsqldb )
Higher levels of language
How to deal with relevance? Taking into account the distance of words or their sequence is a matter of
syntactic level . Syntactic indexing involves splitting into sentences and creating links between words that are used in one sentence. In addition, it is possible to take into account what part of the speech is the lexeme. In a database, it looks like a regular table with links between lexemes and when searching, for example, the presence of several words in one sentence is checked. . ideally, at the request “Bonaparte’s children” there would be documents like “Valevsky’s father - Napoleon.” But the most important task of the highest levels is homonymization analysis, i.e. the ambiguity of both the roots (key, onion) and stress (fell asleep / fell asleep, steaming / steaming). At the moment, both google and yandex take into account word conjugations, but you don’t exactly ask for the specific
meaning . See also:
A couple of talking cats clearly shows that the language occurs where communication is born. PS Unfortunately, I haven’t found an analogue of
Wordnet 'in runet (only mentions about “
Ariadne ” based on Zaliznyak’s dictionary). Nobody thought about it?
+
Original