⬆️ ⬇️

Comparison of technological approaches to solving data extraction problems

The aim of the article is to attempt a comparative analysis of the main approaches in solving problems of semantic text analysis, their differences and effectiveness at the concept level, without taking into account the nuances, combinations of options and possible tricks that will improve the expected result.



Today, there is a huge amount of materials describing certain techniques for solving problems of semantic text analysis. This includes latent semantic analysis, SVM analysis, transfer-convolution, and much more. To write another article about the review and comparison of specific algorithms - it means wasting time.



I would like in several articles to discuss the basic ideas and problems underlying the semantic analysis from the point of view of their practical application, if it can be so expressed, from a basic philosophical and ontological point of view. To what extent is it possible to use generative grammars for text analysis? Do you accumulate spelling variants and various kinds of “corpses” or develop analysis algorithms based on rules?



As part of our reasoning, I will deliberately try to get away from any terms and well-established expressions, for as W. Quine said, the terms are just names in the framework of ontologies that have no practical significance for solving logic problems and understanding something in particular. [ one] Therefore, with his permission, we will rely on single descriptions of Russell, or more simply, to give full descriptions to the detriment of existing well-established terms.





If you do not take into account specific tasks, such as the analysis of emotional coloring, phonetic analysis, etc., then from the point of view of text analysis tasks, the following main types of analysis stages can be identified:



1. Syntactic

Analysis of a linear sequence of words in order to build a dependency tree. The goal is to analyze the structure of the proposal and the relationship of its components. The basis of the analysis are various kinds of grammar (dependencies for Slavic languages ​​and German, directly constituent parts for Romance, generative, etc.).



2. Semantic

Analysis of the dependence of a word or phrase on the general context. Solving problems of polysemy, synonymy, etc. The basis is a different kind of body.



3. Semiotic

Analysis of the meaning of the text, taking into account, allegorical, “translation errors” associated with different culturological associations, adopted stable expressions in the context of the narrator’s environment, concepts. The basics are still difficult. Perhaps the creation of maps of associative fields or maps resembling political, with temporary and territorial boundaries of cultures.



If we consider the possible basic ideas - technological approaches in general, then I see two diametrically opposite approaches:



1. Technologies of accumulation of experience on the basis of known experience (machine learning) and an attempt to use it to analyze new situations. They are also called statistics based algorithms. 90% of publications refer specifically to technology. In other words - statistical methods.



2. Technologies for the development of the analytical capabilities of the machine through the development of algorithms for building logical connections without prior "training" with examples. Or algorithms based on rules or grammars.



The first type should be attributed, of course in a simplified form, the technology of “learning” of the system by creating spelling variants and superpositions of the analyzed entities. Variations on the same topic are different frequency algorithms such as latent semantic analysis, etc.



The second type includes such technologies as SVM-analysis, transfer-convolution, grammar construction.



In general, and here the ideas of Plato and Aristotle collide in all their glory. Answering the question about the benefits of technology should ask about what we want to get at the output and how we want to achieve this? Will we find out the information in the analyzed text that corresponds to our previous experience or do we allow the availability of information outside of this experience? And in this case will we build hypotheses and disprove hypotheses?



In addition, you should separate the problems to be solved. Do we want to understand the “meaning” of what has been written in general, or is it enough to find something that we know and mark the text in accordance with the available experience, namely, to extract information?



As an explanation and an example, we can use the analysis of the phrase: “The Moscow water canal is located at the address: Moscow ul. Earthworks".



Probably, for solving text translation problems, the value of semantic analysis is enormous, but not enough, because besides this it is necessary to resolve the issue of differences in associative ranks, stable expressions, emotional nuances, etc. For example, most of the fundamental research devoted to semantic analysis does not take into account the possible “illiteracy” of the writer. This is quite normal, since most of these basic research was created no later than the 60s of the 20th century. So, they were more speculative, associated more with thinking as such, but not with the tasks of text recognition. If you do not take "serious" scientific works, then you should read Umberto Eco "To say almost the same thing. Experiments on Translation ”, where the question of the influence of semiotic approaches in questions of translation is examined in a popular form.



Is there a sufficient semantic approach to solving the problem of extracting information or is the problem wider? In essence, should we rely more only on semantic analysis or should we abstract and go to a more general level - semiotic?



Analysis of current trends is complicated by the fact that truly breakthrough technologies often represent a commercial secret, as well as a huge amount of materials that are essentially reprints of each other. Fortunately, the Internet can endure. The analysis of the dissertation also does not shine with diversity. Rather, it deals with the confirmation by the applicant of a scientific degree, rather than being really developing something new. Although of course, there are quite interesting publications. For example, quite interesting as a review, although with controversial conclusions, the work of I.V. Smirnova and A.O. Shelmanova "Semantic-syntactic analysis of natural languages" [2].



Let us turn to the essence of the article and, for a start, we will determine the base layer of goals and problems.



Analysis objectives:



  1. Text translation
  2. Search by text
  3. User Tips
  4. Extract data.


Problems:



  1. Migration flows.

    Large mix of semantic and semiotic fields with a large number of errors, i.e. syntax violation (grammar) and text semantics



  2. The difference in the phoneme ranks of different languages.

    The inability to predict typos, and therefore it is impossible to create a "complete" database of spellings



  3. Gadgetization

    Today everyone has smartphones and tablets. As a result of the developed system of hints and corrections of texts, a new class of errors arises. Drop-down words out of context.



  4. Polysemy of concepts.

    Within the framework of Russia, this is a problem voiced by, for example, the “State Services” portal, when departments give names for essentially the same services in different ways. At the same time, they are served in a strongly “bureaucratized”, formal form or very long names. To understand a normal person is impossible.


From the point of view of the world as a whole, the prevailing influence of the English language and the emergence of its simplified version of middle atlentic.



This is not a complete list, but for the purposes of this article is sufficient.



Before giving a brief comparison of technological approaches, I would like to make a few fundamental comments.



First, the comparison is purely applied in nature, and has a very narrow focus, not related to translation tasks. The analysis is performed for the tasks of extracting and searching data. Quite often, you can hear the hypothesis that the technology of recognition of visual images, images and texts can be easily combined and they essentially have to come to the realization of a common mechanism. Perhaps so, but it seems to me that this idea is more like a search for a unified field theory in physics. Perhaps it will be found, but for now in the framework of this study, we limit ourselves to the tasks of working with textual data.



Secondly, the framework of the limited size of the article does not provide a deep analysis. Therefore, the material is of a thesis nature, without a detailed analysis of situations.



Third, a comparison of specific technological approaches, namely: a comparison of the advantages and disadvantages of neural networks, genetic algorithms, JSM methods, etc. not relevant to the subject matter. This is nothing more than a means of achieving a result into which any logic can be “loaded”. Therefore, I would like to compare the principle itself and the possibilities of various technological approaches.



Fourth, all algorithms, without exception, are based on our previous experience and are the result of our previous experience. Unfortunately, there is no knowledge of the data above, including innate instincts, as they are the experience of previous generations. Therefore, to say that some algorithms rely on previous experience, while others do not - this is an exaggeration. The question is how we will use this experience, in what constructions we will wrap it.



Thus, the purpose of the article is to attempt in the first approximation to analyze the possibilities and limitations of the basic logics themselves.



So, there are two main technologies: statistical and rule-based. We will not consider the combined option due to redundancy.



Statistical methods



The bulk of the algorithms are pre-tagged packages enriched with spellings such as abbreviations, typical errors, etc. At the moment, I just started collecting statistics, so the representativeness is not great. Nevertheless, let me highlight the following characteristic "generic features":



1. The bulk of the solutions uses inside full-text search.



2. For acceleration, data hashing is widely used.



3. The number of spelling variants of the same entity ranges from 1 to 100. As an example, one can cite solutions in the field of address data cleaning, where one of the most frequently used services indicates that its “training set” consists of 50 million variants, with a sample base size of 1.2 million options.



4. Analysis is performed by direct comparison of substrings for full compliance with the standard.



5. Requires a separate procedure for verifying the results for the final decision.



The advantages of the method are:



  1. Relative ease of implementation.
  2. High speed search options.


The disadvantages include:



  1. The avalanche-like growth of the base size due to the need to store spellings of individual entities.
  2. The complexity of controlling consistency, which leads to an increase in the likelihood of a polysemy of options
  3. Impossibility or severely limited analysis of partial matches and morphology.
  4. The high cost of the initial creation of algorithms as it is necessary to accumulate a base of spelling variations. This is reflected, for example, in the difficulty of connecting new countries when parsing addresses. Since for each country you need to create your own database of spellings.
  5. The impossibility of applying heuristic approaches to the analysis of situations beyond the known options.


Algorithms based on rules



The bulk of the algorithms is based on the concepts of the frame, syntaxes and with the help of artificial predicative languages, various semantically labeled shells.



Generic features can be considered:



  1. The presence of one way or another marked hulls or reference reference books. For example, "Lexicographer" [3] VNIITI, the national corpus of the Russian language [4], KLADR / FIAS, etc.
  2. The presence of rules, united in grammar. Grammar can be implemented in the form of related patterns, artificial predicative languages, etc.
  3. The analysis is performed by sequential comparison of words. Permutations and partial matches of words are allowed, if such is provided for by the grammar.
  4. A separate verification procedure is not required to accept the final result.


The advantages are:



  1. Higher accuracy
  2. Good portability when working with different buildings and areas of knowledge.
  3. The ability to use heuristic approaches for analyzing situations beyond knowledge packaged in shells.
  4. The ability to analyze and make decisions in situations of strong "pollution" of data associated with various kinds of errors and excessive content.


The disadvantages include:



  1. The difficulty of implementing grammars due to the lack of ready-made tools.
  2. Lower speed.
  3. The complexity of controlling the consistency of the rules
  4. The complexity of building pre-marked and logically linked buildings of knowledge bases.


findings



Despite the apparent obviousness of the benefits of a rule-based technology approach, both approaches have a right to exist. The question is in the areas and economic feasibility of their application.



So, it seems obvious that an approach based on statistical methods can be well recommended in tasks where there is a small array of analyzed entities and there is not much data pollution. Examples include such tasks as organizing [5] a search for commodity items of a small store, searching and analyzing hash tags in social networks, evaluating the emotional coloring of texts. Express analysis of documents to determine their type and further cataloging.



At the same time, in solving problems associated with large arrays of reference data, when working with Slavic languages, the advantage is a technological approach based on rules. An example is the solution of the problem of address analysis. Test results and analysis of existing solutions show that solutions based on statistics give a stable result of search accuracy in the range of 60-70% percent on the context with contamination in the range of 10-15% and an increase in accuracy to 80-85% while reducing contamination below 10%.



In the above figures, it is easy to make sure by assembling a stand, which will be some kind of full-text index, for example, elastic [6], with KLADR / FIAS embedded in it.



This article is essentially introductory. In the future, I will try to elaborate on each of the issues in more detail.



Notes

[1] William Quine's "Philosophy of Logic"

[2] This work was supported by the Russian Foundation for Basic Research (project No. 12-07-33068) and the Ministry of Education and Science of Russia under state contract No. 07.514.11.4134 dated June 08, 2012

[3] The Lexicographer project was originally connected with the idea that emerged from S.Krylov in 1990 to create a bibliographic database on lexical semantics: a draft Russian dictionary was put forward, in which each word or meaning of a word would be compared to its bibliography. This idea interested a group of linguists and gradually transformed into the idea of ​​creating a database on lexical semantics, which could be a working tool of the lexicographer.

At the initial stage, G. I. Kustova, E. V. Paducheva, E. V. Rakhilina, R. I. Rozina, S. Yu. Semenova, M. V. Filipenko, N. M. took part in the creation of the Lexicographer. Yakubova, T. Y. Yanko.

[4] The project involves experts from the Russian Language Institute. V.V. Vinogradov of the Russian Academy of Sciences [IREA RAS], the Institute of Linguistics of the Russian Academy of Sciences [INP], the Institute for Information Transmission Problems of the Russian Academy of Sciences [IITPI], the All-Russian Institute of Scientific and Technical Information of the Russian Academy of Sciences [VINITI] and the Institute of Linguistic Studies of the Russian Academy of Sciences St. Petersburg (together with St. Petersburg State University [SPbSU]), Kazan (Volga Region) Federal University, Voronezh State University, Saratov State University. Website: www. http://ruscorpora.ru

[5] pollution is understood to be redundant in terms of words as well as mistakes

[6] https://www.elastic.co



')

Source: https://habr.com/ru/post/315994/



All Articles