About information search, finding the best ways to view search results and much more

The task of finding the best ways to view the search results is my main topic for the candidate work. Today I want to share the intermediate results of the research, as well as the applications and SDKs that were used in the work.

The decision to write this article was made after viewing the seminar from the cycle “Information Search and Data Analysis” on the topic “Semantic Analysis of Texts Using Wikipedia”, the speaker of which was Maxim Grinev, Associate Professor, Senior Lecturer of the Department of System Programming, Head of the Institute of Scientific and Practical Information.

You can view a report , download a report or view a schedule of other reports .

')

Brief scientific conclusions from the seminar

Below will be briefly described the contents of the seminar and the main results obtained.

The report considered new approaches and methods of semantic search, principles for assessing semantic proximity based on data from the English-language Wikipedia.

The main principles used in the report: the term (which is described by the corresponding article) may have several meanings, thus it is necessary to single out the most relevant article. The term (article) contains references to other terms (articles) both within the main text and in the see see, links, etc. blocks.

The semantic distance between articles A and B can be calculated by counting the number of common articles referenced by articles A and B:

Picture 1

Semantic proximity is a key point in the methods described below. I want to mention that the SimRank method [1], described last year, was declared not successful.

From the author: in addition to semantic proximity, the method of shingles or the Pearson xi-square test can be used to determine the distance between two web documents. Also on this issue is my published article "Method of assessing the similarity of web pages" [2], which describes a generalized method for assessing the similarity of web pages based on some semantic features.

Then it was the extraction of keywords for a given term and the construction of the so-called. community (communities) or semantic graphs [3]:

Figure 2

The essence of these graphs is that the terms (articles) that belong to a single community are included in a certain general category. Those. The classical problem of classifying texts is solved. This category can be computed by defining a generic “parent” category, which includes selected terms. To define communities, the clustering method is used (which does not require specifying the number of clusters and cluster size); communities with a small rank are discarded.

An example of a real semantic graph:

Figure 3

In the course of the research, it turned out that “good” communities have a much greater rank than the others - less relevant.

Figure 4

This approach allows you to filter out non-primary content (top, bottom, see also), since the communities derived from the terms of these blocks will have a small rank, and, accordingly, will be eliminated during the calculations.

Comments and Comments

When I looked at the report, I had a feeling of deja vu, since I do many things in my work and some of the results have a strong relationship with it.

First of all, I want to focus on some of the weaknesses of the described techniques and methods:

All methods are tied to Wikipedia. This means that, a priori, we believe that the range of knowledge is limited only to Wikipedia and this knowledge is absolute;
Wikipedia is a strictly structured resource, i.e. erroneous classification of knowledge inside it is practically absent. This fact greatly simplifies the pre-processing of texts in classical problems of classification and clustering (in fact, this stage is skipped, since all the work is done before the analysis begins, and it is known that the quality of pre-processing is key when analyzing large data arrays);
Wikipedia has a single structure (layout), which makes it easy to clear an article of unimportant information with very high accuracy;

Below I will describe my own work in the context of the methods described above.

Advanced Clustering Method

The method is a refinement of the classic k-means algorithm, which is simple in terms of implementation, but not accurate. There are two reasons why this algorithm is not exact: the algorithm is sensitive to the choice of starting points and the number of clusters. Thus, if unreliable data will be submitted to the input, the quality of the result will be desired.

In his work “The clustering method based on clusters distributed according to the normal law” [4], it was proposed to check the distribution law of objects within the clusters. If the cluster is distributed according to a certain law, then we leave it, if not - divide it into two subsidiaries and the verification process continues until all clusters are distributed according to a certain law or when we exceed the limit by the number of clusters. Thus we solve the problem of the number of clusters. The problem of choosing the initial points is solved by specifying the most separated points within the larger cluster as initial centers. On the test data, the method showed 95% accuracy.

The importance of information blocks sites

When we talk about a modern web page, we mean not only the main content for which, we actually came, but also a lot of additional information on the sides, below, etc. These information blocks can have different purposes - lists of links, statistics, related articles, advertising. It is clear that the importance of this content is much less than the main one.

A method was developed to purify web pages from information noise [5], about which I wrote in general terms earlier [6]. I will dwell on an important point - the procedure for determining the “important” blocks is based on a fuzzy clustering method, and simple words are searched for blocks with maximum numerical estimates (which differ sharply from others). In fact, the same approach was used to define “good” communities, which Maxim Grinev spoke about (see Figure 4).

The prototype is available for download at the codeplex site:

Let's take another close look at Figure 3 - a graph of the relationship of terms. In fact, this is nothing but a graph of the relationship of web pages that are responsible for a specific term.

We developed a similar system for assessing the relationships between sites, but in our case we use search engines as a data source (for us there is no fundamental difference with which search engine to take the results).

A real example of how the system works on the Microsoft request can be found below:

Figure 5 (in this case, a shallow link analysis is selected)

If you look closely, you can also see some of the "community", which indicate membership in different categories. Accordingly, by selecting keywords (or other properties of each web page), you can cluster the search results in runtime. Note that this works for the entire web, not just for Wikipedia.

Finding the best ways to view search results

Now I come to the most interesting part, since all of the above is not in itself of great importance.

Once we have a relationship graph, information about each web page (Google PageRank, the distance between web pages, the "weight" of the page), we can find the best way to view the search results on the graph. Those. in other words, we’ll get a non-linear list of sites ranked according to a certain algorithm (weight), but a set of recommendations on the order in which to search the web search results [7] .

To achieve the goal, we use a modified ant algorithm that simulates user behavior, namely, random transitions during surfing. Each path is estimated by a certain formula and at the end we get the optimal path (the amount of information, the amount of duplicate information and several other parameters are taken into account).

In addition, the user can select:

depth of analysis
maximum (optimal) number of sites
choose what is important for him - the speed of execution (i.e. requests like “what is wikipedia”), or a deep analysis of a certain issue (for example, “African economies”)
Of course, he can save all this in his personal settings, thus reconfigure the algorithm for himself

findings

Thus, the considered methods and algorithms allow to gain knowledge not only in Wikipedia, but also for the entire web. The ideas and methods presented at the seminar and received by us, on the whole, coincide, in some ways even surpass them. Among the shortcomings I can name the impossibility of testing methods on large amounts of data, which Yandex has for its research and the need to work on formalities, and not on the essence of the problems.

I hope this article will help assess the state of affairs in the field of information retrieval.

PS I can not fail to mention the developed Data Extracting SDK , with which almost all applications that were used for research were written.

PSS If something is not clear or there is a desire to get acquainted with some methods (ideas) in more detail - write in the comments, I will try to answer them.

[1] Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008
[2] Method of assessing the similarity of web pages
[3] Maria Grineva, Maxim Grinev, Dmitry Lizorkin.Extracting Key Terms From Noisy and Multitheme Documents. WWW2009: 18th International World Wide Web Conference
[4] Krakovetsky O. Yu. The method of clustering on the basis of clusters, which follows a normal law // International Science and Technology Journal "Information Technology Technology and Computer Engineering" №1 (11). - 2008 p. - p.56-60.
[5] Method of clearing web pages of information noise
[6] V.M. Dubovoy, O. Yu. Krakovetsky, OV The rush. Factor analysis of social status of information blocks of websites // News of Vinnytsia Polytechnic Institute. - 2008. - №6. - c. 103-107
[7] Volodymyr Dubovoy, Oleksandar Krakovetsky, Olga Glon. The method of obtaining optimal values will look at the results of the web-work on the basis of heuristic algorithms.

Source: https://habr.com/ru/post/76301/

All Articles