Latent semantic analysis: implementation

As mentioned in the previous article , latent-semantic analysis (LSA / LSA) allows you to identify latent relationships of the phenomena or objects under study, which is an important criterion when modeling the processes of understanding and thinking.

Now I will write a little about the implementation of the LSA.

A bit of history

LSA was patented in 1988 by a group of American research engineers S. Deerwester at al [US Patent 4,839,853].
In the field of information retrieval, this approach is called latent semantic indexing (LSI).

For the first time, the LSA was used to automatically index texts, identify the semantic structure of the text, and obtain pseudo-documents. Then this method was quite successfully used to represent knowledge bases and build cognitive models. In the USA, this method was patented to test the knowledge of schoolchildren and students, as well as to check the quality of teaching methods.
')

Job Description LSA

As the initial information, LSA uses a matrix of terms-on-documents (terms - words, phrases or n-grams; documents - texts, classified either by some criterion, or arbitrarily separated - it depends on the problem to be solved), describing the data set used to train the system. The elements of this matrix contain, as a rule, weights that take into account the frequency of use of each term in each document or probability measures (PLSA is a probabilistic latent semantic analysis) based on independent multimodal distribution.

The most common variant of LSA is based on the use of a real-valued decomposition of a matrix with singular values or an SVD-decomposition (SVD - Singular Value Decomposition). With it, any matrix can be decomposed into a set of orthogonal matrices, the linear combination of which is a fairly accurate approximation to the original matrix.

According to the singular decomposition theorem, in the simplest case, a matrix can be decomposed into the product of three matrices:

where the matrices U and V are orthogonal, and S is the diagonal matrix, the values on the diagonal of which are called the singular values of the matrix A.
T symbol in the matrix notation

means matrix transposition.

The peculiarity of such an expansion is that if we leave only k of the largest singular values in the matrix S, then a linear combination of the resulting matrices

is the best approximation of the original matrix A to a matrix Ă of rank k:

The basic idea of latent semantic analysis is as follows:
after matrix multiplication, the resulting matrix Ă, containing only k first linearly independent components of the original matrix A, reflects the structure of dependencies (in this case, associative) that are latently present in the original matrix. The structure of dependencies is determined by the weight functions of the terms for each document.

The choice of k depends on the task and is chosen empirically. It depends on the number of source documents.
If there are not many documents, for example, a hundred, then k can be taken 5-10% of the total number of diagonal values; if there are hundreds of thousands of documents, then they take 0.1-2%. It should be remembered that if the selected value of k is too large, then the method loses its power and approaches the characteristics of standard vector methods. Too small a value of k does not allow catching differences between similar terms or documents: there will be only one main component that “pulls the blanket over itself”, i.e. all weak and even unrelated terms.

Figure 1. SVD decomposition of matrix A of dimension (TXD) into a matrix of terms U of dimension (TX k), matrix of documents V of dimension (k XD) and diagonal matrix S of dimension (k X k), where k is the number of singular values of the diagonal matrix S .

The volume of the case for building a model should be large - preferably about three to five million word usage. But the method also works on collections of a smaller size, although a bit worse.

Arbitrary splitting of text into documents usually produces from a thousand to several tens of thousands of parts of approximately the same volume. Thus, the term-to-documents matrix is rectangular and can be highly fragmented. For example, with a volume of 5 million word forms, a matrix of 30-50 thousand documents per 200-300 thousand, and sometimes more, terms is obtained. In fact, low-frequency terms can be omitted, because this will significantly reduce the dimension of the matrix (say, if you do not use 5% of the total volume of low-frequency terms, then the dimension will be reduced by half), which will lead to a decrease in computational resources and time.

The choice of reducing the singular values of the diagonal matrix (of dimension k) for the inverse multiplication of matrices, as I wrote, is arbitrary. With the above dimension of the matrix, several hundred (100-300) main components are left. In this case, as practice shows, the dependence of the number of components and the accuracy change nonlinearly: for example, if you start increasing their number, then the accuracy will fall, but at a certain value, say, 10,000 - will increase again to the optimal case.

Below is an example of the occurrence and change of the main factors with a decrease in the number of singular elements of the diagonal matrix.
Accuracy can be assessed on the marked material, for example, by adjusting the result in advance according to the h of the answers to the questions.
For more information about this method, see, for example, review articles [see bibliography].
Those who want to try out LSA can download binaries for building models and for testing and using them.
The program for SVD decomposition of the matrix in the construction of models used open source (Copyright Sergey Bochkanov (ALGLIB project).), So the programs are distributed without restrictions.

Application

comparison of two terms between themselves;
comparison of two documents among themselves;
comparison of the term and the document.

Also, this method is sometimes used to find the “nearest neighbor” - the closest in terms of terms associated with the original. This property is used to search for related terms.
It should be clarified that proximity by meaning is a context-dependent value, therefore not every close term will correspond to an association (it can be a synonym, and an antonym, and just a word or phrase that often occurs together with the term being searched for).

Advantages and disadvantages of LSA

The advantage of the method can be considered its remarkable ability to identify dependencies between words when conventional statistical methods are powerless. LSA can also be applied both with training (with preliminary thematic classification of documents), and without training (arbitrary splitting of an arbitrary text), which depends on the problem being solved.

I wrote about the main drawbacks in the previous article. To these, one can add a significant decrease in the computation speed with an increase in the amount of input data (in particular, during the SVD transformation).
As shown in [Deerwester et al.], The calculation speed corresponds to the order

where

- the sum of the number of documents and terms, k is the dimension of the space of factors.

A small demo example of calculating the matrix of proximity of full-text documents with a different number of singular elements of the diagonal matrix in the SVD-transformation

The figures show the emergence and change of the main factors with a decrease in the number of singular elements of the diagonal matrix from 100% to ~ 12%. Three-dimensional drawings are a symmetric matrix obtained by calculating the scalar product of the vectors of each reference document with each tested. The reference set of vectors is a text pre-marked on 30 documents; tested - with a decrease in the number of singular values of the diagonal matrix obtained by SVD-analysis. The number of documents (reference and tested) is deposited on the X and Y axes, and the volume of the lexicon is on the Z axis.
The figures clearly show that when the number of singular diagonal elements decreases by 20-30%, the factors are not yet clearly revealed, but this results in correlations of similar documents (small peaks outside the diagonal), which initially increase slightly, and then, with a decrease in the number of singular values (up to 70-80%) - disappear. With automatic clustering, such peaks are noise, so it is desirable to minimize them. If the goal is to obtain associative links within the documents, then an optimal ratio should be found for preserving the main lexicon and mixing the associative one.

LITERATURE

Thomas Landauer, Peter W. Foltz, & Darrell Laham (1998). "Introduction to Latent Semantic Analysis" (PDF). Discourse Processes 25: 259–284. US Patent 4,839,853 (Eng.)
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman (1990). Indexing by Latent Semantic Analysis (PDF). Journal of the American Society for Information Science 41 (6): 391–407.
Thomas Landauer, T. Dumais Solution: A Case for Acceptance, Induction, and Representation of Knowledge 211–240 (1997). Retrieved July 2, 2007.
B. Lemaire, G. Denhière Cognitive Models based on Latent Semantic Analysis (2003).
Nekrestyanov I.S. Thematic-oriented methods of information retrieval / Thesis for the degree of Ph.D. SPSU, 2000.
Solovyov A.N. Modeling of processes of understanding of speech using latent-semantic analysis / thesis for the degree of K. f-m.n. SPSU, 2008.
Golub J., Van Lone C. Matrix calculations. M .: Mir, 1999.