Today we will talk about how to automate the process of analysis using the Window-Facts method. About this method, unfortunately, quite a bit of information, but it remains one of the key methods of processing information flows. More details about text analysis can be found, for example,
here . In general terms, the task of the Window-Facts method is reduced to searching in the text of indisputable facts. But let us clarify what exactly is meant by fact.
In this article, the
fact will be understood as the following - this is a judgment (sentence), in which any subject or named object is mentioned. Having the ability to extract such facts from many texts, we get a text devoid of "water" and containing only facts.
Facts and subjects in the text
Of course, such an approach to the interpretation of the term “fact” as a result, in some texts, misses a lot of information. However, this problem was not observed in all texts. At the same time, the information with which the analyst works (working with the final software information analyzer) was analyzed with fairly high accuracy.
Let us formulate the task that we face a little more clearly: to find words in the available information in the form of text that are personalities or other important objects (for example, the name of a place, place, or something else significant in the human understanding of this word). Next, search for all the proposals in which this person is found (we, as we have said, call such proposals “Facts”).
')
On what grounds do we distinguish personalia from the usual word? I think the answer to this question is very simple - in a capital letter. Trite and stern. Of course, such a generalization has many problems, without the solution of which such a method may not work. And it is about what problems arise in a person who is trying to realize something similar, and how we will deal with them in a little more detail.
Subject Search Problems
We figured out the main thing (to begin with) - everything that begins with a capital letter - we will call it the Subject or the Object, which is referred to in the text or referred to in the text. However, it is worthwhile to immediately recall some conventions in each language. For example, there are characters, after which the next word most often begins with a capital letter. For our Russian, such symbols can be a dot, a question mark, an exclamation mark, etc. Thus, at least all those words that are at the beginning of a sentence fall out of sight of our approach, since it is not known why there is a capital letter in them. Such a restriction, at first glance, cannot have a positive impact on the result of robots. However, and as practice has shown, there is no negative effect.
So, summing up the intermediate results: we learned to identify the Subjects of information by a capital letter. Since not all words beginning with a capital letter are Subjects, we have come to the need to make a list of rules - according to which we analyze exceptions (when the word, although it starts with a capital letter, it will not be considered a Subject).
Further, we are faced with the problem of sorting facts by Subjects. Since the same Subject in different facts may be referred to in a modified form (different declensions, cases, etc.). In order to determine whether the two words, the Subject, belong to the same Subject, we compare these two words to “similarity” to each other. And also experimentally established the threshold of "similarity" of words, at which the words are considered identical.
Such an interpretation of the search for personalities in the text allows you to automatically perform the task equally effectively, without the substantial cost of a linguistic text analyzer, for almost any language. That is, the algorithm shows equally good results in both English, Ukrainian and Russian.
Let me remind you that we agreed to isolate personalities by a capital letter. We also agree that we have set A, which lists all the characters after which a capital letter is put (this is done so that we would not confuse the staff with the usual word). This means that if a word begins with a capital letter and the last non-empty symbol in front of it is not contained in set A, this word will be considered personalia, and the sentence containing this word will be a fact about this person.
Process automation
Of course, now, taking into account the above, we can automatically solve the following tasks:
- make a list of persons mentioned in the text;
- group facts by personalities;
- sort out persons who appear in pairs in the facts and thereby find the facts connecting some persons.
But this is not a complete list of what we can do in a fully automatic mode. Having an array of facts and personalities, you can build connections on the facts between persons. The link graph can be built in a separate text or it can be accumulated in subsequent texts. You can, for example, look for facts about a particular person, and with whom this person is connected, and through what facts.
Lined up chains of connections between persons through facts can be measured in length.
Madwin

All these features were collected in a single software package called “MadWin”. Unfortunately, there is no possibility to lay out the source code of the program, and the program itself had to be slightly “trimmed”. The program is compiled in deb and rpm packages for x86. The functionality of the “trimmed” version of the program available for download is the following:
- can find in the text of the person;
- knows how to find facts in the text and binds them to the persons found;
- able to build relationships (and conducts a gradation of relationships) between any person found.
In the output file (report) the program shows:
- list of persons in the text;
- a list of facts by person;
- link table;
- the list of the developed communications between persons.
Example of the program
Input text
Incoming file, which tells the story of Nicole Kidman (taken from the press):
txt (text taken
here ).
A small quote from the text:
In the courtyard, behind a high fence, there is a large swimming pool and a luxurious garden. The house is in a well-protected and equally well protected from all extraneous quarter, which played a decisive role for the spouses in choosing housing: parents want their daughter to grow in the most relaxed atmosphere.
For all these comforts, the star couple paid about five million dollars. At the same time, Nicole is looking for a home in London, where she will soon have to go to participate in the production of the musical “Nine”.
Output report file
Report of the program in
html format.
Personals are grouped in the report, and facts are presented for each of them. For example, the following are the facts for Subject Nicole.
Nicole
- While some star couples make a whole business out of giving birth to children, they sell the rights to publish and video of babies in advance, Nicole Kidman is not “like that”
- Actress Nicole Kidman and her husband, singer Kate Urban refused millions of dollars, which they were offered for publishing the first photos of their newborn baby Sandy Rose
- But Kit and Nicole see a huge interest in themselves and babe, they appreciate it.
- While they are not up to deals with magazines, they enjoy the first days of their daughter's life, and Nicole is most concerned about issues such as breastfeeding.
- Note that Tom Cruise and Katie Holmes sent Nicole a large bouquet of roses and several huge packages with toys, children's clothes and other necessary things for babies.
- Recently it became known that Nicole Kidman, along with her husband, country singer Keith Urban, acquired a mansion in Beverly Hills.
- In parallel, Nicole is looking for a home in London, where she will soon have to go to participate in the production of the musical "Nine"
As you can see the fact of the above text entered.
The following is a table of personal relations in the report. Each person is given a number and at the intersection of two numbers there is either a “+”, which means that it is possible to build a connection between these people through facts or a minus, which testifies to the opposite.
| one | 2 | 3 | four | five | 6 | 7 | eight | 9 | ten | eleven | 12 | 13 | 14 |
1: Nine, London | - | - | + | + | + | + | - | - | + | + | + | + | + | + |
2: E-motion | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
3: Hills, Beverly, Keith | + | - | - | + | + | + | - | - | + | + | + | + | + | + |
4: Holmes, Katie | + | - | + | - | + | + | - | - | + | + | + | + | + | + |
5: Cruz | + | - | + | + | - | + | - | - | + | + | + | + | + | + |
6: Tennessee, Nashville | + | - | + | + | + | - | - | - | + | + | + | + | + | + |
7: McConaughey, Matthew, Aguilera, Christina, Anthony, Mark, Lopez, Jennifer, Pitt, Brad, Jolie | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
8: Herald, Morning, Sydney | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
9: StarLife | + | - | + | + | + | + | - | - | - | + | + | + | + | + |
10: Rose, Sunday | + | - | + | + | + | + | - | - | + | - | + | + | + | + |
11: Urban | + | - | + | + | + | + | - | - | + | + | - | + | + | + |
12: Kate | + | - | + | + | + | + | - | - | + | + | + | - | + | + |
13: Kidman | + | - | + | + | + | + | - | - | + | + | + | + | - | + |
14: Nicole | + | - | + | + | + | + | - | - | + | + | + | + | + | - |
And completes the table "ways" connecting all the personalities between which there is a connection.
Links
UPD:Beta version twomadwin x867 debmadwin x86 rpmblog author b0noI