Putting the project into trial operation. The commission observes how the system recognizes information from messages arriving in real time. The first message comes: “ Quiet. "
Commission. What does “Quiet” mean? Are they drunk at the branch?
System. "Quiet" = Wind power is within normal limits.
Commission. So they are about the weather. The system was put into trial operation!
All events in the article are fictional. Any coincidences with reality are random.
I had the opportunity to work on a project in which, using semantic analysis, I managed to solve one of the main problems of big business management - obtaining timely and relevant information on the state of affairs in the company's branches.
The company “Z” has an extensive network of branches throughout our vast Motherland and, even if it supplies firewood for the population and enterprises. At the company's facilities, various incidents occur regularly, which higher management structures need to be aware of.
Historically, the information from the objects was transmitted in the form of sms and email messages up the organizational structure.
And, of course, the information was faithfully lost, miraculously modified and regularly delayed during the transition from level to level. As a result, at the top there was a distorted picture of what was happening below.
We were given the task, in a short time, to arrange the delivery of relevant information to the governing body. The company is huge, and the term is small. We decided to include the mobile number and email of the main control center in all mailings from objects. In this case, messages are not read, but to set a smart program on them, which will figure it out with everything.
By "with everything she will understand" meant:
Most readers will ask: “Why is it great to invent, if there are ready-made solutions ?”. Good question, but, as in the joke, there is a nuance . More precisely, there were many nuances: mostly short sentences, the spelling and grammar of individual messages resembled the works of graduates of a parochial school. The text is weakly structured with a large number of abbreviations and abbreviations, and most of the words in it are branch terminology, multiplied by the specifics of a particular region. That is, the same entity in Kaliningrad and Krasnoyarsk can be called differently. As a result, having gone through the software available at that time and in consultation with the experts of the region, they were forced to saw their own solution.
With the meek theory of the stages of extracting facts from the text can be found here .
The architecture is based on a three-level classifier: types of situations, types of events, types of event attributes. And the principle:
For the operation of the algorithm, you need a completed classifier on a given topic.
The heart of an attribute is a token (roughly a word), and we have two options for extracting them: regular expressions or a language dictionary. Given the specifics of the project, we chose regular expressions.
A single token can have several regular expressions to solve the problem of synonyms and typos. Entities like: company objects, organizational structures, administrative-territorial division, should be based on relevant databases. To highlight the quantitative characteristics, they identified special tokens - strict links that have special labels in regular expressions - “groups”.
To manage the classifiers and entities, a configurator was created in which the entire configuration was made. For example, each token indicated whether it is a subject or a predicate .
The algorithm is divided into stages:
Consider an example of the incoming message:
“02/10/15” at approximately ten in the evening, on the site of the Southern Branch of SC, stations from 15 to 17. No soc. value Affected by n.p. Ivanovo and Petrovo. 500 people ..
We clean the text from "garbage", and the characters similar in purpose are reduced to a single form. Only one “dash” of at least four types (- - - -), and not less quotes (““ ”` “).
At the output we get:
02/10/15 at approximately ten in the evening, in the area of ​​the Southern Branch of the UK, AZ stations from 15 to 17. No social services. value Affected by n.p. Ivanovo and Petrovo. 500 people
The shorter the text segment, the easier it is to work with it. We count the characters “!?. ... ”the end of the sentence, and business, if not reductions in the text. The key to the solution is a dictionary of abbreviations and tokens that have a period in the regular expression.
The message now has the form of an array of sentences:
02/10/15 at approximately ten in the evening, on the section of the Southern Branch of the UK, AZ stations from 15 to 17.
No social value
Affected by n.p. Ivanovo and Petrovo.
500 people
From each sentence are allocated: tokens, strict connections, entities based on databases, punctuation marks.
Examples:
At the end of the sentence, the sentences are an array of tokens and punctuation marks, all unknown words will be dropped:
02/10/15 token “numeric date” to token “preposition”roughlyten token “verbal number” of the evening token “reference to time” , token “comma” to the token “preposition” section of the southern branch of the IC by 17 a strict link is “listing” .
No token “negative particle” social value . Token “socially significant object”
Affected are “impact” n. Token “settlement” Ivanovo essence “settlement” and token “union” Petrovo essence “settlement” .
500 people strict connection "number of people . "
For each event attribute in the configurator, an algorithm and input data are specified: tokens and strong links. As well as settings that allow you to adjust the operation of the algorithm. For example, the flag “important sequence of tokens” means that the algorithm needs not only to find the tokens in the text, but also to check that their order corresponds to the order specified in the configurator.
The main algorithm used in the overwhelming number of attributes is the definition of connections between subjects and predicates. The algorithm relies on syntax rules and “peeks” into adjacent sentences.
Example:
02/10/15 the date and time date are at 10 pm, the Southern Branch of the IC has an organizational structure , AZ predicate 1 stations subject 1 for predicate 1 15 to 17, a clarification for subject 1 .
No social value denial with the subject (no socially significant consumers)
Affected predicate 2 bp subject 2 for predicate 2 (populated areas) Ivanovo subject for predicate 2, clarification for “n.p.” and Petrovo subject for predicate 2, clarification for “n.p.”
500 people subject 3 for predicate 2, the predicate is taken from the adjacent sentence .
At this stage, event attributes form events. Each event in the configurator lists the list of its attributes and settings, the most important one is the mark of the main attribute. Without the main attribute, the event will not take place. For example, the event “Attempt of theft” has the following attributes: “The fact of attempted theft” (main attribute), “Time of occurrence”, “Organizational structure”.
The same attribute can participate in different events, for example the organizational structure indicated at the beginning of a sentence can be used for all subsequent events in the message.
The formation of events takes into account data from neighboring offers. For example, if in the previous sentence it is mentioned that the accident has been eliminated, and in subsequent ones there is a transfer of affected consumers, then the transfer is a fact of recovery from damage, and not a fact of damage.
Example:
02/10/15 4:35 PM South. UK branch. Accident completely eliminated. Disadvantaged consumers: 500 people.
The basic information from the message is received, now it needs to be added to the existing situation or to register a new one.
The principle of the union is “What? Where? When?". That is, the message refers to the situation if they answer all 3 questions in the same way.
So, for each message, we check for the presence of similar objects (“Where”) with compatible events (“What”), imposing a time delta (“When”).
For example, an earthquake occurred on three objects of the lower level, each of which transmitted a message to the next level and a copy to the very top. In turn, the “next level” compiled the information and transmitted it to the very top. As a result, at the top of 4 messages that the system will combine into one situation. The fact that the “Where” objects are close by, we learn from the directory of company objects.
Based on the composition of events in a situation, its type is determined. Types of situations are very important for decision making in operational management.
For example, in a situation there is the following set of events: the loss of telemetry, the collapse of a part of a building, the cessation of consumer supply, the avalanche. The type of situation will be defined as “collapse of a building or structure”.
In addition to statistics and reports, the extracted data was used in the following tools:
The chosen approach showed its efficiency, providing the top level of the company with operational data from all of its branches at relatively small time and material costs.
Advantages of the approach:
Disadvantages of the approach:
Initially it was assumed that the users of the system would be top-level employees of the company, but the situation changed after several cases during trial operation, which became the main indicator of the success of the project and the approach in particular.
For example, we began to receive calls from our consultants with a request to describe the situation with related accidents between sites, since they already knew about the existence of the system and that this information can only be obtained from it.
Or a case at a meeting of the head office with branches via videoconferencing, when the head of the Company, looking at the dashboard of our system, asked the head of the branch: “Why the fire at the site was not extinguished for an hour and a half,” to which he received the answer: “What fire?” Information about the fire, through the old channels, reached the branch manager only half an hour after the meeting.
Source: https://habr.com/ru/post/351434/
All Articles