📜 ⬆️ ⬇️

InterSystems iKnow. Part two. Creating a simple domain

This is a continuation of my story about Intersystems iKnow's Natural Language Processing technology, beginning here . In the second part you will find a description of the practical work with iKnow. We will create a domain, configure it, upload text. Then, we will look and we will analyze results. More on this under the cut ...


Start by creating a domain. A domain in iKnow can be compared with an area in Caché or with a mailbox in the entrance of your home. This is a container where the texts are loaded. In addition to the texts, the tools necessary for their analysis are stored there, for example, configurations, loaders, listers, dictionaries, etc.
There are two ways to create domains. One of them is using the % iKnow.Domain class. When using this approach, you must manually write the code to create both the domain itself and all objects within it. This process is rather complicated, it takes time and experience with iKnow, however, it allows you to implement complex iKnow-applications with post-processing of indexed data.
There is an alternative method based on the use of the % iKnow.DomainDefinition class, which is suitable for rapid prototyping. It allows you to create a domain in a declarative way through the description of its structure in the XML representation. And the domain object itself is created automatically when the class is compiled. This method is simpler, more compact, allows you to quickly create a new domain from scratch. In this article I will describe the second way of working and give examples of code.
Comment. I create and test the code in Caché 2015.2 Field Test. The principal difference from previous versions is the support of stemming and lemmatization. In this regard, there will be a difference in a number of settings, but more on that later. So let's get started.

Step zero. Formulation of the problem
')
Before starting to write code, I will formulate a task that I will solve. We will write the simplest news aggregator. To do this, create a domain in iKnow, which will download articles from RSS feeds, and then teach him to separate these articles by topic. I propose to take news categories such as: “politics”, “economy”, “sport” and, for example, “threat from space”.

Step one. Domain creation

Create a domain using DomainDefinition. To do this, it is enough to compile this class:

Class HabrDomain.News Extends %iKnow.DomainDefinition { XData Domain {XMLNamespace =TEST] { <domain name="NewsAggregator" > </domain> } } 


I want to note that the domain itself is completely empty, and is created as an object immediately after compiling the HabrDomain.News class. To verify this, run the command
do $ system.iKnow.ListDomains ()
in the terminal. You will see that the NewsAggregator domain has been formed, with an ID of 1 (the ID may be different if you have already created domains), with no downloaded texts (# of sources is 0).

Step two. Domain Setting

By setting up a domain you can understand a very wide range of actions, but now we’ll talk about domain configuration. The configuration is used only when uploading documents to the domain and is responsible for how iKnow will process the text. The configuration is created as an object, so you can create it once and then reuse it repeatedly for different domains in a given area. Theoretically, the configuration is not mandatory, and all settings can be replaced by some “default” values, but in this case, it is better to forget about working with Russian texts right away.
To describe the configuration in DomainDefinition, add a line inside the Domain tags:

 <configuration name="Russian" detectLanguage="false" languages="ru" stemming="DEFAULT" /> 


According to this line, we created a configuration with the name “Russian”, which will use the semantic model of the Russian language for text analysis, and the mechanisms for automatic detection of the language of text documents will be disabled. The “stemming” parameter with the “DEFAULT” value is a necessary (but not sufficient) condition for the Russian lemmatization to be included in the text analysis.
To complete the setup of the stemming, add one more line after the configuration:

 <parameter name="$$$IKPSTEMMING" value="1" /> 


Step three. Creating Metadata Fields

When we upload articles to our domain, not only texts will be loaded. From RSS feeds you can get a lot more useful information that we can then use. To store this data, we configure the meta information fields. To do this, add the following lines to the class XData block:

 <metadata> <field name="PubDate" dataType="DATE" /> <field name="Title" dataType="STRING" /> <field name="Link" dataType="STRING" /> <field name="Agency" dataType="STRING" /> <field name="Country" dataType="STRING" /> </metadata> 


Thus, we have described 5 fields. PubDate will store the date of publication of the article, Title - its title, Link - link to the full text. In the Agency we will load the name of the resource from which we downloaded the article, and in the Country - the territorial identity of the source.

Step Four. We describe the sources where we will download articles

When setting the task, we agreed that the texts will be downloaded from RSS feeds. As an example, I will take the tape http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/mainnews.rss rbc.ru, which publishes news from all sections. To tell iKnow to work with this resource, add the code:

 <data> <rss serverName="static.feed.rbc.ru" url="/rbc/internal/rss.rbc.ru/rbc.ru/mainnews.rss" textElements="title,description" > <converter converterClass="%iKnow.Source.Converter.Html" /> <metadataValue field="Agency" value="RBC" /> <metadataValue field="Country" value="Russia" /> </rss> </data> 


Now I will describe in more detail the fields of this record. serverName is the server name and the first part of the link to the RSS feed, ending with the name of the top-level domain (in our case .ru). The second part of the link is written in the url parameter. Note that the url starts always with “/”. From each publication we will upload two text fields - the title and the text (under the text I understand the body of the article that is published in the feed; more often it is a brief introduction, not a full-fledged material)
Next, the converter. In our case, it is the standard% iKnow.Source.Converter.Html, and its purpose is to remove all html tags from the loaded text in order to get clear text.
And, at last, we describe loading of metadata. Just above, we created 5 fields, three of which iKnow fills in automatically, this is the publication date, title and link to the full text of the article. The two remaining fields will be filled from here. “RBC” will be written in the “Agency” field, and “Russia” - in the “Country”.

Step five. Dictionaries

One of the advantages of iKnow technology is that dictionaries are not used for basic text analysis, but a compact and fast semantic language model is used. But there are a number of tasks in which we still need dictionaries. One of them is matching — assigning articles to topics (for example, articles about sports, politics, economics, or the threat of an alien invasion). In other words, when describing a domain, we can define terms, at the mention of which in the text the article will be assigned to one category or another. Add the following code to the class:

 <matching> <dictionary name="Sport"> <item name="" uri=":sport:" > <term string="" /> <term string="" /> <term string=" " /> <term string=" " /> </item> <item name="" uri=":sport:" > <term string="" /> <term string="" /> </item> <item name="" uri=":sport:" > <term string="" /> </item> <item name="" uri=":sport:" > <term string="" /> <term string="" /> <term string="-" /> <term string="" /> <term string="" /> </item> </dictionary> <dictionary name=""> <item name="" uri=":politics:" > <term string="" /> <term string="" /> <term string="" /> <term string="" /> <term string="" /> </item> </dictionary> <dictionary name=""> <item name="" uri=":Economy:"> <term string="" /> <term string="" /> <term string="" /> <term string="" /> <term string="" /> <term string="" /> </item> </dictionary> <dictionary name=" "> <item name=" " uri=":ThreatFromSpace:"> <term string="" /> <term string="" /> <term string="" /> <term string=" " /> <term string="" /> </item> </dictionary> </matching> 


The matching section contains a set of dictionaries. Each dictionary describes its category, the terms in which are divided into objects (subcategories) and terms. The purpose of this article is to simply demonstrate the capabilities and mechanisms of iKnow, while for a serious task the dictionaries should also be serious and very voluminous.

Step Six. Launch
Now our domain is fully described.
Full text class:
 Class HabrDomain.News Extends %iKnow.DomainDefinition { XData Domain [ XMLNamespace = TEST ] { <domain name="NewsAggregator" > <configuration name="Russian" languages="ru" stemming="DEFAULT" /> <parameter name="$$$IKPSTEMMING" value="1" /> <metadata> <field name="PubDate" dataType="DATE" /> <field name="Title" dataType="STRING" /> <field name="Link" dataType="STRING" /> <field name="Agency" dataType="STRING" /> <field name="Country" dataType="STRING" /> </metadata> <data> <rss serverName="static.feed.rbc.ru" url="/rbc/internal/rss.rbc.ru/rbc.ru/mainnews.rss" textElements="title,description" > <converter converterClass="%iKnow.Source.Converter.Html" /> <metadataValue field="Agency" value="RBC" /> <metadataValue field="Country" value="Russia" /> </rss> </data> <matching> <dictionary name="Sport"> <item name="" uri=":sport:" > <term string="" /> <term string="" /> <term string=" " /> <term string=" " /> </item> <item name="" uri=":sport:" > <term string="" /> <term string="" /> </item> <item name="" uri=":sport:" > <term string="" /> </item> <item name="" uri=":sport:" > <term string="" /> <term string="" /> <term string="-" /> <term string="" /> <term string="" /> </item> </dictionary> <dictionary name=""> <item name="" uri=":politics:" > <term string="" /> <term string="" /> <term string="" /> <term string="" /> <term string="" /> </item> </dictionary> <dictionary name=""> <item name="" uri=":Economy:"> <term string="" /> <term string="" /> <term string="" /> <term string="" /> <term string="" /> <term string="" /> </item> </dictionary> <dictionary name=" "> <item name=" " uri=":ThreatFromSpace:"> <term string="" /> <term string="" /> <term string="" /> <term string=" " /> <term string="" /> </item> </dictionary> </matching> </domain> } ClassMethod DeleteDomain(DomainName As %String) As %Status { set tSC = ##class(%iKnow.Domain).%OpenId(..%GetDomainId()).DropData(1, 1, 1, 1, 1) quit:$$$ISERR(tSC) tSC quit ##class(%iKnow.Domain).%DeleteId(..%GetDomainId()) } } 



A few words about the DeleteDomain method that I added to the code. The created domain exists as an object of the% iKnowDomain class, but it can only be deleted by the internal methods of the HabrDomain.News class, since it is this domain that controls the domain.
Finally, we can run the calculation.
do ## class (HabrDomain.News).% Build ()
As a result, articles from the sources specified by us will be added to the NewsAggregator domain created at compilation. In addition, the data will be analyzed for the entry of markers from the Matching dictionary.

Step Seven. View Results

To view the results, it is best to use one of the existing UIs, such as the Knowledge Portal , Indexing Results and Matching Results.

image
Figure 1. Knowledge Portal.

Knowledge Portal allows you to conduct a primary analysis of the results of iKnow. Here you can select any of the created domains, in our case this is NewsAggregator. The table “Top concepts” shows the frequency of mentioning certain concepts, with frequency being the number of mentions of the concept, and spread is the number of articles in which the concept is present. If we select any concept in this table (the “Russia” concept is now selected, Figure 1), the contents of the “Similar Entities”, “Related Concepts”, “Paths”, and “Sources” tables will be updated.
The “Similar Concepts” table displays similar concepts. In our case, the concepts where the word “Russia” is found will be similar, but additional terms will be present (for example, “the ambassador of Serbia to Russia”). The table “Related Concepts”, in our case, it turned out to be empty due to the small number of loaded articles, will contain a list of concepts that are most often mentioned related to the selected one. Below in the example, such concepts are in italics.
Another very interesting table is “Sources”. From here you can open a view of the text of the article, the results of indexing and categorization. With text viewing, everything is quite simple. The only parameter that can be selected in the dialog box is the number of sentences displayed. So, for example, if we set 1, then iKnow will show the only, most important, in its opinion, sentence in the article.

image
Figure 2. Indexing Results.

The Indexing Results window allows you to analyze the results of indexing. Here, the concepts are highlighted in color, underlined in italics - links, and gray italics - insignificant words. As a rule, this window is used to check the correctness of domain settings based on the results of indexing, but it is also very convenient for reading the texts of articles (for example, when compiling dictionaries).

image
Figure 3. Matching Results.

Finally, the third available window is Matching Results. Here you can see the results of categorization of articles by dictionaries, which we added to the domain description. The concept highlighted in red in the text means that it corresponds exactly to the term from the dictionary. If only the concept frame is highlighted in red, it is similar to the vocabulary.
It's time to summarize. We learned how to create the simplest news aggregator. To do this, using the class% iKnow.DomainDefinition was formed domain. In this domain, a configuration was created that supports the Russian language and a tool for lemmatization. Sources have been added in the form of RSS feeds. And finally, we created dictionaries for categorization. After that, we started building a domain and, using a standard UI, analyzed the results.
The article shows an example of creating an iKnow domain using the example of analyzing news from an RSS feed. To create a domain used class% iKnow.DomainDefinition. The domain configuration with the support of the Russian language and lemmatization has been created, the source of the RSS feed has been added, the simplest dictionary has been created for categorizing articles.
The DomainDefinition class is great for quickly creating domains and prototyping using iKnow. In real-world applications, dictionaries of terms for categorization and sentiment analysis comprise hundreds, and even thousands of words. For such projects, the% iKnow.Domain class is used, which also allows you to perform other interesting tasks. This will be discussed in my next article.

Source: https://habr.com/ru/post/244697/


All Articles