Every day, while performing his official and other functions, a modern person is faced with the task of analyzing a large amount of information and searching for the data he needs. Over time, the accumulation of user data in the form of documents. These documents add up to a certain user information space. With each new document, the organization of this space is becoming more and more acute: over time, a pair of three folders with hierarchically placed files contain a huge pile of documents that are difficult to bring to a hierarchical form with linear links. We are faced with the task of concretization, categorization and visualization of the user's information space.
Let us define the terminology: in this article, the user information space will be understood as a set of text (not tabular and graphical) documents (files) distributed on the file system within a certain hierarchy of directories. For clarity, let us also simplify the description of the conditions of belonging of documents of the information space to a single subject area, for example, economics. Text files can be economic articles, scientific papers, educational literature and other forms of the presentation of economic textual information.
At the initial stage of the formation of the information space, the user can simply locate it because of its small size and, as a result, a fairly clear structure and connections between its elements. As time passes and the user performs official, scientific, and everyday functions, the power of the information space increases, the weight of individual links between nodes (files) decreases, and it becomes more difficult to navigate it. With this, the time of searching for the necessary information grows, the quality and productivity of the user’s activities within his information space falls.
As a rule, this is connected not only with an increase in the amount of textual information, but also with a low speed of its perception by the user. Finding the right fragment in the entire array is also difficult: the user must correctly create a search query to get an adequate output, and sometimes this is problematic due to, for example, low user awareness in the subject area or the presence of synonyms or facts describing different things with similar formulations. Also, the use of full-text document search by the operating system does not provide for the personalization and relevance of the search results, which also adversely affects the speed of the user and the quality of the organization of its information space.
From the above disadvantages of the standard search and organization of the information space, it follows that to optimize the user's information activity:
')
- Split the subject area into categories or "zones"
- Highlight key domain nodes
- Visualize the subject area to speed up human perception
- Identify nodes within each domain element (ontology formation)
- Determine the properties of objects inside the domain nodes and their connections (completion of ontology formation)
- Determine the links and interactions between the nodes of the subject area (drawing up a semantic network of nodes - ontologies)
- Link the levels of visualization and functional description of the subject area (the imposition of the Tag Map, ontologies and semantic network on each other)
- Implement the function of personalization of the subject area and the relevance of its presentation on the basis of iterative learning in the process of user interaction.
To implement the above, it is advisable to use three technologies: Maps or tag clouds, ontologies and semantic networks, since individually, none of them contributes to the elimination of all existing shortcomings in the organization of the information space, but optimizes its part, which as a result will help to improve the information activity of the user.
The tags (clouds) of tags represent the highest level of detail in the subject area, this is a kind of GUI (Graphic User Interface) of the subject area. In this case, there is some deviation from the classic description of the map (cloud) of tags: The tag is taken somewhat more than just a text label - in our case, the tag will be the naming of the object, which is a union of one to several domain ontologies. The deviation from the classic “cloud” towards the “map” is due to the presence of the division of the subject area into zones (as a country is divided by areas on a political map). This division was introduced to increase the speed of visual perception and the intuitive search for the necessary data among ontologies (facts, documents) of the subject area for the user. The tag map is the top level, the data of the following two levels are used to form it: ontologies and semantic networks.
Working with tag maps, it is possible to ensure the relevance of the search results for a specific user. For this purpose, it is advisable to use mechanisms for recording the history of navigation through the user's information space by tags, the so-called “route of knowledge”. This “route” will be recruited and corrected over time - with each new search, the user will define connections with his movement on the tag map — the relationships between its nodes. And the next time a user accesses a node, in addition to the ontologies that he represents, ontologies will also be issued, or nodes relevant to him. Using this method of organizing search results allows you to personalize the user's information space: he will be offered the options he needs, in accordance with the data his preferences collected by the system.
When working with standard tag clouds, one major drawback was identified: new and old data sources (by the date of addition) are equally marked on the map - this does not allow us to know in advance which user is going to open the node by the date of addition. It is proposed to add the “novelty” property to the tag - a visual display of colors in accordance with a predetermined relative scale (palette) of time. For example, for tags that describe nodes added no later than a day, white color will be applied, for tags added no later than a week, yellow, etc. Adding this property will also help to achieve better organization of the user's information space and speed up the receipt of the required information.
Ontologies represent the lowest level of data detailing. Each user file is treated as a separate ontology. This convention was adopted to simplify the compilation of the functional structure of the information space, since the source data is text documents in the form of files distributed on a file system in a directory structure. For the ontology within the framework of this system, a formal explicit description of the domain node will be adopted. This is a certain deviation from the classical definition of ontologies in terms of the fact that there is described not a node, but a whole subject area - it was done to increase the degree of personalization of the user's information space. There are several ready-made ontologies of the subject area “economics” - they all cover a different scale, have different degrees of detail, but they are not “sharpened” for a specific user.
So that the system is not “one-time” - a certain general dictionary or thesaurus of the subject area should be taken as a basis, which would describe all the concepts and properties of the concepts of the subject domain “economy”, as well as some fundamental connections and relations between them. The use of such a central repository of classes, slots and ontology facets allows using the client-server system architecture, which has certain advantages over the standalone version, namely:
- fault tolerance (ontology objects will be stored on a server with a data backup system)
- scalability (new users can be connected to the system fairly quickly)
- optimal use of computing power
- development by a group of users (the ontology will be replenished and optimized not by one, but by several people, which will significantly accelerate its development, will allow to increase the ontology of large capacities in a relatively short time, and also to avoid redundancy by providing the ability to search for duplicates and synonyms by means of design and support ontologies)
In the case of using client-server architecture, the server will store: domain classes, their slots, facets, and fundamental connections. On the client side will be stored instances of classes, clarifying (personalizing) the relationship between classes and instances. Thus, the system on the user's side will be able to use ontologies already stored on the server and their elements to build their own, personalized, and also send the results of their activities to the server, thereby ensuring its updating and evolution.
As mentioned earlier, each file will be taken as a separate ontology, the relationship with the server ontology will be of the type “contains”. A question may appear: “Why use such an ontology architecture, why not just create one large ontology and work within it?”. The explanation is the following: the use of a set of ontologies allows to personalize the system, the information space provided client-server architecture, at the same time, the analysis of interconnections between ontologies is not at all hampered by their number, due to the ontology matching mechanism.
Also, adding a new document to the information space of a user is much easier to process if you create a separate small ontology from it on the basis of a universal algorithm. It will subsequently be compared with the server, taking into account the preservation of the original ontology.
Using the example of the subject domain “economics”, we can say with some assumption that there is a universal methodology for the formation of ontologies of the domain nodes from user files. In general, the formation of any ontology is carried out in several stages:
- Determining the scope and scale of ontology
Before creating an ontology, it is necessary to determine its area and scale, for this it is necessary to answer a few questions:
- What area will ontology cover?
- What will ontology be used for?
- What types of questions should the ontology information answer?
Answers to these questions may change during the ontology design process, but at any given time they help to limit the scale of the model.
- Consideration of options for reusing existing ontologies
It is necessary to check whether there is a possibility of using or improving and expanding the source server of ontologies
- Enumeration of important terms in ontology
It is useful to make a list of the main terms of ontologies and their properties.
- Definition of classes and class hierarchy
There are several possible approaches for developing a class hierarchy:
- The process of downward development begins with the definition of the most common concepts of the subject area, followed by the concretization of concepts.
- The process of ascending development begins with the definition of the most specific classes, the leaves of the hierarchy, with the subsequent grouping of these classes into more general concepts.
- The process of combined development is a combination of the descending and ascending approaches: First, we identify the more prominent concepts, and then appropriately generalize and restrict them.
- But whatever approach is chosen, it is required to begin with the definition of ontology classes.
- Definition of class properties (slots)
Classes themselves do not provide enough information about the subject area - after determining the classes, it is necessary to describe the internal structure of the concepts.,
In ontology, several types of object properties can become slots:
- internal properties of the object
- external properties of the object
- parts if the object has a structure (can be both physical and abstract parts)
- relationships with other individual concepts
The slot must be tied to the most common class in the hierarchy that can have this property.
- Definition of slot facets
Slots can have different facets, which describe the type of value, the allowed values, the number of values ​​(power) and other properties of the values ​​that the slot can accept.
Here are some common facets:
Slot power determines how many values ​​a slot can have. In some systems, only a single power (only one value is possible) and a multiple power (any number of values ​​is possible) differ.
The value type facet describes which value types can be entered in the slot. Here is a list of the most common types of values: string, number, boolean slots, numbered slots, slot instances (describe the relationship between instances)
- Slot domain and its range of values
Classes to which a slot is attached, or classes whose slot property describes, are called a slot domain. In systems where we assign slots to classes, the slot domain is usually made up of the classes to which the slot is attached. The basic rules for determining the domain of a slot and the range of values ​​of a slot are similar to each other:
When determining the domain or range of slot values, find the most general classes or class that can be the domain or range of slot values ​​respectively.
On the other hand, it is not necessary to define a too general domain and a range of values: all classes in the slot domain should be described by a slot, and instances of all classes in the range of slot values ​​should be potential slot placeholders. You should not choose too general a class for a range of values.
The last step is to create separate instances of classes in the hierarchy. To define a separate class instance, you need (1) to select a class, (2) to create a separate instance of this class, and (3) to enter the values ​​of the slots.
After the formation of ontologies, it is necessary to compare them with the server part to remove duplicates, mutual enrichment and replenishment of each other, etc. Also, the ontology mapping allows you to get away from full-text search in the information space of the user.
In the system ontologies will be presented not only the subject area and user documents, but also the user's search queries. Considering each search query as an ontology will help, using the ontology matching method, with the expanded wording of the query, a high level of compliance of the actual search results to the information sought will be achieved. This is the approach of working with the search, starting not from the “answer” (available in the data system), but from the “question” (trying to understand: what exactly should the user find?). When using this approach, between the specification, the accuracy of the search output and the detailedness of the search query, a directly proportional relationship will be established: the more fully the user describes what he needs, the more the ontology will be compiled, the more accurately the ontology matching mechanism will be worked out. provided by
In short, the ontology comparison method can be described as follows:
- The intersection of the terms of ontologies of the subject domain and the query T (O) = T (O_s) T (O_q)
- If this intersection is not empty, for each term from T (O) two sets T_s and T_q are constructed - terms that are connected with it in every ontology by any relations
- For each term from T (O), an intersection of the sets T_s and T_q is constructed.
- Analysis of the types of relations between terms from T (O) and the intersection of the sets T_s and T_q. (all relations of ontology are divided into three types - hierarchical, synonymous, and others).
- An ontology similarity coefficient is constructed, which is a quantitative display of the similarity of the semantics of two ontologies. The following factors are taken into account: the occurrence of the same term in both ontologies; that the two terms are in different ontologies in the same respect; the fact that the two terms are in different ontologies in the relations of the same type or different (for example, in a hierarchical relation and a synonymy relation); are there any relationship at all (direct or indirect) between the same terms?
The lowest (ontologies) and the highest (tag maps) levels of detail of the user's information space must be associated with each other. This can be done through the creation and continuous expansion of the semantic network of the domain. The semantic network will be understood as an information model having the form of a directed graph, the vertices of which correspond to the system objects (tags and ontologies), and the arcs define the relations between them.
The semantic network will play a connecting role between ontologies — files and tags pointing to them like hyperlinks, but with a more complex structure, will be the “transport” of the system. It will be stored on the client side, as the server side does not need information about the location and structure of user documents. It is through this technology that the user, selecting a tag from any zone on the tag map, will receive information about the tag, ontology data and the source document file associated with this ontology. And thanks to the ability to create a complex structure, the user, in addition to the document file, will be able to receive a custom amount of peripheral (specifying) information, such as ontologies associated with a given document, tags and areas on the tag map, documents relevant to the result.
A system operating according to the principles and algorithms described above should have a sufficiently large
processing capacity, enhanced scalability, security and the potential for expansion and evolution, thanks to the client-server architecture of its main components. It will allow to categorize, personalize, summarize and visualize the user information space, which ultimately should have a beneficial effect on the quality of its information activities in general, as well as develop a detailed ontology of the user information space.
Literature :
1. Gladun A.Ya., Nesen M.V., Shtonda V.N. Intelligent agent-based services based on intelligent network platforms // Computer tools, networks and systems, 2004, â„–6, p. 112-122.
2. Gladun A.Ya., Rogushina Yu.V., Shtona V.N. Ontological analysis of web-services in intelligent networks // International Conference "Knowledge-Dialogue-Solutions" 2007.
3. Debora L.M., Natalia F.N. Ontology Development 101: A Guide to Creating Your First Ontology // Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, March 2001.
4. Kleschev A.S., Artemyeva I.L. Relationships between domain ontologies. Part 1. // Information analysis, Issue 1, C.2, 2002. - C.4-9.