The complexity of the structure of the modern information society is constantly growing. In this regard, the requirements for the effectiveness of information processing algorithms are also increasing. Recently, the most popular areas in this area are Data Mining (DM), Knowledge Discovery in Databases (KDD) and Machine Learning (ML). All of them provide a theoretical and methodological basis for studying, analyzing and understanding huge amounts of data.
However, these methods are not enough if the data structure itself is so poorly suited for machine analysis as it has historically been on the Internet today.
To solve this problem, a global initiative has been undertaken to reorganize the Internet data structure in order to transform it into the Semantic Web, providing opportunities for effective search and analysis of data both by humans and software agents.
This article discusses the main technologies that allow to realize the Semantic WEB.
The most important disadvantage of the existing structure of the Internet is that it practically does not use computer-readable data presentation standards, and all information is intended primarily for human perception. For example, in order to get the work time of a family doctor, it is enough to go to the site of the clinic and find it in the list of all practicing doctors. However, if it is easy for a person to do this to a software agent in an automatic mode, it is almost impossible, unless you create it taking into account the rigid structure of a specific site.
Knowledge Disposition ProcessTo solve such problems,
ontologies are used to describe any subject area in terms that are understandable to the machine and to use
mobile agents effectively.
When using this approach, in addition to the information seen by a person, there is also service information on each page, which makes it possible to use the data effectively by software agents.
In turn, ontologies are an integral part of the global vision for the development of the Internet to a new level, called the
Semantic WEB (SW).
A stack of semantic web concepts')
The most important concepts of Semantic WEB
To achieve such a complex goal as the global reorganization of the world wide web requires a whole set of interrelated technologies. The above figure shows the general structure of the concepts of Semantic WEB. The following is a brief description of key technologies.
Semantic web
The concept of semantic web is central to the modern understanding of the evolution of the Internet. It is believed that in the future the data in the network will be presented both in the usual form of pages and in the form of metadata, approximately in the same proportion, which will allow the machines to use them for logical conclusions realizing all the benefits of using the ML methods.
Uniform resource identifiers (URIs) and
ontologies will be used everywhere.
However, not everything is so rosy, there are doubts about the possibility of the full realization of the semantic web. The main theses in favor of doubts about the possibility of creating an effective semantic web:
• The human factor people can lie, lazy to add meta descriptions, use incomplete or simply incorrect metadata. As a solution to this problem, you can use automated tools for creating and editing metadata.
• Excessive duplication of information, when each document must have a complete description for both the person and the machine.
This is partly solved by the introduction of
microformats .
In addition to the metadata themselves, the most important part of SW is semantic Web services. They are sources of data for semantic web agents, initially aimed at interacting with machines, have the means of advertising their capabilities.
URI (Uniform Resource Identifier)
URI is the uniform identifier for any resource. It can indicate both a virtual and a physical object. Represents a unique character string. The most famous URI for today is the URL, which is the identifier of a resource on the Internet and additionally containing information about the location of the address of the resource.
Basic URI formatOntologies
As applied to the field of Machine Learning, ontology is understood as a certain structure, a conceptual scheme describing (formalizing) the values ​​of elements of a certain subject domain (PRO). An ontology consists of a set of terms and rules that describe their relationships, relationships.
Typically, ontologies are built from
instances ,
concepts, attributes, and
relationships .
- Instance - elements of the lowest level. The main purpose of ontologies is the classification of instances, and although their presence in the ontology is not necessary, but as a rule they are present. Example : words, breeds of animals, stars.
- Concepts - abstract sets, collections of objects.
Example : The concept of "stars", the nested concept of "sun." What is the "sun", nested concept or instance (luminary) - depends on the ontology.
The concept of "light", a copy of the "sun".
- Attributes - each object can have an optional set of attributes allowing to store specific information.
Example : the sun object has attributes like
• Type: yellow dwarf;
• Weight: 1.989 · 10 30 kg;
• Radius: 695,990 km.
- Relationships allow you to define dependencies between ontology objects.
Since, as between different ontologies, it is possible to establish intersection points, then the use of ontologies allows you to look at one ABM from different points of view and, depending on the task, use different levels of detail of the considered ABM. The concept of ontology detail levels is one of the key ones, for example, to indicate the color of a traffic light signal, it is sometimes sufficient to simply indicate “green”, whereas when describing the color of a car painting, even such a detailed description as “
dark green, close in pitch to needles ” may not be enough .
Consider the general structure of the use of ontologies.
Part of possible address ontologyAn example of a possible rule in address ontology. In the case of using this ontology, in order to send a letter to an American university, it is enough to indicate its name, the program agent will find his address on the basis of standard address information from the university site, if you need to send a letter to a particular department, then a list of all will be received from the site of the faculties and the necessary one is chosen, and the address is taken from the site of the required faculty, then, using the above ontology, the program will determine the address format adopted in the USA.
A computer does not understand all the information in the full sense of the word, but the use of ontologies allows it to use the available data much more efficiently and meaningfully.
Of course, there are many questions, for example, how in the beginning the agent will find the site of the required university? However, funds have already been developed for this. For example, the Web Services Ontology Language (
OWL-S ), which allows services to advertise their capabilities and services.
Taxonomy
Taxonomy is one of the options for the implementation of ontologies. With the help of taxonomy it is possible to determine the classes into which the objects of a certain subject area are divided, as well as what relationships exist between these classes. Unlike ontologies, the task of taxonomies is clearly defined within the framework of the hierarchical classification of objects.
Modern languages ​​of the description of ontologies
RDF (Resource Description Framework) is a language for describing metadata of resources, its main purpose is to present assertions in the form of equally well perceived by both man and machine.
An atomic object in RDF is a triple: subject - predicate - object. It is believed that any object can be described in terms of simple properties and values ​​of these properties.
Sample table with highlighted parameters
Sample table with highlighted parametersBefore the colon, you must specify a Uniform Resource Identifier (URI); however, in order to save traffic, you can specify only the namespace.
Also, in order to improve human perception, there is a practice of presenting RDI schemes in. as graphs.
RDI diagram example in the form of a graphOWL (Web Ontology Language) is a Web ontology language created to represent the meaning of terms and the relationship between these terms in dictionaries. Unlike RDF, this language uses a higher level of abstraction, which allows the language, along with the formal semantics, to use an additional terminological dictionary.
An important advantage of OWL is that it is based on a clear mathematical model of descriptive logics.
OWL's place in the general structure of the Semantic WEB from the point of view of the W3C consortium- XML - provides the ability to create structured documents, but does not impose any semantic requirements on them;
- XML Schema - defines the structure of XML documents and additionally allows using specific data types;
- RDF provides the ability to describe the abstract data models of certain objects and the relationships between them. Uses simple semantics based on XML syntax;
- RDF Schema - allows you to describe the properties and classes of RDF - resources, as well as the semantics of the relations between them;
- OWL - expands the descriptive capabilities of previous technologies. It allows you to describe relationships between classes (for example, non-intersectability), cardinality (for example, “exactly one”), symmetry, equality, enumerated types of classes.
According to the degree of expressiveness, there are three OWL dialects.
- OWL Lite is a subset of the full specification that provides minimally sufficient means for describing ontologies. Designed to reduce the primary implementation of OWL. And also to simplify the migration to OWL thesauri and other taxonomies. It is guaranteed that the logical conclusion on the metadata with expressiveness of OWL Lite is performed in polynomial time (the complexity of the algorithm belongs to the class P).
The dialect is based on the descriptive logic SHLF (D) - OWL DL - on the one hand provides maximum expressiveness, completeness of calculations (all of them will be guaranteed to be calculated) and full resolvability (all calculations are completed at a certain time). But in this regard, it has strict limitations, for example, on the interrelationships of classes and the execution time of some queries on such data may require exponential execution time.
The dialect is based on the descriptive logic of SHOLN (D) - OWL Full - provides maximum expressive freedom, but does not give any guarantees of permissibility. All created structures are based justified only realizable algorithm. It is considered unlikely that any rational software will be able to support full support for every feature of OWL Full.
Not a single descriptive logic, so - as in principle is not solvable.
Currently, OWL is the main tool for describing ontologies.
Software (mobile, user) agents (SA)
In the considered ABM SA, it is considered a program acting on behalf of the user, independently collecting information for some, possibly long time. Also important is their ability to interact with other agents and services to achieve the goal.
Unlike search engine bots, which simply scan ranges of WEB pages, agents move from server to server,
i.e., they are destroyed on the starting server, and created on the receiving server with the full set of previously collected information. This model allows the agent to use data sources available to the server that are not accessible through the WEB interface.
It is clear that a server must be installed on the server to accept the agent and service its requests. It is also important to pay attention to the security and integrity of the agents. The approach of allocated spaces is used for this when the agent works in some safe environment with limited rights and possibilities to influence the system.
Agents for their implementation are divided into ordinary and students.
If the former are designed to perform well-defined tasks, then the latter are based on flexibility, usually they are based on neural networks. The use of neural networks allows the agent to constantly adapt to user requirements, as well as more effectively interact with the Internet.
Microformats
Microformats are an attempt to create semantic markup of various entities on
Web pages that are equally well perceived by both humans and machines. Information in some microformat does not require the use of additional technologies or namespaces in addition to simple (X)
HTML . The specification of a microformat is simply an agreement on standards for naming classes of page design elements that allow storing relevant data in each of them.
For example, let's look at the hCalendar format.
This microformat is a subset of the iCalendar format (RFC 2445) and is intended to describe the dates of future or past events to provide opportunities for their automatic aggregation by search agents.
< div class ="vevent" > <br> < a class ="url" href ="http://www.web2con.com/" > <br> http://www.web2con.com/ <br> </ a > <br> < span class ="summary" > <br> Web 2.0 Conference <br> </ span > : <br> < abbr class ="dtstart" title ="2007-10-05" > <br> October 5 <br> </ abbr > <br> -<br> < abbr class ="dtend" title ="2007-10-20" > <br> 19<br> </ abbr > <br> ,at the <br> < span class ="location" > <br> Argent Hotel, San Francisco, CA <br> </ span > <br> </ div > <br><br> * This source code was highlighted with Source Code Highlighter .
This example describes how to create a root container class with a date (class = "vevent") and correlate with an event of a certain date in the standard ISO date format.
Currently, the most common microformats are
- hAtom - newsletter format;
- hCalendar - drawing up a calendar and a description of events;
- hCard - description of people, companies, places;
- hResume - summary description format;
- hReview - introduction of reviews;
- XFN is a way of indicating relationships between people;
In this area there are many new developments, for example, for the automatic construction of automatic classifiers use different levels of ontologies, depending on the data under study.
This article is an attempt to combine data from various sources to get an idea of ​​the general structure of the development of the Semantic Web.