📜 ⬆️ ⬇️

Semantic technologies are simple and accessible by the example of pedigrees.

A program that is capable of drawing conclusions within the framework of the task, can seem to be a technical miracle and the embodiment of Skynet. But, as can be seen below, today it’s not difficult to create such a program in Python if you use semantic technologies. We will dwell on the illustrative example of ontologies - pedigrees - and for any family member in the pedigree we will be able to deduce his family relationships of arbitrary complexity (it is limited by computational resources). For example, on the family tree of the Romanov family, below is the first cousin twice removed nephew of Russian Emperor Peter II.

image

So if you want to get acquainted with the technologies of the semantic web in practice, welcome to the cat, where we will practice on cats on pedigrees.

About triplets, RDF and ontologies can be read in Wikipedia or in other posts . To describe family relationships in pedigrees, we use the OWL 2 ontology of the Family History Knowledge Base ( FHKB ). Note that although the FHKB authors recognize their offspring as a good learning example, they do not recommend OWL 2 for use in real genealogical applications due to computational complexity for today's reasoning systems. Our application will remain educational: we will limit ourselves to small genealogies of up to one hundred family members.
')
Genealogical data is usually available in text format GEDCOM ( .ged ). Some genealogical portals and pedigree management programs allow you to upload link graphs in this format. We will read GEDCOM with the help of the same name library for the Python language and generate triplets of individuals (the so-called ABox ) for the ontology FHKB. Logic ( TBox ) for the derivation of kinship we already have, and all we need to do is set the data to which this logic will be applied.

Imagine that we have data for the following three individuals (abstractly), using the example of the above-mentioned family of Russian tsars:

 I *-*  I.  I *-*  II. 


and FHKB logic:

  *-*    *-*  *-*. 


Then the reasoning system is able to establish the following fact:

  I *-*  II. 


The same information in the Turtle RDF dialect is below. It is compact and quite easy to read:

 fhkb:i1 a owl:NamedIndividual ; fhkb:isBrotherOf fhkb:i2 ; rdfs:label " I" . fhkb:i2 a owl:NamedIndividual ; fhkb:isFatherOf fhkb:i3 ; rdfs:label " I" . fhkb:i3 a owl:NamedIndividual ; rdfs:label " II" . fhkb:isFatherOf a owl:ObjectProperty ; rdfs:label "-" . fhkb:isBrotherOf a owl:ObjectProperty ; rdfs:label "-" . fhkb:isUncleOf a owl:ObjectProperty ; owl:propertyChainAxiom ( fhkb:isBrotherOf fhkb:isFatherOf ) ; rdfs:label "-" . 


(Note: some details are omitted here for clarity. In the original FHKB, the isFatherOf , isBrotherOf and isUncleOf properties are defined somewhat differently to optimize logical reasoning.)

So, we set individuals i1 , i2 and i3 , the properties isFatherOf and isBrotherOf , assigned these properties to individuals and introduced a new property isUncleOf . Pay attention to the prefixes rdfs :, owl : and fhkb : - they show the areas of expertise involved. The prefix rdfs : points to the standard RDF schema (in the example above, this is the label property). Owl prefix: indicates standard ontological terms (individual, property, sequence of properties, etc.). And the prefix fhkb : is the FHKB genealogical ontology used by us, where the logic of family relationships is defined ( isFatherOf , isBrotherOf , isUncleOf , as well as other terms, isGrandfatherOf , isFirstCousinOf , etc.).

For each individual, we only need to take from GEDCOM only minimal information about paternity (motherhood), brothers, sisters and marriages (in fact, GEDCOM doesn’t contain anything else), all other related relationships, the logic for which is given to us in FHKB, will be derived by the system reasoning.

image

So, the logical base (TBox) is available in the Turtle file header.ttl from the repository for this article. The genealogy of the royal family of the Romanovs in GEDCOM is also present , but the reader is advised to take one's own for interest. And here is the script that generates individuals for the ontology FHKB from a GEDCOM file: gedcom2ttl.py . (After cloning the repository, you need to install the Python dependencies using the pip install -r requirements.txt command.) Copy the FHKB logic header.ttl to a new file and write the result of the script operation into it:

 cp data/header.ttl romanov_family.ttl ./gedcom2ttl.py data/tsars.ged >> romanov_family.ttl 


As a result, we have an ontology (TBox + ABox) in Turtle format, which can be opened in any external editor (for example, Protégé ). If necessary, Turtle can be converted to OWL XML format using the ttl2owl.py script. Now the derivation of kinship for this ontology is a matter of technology. I know of three modern open source reasoning systems for Python: RDFClosure , FuXi, and Fact ++ with owlcpp wrapper. In fact, there are much more of them if you “make friends” with a Java virtual machine Python (historically, Java is the leader in semantic technologies and provides a much larger set of tools). These three are built to increase the complexity and performance. The first one is a naive “brute force” approach, when all possible triplets are generated by the exhaustive method. The second (FuXi) is based on the infix Python notation for OWL and the Rete algorithm . The third (Fact ++) is a low-level, optimized implementation of the Tableaux algorithm . In general, today it is one of the most effective open source reasoning systems. For our tasks, the first system (RDFClosure) is sufficient, especially since it is written in pure Python and installed with the trivial pip install command. For reasoning on the genealogy of the Romanovs tsars.ged (41 family members) RDFClosure on a laptop with an Intel Core i7 1.70GHz takes about ten seconds.

As already mentioned, the disadvantage of OWL 2 in relation to pedigrees is computational complexity. I dropped some of the family relationships mentioned in the illustration above and reduced the family tree of the Romanovs to royal persons and their closest relatives so that the demonstration reasoning would not load your computer too much. If you specify all the related relationships from the illustration above and expand the genealogy to at least several hundred family members, RDFClosure becomes useless (Fact ++, however, continues to work).

Let us start the reasoning for the above ontology:

 ./infer.py romanov_family.ttl 


While there are arguments, I will explain the key points of the infer.py script. Its essence fits in six lines:

 import rdflib from RDFClosure import DeductiveClosure, OWLRL_Extension g = rdflib.Graph() g.parse("romanov_family.ttl", format="turtle") DeductiveClosure(OWLRL_Extension).expand(g) print g.serialize(format="turtle") 


In the first two lines, we import the RDFClosure reasoning system and the RDFLib library, which provides interaction with ontologies. In the third and fourth line - we declare the graph and fill it with the ontology content romanov_family.ttl . The fifth line is the launch of the reasoning. In this case, they are nothing more than a cyclic expansion of the input graph with new triplets according to the rules of OWL 2. The sixth is the printing of the resulting graph (in the same Turtle format).

So, we got the result romanov_family.ttl.inferred (it is several times larger than the input file by disk size). Let's prepare it for visualization. I wrote a simple HTML5 application ( index.html ), showing a graph of derived relationship in the browser using the D3.js JavaScript library. It is available in the online repository branch for this article. The edges of the graph correspond to information taken from GEDCOM (marriages, isFatherOf , isMotherOf ), and the derived relationship is highlighted in different colors when choosing a family member. The choice is made by hovering the cursor or by touching the touch displays. The graph for this application is given in JSON format with a very simple structure - a list of edges with an indication of the vertices (individuals) and the type of connection (relationship) between them. The ontology obtained in the previous step is translated into this JSON by the ttl2json.py script:

 ./ttl2json.py romanov_family.ttl.inferred > romanov_family.json 


By default, the HTML5 application loads JSON at data / tsars.json . The new JSON generated by you can be downloaded to the browser by simply pressing a button on the web page (using the File API without a server, and the visualization works offline).

All the above commands are collected in the Shell-script gedcom2json.sh . With it, you can directly translate GEDCOM genealogies into JSON with derived related links for visualization. Adding inference and visualizing other related relationships is relatively easy. To do this, firstly, add the appropriate logic in TBox FHKB , secondly, add the identifier of the new relative to the Turtle-JSON converter ttl2json.py , and third, specify the color, name and identifier of the new relative in the HTML5 visualization code . Of course, the generation time of JSON from GEDCOM will increase somewhat.

In addition, there is an idea that the input maps for any ontology (not only genealogical) can be mind maps. Of course, when drawing, you must adhere to clear rules so that you can translate the map into ABox ontologies, using, for example, the Python XMind SDK. So, for example, I ran a logical reasoning for my pedigree, which historically led in the form of an intelligence card.

To summarize: by setting only the closest kinship ties between family members (brothers and sisters, marriages, paternity and motherhood) and defining the logic of the remaining links, we were able to derive all other links due to semantic technologies. Thus, we have touched the most powerful tool that underlies such products as Wolfram Alpha and the Google knowledge graph . Ontologies and reasoning systems are mature and widely used technologies today, but, unfortunately, the threshold for entering this area is not low.

Link to the repository for this article: github.com/blokhin/genealogical-trees
HTML5 application: blokhin.imtqy.com/genealogical-trees/#ru
Public GEDCOM files can be exported from genealogical portals, for example, www.wikitree.com

Have a nice immersion in semantic technologies, and do not be afraid of us Skynet!

Source: https://habr.com/ru/post/270857/


All Articles