📜 ⬆️ ⬇️

From England to the Mysterious Island, along with the heroes of the novels of Jules Verne

With the ever-growing volume of textual information and the level of development of web-visualization tools, there is a desire to visualize all these volumes. Demonstrating the possibility of such a visualization is a task that was assigned to a team of students as part of the work of ABBYY Labs and the Industrial Programming course at the Faculty of Innovations and High Technologies (FIHT) at MIPT (if you have never read our ABBYY student labs blog it makes sense to return to this post ).

Fifteen third-year developers and four fourth-year managers, students of the FIVT, were asked for three months to explore modern open-source visualization solutions for structured data and then, choosing a topic for themselves, to visualize textual information in natural language. The transition from unstructured to structured information was proposed to be implemented using the ABBYY Compreno semantic-syntactic parser.



And if not Jules Verne, then who?


One of the most turbulent discussions of all the time during the work on the project was devoted to the choice of text-basis of visualization. There were a lot of options: from old Soviet newspapers and scientific articles to the “Song of Ice and Flame” novels and the Marvel Universe comics.
')
Since many of the texts we liked were protected by copyright, we decided to dwell on classic literary works that have expired copyright. Here, too, was not without discussion: offered and Sherlock Holmes, and Tom Sawyer, and many other novels, in the end, we agreed that the trilogy of Jules Verne's novels “Children of Captain Grant”, “Twenty Thousand Leagues Under the Sea” and “Mysterious Island ”Is well suited for our purposes and we all like it :). For the analysis, we took English and Russian translations.

We invite those who wish to read directly to julesvernetrilogy.com - you can simultaneously read the article and click on the buttons. So, choose a language (Russian or English) - and go.

Go to the structured information.


A separate group was engaged in extracting data from the texts of the novels. The guys needed to highlight locations and events that occur in the novels, find the relationship between the characters, make a description of the appearance and voice portraits of the characters, as well as make a smart layout of the texts of books. To solve each of these problems, students used a variety of information about the text, obtained using the ABBYY Compreno parser. We wrote about the work of the parser in detail here , and now we will tell you how it helped us structure the information from Jules Verne's novels.



Meet the heroes with Compreno


As you know, of course, the Jules Verne trilogy we have chosen to analyze is a series of adventure novels, united by a common world and characters. Since the lion’s share of information in novels is built around heroes, their properties, actions, and movements, it was first necessary to learn how to distinguish entities corresponding to the heroes of the novels.

This is not such an easy task as it may seem at first glance, because the same person can be described in completely different ways. As an example, consider the first chapter of the book “Children of Captain Grant” (the references to Lord Glenarvan are highlighted in bold):

"If you couldn’t really break off your neck, I wouldn’t be able to easily accept the papers," suggested John Mangles. "Try it, Edward , try it," said Lady Helena. It was no alternative, Lord Glenarvan ; the precious bottle must be broken. It was, however, that it was necessary to get a hammer of granite. There are many sharp strokes, however, many of them had to be framed. It was taken away from the gaze of his wife and friends.

Since Compreno is able to cope with such cases, the team managed to correctly classify mentions of characters, additionally allowing pronominal anaphora . After that, it was possible to cluster the data and explore statistics related to the properties of the characters.

Communication is our all


In each chapter, the characters interacted with each other or, speaking in terms of Compreno, met as placeholders for some attributes of the same fact. Having drawn icons for all the main characters of the trilogy and highlighting the interaction graphs of characters in each chapter, we got a visual representation of the social activity of the heroes of the trilogy.



Lord Glenarvan was perfectly grave


It was possible to tell more about the activity of many characters in various chapters, highlighting their actions and descriptive characteristics. By clicking on the character icon in the graph, you can see its description in the context of this chapter.

Various heuristics were used to detect such characteristics. One of them was to search for adjectives that depend on people or have common ancestors with them.

Consider her work on the example of such a sentence from the novel:

Lady Glenarvan and Mary Helena and Mary showed their love for them.

Analysis of the proposal using Compreno looks like this:



As a result, we get the perfectly grave characteristic, which is displayed on the website:



Another heuristic helped to extract from the text a description of the clothes of the characters - for this, words with the semantic class “CLOTHES” were searched. After that, the references to the word “clothes” were additionally filtered and dependencies were searched, as in the previous heuristics.



Educated by Jacques Paganel


Ligaments of the form “verb + direct object” and “participle turnovers” fell into the category of actions. In fact, the search algorithm here is two.

The first algorithm, which built participles (or de-participial) turns, checked the dependencies of the main word, and then, passing the sentence syntax tree, collected the entire component, and then built the words in the same order in which they appeared in the sentence. Thus even quite long descriptive fragments sometimes stood out. The record is 48 words in English with funny arithmetic:

It’sa bit

Unfortunately, such constructions had to be abandoned, as they did not satisfy the output format for the site.

The second algorithm was looking for a verb that indicated the character of interest in the dependent words. Then I found its direct verb in the same verb, after which I also wrote it down with preservation of the word order in the sentence.

If you examine the resulting data, you can find many interesting facts. For example, to find out how poor Nab was useful to the colony:



Or learn about the education of the brave Jacques Paganel. Let's look at the example of Jacques, how and what data it takes from the text of Compreno. Jules Verne has this sentence:

“Oh, you, dear Paganel, you will remain,” said the major. “You know too well the thirty-seventh parallel, and the Guamini River, and in general all the pampas to leave us.”

Parser Compreno parses it as follows:



At the exit we get just such a card with the characteristics of Jacques’s personality:



You can also get acquainted with a very active character from the novel "The Mysterious Island":



Finding talkative and emotional characters


The heroes, in addition to active actions, talked a lot, and to identify the most talkative characters, we built another graph - the graph of speech activity. Additional analysis of their speech made it possible to find the most inquisitive characters (asking a lot of questions) or the quietest (practically did not use exclamation sentences in their speech).

In order to do this, we found facts like “direct speech” in the RDF representation of the text received by Compreno. These facts contain the passage itself with a speech and a link to the author of the words, so it turned out to be quite simple to select sentences with different types of speech and link them with the speakers. Although it was not without minor problems in the markup (try to find the chapter where curiosity awakens at Top (Mysterious Island)). She was left on purpose, as a kind of Easter eggs for the most attentive.



Interactive travel map


Since the heroes constantly travel in the book, it seemed to us interesting to extract information from the text about the places the heroes visited, and to build an interactive map that would show the location of the heroes in different chapters. With the help of technology Compreno managed to do this quickly and efficiently, eliminating the possibility of error due to negligence.

For example, this is a fragment of the map for the ninth chapter of the first part of the novel “Children of Captain Grant”:



Highlighting events


After receiving the main locations I wanted to understand what events took place in various places. It was decided to call the events with the completed attribute “Where”, in which the characters of the visualized books took part. Such facts were highlighted with the help of Compreno and displayed on an interactive map in the corresponding chapter. On them you can see the participants of the event and its brief description.



“Smart” reading


In order to get additional information about heroes, events and locations right while reading the book, “smart” markup was added directly to the texts of books that are available for reading online on the website. It was implemented on the basis of already extracted data on events and locations and an additional search for persons in the text with the resolution of the pronoun anaphora .

An example of a marked fragment from the first chapter of the first part of the novel “Children of Captain Grant”:



Marked excerpt from the third chapter of the first part of the book "Mysterious Island":



Context book from additional sources


Jules Verne is known and loved all over the world, so today we can easily find a lot of various information about the author and his works on the Internet. We did so by filling the site with various additional information.



Predicted inventions


An interesting feature of the works of the French writer is the predictions of technical discoveries made in the pages of books. Jules Verne’s contemporaries could not even imagine that mining would be ever widespread from the bottom of the sea or that video telephony would be something completely natural. And Jules Verne could and quite clearly described the technologies of our time in his novels. One of the sections of the site is devoted to such predictions of the author. You can also go to the descriptions of inventions using markup while reading a book.



Read more about Jules Vernet


Often, while reading an interesting book, there is a desire to learn more about the author and about his other works. That is what a separate section devoted to Wikipedia is dedicated to. In it you can find information about the bibliography and biography of the writer.

Like the heroes of his books, Jules Verne traveled a lot. Therefore, it is most vivid to trace his life and career on an interactive map, with the help of which we visualized his biography.



Additional descriptions of locations, events and heroes


In addition to information about the author, during the reading, thanks to the markup, you can get additional information about places, characters and events. To do this, select “Database” (“Data Base”) in the menu or click on one of the many “Read more” labels. Wikipedia materials and excerpts from the trilogy novels were used for such descriptions.



Web development and team


You can look at the team that implemented everything described in the article on a special page of the site.

The success of such projects is, first of all, a successful “facade”, the result of the work of designers and web developers, so for them the work on the project became a real challenge: the guys needed to learn web development from scratch, combining this with hard work, Many influenced our choice of funds - great emphasis was placed on the ease of mastering technology.

An obsolete, but well-documented RactiveJS library in combination with Page.js was used to implement the SPA. The main library of data visualization has become d3 - with its help the graph of heroes and speech statistics are implemented. To build interactive maps was used library Leaflet.

Where to go?


Our work on the project does not end there, just like our course at the Physics and Technology Institute, and in the autumn we plan to again take a hard look at work. Our plans include the creation of a universal site-designer for the visualization of literary works. Indeed, indeed, both the graph of heroes and the map with events are well suited for almost any literary work. Another idea is to “revitalize” the site with the help of interactive tasks on the texts of books.

While we are thinking about the future, everyone on the site can immerse themselves in the world loved by childhood, created by Jules Verne, in English or Russian.

Maria Sandrikova,
Technology Development Department

Source: https://habr.com/ru/post/305970/


All Articles