📜 ⬆️ ⬇️

Game of Thrones. Search for dialog authors in books



Hi Habrahabr,

Based on the result of the vote in the article Theory of Graphs in the Game of Thrones , I translate the teaching material of Erik Germani, who received a social link graph from the first 5 books of the Song of Ice and Flame series, which formed the basis of the above article. The article does not contain a detailed description of machine learning methods, but rather describes how in practice existing tools can be used to search for authors of dialogues in the text. Careful, many letters! Go.

This training material is aimed at newcomers to machine learning, such as myself, when I started this project a year ago. (And I still am, although now I’m just green, not bright green in this topic.) We’ll build a model that can tell who speaks the line of dialogue in the books of George R.R. Martina "A Song of Ice and Flame". For this, we will use the CRF conditional random field method ( note from Conditional Random Fields ) and the wonderful CRFsuite utility from Naoaki Okazaki. For text processing, we use Python 2.7 and NLTK (Natural Language Toolkit).
')
I will try to explain everything as detailed as possible. I hope that when describing each step of my actions, you will be able to extract new tools and methods that will be useful in your own projects. Explanations of the code will be from a beginner and for a beginner, one who understands the syntax of Python and knows about the abstraction of lists, but no more. If you feel my code clarifications are draining your soul, skip them.

Important: If you wanted to find here a theory about the method of conditional random fields, then this material is not for you. For me, CRFsuite is a beautiful black box that I touch with my monkey paws. We will spend some time improving the model’s performance, but this will prove to be an erroneous attempt. If this upsets you, keep in mind:

Our game plan is simple. As with any other algorithm for machine learning, we need to prepare data for training and verification. Then we select the properties that the algorithm will use for classification. After we process the text using these properties, we will feed the result to CRFsuite and congratulate ourselves on a job well done. (or burden ourselves with the hard work of checking the guesses of the machine).

Let's start.

Loading text


First of all, we need to find a copy of the source of the text and I will leave the choice to you, whether you will pay an iron price for it or not.

If you are new to natural language processing, then perhaps you underestimate how difficult the source text may be. Each .txt file has an encoding that determines how each character will be described. ASCII, the format in which I read the passing game Ocarina of Time, was supplanted by UTF-8, which can handle all the special characters. (ASCII can represent 128 characters.) My copy of the PLIP (approx. Song of Ice and Flame) is in UTF-8, which is a minor inconvenience, but is actually a bonus.

We will load this text into the NLTK to more easily manipulate it. NLTK can do a lot of tasks, and this is how I learned Python, if it turns out to be interesting for you, take a look at their excellent online book . For our purposes, use this tool to break the text into tokens. This implies the division of sentences into words and punctuation marks, which is often done in natural language processing projects.
import nltk nltk.word_tokenize("NLTK is ready to go.") 
  ['NLTK', 'is', 'ready', 'to', 'go', '.'] 

NLTK has preloaded chassis, but we need to load our own.

Create a folder and paste the PLIP text files there. Since the books are very large, the source text will be almost 10 MB. Not ideal for search and replace text. I divided the text into books, but real professionals, who are going to analyze more, would rather divide each book by chapters and enumerate them sequentially.
But we will not complicate things now! Once the text is in the folder, we can run the following:
 corpus = nltk.corpus.PlaintextCorpusReader(r'corpus', 'George.*\.txt', encoding = 'utf-8') 

Here r indicates not to process the string. Here it does not matter, tk. I directly refer to the “corpus” folder, but if in your case the folder has a complex location, it is better not to forget about it.

The second argument is a regular expression that tells NLTK to take all the files in the folder that have “George” in their names and with the extension “.txt”.

The encoding parameter is very important - if the encoding of the text does not match the specified one, then errors will sprinkle.

The body in NLTK is very useful, with it you can get information from the text at different levels.
 corpus.words("George RR Martin - 01 - A Game Of Thrones.txt")[-5:] 
[u'the ', u'music', u'of ', u'dragons', u '.']
 corpus.words()[0] 
u'PROLOGUE '
 corpus.sents()[1][:6] 
  [u '\ u201c', u'We ', u'should', u'start ', u'back', u ', \ u201d'] 

Here we hear the doomed Gared from the Game of Thrones prologue and see some Unicode characters presented in Python. You see that all Unicode strings begin with u , and contain special characters. \ u201c is the left quote, \ u201d is the right quote. I mentioned that UTF-8 is more of a bonus, and here's why. Let's see what happens if we open the same file without specifying the encoding.
 bad_corpus = nltk.corpus.PlaintextCorpusReader(r'corpus', '.*\.txt') bad_corpus.sents()[1][:9] 
  ['\ xe2', '\ x80 \ x9c', 'We', 'should', 'start', 'back', ',', '\ xe2', '\ x80 \ x9d'] 

Just like \ u points to a string in Unicode format, \ x points to a hexadecimal string, so NLTK gives us 3 hexadecimal bytes - \ xe2, \ x80, \ x9c - and tries to split them. You can see that he does not know how to do it.

We will work with paragraphs, so let's take a look at one of them:
 print corpus.paras()[1] 
  [[u '\ u201c', u'We ', u'should', u'start ', u'back', u ', \ u201d', u'Gared ', u'urged', u'as', u'the ', u'woods', u'began', u'to ', u'grow', u'dark ', u'around', u'them ', u'. '], [u' \ u201c ', u'The', u'wildlings', u'are ', u'dead', u '. \ u201d']] 

You may notice how NLTK structures data. Offers are a list of tokens, and a paragraph is a list of offers. Easy enough!

Tags


Next, we need to prepare data for training, but in order to do this, we need to decide on the labels that we will use. When parsing text, the algorithm knows that each token belongs to a lexical category, each of which has its own label. JJ is an adjective, NN is a noun, IN is a preposition. These tags play a key role in the credibility of our model. The Penn Treebank ( note the project on a text label ) allocates 36 such tags.

What are our labels? The simplest option is the names of the characters. This will not work for several reasons:

  1. PLIP contains more than a thousand characters. This is too much choice for our poor model. We need to weed out as many tags as possible in order to properly classify relying on banal luck.
  2. To the characters are treated differently. Joffrey can be either “Joffrey” or “Joff”, “Prince” or even just “he”.
  3. If we use the names of characters as labels, they must be defined in the training data. Otherwise, our model will not be aware of their existence and therefore can not determine them.
  4. All characters sound just the same. (I realized this from a different machine learning experience, where I tried to separate characters according to their vocabulary). Some have some catchy phrases such as “regrettable” ( grievous note ) for Varys and “Hodor” for Hodor, but this is rare. In addition, for many, there is not enough time for conversations to distinguish them from the rest.


Although the definition of the names of the characters sounds very tempting, let's discard this idea and think about the process that occurs in the head of the reader when solving a similar problem.

Take the book closest to you, open a random page, and try to determine who is talking there. How do you do this? You look at the nearest proper names next to the dialogue.

"Will saw them," answered Gared.

[...]

Sir Weimar Royce looked at the sky without any interest. “Night comes every day around the same time. Does darkness really take away your courage, Gared? ”

Although not every line of the dialogue is marked. Look further and see:

"Have you noticed the position of the bodies?"

You will look at the paragraphs above and below. Here are 2 on top:

"And the weapon?"

“Several swords and bows. One had an ax, heavy like that, with two blades ... cruel iron. He was lying on the ground near this man, right by the hand. ”

Not a drop of hint. Two paragraphs below:

Will shrugged. “One sat near the rock. The rest were on the ground, fell, or something. "

"Or slept," suggested Royce.

We know that Will would not ask himself, so we can say that he is not the author of this speech, and since many dialogues stretch into several paragraphs, we will assume that Royce is the first line author.

This scheme will help mark our model. We will teach her to identify her own names next to the text, and if there are no such names, search in nearby paragraphs. Then, our labels will be:

PS ± 2, FN ± 2, NN ± 2, Other.

PS - after speaking. If the label of the paragraph is PS -2, then this will mean that the name, speaking part of the dialogue, is located two paragraphs above. If FN is 1, then the first name is in the next paragraph. NN 0 means at least 2 names precede the dialogue and we need the one closest to the dialogue.

I will also define ADR ± 2 for characters addressed in the dialogue text.

Mark


Now we will prepare the training data. SublimeText will help us in this. I opened the text “Game of Thrones”, selected the left quote, chose Find -> Quick Find All, and twice pressed the Home key. Now the cursor is near the beginning of each paragraph with a dialogue. Then I typed "{}". Since there are no curly braces in the text, then we can use them to leave notes that we will use in the future.

We will use a regular expression (? <= \ {) (? = \}) To jump around curly braces. If you have not met with this design, they are called positive retrospective and advanced checks. The first expression in parentheses will cause SublimeText to start highlighting the lines that have an opening brace (escaped with a backslash) at the beginning. The following expression will say stop when there is a closing brace. As you can see, both expressions consist of the construction? =, Only the first one also contains <.

Now you can go through the brackets by pressing F3, which is a hot key to search for the next in SublimeText under Windows. This kind of optimization is important because you will be tagging approximately thousands of dialogs. At least I did so much. It was not so hard and time consuming as I expected. (Although maybe I'm lying, because I finished only a year later).

Before you begin, I want to make one remark: think about whether you want to use positional labels (PS, FN, NN) or all the same names of characters. I know that I have already said that we will not use names, but if you decide to use positional labels, then you associate this training data with the appropriate model. If you mark John’s dialogs with the label “Jon”, then in the future you will have the opportunity to change the label to the positional one, or use other labels if you find it better.

I think that there is no definite answer. Last year I tagged characters. Now I need to make preliminary manipulations that add ambiguity. If Eddard's name appears 2 paragraphs above and one paragraph below, which one to choose? This will directly affect the behavior of the model and doing this automatically makes the process even more inaccurate. Therefore, I am not sure what to advise. It seems to me that from the point of view of a manual tag, it is easier to write the name of the character, but, from the point of view of automation, it is much more convenient to have positional tags.

Retrieving Properties


Well, you've tagged some text. I applaud you for your commitment to natural language processing. All we need to do now is to write several functions that will take a paragraph as an argument and mark them with properties that interest us.

Remind what properties? The workhorses responsible for the accuracy of the model are the following functions: whether they exist in the current or in the adjacent paragraphs PS, FN or NN.

Search for names


Our first function is to find proper names. This can be done by defining parts of speech.

 sentence = corpus.paras()[33][0] print " ".join(sentence) print nltk.pos_tag(sentence) 
  "Such eloquence, Gared," Ser Waymar observed.
 [(u '\ u201c', 'NN'), (u'Such ',' JJ '), (u'eloquence', 'NN'), (u ',', ','), (u'Gared ',' NNP '), (u', \ u201d ',' NNP '), (u'Ser', 'NNP'), (u'Waymar ',' NNP '), (u'observed', 'VBD '), (u'. ','. ')] 

NPP near Ser and Waymar means that these are proper names. But there are also disadvantages:
  1. Errors happen. Noticed how the closing quote became a proper name?
  2. Identifying parts of speech takes time.

 %timeit nltk.pos_tag(sentence) 
  100 loops, best of 3: 8.93 ms per loop 

 asoiaf_sentence_count = 143669 ( asoiaf_sentence_count * 19.2 ) / 1000 / 60 
  45.974079999999994 

There are many paragraphs for processing in PLIP and more than 45 minutes to determine parts of speech will delay the process of testing and refactoring. Of course, it would be possible to analyze everything once and continue to work with what happened. But for this, we would have to deal with another data structure, and such a definition would have to be redone every time the source text changes. (And this is inevitable.)

Fortunately, it is not necessary to contact parts of speech to determine the names of the characters. This is one of the advantages of choosing a PLIP for analysis: there are tons of data that have already been received. Let's scrape some of them.

Existing Information


It was a very useful Wiki Song of Ice and Flame, I received an almost exhaustive list of character names by literally copying a page with a list of heroes . The result can be found here . If this is enough for you, then we will meet in the next chapter of the article . For those who are interested in how you can automatically extract data from the page, I will give a couple of ways that I used in other projects.

Wget


Excellent utility that is very simple if you need to go through previously known links. You do not have to think about how to bypass links, you just need to create a file with a list and pass it using the -i flag, like this:
 wget -i list_of_links.txt 

Requests


Python has a library of requests , which is well suited for working with individual pages.
 import requests r = requests.get("http://awoiaf.westeros.org/index.php/List_of_characters") html = r.text print html[:100] 
  <! DOCTYPE html>
 <html lang = "en" dir = "ltr" class = "client-nojs">
 <head>
 <meta charset = "UTF-8" />
 <title 

Parsing


After downloading the html, we need to peel off the page from the extra tags to get to the links. BeautifulSoup is an HTML parser that allows you to get links without any fuss. After installation and parsing, you can find all the links by simply running:
 parsed_html.find_all("a") 

Here you can read more about it .

I would like to tell about one more method in which the lxml library is used . With this library you can work with Xpath. I'm new to Xpath, but this is a powerful way to move through a tree structure.
 import lxml.html tree = lxml.html.fromstring(html) character_names = tree.xpath("//ul/li/a[1]/@title") print character_names[:5] 
  ['Abelar Hightower', 'Addam', 'Addam Frey', 'Addam Marbrand', 'Addam Osgrey'] 

If you look askance at the Xpath expression from above, this is what it does:
 tree.xpath("//ul        #                 /li        #               /a[1]      #     .            /@title    #   title          ") 

Now, you need to select names among the result and delete what has nothing to do with the name. Just running through the PLIP page, I noticed elements of the form "Taena of Myr". We do not want our model to associate the dialogues with the “of” particle.

NLTK will help with this. It has a body of text with “bad” words - stopwords. Those that occur so often that they do not make any sense to characterize the text.
 particles = ' '.join(character_names).split(" ") print len(set(particles)) stopwords = nltk.corpus.stopwords.words('english') print stopwords[:5] particles = set(particles) - set(stopwords) print len(particles) #     . .. Aegon I   ,   #  I    .    . "I" in particles 
  2167
 ['i', 'me', 'my', 'myself', 'we']
 2146
 True 

And at the end you need to add some more, perhaps, missed nicknames, such as Deni, Black Fish or Joff. If you are satisfied with the list of names, then save it in a file for future use.

Search for names. Part 2


We abandoned the idea of ​​searching for names using parts of speech and got a list of names. We will extract the token sequences and see if we can find them in our list of names. Finally it is time to write the code.
 import itertools from operator import itemgetter particles = [particle.rstrip('\n') for particle in open('asoiaf_name_particles.txt')] tokens = [u'\u201c', u'Such', u'eloquence', u',', u'Gared', u',\u201d', u'Ser', u'Waymar', u'observed', u'.'] def roll_call(tokens, particles):   speakers = {}   particle_indices = [i for (i, w) in enumerate(tokens) if w in particles]   for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): ix):       index_run = map(itemgetter(1), g)       speaker_name = ' '.join(tokens[i] for i in index_run)       speakers[min(index_run)] = speaker_name   return speakers 

This function uses lambda expressions that I could not use last year when I did this project. The script that I used then is so terrible and we don’t read that I did not dare to publish it. In addition, I think that in this script, beginners can learn something new, so a little more about it.

Itertools is a worthwhile tool. I often use it to get rid of nesting or for permutations. We need the groupby function in it . Due to the release of the new version of this function at the time of writing the material, I completely preferred groupby , rather than dropwhile and takewhile, which I used in a recursive manner.

When programming, I thought that the roll_call function should know the position of the names that it found. So I decided to keep all the serial numbers of the names. This can be seen in the 3rd line of the function code.

 particle_indices = [i for (i, w) in enumerate(tokens) if w in particles] 

Enumerate helped me a lot when I got acquainted with Python. It takes a list and for each element returns a bunch of the sequence number and the element itself.

The 4th line is the trickiest part of the code in all the material and I did not write it. It is taken directly from the library documentation .
 for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): ix): 

Groupby goes through the list and groups the elements depending on the result of the lambda function. Lambda - anonymous functions. Unlike roll_call, they do not need to be predefined. This is only part of the code that takes arguments and returns a value. In our case, it simply subtracts the number from the sequence number.

Let's take a look at how this works.
 print tokens particle_indices = [i for (i, w) in enumerate(tokens) if w in particles] print particle_indices for index, location in enumerate(particle_indices):   lambda_function = index-location   print "{} - {} = {}".format(index, location, lambda_function) 
  [u '\ u201c', u'Such ', u'eloquence', u ',', u'Gared ', u', \ u201d ', u'Ser', u'Waymar ', u'observed', u '.']
 [4, 6, 7]
 0 - 4 = -4
 1 - 6 = -5
 2 - 7 = -5 

This is the trick with groupby : the indexes are numbered sequentially, so if the items in the list also follow each other, then the result of the lambda will be the same for them.

groupby sees -4 and assigns the value 4 to the group. The 6th and 7th elements both have -5 and are grouped accordingly.

Now we know where the compound names are and should use them. What does groupby return? The key is the result of our lambda, and the group itself, the grouper object. Next, we use the map function to apply itemgetter (1) , which extracts an element from the bundle, to all elements of the group and in this way we will create a list of the name positions in the initial list of tokens.

After groupby, we just need to extract the found names and store them in the associative array of speakers .
 roll_call(tokens, particles) 
  {4: u'Gared ', 6: u'Ser Waymar'} 

Optimization


Let's compare the speed of this function with the method in which we used parts of speech.

 %timeit roll_call(tokens, particles) 
  100 loops, best of 3: 3.85 ms per loop 

Not bad, 5-6 times faster. But we can improve the result using set . Set sets almost instantly check if an item is in the list.
 set_of_particles = set(particle.rstrip('\n') for particle in open('asoiaf_name_particles.txt')) %timeit roll_call(tokens, set_of_particles) 
  10,000 loops, best of 3: 22.6 µs per loop 

You understand that they are good when you see Greek letters in speed.

Name search regarding conversations


Now we need to write a program that will call the above function in the right places, so that we can find the names of the characters before, in and after the text of the dialogues. We will collect all this into a class that will be able to collect a complete list of the positions of character names for us, which we will pass on to another algorithm to extract the properties and then to CRFsuite.

But first, I would like to put our data in order.

XML parser


After a successful single-line command with Xpath, I decided to write an XML parser for our text files. There is a ton of meaning in choosing this format. A PLIP is a set of books in which there are chapters, which in turn consist of paragraphs, and some of them contain dialogues - and we need to mark them unnoticed. If I hadn’t translated the text to XML (and at first I didn’t do it), the tags would have littered the text itself.

I prefer to keep silent about the script below: it reminds me of my first steps in Python, huge functions, crutches, and variables with long names.
 from lxml import etree import codecs import re def ASOIAFtoXML(input): #   input           . root = etree.Element("root") for item in input:   title = item["title"]   current_book = etree.Element("book", title=item["title"])   root.append(current_book)   with codecs.open(item["contents"], "r", encoding="utf-8") as book_file:       #  ,      .       current_chapter = etree.Element("chapter", title="Debug")       for paragraph in book_file:           paragraph = paragraph.strip()           if paragraph != "":               title_match = re.match("\A[AZ\W ]+\Z", paragraph)               if title_match:                   current_chapter = etree.Element("chapter", title=title_match.group())                   current_book.append(current_chapter)               else:                   current_graf = etree.SubElement(current_chapter, "paragraph")                   while paragraph != "":                       current_dialogue = current_graf.xpath('./dialogue[last()]')                       speaker_match = re.search("(\{(.*?)\} )", paragraph)                       if speaker_match:                           speaker_tag = speaker_match.group(1)                           speaker_name = speaker_match.group(2)                           paragraph = paragraph.replace(speaker_tag, "")                       open_quote = paragraph.find(u"\u201c")                       if open_quote == -1:                           if current_dialogue:                               current_dialogue[0].tail = paragraph                           else:                               current_graf.text = paragraph                           paragraph = ""                       elif open_quote == 0:                           current_dialogue = etree.SubElement(current_graf, "dialogue")                           if speaker_name:                               current_dialogue.attrib["speaker"] = speaker_name                           close_quote = paragraph.find(u"\u201d") + 1                           if close_quote == 0:                               #  find  -1   ,    0                               #        .                                 #      .                               close_quote = len(paragraph)                           current_dialogue.text = paragraph[open_quote: close_quote]                           paragraph = paragraph[close_quote:]                       else:                           if current_dialogue:                               current_dialogue[0].tail = paragraph[:open_quote]                           else:                               current_graf.text = paragraph[:open_quote]                           paragraph = paragraph[open_quote:]   return root tree = ASOIAFtoXML([{"title": "AGOT", "contents": "corpus/train_asoiaf_tagged.txt"}]) #      . # et = etree.ElementTree(tree) # et.write(codecs.open("asoiaf.xml", "w", encoding="utf-8"), pretty_print=True) 

The essence of the code is above: we use lxml to create a tree, then line by line go over the text. If the string is recognized as a chapter name (capital letters, punctuation, and spaces), we add a new chapter to the top of the current book. As soon as we are in the text of the chapter, we make our way through the paragraphs, using another regular expression to determine who spoke the dialogue and add it to the appropriate vertex of the dialogue. They should already be pre-marked, of course.

An interesting note on XML. This is a hierarchical structure, so it by its nature requires strict branching, apex at the top. But it is not so in prose. In prose, the dialogues are inside the text. lxml provides a solution: text and tail. Thus, the XML vertex stores text, but this text is interrupted after another vertex has been added.
 markup = '''<paragraph>Worse and worse, Catelyn thought in despair. My brother is a fool. Unbidden, unwanted, tears filled her eyes. <dialogue speaker="Catelyn Stark"> “If this was an escape,”</dialogue> she said softly, <dialogue speaker="Catelyn Stark">“and not an exchange of hostages, why should the Lannisters give my daughters to Brienne?”</dialogue></paragraph>''' graf = lxml.etree.fromstring(markup) print graf.text 
Worse and worse, Catelyn thought in despair. My brother is a fool.
Unbidden, unwanted, tears filled her eyes. 

 print graf[0].text 
 "If this was an escape," 

What happens to the remaining “she said softly”? We will save in the variable vertices of tail .
 print graf[0].tail 
 she said softly, 

And so on, adding to each vertex of the dialogue the rest of the text.

As a result, this greatly simplifies our search for the authors of the dialogues when we need them. And we need them right now!
 class feature_extractor_simple:   """Analyze dialogue features of a paragraph. Paragraph should be an lxml node."""   def __init__(self, paragraph_node, particles, tag_distance=0):       self.paragraph = paragraph_node       self.particles = set(particles)       self.tag_distance = tag_distance       self.raw = ''.join(t for t in self.paragraph.itertext())       self.tokens = self.tokenize(self.raw)   def tokenize(self, string):       return nltk.wordpunct_tokenize(string)   def find_speakers(self, tokens):       speakers = {}       particle_indices = [i for (i, w) in enumerate(tokens) if w in self.particles]       for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): ix):           index_run = map(itemgetter(1), g)           speaker_name = ' '.join(tokens[i] for i in index_run)           speakers[min(index_run)] = speaker_name       return speakers   def pre_speak(self, prior_tag="FN", near_tag="NN"):       #   .       features = {}       if self.paragraph.text is not None:           speakers = self.find_speakers(self.tokenize(self.paragraph.text))           if len(speakers) > 0:               features.update({"{} {}".format(prior_tag,self.tag_distance): speakers.values()[0]})           if len(speakers) > 1:               features.update({"{} {}".format(near_tag,self.tag_distance): speakers[max(speakers.keys())]})       return features   def dur_speak(self, tag="ADR"):       #  .       features = {}       for dialogue in self.paragraph.itertext("dialogue", with_tail=False):           tokens = self.tokenize(dialogue)           named = self.find_speakers(tokens)           addressed = {k: v for (k, v) in named.items() if tokens[k-1] == "," or tokens[k + 1 + v.count(" ")].startswith(",")}           if len(addressed) > 0:               features.update({"{} {}".format(tag, self.tag_distance): addressed[max(addressed.keys())]})       return features   def post_speak(self, tag="PS"):       features = {}       #   .       tails = [line.tail for line in self.paragraph.iterfind("dialogue") if line.tail is not None]       for tail in tails:           tokens = self.tokenize(tail)           speakers = {k: v for (k, v) in self.find_speakers(tokens).items() if k <= 1}           if len(speakers) > 0:               features.update({"{} {}".format(tag, self.tag_distance): speakers[min(speakers.keys())]})               break       return features 

A few words about these features.

If you're new to Python, then don't be afraid of classes. You just need to write ordinary functions, passing them self as an argument . This will allow Python to know which object the function is currently working with. Class is like a clone factory, and the object is a clone. All clones have the same DNA, these are methods and variables, but because of their life experience, their personalities differ, which in this context is the data transmitted to them.

Classes also have a special function __init__ , which allows you to initialize object variables.
Now you can relax, because your data is in the hands of a specialized class. And once you have abstracted his behavior, you can get the information he processed by clicking on his finger.
 paragraph = tree.xpath(".//paragraph")[32] example_extractor = feature_extractor_simple(paragraph, particles) print example_extractor.raw print example_extractor.pre_speak() print example_extractor.dur_speak() print example_extractor.post_speak() 
"Such eloquence, Gared," Ser Waymar observed. "I never suspected you had it in you."
 {}
{'ADR 0': u'Gared '}
{'PS 0': 'Ser Waymar'} 

If you are confused by the work of some functions, I will briefly explain what they are doing. If everything from above looks acceptable to you, then you know what to do, see you in the next chapter.

Awkward manipulation of the associative array occurs here, and this is because they are not ordered in Python. It reminds me of a feeling when leaving your home you feel that there are no keys in your pocket, locking the door. I had to constantly check whether we get the first or last character, depending on the case, I look at the value of the keys and select the minimum / maximum.

pre_speak


As I said above, the text attribute contains all the text up to the first line of the dialog. We just need to find the names of the characters in it.

dur_speak


In the case when the name is in the body of the dialogue, which may consist of multiple lines, we need to run through them all:
 for dialogue in self.paragraph.itertext("dialogue", with_tail=False) 

The itertext function in lxml allows you to get all the vertex text. We also set the flag with_tail = False to search for only vertices without a “tail”, which means only the text of the dialog.

Once we find the names of the characters, we need to select in them only those that are separated by a comma, which will allow us to find the appeal. (for example, “Ned, promise me.” / “Promise me, Ned.”) In

my heart I feel that the last name found in the dialogue will most likely answer in the next paragraph, so we will rewrite the addressee with the last mentioned name.

post_speak


For this function, we need only the first character after the dialogue. Therefore, we interrupt the cycle as soon as we find one.

The function looks in the first 2 tokens after the closing quotation mark. So you will find dialogs like:
"Goodbye," said John.

Tip for novice programmers: you can call the fetch function when building a list.
 tails = [line.tail for line in self.paragraph.iterfind("dialogue") if line.tail is not None] 

This allowed to get the dialogues in one line. (you just need to specify a condition to remove all results without a “tail”)

CRFsuite


Perhaps this is the most curious part for you. It contains conditionally random fields, whatever they might mean, and is launched from the command line, but you can’t see how it works from the inside.

But in fact, CRFsuite is a very simple and interesting part of all this. While writing material, I discovered that he has a library for Python, but now we will not complicate everything and will use the executable file using the command line.

(I plan to update the model when the next book, Winds of Winter, will see the light. But I have a couple more years until it happens)

All that is needed is a CRFsuite text with some tab-separated properties, like these for example:

 FN 0 Graf Sent Len = 4 FN 1 = True FN -2 = True FN 0 = True NN 1 = True 

This is the format for the training data. The first attribute is the correct answer. All subsequent properties. They may look like you want, but do not use a colon — this is for weighted properties, and therefore may lead to a false interpretation.

You need to open a command prompt where crfsuite.exe is located and type the following there:
 crfsuite learn -m asoiaf.model train.txt 

This will create a model that is the brain of everything. You can call it whatever you like, I called my asoiaf. To look at the accuracy of the model, type this:
 crfsuite tag -qt -m asoiaf.model test.txt 

To actually run the model for tagging type
 crfsuite tag -m asoiaf.model untagged.txt 

untagged.txt should look the same as train.txt , but without the attribute of the correct answer at the beginning, i.e. approximately like this:

 NN -1 = True FN 0 = True FN 2 = True FN -1 = True NN 0 = True 

Here you can learn more about it.

Let's play now with a variety of properties that can improve the accuracy of the model. We start with the simplest: with boolean values ​​that determine the location of the positional labels in and near the paragraph.

And again our class for extracting properties, only now with a few new features at the beginning.
 class feature_extractor:   """Analyze dialogue features of a paragraph. Paragraph should be an lxml node."""   def __init__(self, paragraph_node, particles, tag_distance=0):       self.paragraph = paragraph_node       self.particles = set(particles)       self.tag_distance = tag_distance       self.raw = ''.join(t for t in self.paragraph.itertext())       self.tokens = self.tokenize(self.raw)       self.speaker = self.xpath_find_speaker()   def features(self):       features = {}       features.update(self.pre_speak())       features.update(self.dur_speak())       features.update(self.post_speak())       return features   def local_features(self):       #               features = []       if self.tokens.count(u"\u201c") == 0:           features.append("NoQuotes=True")       prior = self.paragraph.getprevious()       try:           last_dialogue = list(prior.itertext("dialogue", with_tail=False))[-1].lower()           hits = [w for w in ['who', 'you', 'name', '?'] if w in last_dialogue]           if len(hits) > 2:               features.append("Who Are You?=True:10.0")       except (AttributeError, IndexError):           pass       try:           dialogue = list(self.paragraph.itertext("dialogue", with_tail=False))[0].lower()           for token in ['name', 'i am', u'i\u2019m']:               if token in dialogue:                   features.append("My Name=True:10.0")                   break       except (AttributeError, IndexError):           pass       if self.tokens[0] in self.particles:           features.append("FirstSpeakerIndex0=True")       if self.paragraph.text is not None:           name_precount = len(self.find_speakers(self.tokenize(self.paragraph.text)))           if name_precount > 2:               features.append("Many Names Before=True")           conjunctions = set([w.lower() for w in self.tokenize(self.paragraph.text)]).intersection(set(['and', 'but', 'while', 'then']))           if len(conjunctions) > 0 and self.paragraph.find("dialogue") is not None:               features.append("Conjunction in Head=True")       short_threshold = 10       if len(self.tokens) <= short_threshold:           features.append("Short Graf=True")       dialogue_length = sum(map(len, self.paragraph.xpath(".//dialogue/text()")))       dialogue_ratio = dialogue_length / len(self.raw)       if dialogue_ratio == 1:           features.append("All Talk=True")       elif dialogue_ratio >= 0.7:           features.append("Mostly Talk=True")       elif dialogue_ratio < 0.3 and not self.tokens < short_threshold:           features.append("Little Talk=True")       return features   def feature_booleans(self):       bool_features = []       for tag in ["PS", "FN", "NN", "ADR", ]:           label = "{} {}".format(tag, self.tag_distance)           if label in self.features().keys():               bool_features.append("{}=True".format(label))           else:               bool_features.append("{}=False".format(label))       return bool_features   def tokenize(self, string):       return nltk.wordpunct_tokenize(string)   def find_speakers(self, tokens):       speakers = {}       particle_indices = [i for (i, w) in enumerate(tokens) if w in self.particles]       for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): ix):           index_run = map(itemgetter(1), g)           speaker_name = ' '.join(tokens[i] for i in index_run)           speakers[min(index_run)] = speaker_name       return speakers   def xpath_find_speaker(self):       speakers = self.paragraph.xpath(".//@speaker")       if speakers == []:           return "NULL"       else:           return speakers[0]   def pre_speak(self, prior_tag="FN", near_tag="NN"):       #          features = {}       if self.paragraph.text is not None:           speakers = self.find_speakers(self.tokenize(self.paragraph.text))           if len(speakers) > 0:               features.update({"{} {}".format(prior_tag,self.tag_distance): speakers.values()[0]})           if len(speakers) > 1:               features.update({"{} {}".format(near_tag,self.tag_distance): speakers[max(speakers.keys())]})       return features   def dur_speak(self, tag="ADR"):       #          features = {}       for dialogue in self.paragraph.itertext("dialogue", with_tail=False):           tokens = self.tokenize(dialogue)           named = self.find_speakers(tokens)           addressed = {k: v for (k, v) in named.items() if tokens[k-1] == "," or tokens[k + 1 + v.count(" ")].startswith(",")}           if len(addressed) > 0:               features.update({"{} {}".format(tag, self.tag_distance): addressed[max(addressed.keys())]})       return features   def post_speak(self, tag="PS"):       features = {}       #          tails = [line.tail for line in self.paragraph.iterfind("dialogue") if line.tail is not None]       for tail in tails:           tokens = self.tokenize(tail)           speakers = {k: v for (k, v) in self.find_speakers(tokens).items() if k <= 1}           if len(speakers) > 0:               features.update({"{} {}".format(tag, self.tag_distance): speakers[min(speakers.keys())]})               break       return features paragraph = tree.xpath(".//paragraph")[-1] example_extractor = feature_extractor(paragraph, particles) print example_extractor.raw print example_extractor.features() print example_extractor.local_features() print example_extractor.feature_booleans() 
 And in their hands, the daggers.
 {}
['NoQuotes = True', 'Short Graf = True', 'Little Talk = True']
['PS 0 = False', 'FN 0 = False', 'NN 0 = False', 'ADR 0 = False'] 

Last night, during an undocumented machine learning insanity, I tried to improve on a number of properties. Below are some sketches acceptable for publication.

Option 1: True Positional Boolean Values ​​Only
Label Count Recall
PS 0  207    0.9949
FN 0   185    0.95
NULL   118    0.3492
OTHER  56     0.3939
PS - 2 44     0.5238
Item accuracy: 430 / 678 (0.6342) 

Next, we will meet a lot of similar statistics, and therefore let's immediately determine what they mean.

Imagine that we are at dinner, looking at people. I asked you to determine if a bystander is an illuminati. You, as a person who completely believes in conspiracy theories, finish eating dumplings and start tagging passersby.

Accuracy ( note Precision ), the value that will not be considered here, shows the frequency of errors of the first kind. In other words, how often you mistakenly ranked a person to the Illuminati.

Completeness ( approx. Recall ) measures the number of labels in the verification data that the model has identified correctly.

And F1 is a combination of both tags. You can see that if you classify all people as Illuminati, this will ensure maximum completeness and insignificant accuracy.

SinceEverything is marked by me, I am not very interested in the accuracy of the model. I need completeness and accuracy.

In the first version of the properties, I considered only true boolean values. Those.in the paragraph above, all sets were of the form “ADR 0 = True” and “PS 0 = True”. Accuracy ( approx. Item accuracy ) was 63.4%.

63.4% is good? Given that NULL, PS 0 and FN 0 make up three-quarters of our verification data, and they are naturally easy to find, we definitely can do better. Now add the rest of the positional boolean values, false.

Option 2: all positional boolean values
Label Count Recall
NULL 254 0.9048
PS 0 204 0.9899
FN 0,149 0.975
OTHER 24 0.2273
PS - 2 19 0.2857
Item accuracy: 515/678 (0.7596) 

Now we perfectly define simple cases and get decent accuracy. 75% means that you only need to mark the first book, Game of Thrones, and one third of Battle of Kings and the model to determine the remaining three-quarters of the books itself. This requires many hours of work, but within reason.

And yet, I see no reason why not define NULL tags with a full 98% +, so let's add a property aimed at this.

Option 3: quotes?
Label Count Recall
PS 0 218 0.9907
NULL 180 0.9119
FN 0 167 0.9118
OTHER 63 0.3784
PS 2 25 0.5
Item accuracy: 550/710 (0.7746) 

We count the number of opening quotes in a paragraph.

I want to say that I am surprised that NULL did not become more accurate. Need to work on this. Further I would like to improve FN 0.

Option 4: the index of the first name?
Label Count Recall
PS 0 218 0.9907
NULL 183 0.9057
FN 0 157 0.8971
OTHER 68 0.4189
PS - 2 23 0.5484
Item accuracy: 551/710 (0.7761) 

This property contains the index of the first name.

hmm ... maybe too complicated, let's go back to the boolean values ​​again.

Option 5: index 0 name? + redundancy
Label Count Recall
PS 0 216 0.986
FN 0 166 0.9265
NULL 160 1
OTHER 85 0.5811
PS 2 32 0.7143
Item accuracy: 578/710 (0.8141) 

Here it is! I did not correctly count the number of opening quotes, thereby spoiling the result.

As soon as I fixed this, NULL is perfectly defined ... but now we’ve run out of easy ways to improve the model. I now really need to contrive to further improve the result! Let's see if it works ...

Option 6: After the speaker (PS) + and - 2
Here we will use the boolean value if the speaker is in two paragraphs above or below the current one. In theory, this should increase the PS -2 result.
Label Count Recall
PS 0 216 0.986
FN 0 166 0.9265
NULL 160 1
OTHER 84 0.5676
PS 2 32 0.7143
Item accuracy: 578/710 (0.8141) 

No effect!

Option 7: sequences ??
Label Count Recall
PS 0 217 0.986
FN 0 168 0.9265
NULL 160 1
OTHER 82 0.5541
PS 2 30 0.6429
Item accuracy: 576/710 (0.8113) Instance accuracy: 56/142 (0.3944) 

Wait! It turned out that CRF can work with sequences, and in fact this is its meaning. I ignored the value of the accuracy of the instance ( approx. Instance accuracy ), because it was always 0/1, which means that the model viewed the entire text as one long dialogue.

Sorry, I need to slap myself. Assuming that we will increase accuracy — and this is an open question — how will we use this functionality? I tried to specify the length of each sequence in 5 paragraphs, but this does not seem to me correct.

Perhaps if there are two consecutive NULLs, this will be a sequence, assuming that the conversation is over.

After I played with this, I still could not build a model that would work with conversations. As I understand it, it should have a lot of special transition weights ( depending on transition weights ), depending on the position in the sequence. Thus, the model will make different decisions, depending on our position, in the beginning, in the middle or at the end of the conversation.

But nothing in the behavior of the model shows that this happens. In the near future, I will play a little more with other properties. Oh yes, let's take a look at the script that generates our training and verification data. It is not optimized, because calculate properties for each paragraph 5 times. I will leave it as it is for this material, but keep in mind that it can be accelerated if you use one cycle to preserve the boolean properties of the paragraphs and the second to add to the existing ones.
 tree = ASOIAFtoXML([{"title": "ASOIAF", "contents": "corpus/train_asoiaf_pos_tagged.txt"}]) paragraphs = tree.xpath(".//paragraph") In [29]: def prep_test_data(paragraphs):   max_index = len(paragraphs)   results = []   for index, paragraph in enumerate(paragraphs):       extractor = feature_extractor(paragraph, set_of_particles)       all_features = extractor.local_features() + extractor.feature_booleans()       for n in [-2, -1, 1, 2]:           if 0 <= n+index < max_index:               neighbor_features = feature_extractor(paragraphs[index + n], set_of_particles, tag_distance = n).feature_booleans()               if neighbor_features:                   all_features += neighbor_features            all_features.insert(0, extractor.speaker)       results.append("\t".join(all_features))   return results results = prep_test_data(paragraphs) In [31]: max_index = len(results) with codecs.open(r"new_test.txt", "w", "utf-8") as output:   for line in results[:int(max_index/2)]:           output.write(line + '\n') with codecs.open(r"new_train.txt", "w", "utf-8") as output:   for line in results[int(max_index/2):]:           output.write(line + '\n') 

More properties


I tried several other properties:

I thought that the latter is a very clever move, but it did not work and we did not get any accuracy above 81%.

I tried to change the training data with the verification data and it gave 84%. You should not spend a lot of time improving the set of properties for certain data, because this leads to retraining. In fact, mixing the training and verification data is a good idea. I did not mix them, because I thought that this would damage the sequences, but we no longer use them, so why not? We mix them up.

Slightly mixed data.

Received 82%.

Okay! I think here we have reached the limit of my skills.

And there will be no continuation?


Let's summarize and talk about what can be done next.


Conclusion


Good!I hope this was useful to someone. Thank you for reading and if you want to contact me, then I am on Twitter.

Also, I would like to point out that all of the above was made for a large critical study of the Game of Thrones. If you are a fan of these books and would like to read the analysis that was possible thanks to the label of the dialogues, then I will publish everything soon.

Source: https://habr.com/ru/post/304230/


All Articles