(
Part 1 ,
Part 2 ) Last time I mentioned tokenization prematurely; Now you can talk about it, and at the same time about the marking of parts of speech (POS tagging).
Suppose we have already caught all the errors (which we guessed to catch) at the level of text analysis with regular expressions. So, it is time to move on to the next level, at which we will work with individual words of the sentence. Tokenization module deals with words. Even in such a simple task there are pitfalls. I'm not even talking about languages ​​like Chinese and Japanese, where even isolating individual words of a text is nontrivial (the hieroglyphs are written without spaces); in English or in Russian, too, there is something to think about. For example, is a dot in a shorthand or is it a separate token? ("Other" - is it one token or two?) And the name of a person? "JS Smith" - how many tokens are there? Of course, a volitional decision can be made on each item, but in the future it can lead to various consequences, and this must be borne in mind.
Something like this, I argued at the initial stages of our project, but now I tend to think that in word processing tasks I often have to obey the decisions of other people. This will be clear on the example of marking parts of speech.
')
Marking of parts of speech
Knowing the division of sentences into words, you can already search through the text frequently encountered typos. For example, to ship “egg yoke” to “egg yolk” (this typo seems to be so popular that
Wikipedia even
mentions it ). But the real progress in comparison with regular expressions is provided by marking parts of speech, that is, by matching each word of the text of its part of speech:
“I love big dogs.” -> “I_PRP love_VBP big_JJ dogs_NNS ._.”
In this example, the following markers are used: PRP - pronoun; VBP is the verb of the present tense, the only person, not the third number; JJ is an adjective; NNS is a plural noun. Well, the point - it's just a point.
Knowing the parts of the speech of individual words, one can formulate more complex patterns of errors. For example, “DT from” -> “DT form”. The token DT stands for “defining word” - an article or pointers like this / that. If in the text there is a combination of “the from” or “this from”, most likely, this is a typo, and it meant not the preposition from, but “form” - form. It can be even trickier: “MD know VB” -> “MD now VB”. Here there is a catching typo “know instead of now” - the pattern “modal verb + know + verb”. Under it falls, say, the phrase "I can know say something more."
Of course, it is not difficult to implement simple operations such as “or” (“if this or that has met”) and denial (“this is not met”). It is on such expressions that the already mentioned
LanguageTool system works. Since it is distributed under the LGPL license, I decided to transfer all its rules to our system. Why not? People have done a great job, it would be foolish not to use the results, if allowed. We will talk about the limitations of this approach, but for now let us return to the marking of parts of speech.
The most popular method of POS-marking today is reduced to the same task of classification, this time in its full version. We give the input to the learning algorithm a word and its context — usually the initial and final characters of the word, as well as data about the previous words of the sentence — these words themselves and their corresponding parts of speech. We also report a part of the speech of the word in the current context, and the algorithm remembers this information. If we now give the input a context, the algorithm will be able to make a reasonable guess about a part of speech.
Here, too, often use the model of maximum entropy. Although you could play with other algorithms. For example, there is a development based on support vector machines (
SVMTool ).
Annotated corpses, great and terrible
Last time I didn’t focus on this, but now it’s time to go. In order for the POS tagger to work, it needs to be trained on a large collection of texts, where each word is assigned a tag of a part of speech. Then a reasonable question arises: where is this collection to take?
Such collections (“annotated corps”) exist, although there are not so many of them. POS-marking is most common, less often - deep annotation, that is, marking of syntactic-semantic links between words in a sentence. The largest deeply annotated corpus of English is called
Penn Treebank and contains almost three million words. Good buildings also exist for German and Russian - this is one of those that I personally studied.
Now think about this. There are subtleties about which different linguists have different opinions. For example, how many cases in Russian? The student’s answer is six, but I can name at least eight or nine. In English, what part of speech is the word book combined book market? I would say that this is an adjective, but one can defend the interpretation as “a noun in the role of an adjective.”
Thus, it is possible to mark the text in different ways, based on any linguistic or practical considerations. Unfortunately, our considerations are unlikely to be embodied in the final system, for using a ready-made corpus, we are forced to accept the rules of the game of its developers. If I train the POS tagger on the Penn Treebank case, I have to accept that “book” is still treated as an adjective as a noun. Who does not like - can create their own body and mark it at their discretion.
Similarly, in Penn Treebank, the punctuation mark is always a separate token, so the entry “etc.” is two tokens, and “JS Smith” is five tokens, even if this agreement is inconvenient for me. No choice. This is, by the way, on the issue of the presence of linguists in the project. If I had unlimited budgets and a lot of time, I could try to make my own system, embodying our views on spelling. However, in real conditions, the existing NLP tools and text boxes direct actions along a fairly clear route, leaving not too much room for imagination.
Yes, another comment. Naturally, ready-made collections contain correct texts, devoid of obvious grammatical errors. What does this mean for us? Well, take the same POS tagger. First, we train him on correct texts (where he never sees combinations like “I has”), and then we use it to mark words in texts with errors. Will he be as good in the new environment? But who knows him; but building a casing with typical mistakes for the sake of training a scriber is too much luxury for us.
We continue in the next section.