Natural language processing: the missing tool

Suppose you want to create a web application. In today's world, an incalculable set of software has been created to make your life easier. You can use some kind of comprehensive framework, or connect a couple of libraries that solve typical tasks for you, such as templating, database management, interactivity, and the like. These libraries provide you with a uniform interface for solving both general and exceptional tasks that you might not be able to do at once.

But amidst this abundance of tools, there is a significant gap: a library for working with natural languages.
')

“But there are a lot of such people!” - you want to argue, - “ NLTK , for example, or lingpipe .” Of course, but do you use them? “So in my project, it’s kind of like, I don’t need to handle natural languages.”

But in fact, you are engaged in the processing of natural language, without even realizing it. Such an elementary action as sticking together strings is only a special case of generating texts in a natural language, one of the fundamental parts of the OJ1. But if you need to perform more complex operations, such as the formation of the plural form of nouns, the arrangement of capital letters in the sentence, the change of the form of the verb, you can’t do without using linguistics ² . But when you suddenly need to get the plural form of a noun, you’ll rather pull off a couple of regulars from the Internet than go search for a suitable OEY library. This is partly to blame for the omission of the OEY area.

It is worth estimating which tasks the OEY solves can be useful in your application: generation of keywords, reduction to canonical form, language identification, full-text search, autocompletion, classification by subject and clustering, automatic annotation, analysis of handwritten text or, perhaps, some more . Not only an application will need all this abundance at the same time, but many could only benefit from making a couple of additional features. A blog that automatically generates tags, performs a full-text search, automatically annotates entries for a news feed, and determines the time sequence for some features. But few people are engaged in the implementation of such opportunities, because the occupation is rather nontrivial. Modern solutions for these problems are usually built on models that require huge corpuses of linguistic data to generate. In most cases, the game is not worth the candle, because it is easier to arm yourself with a couple dozen heuristics.

Both of the examples considered suggest problems with the practical application of the OJA: existing software solutions imply that the user wants to build a system that deals exclusively with OJA when he just needs to fasten a couple of chips. I do not want to get a doctoral degree in applied linguistics in order to be able to quickly get multiple forms for nouns or the result of some other well-studied problem in OEY. It should be understood that the user will most likely use the built-in linguistic model rather than train theirs. Although this introduces some limitations, the principle itself still provides more opportunities than heuristics. And what is more important, using the model and wanting to improve it, the developer can simply train it on texts that are specific to his application ³ .

My call: I want to see all the functionality that the applied OEE in its current infant form is capable of, collected in one place where it can use common linguistic resources and provide a simple interface that will attract developers. I want to see the complex, but practically applicable technologies of the OEY available not only for linguists. I wish, finally, that all this would be created on the fundamental principles of OEY, leaving it possible to improve the original models and algorithms. The engineers at the OJA field take great care of themselves so as not to give in to false expectations (as opposed to the high hopes of the 80s). Somewhere we are powerless, and it is in the order of things.

[1] As a field of study, OEY certainly does not consider string sticking as a method. Instead, the possibilities of generating text based on the functional description of the desired result are explored. As an example, pronominal expressions.

[2] This feature (rules for outputting a plural form) is collected in the language / MediaWiki folder. It is one of the most multinational open source projects, and a striking source of information about linguistic oddities in foreign languages.

[3] As an example, consider how text generation can help in localizing applications. Suppose you want to notify the user: "You have three new messages." The simplest solution would be: printf ("New Posts:% d", numMessages). Going a short way, we are spared the need to generate the necessary numeral and consistent form of the word "message."

If you still want to display the notification in a more natural form, then the next step is to add a couple of functions: translating the number into a numeral and generating the desired form of the word “message”. The result will be something like: printf ("You have% s new% s", toNumeral (numMessages), pluralize ("message", numMessages)). Since most applications are originally written in English, poor morphology, unpretentious bicycles are enough, and mostly they encounter such problems during localization.

However, there is an invariant representation for this problem. Consider the grammatical dependencies that we could extract from our sentence using some of the OEY:

subj (message-4, you-1)
num (message-4, three-2)
amod (message-4, new-3)
root (ROOT-0, message-4)

Let us ask ourselves: “Using this data, is it possible to automatically generate the corresponding message in any language that conveys the same information and has a grammatically correct structure?” This is the fundamental question of generating natural language texts. (Moreover, this is a question not only of machine translation, because the message received may vary depending on the functionality we have indicated, which is rather difficult to get directly from the text.) It would be indisputable to get a magic black box that gives us grammatically correct texts on demand, but and tools created for the generation of texts these days can significantly facilitate the work of translators. In my opinion, this topic deserves close study.

-
Link to the original article.

Source: https://habr.com/ru/post/170619/

All Articles

Natural language processing: the missing tool

More articles: