Vaccination against reality: pink glasses for the browser

Why are there so many swearing around? It's one thing when a hammer falls on your foot, or when you need to urgently tell a colleague that he does not have time to make a site layout. But on the Internet, the author must always have enough time to find a beautiful phrase and show himself to be a competent, intelligent person with a large vocabulary. Unfortunately, rare is the case when obscene vocabulary is really appropriate - offhand, one in a hundred.

Some forum, chat and blog owners are struggling with an abundance of mate with organizational measures (setting rules) or technical ones (using parsers), but the biggest drawback of existing antimat systems is the numerous false positives that generate amazing neologisms such as sketch , zasterpenis and skigey (who aren't guessed - in the original was the word "turpentine"). Also, scripts (and often the authors of the texts themselves) sometimes replace letters from the middle of expletives with asterisks (***) or "# $% ^" characters, which is why I suspect that these people have genitals instead black squares.

We will go a different way: let the reader decide for himself what he wants on the screen: colorful Russian language or no less colorful literary Russian language. We will develop an extension for the browser "We do not use foul language", replacing the profanity with synonymous literary expressions. The main and decisive requirement for expansion is the naturalness and readability of the text after the replacement. We do not want to impoverish the language , simply withdrawing a mat from it - we enrich it, offer something more in return.
')
In the following article, I provide a superficial linguistic study of Russian mat, and also give a brief course of regular expressions in JavaScript and a guide to creating extensions for the Chrome browser.

Part 1. Linguistics. Russian mate on the shelves

In Russian, there are three main word-forming obscene roots and a few words that were not originally obscene, but in modern language are under taboos. The famous word of three letters is the most productive word for Russian in terms of word formation, and probably the richest in word forms and derivative words in Russian in general. Also interesting is the fact that expressions derived from it cover almost all the semantic fields of a language. And for me, as a developer, this means that in different contexts, these expressions will have to be replaced with different analogues.

Let me remind you that a word in Russian can include the following morphemes: prefix, root, interfix, suffix, ending. For those who do not know, interfix is the part of a word that connects two or more roots (water , sing, goal of the gate). It is obligatory to have at least one root in the word, the rest is optional.

A unique feature of the Russian mat is that changing one morpheme (for example, a prefix) can drastically change the semantics of a word. Moreover, even words with the same prefix (or without it) can carry different semantic meaning depending on the suffixes after the root. The ending does not affect the semantics in any way, but serves only to designate the relation of a word to other members of a sentence.

In total, the Russian language has 70 prefixes, 20 of which are borrowed from Greek and Latin languages (ex-, post-, hyper-, etc.), and 50 are native Russian. Borrowed prefixes with swear words are not used, or are used, but so rarely that this can be neglected. Note also that each of the roots in question is not used with all existing Russian prefixes, which significantly reduces the number of existing prefix-root combinations. Many suffixes are eliminated in the same way.

Due to the large difference in the semantics of words and phrases formed from obscene roots using different prefixes and suffixes, each expression must be considered separately. To find and replace powerful language constructs, we will use a powerful tool - regular expressions.

Part 2. Regular expressions

In total, there are about 100 regular expressions in the current version of the extension, many of which are similar to each other, so I think it’s pointless to bring them all. I will consider in detail only some of the templates, but this will be quite enough to illustrate the use of regulars in JS in general and my way of thinking in this particular case. You can see the full set of regulars used in my GitHub .

Introduction to regular expressions

Regular expressions in JavaScript work through a special RegExp object. A short form of a regular expression has the following syntax:

//

In the simplest case, there can be a rigidly set of characters between slashes. For example:

//

Such a construction will find all occurrences of the “Sausage” substring in the text. Important: the register has a value , that is, the string “sausage” cannot be found by such an expression. To ignore the case of characters, you need to use the i flag:

//i

Such a construction already corresponds to all entries of the strings “Sausage”, “Sausage”, “SAUSAGE” and “COLBASA”.

We, of course, want all the sentences in the processed text to begin with a capital letter, and the words inside the sentence - with a lowercase letter. Therefore, we consistently apply both regular expressions to the text, replacing them with one of the synonymous and equally emotionally colored expressions. For example, take the word "Fear."

 text = text.replace(//g, randomWord(["", ""])); text = text.replace(//i, randomWord(["", ""]));

The replace () string object method takes two arguments:

Regular expression or the desired substring is a pattern that will be searched in the document;
A string (or a function that returns a string) that will replace all occurrences of the pattern.

Consider a more complex and closer to the subject matter of the case when the word has several word forms that describe the same state / phenomenon / object.
We write:

 text = text.replace(/(||)/g, "  "); text = text.replace(/(||)/i, "  ");

The design of the form (A | B) C meets the AC and BC offsets. With the help of the above lines of code, we find all occurrences of the substrings "Ofiget", "Afiget", "Prefiget", and then - do the same, but without taking into account the register.

The construction (A | B) can be used any number of times, at any nesting level and in any part of the expression. Consider this in a slightly more complex example with multiple suffixes: the common adverb is “bad”. This word has a huge number of suffix derivatives: “hrenovasto”, “hrenovenko” and even “hrenovastenko”. In drawing up a regular expression for a rougher analogue, we must take into account the alternation of e / e in the root and the fact that “e” in the letter is often replaced by the letter “e”. Now we don’t need it, so let's make a regular expression that takes into account all these forms:

 /(|)(|)/

and make the appropriate replacement:

 text = text.replace(/(|)(|)/g, randomWord(["", "", "", "", ""])); text = text.replace(/(|)(|)/i, randomWord(["", "", "", "", ""]));

Similarly, but taking into account the possible endings, we proceed with the adjective “bad” - a negative value judgment in relation to the quality of the object of speech:

 text = text.replace(/(|)(|)((|||)|(|||)||)/g, " "); text = text.replace(/(|)(|)((|||)|(|||)||)/i, " ");

By simple addition and multiplication, we can count that the regular expression given in the example

 /(|)(|)((|||)|(|||)||)/

can find 40 possible forms of the word. And in the case of the use of obscene root and, as a result, alternation of e / e - 80.

If we want to find and replace the word “Ballet” by “Opera” or “Musical”, but do not change the words “Corps de Ballet”, “Crossbow” and others, then we should do it this way:

 text = text.replace(/(\s|^)/g, randomWord(["", " "])); text = text.replace(/(\s|^)/i, randomWord([" ", ""]));

The special character ^ indicates the beginning of the input data. Thus, we find the word "Ballet", coming after the space or at the very beginning of the searched line.
Yes, the surprise, the \ b construct used to designate a word boundary does not work with Cyrillic, although it defines the boundary of a word written in Latin letters perfectly.
The special character $ , which marks the end of the input data, may also be useful.

Part 3. Google Chrome Extension

The extension will consist of three main parts:

mandatory manifest.json manifest file, which describes the main parameters of the extension;
actually, a javascript file that will do all the work;
Icons of the size 128x128, 48x48 and 16x16;

The manifesto is utterly simple.

 { "manifest_version": 2, "name": "   ", "version": "1.0", "icons": { "16": "icon32.png", "48": "icon128.png", "128": "icon128.png" }, "description": "      .", "content_scripts": [ { "matches": ["*://*/*"], "js": ["content_script.js"], "run_at": "document_end" } ] }

On the syntax is best to learn everything that is necessary, from the source .

The script starts immediately after the page loads (we clearly indicated in the “run_at” manifest: “document_end”), its body consists of three functions:

walk (node) is a function that recursively traverses the nodes of an HTML document. If the node it accepts contains text, it passes the node to the makeItCultural () function;
makeItCultural (textNode) is a function that replaces substrings by patterns according to regular expressions. In case there are several replacement options, these options are passed as an array to the randomWord () function;
randomWord (words) is a function that takes an array and returns its random element;

On the chrome: // extensions / tick page, tick the “Developer mode” checkbox, click the “Load unpacked extension ...” button and select the folder with our extension. After that we test it, we rule, after each edit we do not forget to click on the Reload link next to our extension.

To publish extensions in Webstore, you need to pay $ 5 once. In general, this whole process is perfectly described in another good article on Habré , so I see no reason to repeat.

Conclusion

The resulting expansion forces Artemy Lebedev to play with new colors on LJ and Sergei Shnurov’s interview, but the Bolshoi Petrovsky Zagib is not in the teeth for it (I’ll not give you a link - google it yourself). The theme of the Russian mat is infinitely deep and more than a dozen academic works have been written on it. One day, physicists will prove (or disprove) string theory, engineers will build personal quantum computers, each comer will have a gadget powered by a cold nuclear reactor in his pocket, and the Russian mat will not be fully studied.

I tried to write as much as possible, so that it was clear and not boring to all readers, but it turned out as usual - for junior and middle school students sitting on heroin.

The finished extension can be installed from Chrome Webstore , source code found on GitHub .

List of sources

Javascript.ru - Regular expressions ;
Chrome Web Store - Google Developers
Alexey Plutser-Sarno. Big Dictionary Mata, Volume I and Volume II

PS Please do not use foul language in the comments. I do not want to become the originator of mass habra repressions.

Source: https://habr.com/ru/post/180241/

All Articles