Here's what to do with a man who is all the reproaches and calls for literacy
responds with one argument - “to me according to ***”?-
Cry of the soulIf you don’t like it, don’t listen, but do not stop lying!- Proverb
')
1. Introduction
Quite recently, topics were raging on Habré calling for universal literacy and the use of spell-checkers. It seems to be a seasonal phenomenon, and the current lull is temporary. Part of habrayuzerov will always be sure that the main meaning in the text, and not spelling; the other part will always be annoyed by constant errors that interfere with the perception of this very meaning. Anyway, the exhortations of the latter change little, and the most popular answers to calls for literacy can be read in the epigraph.
In this article, I propose a partial solution to the problem of literacy on Habré, which will give each of the parties the desired: some will be able to write as they wish, while others will read Habr without “stumbling” on mistakes. Unlike other methods (“read textbooks”), the described approach is directly related to IT and will take less than a minute to master (if you go straight to the “Conclusion”).
2. Motivation
The reason for writing this article is “intuitive” literacy, thanks to which, like many people, I am irritated by texts with an abundance of errors. Moreover, recently I began to notice that the sense of language acquired in childhood for several diopters, due to lack of demand, begins to evaporate, and without adequate thinning of glasses.
The second, more technical-prosaic reason for this article was the long-time desire to deal with regular expressions. And without an interesting task, to study something is simply boring, so the thought came to me to write HabraChist :)
3. Approach
3.1. Technical side
HabraChist is a java-script with a set of rules for detecting and fixing the most common errors. The script requires
Greasemonkey , an extension for Firefox. The process of installing and using Greasemonkey has already been
described on Habré. If desired, the script can be adapted for the Opera.
After loading the page, HabraChist applies each rule in turn to the headers and texts of the topics, as well as to the comments. The rules are rigidly defined in the script code as regular expressions and represent a set of pairs of the form
"(f | sh) s" -> "$ 1i"
The left side indicates the regular expression that is used to find the error. The right side, in turn, describes what to replace the error found. For example, the above rule corresponds to the school "write
-zhi through
u ". Instead of "$ 1", the contents of the brackets will be substituted in the found match, that is, the letter "f" or "w". Moreover, in this case it does not matter whether it is a lower case letter or a capital letter - the one that is used in the source text will be inserted. In the same cases, when the first letter of the original word and its replacement do not match (for example, “schaz” → “now”), it is necessary to duplicate the rule for words beginning with a capital letter (“Schaz” → “Now”).
3.2. Data collection
The main difficulty of the task was to find the most common mistakes. At first I went from the bottom up to habraraiting to read comments. However, it turned out that the majority of habra people with an alternative position have a high level of literacy to envy. Therefore, I had to change tactics: taking some arbitrary word a la “it seems,” I went to google to look for comments from habrayusers who prefer to use this particular word. Of course, other words of non-traditional spelling were often found in their texts. I (not without surprise) learned some interesting mistakes from the FAQ
“Spelling in Russian” .
The rule base was replenished throughout the entire testing period.
/> 3.3. Restrictions
Of course, the proposed approach can only track simple errors that can be detected by searching the string. More sophisticated ways, like asking questions “what to do? what will they do? ”for the verbs, they are, and are, beyond the limits of the possibilities of regular expressions (however, specific rules like“ go to → go ”and“ go to → go ”are always executed). Many errors cannot be determined unambiguously, for example, the continuous and separate spelling of “not” (compare the “ugly interface” and “the interface is not beautiful, but terrible”). However, in this class of errors, there are those which, being ambiguous, however, are much more often used in erroneous spelling than in correct spelling. For example, the erroneous spelling of the word “by the way” (“By the way, I wanted to tell you ...”) is very common, although the space in it can be justified (“By the way, the prince does not find fault”). In this case, I proceeded from the assumption that the hundred errors that were eliminated are worth a couple of added ones.
The prevalence of an error was estimated by Google results, in some cases, by Yandex. This method is not very accurate, because Only one mistake in the title of the article is repeated by Google for each comment. Unfortunately, the popularity of some errors cannot be estimated by means of search engines, although experience indicates their high occurrence. These include, in particular, the non-use of the soft sign in the verbs of the 2nd person of the present tense (“kachaesh”, “listen to”), as Google does not allow searching by the mask “* esh”.
Another limitation of the method is due to the approach to data collection, many errors simply were not found. This can be quickly corrected with your help: cite errors in the comments that prevent you on Habré, you can immediately in the form of regular expressions.
4. Result
Download HabraChista can c
userscripts.org . The basic set of 157 rules corrects more than 70 thousand errors from different categories: grammar, slang, olbanian, mate, etc.
Below is the Top 10 grammatical errors on Habré.
Mistake | Right | Number 1) |
---|
All the same, all the same | all the same | ~ 5000 2) |
Right now, schA | now | ~ 4000 |
And | and | ~ 3100 |
I don't know, I don't want to, I can't ... | I don't know, I don't want, I can't ... | ~ 3000 |
Something, something, somehow, like that ... | something, somehow ... | ~ 2900 2) |
Flash | flash 3) | ~ 2800 |
Hardly, vryatli vryat | hardly | ~ 2400 |
Not right, not right ... 4) | wrong, wrong ... | ~ 2300 |
* ka (well ka, give ka ...) | * -ka (come on, give me ...) | ~ 1800 2) |
Was not, was not, was not | was not, was not | ~ 1700 |
1) The figures are approximate, because changed to hundreds within two weeks.2) Since Google on the request "all the same" gives, basically, the correct writing "all the same", the number of errors was estimated as follows. If on the first page of the ten results there is one incorrect spelling, we take the average number of errors at 10%, and by multiplying the total number of results by 0.1, we get an estimate of the number of errors.3) Anticipating perturbations, refer to the source: Information Bureau Gramot.Ru .4) The space in these cases is needed, but on Habré is used most often out of place (see " Restrictions ")./> 4.1. Testing
Testing the script and updating the rule set took two weeks and included reading both the main page and the most hackneyed corners of Habr. During testing, the script, correcting the error, left the original version in brackets, so that you can visually assess both the average number of errors and the correctness of the script. All the problems that have been noticed so far have been fixed, we will deal with the rest as soon as we receive feedback.
Surely you are interested in the performance issue of many dozens of regular expressions when processing hundreds of comments. The results are shown in the table.
Place of use | Time 1) |
---|
Home - Hiked | 1 sec. |
Topic with 142 comments | 2 sec. |
Topics with 385 comments | 5 s. |
1) Testing was carried out on a four-year-old laptop (Pentium-M 1.6 GHz, 1 GB of RAM); the time was measured manually, rounded to the nearest major second.Considering that most topics have less than a hundred comments, and the average Habrayuzer reads about one or two lines per second, the script performance can be considered acceptable.
/> 4.2. Demonstration
To demonstrate the capabilities of the script, below is a small text with a large number of errors most common in Habré. Compare results with active HabraChistom and without it.
Attention! The text is compiled solely for testing HabraChista, and may not coincide with the opinion of the author.
Caution! The following text can negatively affect a healthy psyche (I almost went crazy while I was writing :)
Huyase! What is happening, people! Fs terms are not written right.
To mine, what any topic, there will be some. For example, what kind of nebuladag will throw out a huge amount of bukaf without checking - it’s not clear Nicky, my mosque is too many wrongful bukof. I am silent about some hellish dalpapes writing chushas of the type “in my mouth my feet !!! 111dinadin” in kamentah - such people need to drive nakuy. It seems to me, if you write so much - so we can look for a minute to check, all the better it will be!
Hez, I can't understand it.
And I will not. Let them write as they want, and I will read ya as ya want :)
PS By the way, here's a try to write with aspishkami, it’s a complete pistet!
PPS Nitsche personal :) Sorry if cho netak.
5. Conclusion
The article presented the
HabraChist - Greasemonkey script to fix grammatical errors on Habré on the fly. The script uses a set of rules based on regular expressions. By approximate calculations, HabraChist corrects more than 70 thousand spelling errors. Further development of the project is entirely dependent on your comments.
Acknowledgments
License. The script is completely free for non-commercial use, modifications and improvements. When using the rules from the script, please indicate their author (that is, me,
YasonBy ).
Excuse As mentioned at the very beginning of the article, I am not a full-time specialist in regular expressions (and not even a java-script).
Constructive criticism and tips for improving HabraChist will be greatly appreciated.
UPD: Many habrovchane criticize the script for restricting their freedom of expression. But at the same time, most people probably use banner cutters and advertiser presses, without worrying at all about the freedom of the advertiser to show them their goods ...
UPD: HabraChist also works with Safari + GreaseKit (as
suggested by XuMiX ) and Opera (Tools> Settings> Advanced> Content> JavaScript Settings> JavaScript User Files, specify the folder in which the js file is located. For how-to
tequibo ).