Plagiarism search system

Foreword

One time I was lucky to all sorts of strange work. For example, I almost got an admin in the synagogue. I was stopped only by a premonition that they would force me to work there on Saturdays as the last goy.

Another option was also curious. The firm wrote essays and coursework for American students who were writing scrap themselves. Later, I learned that this is a fairly common and profitable business, which even came up with its own name - “paper mill”, but at once this way of making a living seemed to me a complete syur. However, it should be noted that there were a lot of interesting tasks in this work, and among them - the most difficult and cunning one that I did during my career, and which you can then proudly tell children.

Her wording was very simple. Course writers are remote workers, very often Arabs and blacks, for whom English was non-native, and they were no less lazy than the students themselves. Often they followed the path of least resistance, and instead of writing the original work, they stupidly pererali it from the Internet, in whole or in part. Accordingly, it was necessary to find the source (or sources), compare, somehow determine the percentage of plagiarization and transfer the collected information to catch the negligent.
')
The case was somewhat facilitated by the language of coursework - it was exclusively English, without cases and complex inflectional forms; and it was greatly complicated by the fact that it was unclear from which side in general to take up this business.

Pearl was chosen as the implementation language, which turned out to be very successful. It was impossible to solve this problem in any static compiled language with their rigidity and slow launch. It is possible to rewrite the ready-made solution, but it is impossible to come to it by numerous tests. Well, plus a bunch of great libraries run.

Dead ends

Initially, picking at the task was entrusted to a pattalized student. He has not been wise for a long time. If you need to search the Internet, then you will need a search engine. We put there all the text, and Google will find where it came from. Then we will read the found sources and compare them with pieces of the source text.

Of course, nothing happened with that.

Firstly, if you send the entire text to Google, it will be very bad to search. In the end, they have stored there indexes, in which the number of adjacent words inevitably limited.

Secondly, it quickly became clear that Google doesn’t like it at all when people search for it from the same address. Before that, I thought that the phrase “You were banned from Google?” Is just a joke. It turned out nothing of the kind. Google after a certain number of requests really bans, displaying a rather complicated captcha.

Well, the very idea of parsing HTML is not very successful - because the program can crash at any time, when Google decides to slightly improve the layout of the search results page.

The student decided to encrypt and search the search engine through open proxies: find a list on the Internet and walk through them. In fact, half of these proxies did not work at all, and the remaining half shamelessly slowed down, so the process did not end well.

And thirdly, the search for pieces of text using character-by-character comparison turned out to be unrealistically slow and completely impractical. And in addition, it was also useless - since the Kenyans had enough tricks not to rewrite texts literally, but to slightly change the wording here and there.

I had to start by reading specialized literature. Since the task turned out to be marginal, it was not described in any textbook or in any solid book. All I found was a bunch of scientific articles on private issues and one review thesis from a Czech. Alas, she caught me too late - by that time I already knew all the methods described there.

Distracting from the topic, I can not help but notice that almost all scientific articles published in competent journals are a) difficult to access and b) are rather useless. Those sites where they are stored, and which the first search engine gives out, are always paid and very biting - usually almost ten dollars per publication. However, looking around better, you can, as a rule, find the same article in the public domain. If this also failed, you can try to write to the author, who, as a rule, does not refuse to be kind enough to send a copy (from which I conclude that the authors themselves receive little from the established system, and the income goes to someone else).

However, there is usually little practical use for each specific article. With rare exceptions, there is no information on them that can be used to sit down and immediately outline an algorithm. There are or abstract ideas without any indication how to implement them; or a bunch of mathematical formulas, through which you understand that the same thing could be written in two lines and in human language; or the results of experiments conducted by the authors, with the same commentary: “far from everything is clear, it is necessary to continue further”. I don’t know whether these articles are written for a tick, or rather, for some internal scientific rituals, or the toad presses to share real workings that can be quite successfully used in your own startup. In any case, the erosion of science is evident.

By the way, the largest and most famous site for plagiarism search is called Turnitin. This is a practical monopolist in this area. His internal work is classified no worse than the military base - I have not found a single article, not even a short note, which tells at least in the most general way what algorithms are used there. Solid mystery.

However, from the lyrics back again to the dead ends, this time to my own.

The idea of document fingerprinting was not justified. In theory, it looked good - for each document downloaded from the Internet, its imprint is considered - some long number, somehow reflecting content. It was assumed that a base would be set up in which, instead of the documents themselves, the url and fingerprints would be stored, and then it would be sufficient to compare the source text with the base of the prints in order to immediately find the suspects. It does not work - the shorter the prints, the worse the comparison, and when they reach half the length of the source, it becomes meaningless to store them. Plus, the changes that the authors make to deceive the recognition. Well, plus the huge amount of Internet access - storing even the shortest prints becomes burdensome very quickly due to the huge size of the data.

Parsing and normalization

This stage at first seems banal and uninteresting - well, it is clear that the input will clearly contain text in MS Word format, and not a text file; it must be disassembled, broken down into sentences and words. In fact, here lies a huge source of quality improvement checks, which is far ahead of any tricky algorithms. It's like book recognition - if the original is crookedly scanned and smeared with ink, then no subsequent tricks will fix it.

By the way, both parsing and normalization are required not only for the source text, but also for all links found on the Internet, so besides the quality, speed is also required.

So, we got a document in one of the most common formats. Most of them are easy to disassemble - for example, HTML is perfectly readable using HTML :: Parser , all sorts of PDF and PS can be processed by calling an external program like pstotext. Parsing OpenOffice documents is just fun, you can even screw the XSLT, if you enjoy the perversions. The general picture is spoiled only by the badass Word - the more bastard text format cannot be found: hellishly complex in parsing and devoid of any structure inside. For details, refer to my previous article . If I could, I would never understand it at all, but it is much more common than all the other formats combined. Whether this is the law of Gresham in action, or the machinations of world evil. If God is all-good, then why is everyone writing in Word format?

In the process of parsing, if you get the normal format, you can learn all sorts of useful things from the text: for example, find the table of contents of the document and exclude it from the comparison process (there is still nothing useful). The same can be done with tables (short lines in the cells of the table give a lot of false positives). You can calculate the headings of chapters, throw out pictures, mark Internet addresses. For web pages, it makes sense to exclude side columns and footers if they are marked in the file (html5 allows it).

Yes, by the way, there may still be archives that need to be unpacked and get each file from there. The main thing is not to confuse the archive with any complex packed OOXML format.

Having received just a text, we can work on it again. It will only be beneficial to discard the title page and proprietary information that universities require on a mandatory basis (“Student’s work at such and such”, “tested by Professor Syakoi TO”). At the same time, you can deal with the list of references. It is not so easy to find it, for it has at least a dozen titles (“References”, “List of reference”, “Works Cited”, “Bibliography” and so on). However, it may not be signed at all. It is best to simply throw it out of the text, because the list is very disturbing to recognition, while creating a considerable load.

The resulting text should be normalized, that is, ennobled, giving it a unified form. The first thing you need to find all the Cyrillic and Greek letters, writing is similar to the corresponding English. Cunning authors specifically insert them into the text to trick the plagiarism check. But it was not there: a similar trick is one hundred percent evidence and a reason to push such an author into the neck.

Then all common folded forms like can't be replaced with full ones.

Now it is necessary to change all highly artistic Unicode characters into simple ones - quotes of the herringbone, quotes in the form of inverted commas, long and half-long dashes, apostrophes, three-dotted and ligatures of ff, ffi, st and all that. Replace two apostrophes in a row with normal quotes (for some reason this happens very often), and two dashes - with one. All sequences of whitespace characters (and also a bunch of them) should be replaced with one regular space. To throw out after this from the text, all that does not fit into the range of ASCII characters. And finally, remove all control characters, except for the usual line feed.

Now the text is ready for comparison.

Further we break it into sentences. It is not so easy as it seems at first sight. In the field of natural language processing, everything seems to be easy only from the beginning. Sentences may end with a dot, a triple-point, an exclamation point and a question mark, or they may not end at all (at the end of the paragraph).

Plus points can stand after any cuts, which are not the end of a sentence at all. Full list takes up half a page - Dr. Mr. Mrs. Ms. Inc. vol. et.al. pp . and so on and so forth. And plus Internet links: it’s good when there is a protocol at the beginning, but it’s not always there. For example, an article may generally tell about various chain stores and constantly mention Amazon.com. So you still need to know all the domains - a dozen major and two hundred pieces of domains by country.

And at the same time to lose accuracy, since the whole process is now becoming probabilistic. Each particular point may or may not be the end of a sentence.

The original version of the breakdown of the text into sentences was written in the forehead - with the help of regular expressions all the wrong points were found, replaced with other characters, the text was broken into sentences for the remaining ones, then the symbols of the points came back.

Then I felt ashamed that I did not use advanced methods developed by modern science, so I began to explore other options. Found a fragment in Java, dismantled a couple of geological eras (oh and the same boring, monotonous and verbose language). Found Python NLTK. But most of all I liked the work of a certain Dan Gillick (Dan Gillick, "Improved Sentence Boundary Detection"), in which he boasted that his method was completely superior to all others. The method was based on Bayesian probabilities and required prior training. On the texts I used to train him, he was excellent, and on the rest ... Well, not that very bad, but not much better than that shameful version with a list of abbreviations. I finally returned to her.

Web search

So, now we have the text and we need to get Google to work for us, look for pieces scattered throughout the Internet. Of course, the usual search can not be used, but how to? Of course, using the Google API. All business. Conditions there are much more liberal, convenient and stable interface for programs, no HTML parsing. The number of requests per day, though limited, but in fact Google did not check it. If you are not impudent, of course, sending requests in millions.

Now another question - what pieces to send the text. Google stores some information about the distance between words. Experimentally, it was found that the optimal results gives a series of 8 words. The final algorithm was:

We break the text into words
We throw out the so-called stop words (service words that come across most often - a, on, etc. I used the list taken from mysql)
We form requests of eight words with overlapping (that is, the first request is words 1-8, the second 2-9, etc. It is possible even with an overlap of two words, this saves requests, but slightly degrades the quality)
If the text is large (> 40kb), then every third request can be thrown out, and if it is very large (> 200 kb), then even every second. It hurts the search, but not so much, obviously, because plagiarists usually twig whole paragraphs rather than individual phrases.
Then we send all requests to Google, even at the same time.
And finally, we get the answers, analyze, make a general list and throw out duplicates from it. You can also sort the list of received addresses by the number of received duplicates and cut off the last ones, considering them not indicative and which are not particularly influencing. Unfortunately, here we meet with the so-called Zipf distribution, which looks at every angle when searching for plagiarism. This is such an exhibitor upside down with a very long and dull tail stretching to infinity. It is impossible to fully treat the tail, and where to cut it is not clear. Wherever you go, quality will deteriorate. Here and with the list of addresses so. Therefore, I cut it, based on the empirical formula, depending on the length of the text. This, in any case, guaranteed some kind of stable processing time as a function of the number of letters

The algorithm worked fine until Google realized it and covered up the lafu. The API remained, even improved, but the company began to want money for it, and rather rather big ones - $ 4 per 1000 requests. I had to look at alternatives, of which there were exactly two - Bing and Yahu. Bing was free, but that was where his dignity ended. He was looking noticeably worse than Google. Maybe the last one is the new Evil Corporation, but their search engine is still the best in the world. However, Bing was looking even worse than himself - through the API he found one and a half times fewer links than from the user interface. He also had the ugly habit of part of requests to end up with an error and had to be repeated again. Obviously, in this way in Microsoft regulated the flow of requests. In addition, the number of words in the search string had to be reduced to five, stop words should be left, the overlap should be made only in one word.

Yahu was somewhere in the middle between Google and Bing, both in price and quality of search.

In the process, there was another little idea. The head of the department discovered a project that collected the contents of the entire Internet every day and put it somewhere on Amazon. We only needed to take data from there and index it in our full-text database, and then look for what we need in it. Well, actually write your own Google, only without a spider. It was, as you guess, completely unreal.

Search local database

One of the strengths of Turnitin is its popularity. Many works are sent there: students are theirs, teachers are students, and their search base is increasing all the time. As a result, they can find stolen goods not only from the Internet, but also from last year’s coursework.

We took the same path and made another local database - with ready-made orders, as well as with the materials that users applied to their applications (“Here’s an article on which you need to write an essay”). Writers, as it turned out, love to rewrite their previous works.

All this stuff lay in the KinoSearch full-text database (now renamed Lucy ). The indexer worked on a separate machine. Film search proved to be good - although the bill went to hundreds of thousands of documents, I searched quickly and carefully. The only drawback is that when adding fields to the index or changing the version of the library, you had to reindex all over again, which lasted for several weeks.

Comparison

Well, now the most vigorous - without which everything else is unnecessary. We need two checks - first compare the two texts and determine that there are pieces in one of the other. If there are no such pieces, then you can not continue and save computing power. And if there is, then a more complex and difficult algorithm comes in, which is looking for similar proposals.

Initially, the algorithm of shingles, pieces of normalized text with overlapping, was used to compare documents. For each piece is considered some checksum, which is then used for comparison. The algorithm was implemented and even worked in the first version, but it turned out to be worse than the search algorithm in vector spaces. However, the idea of shingles unexpectedly came in handy when searching, but I already wrote about that.

So, we consider a certain coefficient of coincidence between documents. The algorithm will be the same as in the search engines. I will present it in a simple, collective-farm manner, and the scientific description can be found in the scientific book ( Manning K., Raghavan P., Schutz H. Introduction to information retrieval. - Williams, 2011 ). I hope not to confuse anything, and this is quite possible - here the most difficult part of the system, and even constantly changing.

So, we take all the words from both articles, select the basis of the word, throw out duplicates and build a giant matrix. In the columns she will have the very basics, and only two lines - the first text and the second text. At the intersection we put a number - how many times a specific word was encountered in a given text.

The model is quite simple, it is called the “bag of words”, because it does not take into account the word order in the text. But for us, it is the very thing, because plagiarists often change words in places when changing the text, rephrasing what is written off.

Distinguishing the basis of the word in linguistic jargon is called stemming. I conducted it with the help of the Snowball library - quickly and no problems. Stemming is needed to improve the recognition of plagiarism - because cunning authors do not just rewrite someone else's text, but change it cosmetically, often turning one part of speech into another.

So, we have obtained a certain matrix of fundamentals, which describes a huge multi-vector space. Now we consider that our texts are two vectors in this space, and we consider the cosine of the angle between them (through the scalar product). This will be a measure of similarity between the texts.

Simple, elegant and true in most cases. It only works badly if one text is much larger than the other.

It was experimentally found that texts with a similarity coefficient <0.4 can be disregarded. However, then, after complaints from the support service about not found a couple of triggered sentences, the threshold had to be lowered to 0.2, which made it rather useless (and then damn Zipf).

Well, a few words about the implementation. Since the same text has to be compared all the time, it makes sense to obtain in advance a list of its foundations and the number of their entries. Thus, a quarter of the matrix will be ready.

To multiply the vectors, I first used the PDL (and what else?), But then, in pursuit of speed, I noticed that the vectors are very sparse, and I wrote my own implementation based on Perlow’s hashes.

Now we need to find the similarity coefficient between the sentences. There are two options and both are variations on the same theme of the vector space.

You can do it quite simply - take the words from both sentences, make a vector space of them and calculate the angle. The only thing is that there is no need to even try to take into account the number of occurrences of each word - all the same, the words in one sentence are repeated very rarely.

But it can be made more cunning - to apply the classic tf / idf algorithm from the book, only instead of a collection of documents we will have a collection of sentences from both texts, and instead of documents - accordingly, sentences. We take the total vector space for both texts (already obtained when we calculated the similarity between the two texts), build vectors, replace the number of entries with ln in the vectors (occurrences / number of sentences) . Then the result will be better - not radically, but noticeably.

If the threshold of similarity between the two sentences exceeds a certain value, then we write the found sentences into the base, then to poke the similarities of the plagiarists.

And yet - if there is only one word in a sentence, then we don’t even compare it with anything - it’s useless, the algorithm doesn’t work on such stubs.

If the coefficient of similarity is greater than 0.6, do not go to the fortuneteller, this is a rehashed copy. If less than 0.4, the similarity is random or not at all. But in the interval a gray zone is formed - it can be plagiarism, and just a coincidence, when, in the eyes of a person, the texts have nothing in common.

Then another algorithm comes into play, which I learned from a good article ( Yuhua Li, Zuhair Bandar, David McLean and James O'Shea. "A Method for Measuring Sentence" ). There is already heavy artillery in the case - linguistic signs. The algorithm requires taking into account irregular forms of conjugation, the relationship between words such as synonymy or hyperonymy, as well as the rarity of the word. For all this stuff, the corresponding information in a machine-readable form is required. Fortunately, good people from Princeton University have long been engaged in a special lexical base for the English language, called Wordnet . On CPAN there is also a ready-made module for reading. The only thing I did was transfer the information from the text files in which it is stored in Princeton into MySQL tables, and, of course, I rewrote the module. Reading text files from a heap is neither convenient nor fast, and storing links as offsets in a file cannot be called particularly elegant.

Second version

Hmm ... Second. And where is the first? Well, about the first there is nothing to tell. She took the text and consistently performed all the steps of the algorithm — normalized, searched, compared and produced the result. Accordingly, she could not do anything in parallel and was slow.

So the rest of the work after the first version was directed to the same thing - faster, faster, faster.

Since most of the time it takes to get links and information from the Internet, access to the network is the first candidate for optimization. Serial access was replaced with parallel download (LWP to asynchronous Curl ). The speed of work, of course, has grown fantastic. Joy could not spoil even glitches in the module when it received 100 requests, executed 99 and hung on the last one indefinitely.

The overall architecture of the new system was modeled on the OS. There is a control module that starts the child processes, allocating time (5 minutes) to them according to the “quantum”.During this time, the process should read from the database, where it stopped last time, perform the following action, write information on the continuation to the database and complete. For 5 minutes, you can do any operation, except for downloading and comparing links, so this action was broken into parts - 100 or 200 links at a time. After five minutes, the dispatcher will interrupt the execution in any way. Did not have time? You will try next time.

However, the workflow itself should also monitor its performance on a timer, because there is always the risk of running into any site that will suspend everything (for example, 100,000 words of the English language were listed on one such site - and there was nothing else there. Clear that the algorithms described above will look for similarity for three days and maybe even someday they will find it).

The number of working processes could be changed, in theory, even dynamically. In practice, the three processes were optimal.

Well, it is clear that there was also a MySQL database in which the processing texts and intermediate data are stored, as well as the final results. And a web interface where users could see what is currently being processed and what stage it is at.

Tasks were given priority so that the more important ones were executed faster. Priority was considered as a function of the file size (the more, the slower it is processed) and the deadline (the closer it is, the faster the results are needed). The dispatcher chose the next task for the highest priority, but with some random correction — otherwise the low-priority tasks would not have waited at all for their turn as long as there are more high-priority ones.

Third version

The third version was a product of evolutionary development in terms of processing algorithms, and a revolution in architecture. I remember sticking up somehow in the cold, before an unsuccessful meeting, waiting for Godot, and recalled a recent story about the services of Amazon. And they store files, and virtual machines do, and even they have all sorts of incomprehensible services of three letters. And then it dawned on me. I remembered a giant shrimp, seen once in a Sevastopol aquarium. It stands in the middle of the stones, waving its paws and filters the water. Carries over to her all sorts of tasty pieces, and she takes them away, spits water on. And if you put a lot of such shrimp in a row, so they all filter there for twenty minutes. And if even these crustaceans and different species will catch each its own, so in general - what prospects are opening up.

Translating from a figurative language to a technical one. Amazon has an SQS queue service — such continuous pipelines along which data flows. We make several programs that perform only one action — no context switches, child-generated children, or other overhead. “The crane fills the same buckets with water from morning to evening. The gas stove heats the same pots, kettles and pans ”.

The implementation was simple and beautiful. Each algorithm step described above is a separate program. Each has its own line. There are XML messages in the queues, where it is written what should be done and how. There is another management queue and a separate dispatcher program that keeps track of the order, updates the progress data, notifies the user about the problems that have happened. Individual programs can send a response to the dispatcher, and can directly and in the next turn - as convenient. If an error occurs - then send a message about this dispatcher, and he already understands.

Automatic error correction is obtained. If the program did not cope with the task and, let’s say, it fell, then it will be restarted, and the unfulfilled task will remain in the queue and will emerge again after a while. Nothing is lost.

The only difficulty with Amazon is that the queue service ensures that every message will be delivered at least once. That is, it will be delivered anyway, but not the fact that once. You need to be prepared for this and write the processes so that they react appropriately to the doubles - or do not process them (which is not very convenient, because you need to keep some records), or process them idempotently.

Files downloaded from the Internet, of course, were not forwarded in messages - and it is inconvenient, and SQS has a size limit. Instead, they were added to S3, and only the link was sent in the messages. The dispatcher after the completion of the task cleared all these temporary storages.

Intermediate data (for example, how many links we need to read and how much has already been done) was stored in Amazon Simple Data Storage - a simple but distributed database. SDS also had limitations to keep in mind. For example, she did not guarantee instant updates.

And finally, the finished results are texts with plagiarism, I began to add not to MySQL, but to CouchDB. All the same, in the relational database they were stored non-relationally - in text fields in the Data :: Dumper format (this is the Perlovsky equivalent of JSON). CouchDB was all good, like the Queen of Sheba, but had one flaw, but fatal. It is impossible to access its database with an arbitrary query — indices must be built in advance for any query, that is, they must be foreseen in advance. If there is no crystal ball, then the indexing process must begin - and for a large base it lasts several hours (!) And all other requests are not executed. Now I would use MongoDB - there is a background index.

The resulting scheme had a huge advantage over the old one - it naturally scaled. Indeed, it has no local data, everything is distributed (except for the results database), all instances of workflows are completely the same. They can be grouped by severity - on one machine, run all light ones that require few resources, and use a separate virtual server to be a brake like the text comparison process. Few?Do not pull? Can one more. Can not cope any more process? We will carry it to a separate car. In principle, this can even be done automatically - we see that too many unprocessed messages have accumulated in one of the queues, we are raising another EC2 server.

However, the stern aunt Life, as usual, made its own adjustments to this idyll. On the technical side, the architecture was perfect, but from the economic side it turned out that using SDS (and S3) is completely unprofitable. It costs too much, especially the base.

I had to quickly move the intermediate data to the good old MySQL, and add the downloaded documents to the hard disk shared via NFS. Well, at the same time forget about the smooth scaling.

Unrealized plans

Studying the processing of natural language, in particular, according to Manning ’s comprehensive book , I couldn’t get rid of the idea that all the methods described there were just ad hoc tricks, tricks for a specific task, which were not at all generalizable. As far back as 2001, Lem left computer science, an artificial intelligence that hadn’t been invented in forty years, although there was a lot of cracking on this topic. At the same time, he darkly predicted that the situation would not change in the foreseeable future. The machine did not understand the meaning, so it will not be understood. The philosopher was right.

The search for plagiarism was exactly the same trick. Well, I did not expect to generate AI and wait for human comprehension of the text. However, all natural language parsers that I found were extremely complex, probabilistic, gave results in an incomprehensible form and required huge computational resources. In general, I think that at the current stage of the development of sciences this is unrealistic.

Human factor

The system was written in such a way that it could work in fully automatic mode, so that people could not do anything here. In addition, a very good sysadmin worked with me in a pair, thanks to which all servers were set up perfectly and downtime of various kinds was reduced to a minimum. But there were still users - support service. Well, the bosses, of course.

Both had been convinced for a long time that the search for plagiarism was not a computer, but a little man (or even a whole crowd) who was sitting inside the computer. He is almost like a real one, in particular, he understands everything about what is written in coursework on any topic, and he finds plagiarism because he holds all the contents of the Internet in his head. However, when these little people were messing around, asking, contrary to any logic, for some reason, not from them, but from me. One word - philologists.

It cost me a lot of work to explain that plagiarism is still looking for a computer that does not understand what it does. Somewhere in a year it came to the authorities, to the rest, it seems, not to the end.

Support was also another fashion - enter a few sentences in Google and I am happy to report that Google has found plagiarism, but my system - no. Well what could I say to that? Explain about the distribution of Zipf, to tell that for the sake of speed and reduction in memory size, you had to make compromises, and each such compromise meant a deterioration in quality? Hopelessly. Fortunately, in most of these cases, it turned out that Google found the material on some paid site, to which the system simply did not have access.

There was another trick - to report that Turnitin found plagiarism, but our system did not find it. And here it was impossible to explain that Turnitin, most likely, is written by a whole team of qualified specialists with diplomas in the relevant field, and the site itself has intimate relations with some cool search engine. Again, fortunately, the majority of cases of plagiarism not detected were from paid sites or from other student works, in general, they are not accessible to us at all.

For several months I tried to satisfy the director’s requirement of a fixed processing time — every job should not be checked for more than an hour. It didn’t work for me at all, I didn’t sleep at night, until one day I was told that, in essence, they wanted to invent a perpetual motion machine from me — one that will grow with increasing load. In life, such does not happen in the world of programs - too. When the requirement was reformulated - each work no more than a certain volume (50 pages) should be searched for no more than an hour, if at this time there are no huge dissertations in the queue - things went smoothly. The conditions were tough, but at least realizable.

From time to time pleased the support service. I find it difficult to explain their logic, but from time to time, with a strong loading of the verification queue, they ... stuffed additional copies of the work into it. Well, that is, if there are a hundred cars in a traffic jam, then you need to drive another hundred onto the road, and then things will go smoothly. I could not explain the error to them, and such cases were forbidden purely administratively.

Parting commentators

My sad experience shows that there are a number of youngsters on Habré, for unknown reasons, who believe that they are from birth excellently versed in all branches of knowledge invented by mankind. Straight along Chekhov, “I have it emancipe, she’s all fools, she is one smart.” If you belong to such comrades and decide to write to me that I am an idiot, I don’t understand anything, I don’t understand simple things, etc., then please remember that the system I developed was operated in the tail and in the mane for two years, 24 hours a day , almost no downtime, and saved the customer a few bags of money. Therefore, when writing comments of the type described above, please immediately indicate the similar characteristics of the system developed by you. Well, so that your genius was noticeable immediately, without leading questions.

Source: https://habr.com/ru/post/199190/

All Articles