On the Internet, you can find services for generating meaningless texts, also known as
"Lorem Ipsum" .
Usually they are used as a “fish” to fill all sorts of design layouts.
As an option - you can use them as texts for keyboard simulators.
All detected generators make up texts by random or not-so-many combinations of words, sometimes coordinated with each other by endings, sometimes taking into account interrelations.
The following describes the method of generating abracadabra, which contains few meaningful words, but remains readable.
')
About this kind:
check with the evening crushing us of people’s travels, the hobby demanding my years was tearing up the bones of everything Mozhenskys imenitatsyy tossed between karmy and soviet father, he wasn’t revived, they’ll be his person who punched him I know I’m not married to wedding marriage archives whoever prevailed and revenge his hands on them for half an hour on the night before the rest the other feathers would see the thousand fairy for her fool for the tedious day of the day papazazatyanul emne from idela professet and mock gorkulto captive tera
For my taste, it is much more fun.
The idea is quite simple - it is to generate the text by letter, taking into account their compatibility in the Russian language.
Materiel
The text is modeled as a
Markov chain of the Nth order, where individual letters appear as elements of the chain (outcomes of events). The current state of the stream is determined by the N letters already typed, and the next letter is determined by the transition matrix.
The transition matrix for each chain of N letters determines
which letters are allowed for the continuation of the text and the probability of their appearance.
In practice, it is a frequency dictionary of letter combinations:
dict [prefix] [letter] = probability / frequency of this letter after this prefix.
Compiling a dictionary
To compile a dictionary, you need to scan a sufficiently large amount of text
and count the number of occurrences of all possible letter combinations, and then normalize all values under each prefix.
In the dictionary, you can include prefixes of fixed length, or all from 0 to the maximum.
The only benefit for generating text is that using one dictionary, you can generate texts of varying degrees of readability. And also put words in a word matrix shorter than the depth of the dictionary.
Two options are possible: scan in a continuous stream (by text) or by words (for example, according to a dictionary).
With
continuous scanning , word delimiters (spaces, and even punctuation marks) are included in the prefix, and this makes it possible to take into account the compatibility of words to some extent (combinations like “same,” “same,” “what,” etc).
To scan the stream, you must save N previous letters (tail), enter each new letter in the dictionary [tail] [letter], then add it to the tail, which is truncated along the length.
"... ...":<br/>
[...][] [...][], [..][], [.][], [][_], [_][], [_][], [_][], [_][], [][], ...
If the dictionary is not fixed depth, then in the matrix, you can enter the prefixes from the maximum to 0. (The transition matrix from the empty prefix determines the unconditional probability of a letter.)
[][], [][], [][], [][], [][]
When
scanning by words , word delimiters are included in the prefix only as the first character.
To add a word, you need to enter into the dictionary all the substrings are N-1 + long, adding separators to the beginning and end of the word. With a fixed dictionary depth, shorter than N words in the dictionary will fail.
"", "":<br/>
[_][], [][_],<br/>
[_][], [][], [][], [][], [][_]...
With such a scan, the maximum prefix length can be limited to a long word.
Text generation
You can also generate text in a continuous stream or by words.
- An arbitrary prefix from the dictionary is selected as the starting line,
beginner with a space (so as not to begin the text with letters not found at the beginning of a word) - and until you get bored:
- take the tail of the already typed text long N
- all possible continuations are searched in the dictionary
- the continuation is selected in a randomly weighted way and added to the text
- if the text is generated by words, and the last letter added is a separator, then you need to re-select the starting line.
According to a dictionary compiled according to words, it can also be generated only by words, because after adding a separator to the text and tail, the entry in the dictionary will not be found, since the separator is not included in the prefix.
/ * "Randomly weighted" I called the method of selecting an element from an uneven distribution. It is implemented by dividing a single segment into segments, proportional to the probabilities of each element, and choosing a point on this segment using a uniformly random value. * /The deeper the dictionary is used, the more readable is the generated text, but the more meaningful the words will appear in it.
The average word length is in Russian = 5.28, with the depth of the dictionary> = 5 most of the words are meaningful.
A good ratio of readability and meaninglessness is obtained when using combinations of 3-4 (~ syllable length).
Examples
As a fairly representative basis, I used the
frequency dictionary of S.A. SharovIt contains ~ 70000 words (word forms) occurring more often than 1 time per million,
and built on the "national case of the Russian language", rather carefully compiled by scientists.
Of course, it is only suitable for generating text by words.
The full dictionary of letter combinations from 1 to the longest word (24 letters) is 820105 entries.
The dictionary of combinations from 1 to 5: 240784 records.
N = 1
bycicles a la u lushi yla skla skhtru for skilebrut no and notusno boislchekayavskoy these not vilisenashen stalkmyruh szhavh tustoedlo chil pkost ptoch chalsr i amoblu ras eto koglu sneznotm kanazakogotrakopirovki obes oteznopl bsnaprobe of Chez ntolily in natalst ki th d Zuren vabatayavetutoda
N = 2
That is why they saved it for me, but I didn’t bake half the time for it, not for oh, orio on a special occasion, for which I burned the pedels because of the fact that it wasn’t bad for you, but obliquity and obscenity, I cursed a lot of the room, we used a pattern. double genocupalization of the plains by simonovaya I am not one of the otated years of this etninovori hoi onresting durii by the bed of stratepei not broken koles by the hall of them and caravnulan we think the same ska got into il otsuyu is worried in love
N = 3
and this sistil shook or hasn’t gotten all over the nickname of the faithful person he didn’t pick up the closest ones and he loves where you didn’t call a hoax so much that you’ve gotten the aggressor In obrazu barratsii at that time for proit u in closed the first refusal from the department echel ud would have reduced uzhdun pushi same folded like a rotor just obezhomu just for what ekspekt organgetto podprovy not and if bu difficult mouth called eslatsya
N = 4
The traveler's boxing coat Sostok suddenly noticed that the nature doesn’t so much as asking for it, tilting the evening by traveling the most populous in the changes in the posts as a wonderful lift up the door of the cartel go to his kitchen room the habit of everything else in the evening with a drink at night probably it stops me I thought and thought it was interesting only mine in the door are in the circle I am very lads and the manual here it is in that counting from sitting cured on paper you and paper and that you agreed the main thing most friends would fail in assuring avatel turned on sitting and breathing right hand fights
N = 5
It was perfect for her to run down before the war before order in a huge day head only from the goats of the publishers of time. Doubt where death is what kind of pen to clean up earlier, I couldn’t right or lexicon laughed from not knocking it over to another and let it be oaky again and he turned out under a star from a meeting before the local one and they sat on the shoulders of the escort and on a secret diet I didn’t open up in the sight of it. There wasn’t much communication that was brilliant. That time the jaws would be the last something with all the foliage he could then alacha the moral high
For complete happiness, you can still add the punctuation and alignment of endings.