📜 ⬆️ ⬇️

Sound in numbers

Prehistory
Honestly, by education I am a techie, and never was fond of linguistics. Of course, knowing the languages ​​is interesting, but studying them is troublesome. And in general, technical sciences seemed to me clearer and more interesting than the humanities. That was until I needed to come up with a new domain name. Having suffered, the lack of good ideas and insights, rejecting many banal options, I thought that since there was no inspiration, I had to look for it somewhere, and decided to approach this issue technically. I decided to make a domain name generator.

Idea
The good idea of ​​a randomizer came quickly. There are already almost two million domains in runet, with good and bad names. Of course, a “good title” rating — a “bad name” is individual, but there is something in common that unites both. I think that more than one generation of linguists have been puzzling over this general (and maybe they have already known everything for a long time), but I decided to approach the issue technically, so I decided that the good and bad domains are determined by a combination of letters) : we split the domain name into syllables, the syllables from each domain name are saved in the “syllable dictionary”. Having a vocabulary of syllables, we can combine them in a random order, getting good
domain names (provided that the source database for which the dictionary was compiled had good names). In addition, with this approach, you can generate not only domain names, but anything else. For example, nicknames, drug names or names.

Problems
The first experiments gave optimistic results, but also showed that everything is not so simple. Making allowance for the absolute chance of the word received, one could say that the nickname is similar to the nickname, and the name of the medicine is the name of the medicine. But the yield of good options was small. Moreover, we can easily distinguish a male name from a female one by ear (we do not take exceptions into account), but it was difficult to distinguish the generated male name from the generated female one. In addition, unnatural words for the language (for example, starting with a soft or hard sign, or with unpronounceable sound combinations such as mts-, nts-) must somehow be sifted out or marked.

Solutions
Thinking again, I decided that the main problem in the endings. When the ending of the “artificial” word was similar to the ending of the “natural”, the word itself was similar to the natural. When the ending crawled forward on the word, or ran away altogether, it was difficult to call the word good. Therefore, I decided to put the endings in a separate dictionary and make new words according to the principle
[word] = [arbitrary combination of syllables] + [arbitrary ending].
This principle began to give very good, in my opinion, results. True, the problem with eliminating unnatural words remained. To solve it, I decided to try to make a function for the numerical evaluation of the word: a great word should get 100 points, and what a word
cannot be considered at all, should get 0.
')
Melodiousness
Climbing on the Internet, I found a good word describing the characteristic I need for a numerical evaluation - “melodiousness”. But the Google search on the subject of “algorithms for the evaluation of soundness” did not give good results. Therefore, I decided to do the following: assign a “melodic” word from alternating vowels and consonants, and a “disinterested” word from letters of the same type. Then the numerical evaluation of the harmoniousness can be defined as the ratio of the number of vowel-consonant pairs to the total number of pairs of letters. For completeness, I introduced several additional conditions:
- forbidden letters (b, b and s for the Russian language), in case of which they are at the beginning of the word,
it gets 0 points.
- for the presence of paired letters at the beginning of the word, the points of the “artificial” word are reduced by 80%
- for the presence of two unpaired consonants or vowels at the beginning of a word, his scores are reduced by a quarter
As a result of such simple calculations, we can somehow rank the artificial words, discarding bad ones or highlighting good ones.

What's next?
What happened as a result of my experiments can be viewed at http://vidumschik.ru . I think such a generator can be useful to many. But I would very much like to know if anyone has conducted an assessment of the harmony of words? Or maybe someone can offer a good algorithm?

All this is done by my comrade, who, for obvious reasons, cannot write here.

Source: https://habr.com/ru/post/51209/


All Articles