I will continue the series of articles "How to entertain yourself with the help of the Wikipedia frequency dictionary and the interpreter of Python, if nothing else is at hand and is not foreseen in the near future."
I will try to recreate that wonderful evening, when my Wikipedia parser worked, I got the coveted dictionary, opened Python interactively and started typing various queries in order to get words with all sorts of unusual properties. That one, two years ago, the session with the shell, unfortunately, has not been preserved, so I will do it all over again.
So. Put the wiki_freq.txt file in some folder, open the Python shell in this folder. Go.
First, let's download the dictionary:
>>> import codecs >>> d={} >>> for l in codecs.open("wiki_freq.txt","r","utf-8-sig"): w,f=l.split() d[w]=int(f)
And try something quite simple, for example, to get the longest words. Let's output the beginning of the list of all dictionary words sorted by length in the reverse order (if the length of two words is the same, they are sorted by frequency):
')
>>> print ', '.join(sorted(d.keys(),key=lambda w:(len(w),d[w]),reverse=True)[:100])
the wordsPori-ikaalinen-kuru-vilpula-lankapokhya-padasyoki-heinola-mantyukharyu-savitaypale-lappenranta-antrea-rauta, in the “ipd” of “ip”, ​​“ip” of “ipk”, “iegla-baltezers-ropazh-krievupe-vangazh-i” Shala's lava-pyatikhatki, baturino-rakino-dyshkin-zadayki-stepalkovo-kaseli-cheshikhino-poekhnova-kozlin, mikhailovsky-sevsk-dmitrovsk-kromy-naryshkino-eagle-novosil-borki-kostorny, Russian-Georgian-German-Dutch-Italian-Italian Finnish, ultra-high-tech-indestructible-super-cosmo-cyber-suit, Komsomol-Moscow-Leningrad-Markovtsev-Komsomol, Gelendzhik-Anapa-Tunnel-Ubinsk-Novodmitrievskaya-Gelendzhik, Tula-Gorbachev-Belev-Kozo-Novo-Zmitne-Gelendzhik, Tula-Gorbachev-Belev-Kozo-Novo-Dmitriyivka-Gelendzhik, Tula-Gorbachev-Belev-Kozo-Koz-Novo-Dmitriyevskaya-Gelendzhik, Tula-Gorbachev-Belev-Kozo-Novo-Dmitriyevskaya-Gelendzhik, Tula-Gorbachev-Belev-Kozo-Koz-Novo-Dmitriy-Gelendzhik Tula, Copenhagen-Hamburg-Bremen-Dortmund-Fra kfurt-stuttgart-zurich, methionyl-glutamyl-histidyl-phenylalanyl-prolyl-glycyl-proline, fly-mura-tur-tara-para-park-spider-paut-plut-raft-slot-elephant, tsivi-tsivi-tsivi- tsivi-wiss-wiss-wiss-wiss-wiss-wiss-cirr, threonyl-lysyl-prolyl-arginyl-prolyl-glycyl-prolyl-diacetate, October-tram-nevsky-nezhinskaya-cosmonauts-Pushkin, white-cornflower-blue-yellow- red-yellow-cornflower-white, rehberger-etelyavuori-koivusaari-kholopainen-kallio-youtsen, printsmarmabĂĽkpetralbergangansiosfbernhardtwilgelmsberg, llanvayrpullgvingilgogerihuhirndro Ulliteline Helsinki-Moscow, another-promising-good-senator-as-his-there, Villa de Nuestra-Senora-de-La-Candelaria-de MedellĂn, Krasnoyarsk-Kirensk-Yakutsk-Nagaevo-Nizhnekolymsk-Uelen, Pizhanka- Mari-Oshaevo-Akhmanov-Voya-Paigishevo-Kazakovo, Georg-Alex Ndp; md Mirador-Miramar-Soler-Center-Plazario-Otei-Airport, Arginil-Alpha-Aspartyl-Lysyl-Valyl-Tyrosyl-Arginine, Adam-Karl-Wilhelm-Stanislav-Eugene-Pavel-Ludvig, Kharkiv-Lozovaya-Aleksandrovsk-Perekop Sevastopol, Mukha-Mura-Fura-Fora-ko a-corn-koan-clan-clone-elephant, Bryansk-starodubsko-trubchevsko-Novgorod-Seversk, in F-minor-a-flat-major-major-s-flat-major-major-f-minor, station-Marx-greenhouse-baharova- Lane, Rostov-on-Don-Vladivostok-San Francisco-New York, Fly-Muse-Louza-Vine-Pose-Por-Port-Sort-Ston-Elephant, Dubasari-Quarantine-Koshnitsa-Dorotsky-Grigoriopol, espionage Japanese-German-Trotsky sabotage, sodium bicarbonate-calcium-magnesia, Ukrainian-Crimean Tatar-Russian-Belarusian, Wakamaya-Silva-Maradona-Mariya-Nina-Sanches-Raht, Polish-It liano-french-german-canadian, network-transformer-rectifier-inverter-traction, ham-and-cheese cyber-cooperative sandwiches, italy-slovenia-hungary-slovakia-ukraine-russia, musical-choreographic-cinematographic, hevie pop-up jordan but-cowboy-superheroic-soldier, bicarbonate-sodium chloride, calcium-magnesia, Schleswig-holstein-Sonderburg-Augustinburg, el Santuario de Nuestra-senorara de la Soledad, lions-ternopol-Vinnytsia-Dnipropetrovtev-Dnipropetrov-Lviv. Bogdanovich-Yegorshino-Alapaevsk-Goroblagodatskaya, Kazan-Chistopol-Almetyevsk-Bugulma-Orenburg, geo-bio-filo-was-fantas-mantas-parabolic, visiting-unbeliever-with-explanatory-pamphlet, communid-de-vilia-i Tierra-Antigua-de-Cuellar, Barnaul-Ekibastuz-Kokchetav-Kustanay-Chelyabinsk, Schleswig-Holsht Single Sonderbourne, Augustineland zonderburg-avugustenbrgsky, Petropavlovsk Kokchetau-Akmolinsk karaganda-shu, EQ-compressor-distortion equalizer fuzz, Irish-Swedish-Austrian-Polish-Czech, Yermilova-Glebych seaside-of-Soviet-vyborg, Hungary-Belikova manastir-osek- Zhyakovo-Slavonian, Edison-and-Svan-United-Electric-Light-Company, Skuratov-Eagle-Kursk-Kharkiv-Lozovaya-Ilovaisk, Sulfate-Hydrocarbate-Magnesica-Calcium, Religious-Enlightenment-Charitable, Sevzapenergoset-Project-West-Selcenegrane, Calcist-Settler-Calcium Omsk-Irkutsk-Khabarovsk-Vladivostok, Semernikovo-Crimea-Chaltyr-Ekaterinovka-Sambek, Schleswig-Holstein-Sonderburg-Augustenburg, Circassian-Abkhaz-Kabarda-Adygei, Municipality and Tehjur-VA-El-Swa-i-Kabarda-Adygei, Municipality and Tehjur-VA-El-Swa-i-Kabarda-Adygei, municipality vocal-plastic Stia-kontrsuggestiya-kontrkontrsuggestiya, Sterlitamak-Petrine-Beloretsk Magnitogorsk
Dashes spoil the picture, although “spy-Japanese-German-Trotsky-sabotage” is certainly wonderful. Since The dictionary was compiled two years ago, now it is possible to check whether this or that unusual word was a trick of hooligans, this particular adjective was not, it still exists in the article about Kaganovich Lazar Moiseevich.
But perhaps I will remove hyphens. I will not scoff at the dictionary, better beat it and create a new one, without hyphens:
>>> for l in codecs.open("wiki_freq.txt","r","utf-8-sig"): w,f=l.split() if '-' in w:continue d[w]=int(f)
run the search for long words again.
the wordstaumatauakatangiangakoauauotamateaturipukakapikimaungahoronukupokanuenuakitanatahu, printsmarmabyukpetralbertgansiosifbernhardtvilgelmsberg, llanvayrpullgvingillgogerihuirndrobullllantisiliogogogo, taumatauakatangiangakoauauotamateapokanuenuakitanatahu, gidrazinokarbonilmetilbromfenildigidrobenzdiazepin, dekametilendimetilmentoksikarbonilmetilammoniya, glyukopiranozidmetilbuteniltrigidroksiflavanol, metoksihlordietilaminometilbutilaminoakridin, italbenzokladbischemotosilikonfarmometalhimiya, chargoggagoggmanch auggagoggchaubunagungamaugg, enterogematogepatogematopulmoenteralnogo, denlilogumomishitszyuylupigachzhunchzhenkehan, raspylyatslabokontsentrirovannyyekstrakt, politetraftoretilenatsetoksipropilbutan, hexachloro, fosforibozilpirofosfatamidotransferaza, katastrofanarhistoriyazvandalkogolny, atsetoksipentametiltsiklogeksadienonov, lesnayazatenennayaglubokayachernayadolina, ultravysokotemperaturnoobrabotannoe, nicotinamide, prevysokomnogorassmotritelstvuyuschy, superarhieks traultramegagrandiozno, skonstruirovanteplovozostroitelnym, uridindifosfatglyukuroniltransferazu, lizofosfatidilholinatsiltransferazoy, uridindifosfatglyukoroniltransferazy, superkalifradzhilistikekspialidoshes, lizofosfatidilholinatsiltransferaza, tetrametiltetraazobitsiklooktandion, nicotinamide, superkalifradzhilistikekspialidoshes, popreblagorassmotritelstvuyuschemusya, tetragidroksiglyukopiranozilksanten, bisetilenditiotetraselenafulvalen, adenozinfosfatizopentiltransferaza, soyuzglavstroydorm ashzagranpostavka, ishvarapratyabhidzhnyavritivimarshini, nitropentametiltsiklogeksadienonov, perftormetiltsiklogeksilpiperidina, polidigidroksifenilentiosulfonat, galogenoserebryanomfotograficheskom, nikotinamidadenindinukleotidfosfa, poliamidbenzimidazoltereftalamida, rentgenoelektrokardiograficheskogo, bromdigidrohlorfenilbenzodiazepin, undzibtsigtauzendzibenhundertziben, lecithin-cholesterol acyltransferase, lecithin-cholesterol acyltransferase, alyuminievokobaltovomolibdenovyh, predsedatelstvovaliarheanaktidy, kominefteen ergomontazhavtomatikoy, pesnitruschobnadezhdyrazbityhserdets, induktotermogalvanogryazelechenie, lecithin-cholesterol acyltransferase, methylenetetrahydrofolate, verhnevolzhskselelektrosetstroy, obrazovaniemmonobromproizvodnogo, kominefteenergomontazhavtomatika, mintraktorselhozmashinostroeniya, geksakosioygeksekontageksafobiya, methylenetetrahydrofolate, pankreatoholangiorentgenografiya, kominefteenergomontazhavtomatiki, fosfoenolpiruvatkarboksikinazoy, raboteballastoraspredelitelnaya, electrogastrogram iCal, proizvodstveelektrooborudovaniya, adeninfosforiboziltransferaznoy, tsiklodekstringlyukanotransferazy, dioksifenilalanindekarboksilazy, psihiatriideinstitutsionalizatsiya, dimetilalkilbenzilammoniyhlorid, natsionalisticheskiorientirovanoy, samoprovozglashonoyturkestanskoy, diatsilglitseriltrimetilgomoserin, entsefalomielopoliradikulonevrit, gidrolektrosvetoterapevticheekie, mahachkalinskogogosudarstvennogo, gidroksimetilhinoksilindioksid, dihlordifeniltrihlormetilmetan, poperechnomorschinistoyskulpture, iketikpautiyavt iktstuzobrazhalsya, pornograficheskoyvideoproduktsii, entsefaloduroarteriosinangioziz, avtotraktorelektrooborudovanie, mahachulalongkornradzhavidyalayya, poproshaynichestvomprikormlennaya, novoderevenkovskagropromhimiya, hexanitrohexaazaisowurtzitane, elektrogastroenterografichesky, duodenogastralnoezofagealnym, bisbenziltetragidroizohinolina, novosibirskneftegazpererabotka, duodenogastralnoezofagealny, adenosylmethionine decarboxylase, neyroimmunovegetodistonicheskih, shestisotvosmidesyatistranichnoy, tsiklobutadienzhelezo tricarbonyl, fumarylacetoacetate hydroxylase
The absolute record-breaker is the word Taumatauakatangangakoauautamatatatatupakapikami maunghovorukupokanuenuikitanatahu, the name of a small hill in New Zealand. Locals for some reason prefer to shorten the name to Thaumat. The second word from Wikipedia disappeared - probably a mistake or tricks of hooligans. When I entered the third word - llanvairpullgvingillgogeriuhirndrobullllantisilogogogy, Wikipedia took an interest -
Did you mean “llanvairpullgwingingglogeruhirndrollllllistilogogogh”? Yes! It was Llanvayrpullgvingilgogerihuhirndrobullllantisilogogogh I meant! Apparently the word has changed, a more correct transliteration has appeared. This is the name of the village in Wales. I also liked the name of the lake in North America - Chargoggoggmanchauggagggchaubunagungamugg. Most of the list consists of the names of all kinds of substances, the furious Soviet abbreviations, such as the Upper Volga, elektrosetrostroy, or simply the cases of keypad squeezing.
Look for long palindrome words. Add a filter to the last query:
>>> print ', '.join(sorted([w for w in d.keys() if w==w[-1::-1]], key=lambda w:(len(w),d[w]),reverse=True)[:100])
the wordsaqueducts, hydrogens, monahan, onsnonsno, eeeeeee, sologol, monaghan, holomolokh, razinizar, sunilinus, xxxxxxxx, noossoon, utility, aniline, tartrate, akasaka, ogogo, anisin, anikin, kammak, apami, apami xxxxxxx, Glenelg tenenet, ateleta, iiii, Nenonen, yoninoy, Abel, anapana, Markram, ravivar, arutura, aaaaaaa, Motet, dovyvod, Aracar, nemtmen, malyalam, avavava, macaques agubuga, montnom, kkkkkkk, kalilak, undergo, Dunuud, tatat, mokik, oooogoooo, Moror, kinonik, atarata, nananan, pppvppp, masses , Tommot, Nissin, Renner, Repper, Sakkas, Moss, Kellek, Semmes, Mottom, Kashak, Resser, Kukkuk, Necken, Serres, Noppon, Killik, xxxxxx, Aveeva, Millime, Tennet, Callac, Hutts, Molle, Kattak, Merkel , tipit, karak, nabban, pyrrip, tibbit, savvas, ptetp, remmer, nannan, rollur, yonna, sassos, harrah, kammak, rannar
I took the words to Google, walked on the links, found out about Holomolokh Lake, and also about the planned language of Sunilinus, a language where every word is a palindrome. I also learned that the longest word palindrome is saippuakauppias, the seller of soap in Finnish. With the word "onsnonsno" it turned out interesting, the author of the article about glycerin was too lazy to switch the keyboard layout and wrote the formula of glycerin aldehyde in Cyrillic - 2 (so it is still there, the difference with the naked eye is not visible) and my parser took the two as a separator.
Look for anagram words.
To begin with, we will compile a supplemental anagram dictionary, set the threshold of occurrence 20, so that we can cut off any rubbish (with long words and palindromes, I could not do this, because these words are extremely rare):
>>> from collections import defaultdict >>> anagrams = defaultdict(list) >>> for k in d: if d[k]>20: anagrams[''.join(sorted(k))].append(k)
And we are looking for the 10 largest sets of anagrams:
>>> for l in sorted(anagrams.values(), key=len, reverse=True)[:10]: print ', '.join(l)
anagramsdoran, ardon, dorn, rodan, nodar, andro, people, drone, radon, daron, rhonda, nardo, odran, norad
irka, rica, arch, crayfish, caviar, acre, iraq, kira, kari, ikar, cairo, arik
vrane, arwen, equals, avren, right, narva, avner, narev, nerve, varna, revan, waren
Ryan, Inar, Nari, Iran, Rain, Henry, Arin, Rina, Nair, Rani, Arni
measure, marek, mark, ermak, gloom, cameras, cream, rivers, edge, karma, frame
anchor, butra, karon, koran, crown, mink, acron, ranco, wounds, carnot, craon
Satyr, Sitar, Grow, Tyras, Sirta, Sarti, Trias, Tarius, Trasi, Ritas, Istra
Krua, hand, arch, raku, kaur, urka, acre, pun, ukra, urak, kura
sector, corset, roquets, surroundings, campfire, bonfire, row, Cortes, stoker, construction sites, cross
noir, naru, urn, fleece, nura, wound, uranium, naur, ruan, arun, arnu
And also sets of anagrams with words of maximum length:
>>> for k in sorted([k for k in anagrams if len(anagrams[k])>1], key=len, reverse=True)[:10]: print ', '.join(anagrams[k])
anagramsshort films, short films
astrological, realistic
characterized, characterized
psychopathology, pathopsychology
ontological, neolithic
oligarchic, archaeologically
topological, typological
isothermal, isometric
canine, oncological
Oskograd, Stragorod
We are looking for words in which all the letters are different, again changing the same long-suffering line. I'm already starting to regret that I have not made a function of it:
>>> print ', '.join(sorted([k for k in d if len(set(k))==len(k)],key=lambda w:(len(w),d[w]),reverse=True)[:100])
the wordshard-core, quadrangle congratulatory, quad pulse, highly specialized, multibrand, sound pickups, four ohpolyusnik, sloppiness, obschefrantsuzsky, Gumpoldskirchen, Bolshenarymskoe, congratulatory, Cotton, obschefrantsuzskim, incendiary, connected, yuzhnoalbertsky, extremophilic, fastsikulyatornye, highly specialized, non-Newtonian, subvertical, podkantselyaristy, viscoplastic, chetyrohabsidnoy, ritmendblyuzovyh, recline obscheitalyanskuyu, extremophilic, Germanophile, dvuhprestolnym, viscoelastic, film producers, Ministry of Fisheries, four-bedroom, normalizing, non-degenerating, temperature smoothie, hae exposed, crushing, pickup, handling, coal mining, pickup, pulitzer, braunschweig , , , , , ,
, — , :
>>> def output(lst): print ', '.join(sorted(lst,key=lambda w:(len(w),d[w]),reverse=True)[:100]) >>> output([k for k in d if min(Counter(k).values())>=2])
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
, — , , , , , ..
:
>>> output([k for k in d if set(Counter(k).values())==set([2])])
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
, , …
, … , . , , .
PS —
yadi.sk/d/cRHJvinF7Vt8w