Digital Dictionary from A to Z

In my understanding, one of the most useful programs on a PC and a smartphone is an electronic dictionary. In those ancient times, when I was learning a foreign language, every word had to be searched for in a paper dictionary. I have done this trivial operation hundreds of times, and I have had to look at some malicious words again and again, as I had time to forget their meaning. How insulting it was! How it is now, vzhuh and translation before my eyes on the screen. Search history, in case the word you are looking for is not transferred from a short-term to a long-term memory.

Stardict

Let's create an electronic dictionary for StarDict / GoldenDict programs on our own. It may take many or few man-hours, depending on the quality of the source material.

Step One: OCR

Unlike mountaineering, digitizing a dictionary is the hardest step, not the last but the first. If you have to carry out an OCR of a paper dictionary with faded pages, printed too small, with various artifacts of casual use, or in an exotic language, even FineReader will not help much. On some pages, the difference in the length of time between manual typing and OCR with error correction is negligible.

I advise you to save everything in plain text files, since advanced search and error correction, tagging, sorting conversion and other operations with a text array are unimaginable to perform with a binary file .

At this step, it is important to determine the structure of the dictionary entries. In the simplest case, there will be only two fields: a key and a value . This is sufficient, but if you need to highlight the various elements of articles, then you need to label all such elements in a certain way.

It's time to talk a bit about formats. There are many formats of electronic dictionaries, here is their list.

We will not analyze all formats here, as most of them are proprietary. We are interested in open standards and open source software.

Dictd

Originating in an era when the network TCP / IP protocols easily multiplied and multiplied dictd now only of archaeological interest. This client server protocol using TCP port 2628 is defined in RFC 2229 .

The source file for the dictionary is formatted as follows.

::

For example, such a dictionary

 :catalysis: "increase in the rate of a chemical reaction due to the participation of an additional substance called a catalyst, which is not consumed in the catalyzed reaction and can continue to act repeatedly. " <a href="is.gd/v6a22Q">ref</a>. :deconstruction: :rendered: eg. "rendered irrelevant." :reading: cf. 'reading of' :minor: a minor reading.

The finished dictionary file is created with the dictfmt command.

 dictfmt --utf8 -s "  " -j dict-name < mydict.txt

As a result, 2 files are generated: dict-name.index and dict-name.dict . Of these, the first is obviously an index file; nothing needs to be done with it, and the second can be compressed with the dictzip command. This command compresses the * .dict file using the gzip utility. Immediately the question arises: why is it then needed, if there is a regular gzip ?

The fact is that dictzip uses extra bytes in the header of the archive files to provide pseudo-random access to the file.

Finally, the files are placed in profile directories, /usr/lib/dict , we restart the dictd service and voila. The search syntax is simple, just type

dict WORD.

Running through dictd links is like a safari on the Internet of the 90s, alive and still kicking!

Sdict

The daring attempt by Alexei Semenov to change the world for the better with Perl magic at the time when Microsoft hadn’t played tricks with Linux and the open source community, and ABBYY Lingvo’s main source of dictionaries.

Title of the source file of the dictionary.

 <header> title = Sample 1 test dictionary - dictionary name; copyright = GNU Public License - copyright information; version = 0.1 - version; w_lang = en - language for words; a_lang = fi - language for articles. For further information about language codes refer 'C:\Sdict\share\doc\iso639.htm' file; # charset = ... - use if your source file is not in UTF-8 encoding. </header>

The body is formatted as follows:

 word___article

You can swing the version for Symbian OS, if that. The project is no longer alive, and even the dictionaries themselves can only be learned from the Time Machine .

XDXF

Well, that's all, we are tied up with archeology and go to dictionary formats and programs suitable for using IRL.

XDXF has all the advantages and disadvantages of the XML format, which it is. The entire format syntax and examples can be viewed here .

The skeleton of the vocabulary file looks as follows, consists of 2 parts: meta_info and lexicon .

 <xdxf ...> <meta_info>    : ,   . </meta_info> <lexicon> <ar> 1</ar> <ar> 2</ar> <ar> 3</ar> <ar> 4</ar> ... </lexicon> </xdxf>

There are a huge number of dictionaries in this format. The big advantage of the format is that there is no need to convert anything further. The program GoldenDict recognizes XDXF files along with a large number of other supported formats.

TSV / StarDict

StarDict and its clones are not so much about the format of an electronic dictionary, but about quality software for viewing, converting and creating such.

To create an electronic dictionary using StarDict, a TSV file is enough, which I chose to make a digital copy of the Armenian-Russian dictionary .

Nevertheless, some formatting and layout of the dictionary file is possible, however, it cannot be compared with XDXF .

 a 1\n2\n3 b 4\\5\n6 c 789

The format defines the line break character \n , in the case when the article is divided into paragraphs.

Step Two: Adjustment

After the first step, there will most likely be dozens, or even hundreds of spelling, grammatical and any other errors, strange characters and other OCR artifacts.

The peculiarity of dictionaries is that spell checking is needed simultaneously in two languages. Even now, in 2018, surprisingly few text editors and even office suites are able to perform this simple operation.

Not holivar for, I recommend processing Teska to produce with Vim . If your favorite text editor handles it no worse, that's fine. With Vim enough command.

 :setlocal spell spelllang=en,ru

to check the spelling of two dictionaries, in this case, Russian and English. Next, a list of rakes.

Sorting text works anyhow for non-Latin locales, especially bad where writing a letter requires more than one character, like Armenian ու = ո + ւ . It is necessary in such cases to sort the list of words yourself using a simple Perl, or another script.
Pattern matching can also work unexpectedly for some locales, even if the text itself and the console are in UTF-8.
When digitizing a printed dictionary, you need to be prepared not only for digitization errors, but also for errors in the printed dictionary itself. They may contain a lot there!
If the title of the article is written in capital letters, then perhaps it should be translated into lowercase when digitizing. Not all letters have uppercase characters; in fact, not all locales even have uppercase.

Step Three: Compile the Dictionary

For the XDXF format, as already mentioned, this step is not required. Just push the file into the /usr/share/goldendict , where the program will pick it up.

For a TSV file, use the stardict-editor utility supplied with the StarDict toolkit.

stardict-editor

At the output, the program creates the following files, like the ancient Dict.

somedict.ifo
somedict.idx or somedict.idx.gz
somedict.dict or somedict.dict.dz
somedict.syn (optional)

The files are copied to the /ysr/share/stardict/dic directory and that's it.

PS For the Android mobile platform, the GoldenDict program suddenly became paid, but you can still find the latest free version of the program on the Internet.

Source: https://habr.com/ru/post/421075/

All Articles