In my understanding, one of the most useful programs on a PC and a smartphone is an electronic dictionary. In those ancient times, when I was learning a foreign language, every word had to be searched for in a paper dictionary. I have done this trivial operation hundreds of times, and I have had to look at some malicious words again and again, as I had time to forget their meaning. How insulting it was! How it is now, vzhuh and translation before my eyes on the screen. Search history, in case the word you are looking for is not transferred from a short-term to a long-term memory.
Let's create an electronic dictionary for StarDict / GoldenDict programs on our own. It may take many or few man-hours, depending on the quality of the source material.
Unlike mountaineering, digitizing a dictionary is the hardest step, not the last but the first. If you have to carry out an OCR of a paper dictionary with faded pages, printed too small, with various artifacts of casual use, or in an exotic language, even FineReader will not help much. On some pages, the difference in the length of time between manual typing and OCR with error correction is negligible.
I advise you to save everything in plain text files, since advanced search and error correction, tagging, sorting conversion and other operations with a text array are unimaginable to perform with a binary file .
At this step, it is important to determine the structure of the dictionary entries. In the simplest case, there will be only two fields: a key and a value . This is sufficient, but if you need to highlight the various elements of articles, then you need to label all such elements in a certain way.
It's time to talk a bit about formats. There are many formats of electronic dictionaries, here is their list.
We will not analyze all formats here, as most of them are proprietary. We are interested in open standards and open source software.
Originating in an era when the network TCP / IP protocols easily multiplied and multiplied dictd
now only of archaeological interest. This client server protocol using TCP port 2628 is defined in RFC 2229 .
The source file for the dictionary is formatted as follows.
::
For example, such a dictionary
:catalysis: "increase in the rate of a chemical reaction due to the participation of an additional substance called a catalyst, which is not consumed in the catalyzed reaction and can continue to act repeatedly. " <a href="is.gd/v6a22Q">ref</a>. :deconstruction: :rendered: eg. "rendered irrelevant." :reading: cf. 'reading of' :minor: a minor reading.
The finished dictionary file is created with the dictfmt
command.
dictfmt --utf8 -s " " -j dict-name < mydict.txt
As a result, 2 files are generated: dict-name.index
and dict-name.dict
. Of these, the first is obviously an index file; nothing needs to be done with it, and the second can be compressed with the dictzip
command. This command compresses the * .dict file using the gzip
utility. Immediately the question arises: why is it then needed, if there is a regular gzip
?
The fact is that dictzip
uses extra bytes in the header of the archive files to provide pseudo-random access to the file.
Finally, the files are placed in profile directories, /usr/lib/dict
, we restart the dictd
service and voila. The search syntax is simple, just type
dict WORD.
Running through dictd links is like a safari on the Internet of the 90s, alive and still kicking!
The daring attempt by Alexei Semenov to change the world for the better with Perl magic at the time when Microsoft hadn’t played tricks with Linux and the open source community, and ABBYY Lingvo’s main source of dictionaries.
Title of the source file of the dictionary.
<header> title = Sample 1 test dictionary - dictionary name; copyright = GNU Public License - copyright information; version = 0.1 - version; w_lang = en - language for words; a_lang = fi - language for articles. For further information about language codes refer 'C:\Sdict\share\doc\iso639.htm' file; # charset = ... - use if your source file is not in UTF-8 encoding. </header>
The body is formatted as follows:
word___article
You can swing the version for Symbian OS, if that. The project is no longer alive, and even the dictionaries themselves can only be learned from the Time Machine .
Well, that's all, we are tied up with archeology and go to dictionary formats and programs suitable for using IRL.
XDXF has all the advantages and disadvantages of the XML format, which it is. The entire format syntax and examples can be viewed here .
The skeleton of the vocabulary file looks as follows, consists of 2 parts: meta_info
and lexicon
.
<xdxf ...> <meta_info> : , . </meta_info> <lexicon> <ar> 1</ar> <ar> 2</ar> <ar> 3</ar> <ar> 4</ar> ... </lexicon> </xdxf>
There are a huge number of dictionaries in this format. The big advantage of the format is that there is no need to convert anything further. The program GoldenDict recognizes XDXF files along with a large number of other supported formats.
StarDict and its clones are not so much about the format of an electronic dictionary, but about quality software for viewing, converting and creating such.
To create an electronic dictionary using StarDict, a TSV file is enough, which I chose to make a digital copy of the Armenian-Russian dictionary .
Nevertheless, some formatting and layout of the dictionary file is possible, however, it cannot be compared with XDXF
.
a 1\n2\n3 b 4\\5\n6 c 789
The format defines the line break character \n
, in the case when the article is divided into paragraphs.
After the first step, there will most likely be dozens, or even hundreds of spelling, grammatical and any other errors, strange characters and other OCR artifacts.
The peculiarity of dictionaries is that spell checking is needed simultaneously in two languages. Even now, in 2018, surprisingly few text editors and even office suites are able to perform this simple operation.
Not holivar for, I recommend processing Teska to produce with Vim . If your favorite text editor handles it no worse, that's fine. With Vim enough command.
:setlocal spell spelllang=en,ru
to check the spelling of two dictionaries, in this case, Russian and English. Next, a list of rakes.
ու = ո + ւ
. It is necessary in such cases to sort the list of words yourself using a simple Perl, or another script.For the XDXF
format, as already mentioned, this step is not required. Just push the file into the /usr/share/goldendict
, where the program will pick it up.
For a TSV file, use the stardict-editor
utility supplied with the StarDict toolkit.
At the output, the program creates the following files, like the ancient Dict.
The files are copied to the /ysr/share/stardict/dic
directory and that's it.
PS For the Android mobile platform, the GoldenDict program suddenly became paid, but you can still find the latest free version of the program on the Internet.
Source: https://habr.com/ru/post/421075/
All Articles