📜 ⬆️ ⬇️

Add a furigana to a kanji Python macro for LibreOffice

Ladies and gentlemen, the plan is:





In modern Japanese, there are mainly three written systems.
')
First, these are two syllabary alphabets: hiragana and katakana. Hiragana is more rounded, it looks like this: こ れ は ひ ら が な で す and is, as it were, the main alphabet. Katakana is more angular (カ タ カ ナ デ) and is used mainly for borrowed words, but the whole set of hiragana and katakana is almost the same. Then we will call all this simply “kana”. “The syllabary alphabet” means that instead of our vowels and consonants “a”, “b” and “c” - only whole syllables like “ka”, “ca” and “that”. There are five vowels, though there are five pieces (“a”, “i”, “u”, “e“, “o” + “I”, “y” and “e”) and only one consonant “n” in exception order.

That is why it is very difficult for the Japanese to pronounce the words with consecutive consonants - they are simply not used to this, but this does not matter to us now. In principle, one cana can write any phrase in Japanese.

Another system is the hieroglyphs borrowed from China, which we will further call kanji , after that they are so called. After borrowing, the Japanese, and sobsno the Chinese, too, the kanji changed significantly, and now they are quite different, although of course, on the other hand, they were largely similar. Let's just say, looking at the Chinese text, the Japanese can more or less understand what they are talking about. Kanji looks like this: 友 達 、 日本 酒 、 世界。 Yes, in Japanese - a round dot.

Here is the key point for understanding: Japanese and Chinese at the grammar level are not at all related. So just take Chinese signs and start writing with them was not possible. Actually, using kanji, you can write individual words, or rather even the basics of words, and kana is still used to indicate grammatical forms and the connection between words. It looks like this: 送 り が な は と っ て も 便利 で す. If you look closely, you can see that the first character is kanji, followed by several kana characters, etc. This trick is easy to visually distinguish Japanese text from Chinese, which looks graphically more “dense” because there is only kanji. This kana, which clings to kanji to indicate grammatical form, is called “okurigana”.

Well, finally ... The number of kanji is quite large, and if you are not a robot, then everything is difficult to remember. If a word is written in a kanji, it is often not obvious how to read it yourself, despite the fact that in oral speech the word could easily occur and the person knows it. To help in this situation, especially for rare kanji or when the text is intended for children, foreigners, or other mentally limited categories of citizens - reading kanji is signed from above using kana. This is called furigana . It looks like the picture at the beginning of the post.

Fuh, go to the next item.

To add annotation over the text is used the so-called ruby. It has no relation to the programming language. As I just learned from Wikipedia - in Russian is called “agate”

Ruby support is in html using a ruby ​​tag:
<ruby>  <rt></rt> </ruby> 


But now we are interested in LibreOffice. In manual mode, you can add a Ruby annotation to the text through the menu Format -> Asian Phonetic Guide. This is somewhat strange, because it is possible to use the field not only for phonetics, but figs with them. If this is not in the menu, then you can try to add support for Asian languages ​​in Tools -> Options -> Language Settings.

Next, we want to do this automatically for the selected text. LibreOffice is beautiful because you can write macros in Python. To do this, there should be a module libreoffice-script-provider-python (put through apt-get), which is not worth it by default. Oh yeah, I do everything under Ubuntu, if you have another operating system, then you can share a recipe for it in the comments :)

Actually, the macro is written as a normal function on Python. The document is visible through a global variable with the instance of the corresponding class and, in fact, it contains all the necessary methods.

Here is a simple example:
 def HelloWorldPython(): desktop = XSCRIPTCONTEXT.getDesktop() model = desktop.getCurrentComponent() if not hasattr(model, "Text"): model = desktop.loadComponentFromURL("private:factory/swriter","_blank", 0, () ) text = model.Text tRange = text.End tRange.String = "Hello World (in Python)" return None g_exportedScripts = HelloWorldPython, 


Save to file, put it or make a symbolic link in the folder in which LibreOffice holds the scripts. In my case, this is “~ / .config / libreoffice / 4 / user / Scripts / python”.

Open LibreOffice Writer (OpenOffice should work too), go to Tools -> Macros -> Run Macro and see our script there, if everything worked out.

It remains to write a script that would take the kanji from the document and add their readings in the corresponding characters in ruby. Everything is simple: to generate the reading there are special programs, we simply run them from our macro script, run the Japanese text through standard I / O and insert an output into the document.

A program called kakasi takes the Japanese text and gives a reading to the whole piece, but this is not exactly what is needed, because I want to distribute the fragments of the phonetic prompt between the ruby ​​fields of the corresponding symbols. To do this, with the help of mecab, you can tokenize the Japanese text, and therefore already feed it kakasi in parts. In fact, the accuracy of reading generation slightly deteriorates from this, but the layout of the document improves. Any flaws can then be corrected manually.

That's all, set up apt-get kakasi, mecab,
go to github.com/undertherain/furiganize , download the clumsy script that I wrote and which myself, but this is all done. Put it in the desired folder and enjoy. If someone shares his experience with other operating systems, it will be great in general.

Source: https://habr.com/ru/post/395153/


All Articles