
In projects where you need to process a lot of multilingual content, it is almost impossible to do without a good grammar checker. Writing each module for a particular language will take a lot of time and money. Moreover, it is necessary to understand the language very well, to be a linguist and at the same time to relate the language to mathematical formulas. The task is not easy. In this case, an effective method would be to take a ready-made solution and integrate it into the application. How to do this in a simple way, using the tools of Open Office, I will discuss in this article.
To begin with, let's try to figure out what the grammar packages in OO actually consist of. As an example, let's look at the Russian grammar files. From the site of OO we download the archive with the grammar language pack. Inside the archive we will find three files: readme.txt, ru_RU.dic, ru_RU.aff. Everything is clear with the Readme - you can read it if necessary. The file with the dic extension is a dictionary with words that are found in Russian. Naturally, there are not all the words, and it will need to be replenished. The second file with the AFF extension is a grammar file. The format of this file will be either MySpell or Aspell. In any case, it is suitable for our module.
Now that we have dealt with the formats of these files, let's proceed to the module itself. For it, we need the Hunspell API, which can be found in the link at the bottom of the article. The API is written by another developer and is suitable for the most primitive grammar checker. We will write in Java. For tests, we will write a simple servlet that will receive the word as a parameter, check it, and return the corrected result to us if the word is incorrectly written. There are a lot of articles written on how to create a WEB application in the Java environment, so we will not dwell on this. We start immediately with the most important thing. Let's make a small form consisting of an input field and a button:
')
<html>
<head>
<meta http-equiv = "Content-Type" content = "text / html; charset = UTF-8">
<title> JSP Page </ title>
</ head>
<body>
<form method = "POST" action = "http: // localhost: 8080 / SpellChecker / spellChecker">
<input type = "text" name = "input" />
<input type = "submit" name = "submit" />
</ form>
</ body>
</ html>
Now, let's do the servlet itself. In the servlet, we will receive data from the form, send it to the Hunspell library, process the response, and send it back to the form. Consider the methods that will do this:
1. protected void processRequest (HttpServletRequest request, HttpServletResponse response)
2. throws ServletException, IOException {
3. response.setContentType ("text / html; charset = UTF-8");
4. PrintWriter out = response.getWriter ();
5. try {
6. String input = request.getParameter ("input");
7. List <String> list = check (input);
8. request.setAttribute ("output", list);
9. Iterator iter = list.iterator ();
10. while (iter.hasNext ()) {
11. out.println ((String) iter.next ());
12. }
13.
14.} finally {
15. out.close ();
sixteen. }
17.}
Consider everything in stages:
1 and 2 lines declare the function that processes the request. On the 3rd line in the answer, we put that the type of the result will be an HTML page encoded with UTF-8 in order to avoid misunderstandings with special characters (signs). Next, on line 4, initialize the container to display the data, and open the stream. Here the fun begins. The parameter, located on the 6th line, contains the input data, ie the word that needs to be checked for grammar. We pass it to the check function (String string), which we will analyze later. In short, I will say that this function checks our word for errors, and provides a list with corrections as an answer. After that, we iterate over the list of fixes (line 9-12), and output this list to the stream, that is, display the list on the screen. On line 15 we close the stream.
1.public List <String> check (String input) {
2. List <String> list = new ArrayList ();
3. List <String> stemList = new ArrayList ();
4. try {
5. String newInput = "";
6. if (input! = Null) {
7. newInput = new String (input.getBytes ("ISO-8859-1"), "UTF-8");
eight. }
9. Hunspell hunspell = Hunspell.getInstance ();
10. Dictionary dict = hunspell.getDictionary ("ru-RU");
11. if (dict.misspelled (newInput)) {
12. list = dict.suggest (newInput);
13.} else {
14. list.add (newInput);
15. }
sixteen.
17.} catch (FileNotFoundException ex) {
18. Logger.getLogger (spellChecker.class.getName ()). Log (Level.SEVERE, null, ex);
19.} catch (UnsupportedEncodingException ex) {
20. Logger.getLogger (spellChecker.class.getName ()). Log (Level.SEVERE, null, ex);
21.}
22. list.addAll (stemList);
23. return list;
24.}
And here is the function check (String string). On lines 2 and 3 we make two new lists. We will insert the final result into the first one, and use the second one as a temporary buffer. Lines 5-8 encode our request in UTF-8, because the dictionary understands only this encoding. On the 9th line we initialize Hunspell, and on 10 we indicate which dictionary we will use. In this case, it is a Russian dictionary. The grammar checker files (ru_RU.dic, ru_RU.aff) must be in the same directory as the program, so let's not forget to transfer them there. Next (line 11) is checking whether the word is spelled correctly. If the word is written incorrectly (11-13), then we get the results of how to correctly write this word. If the word is written correctly (13-15), then we return it back. Everything is extremely simple and brief. Lines 17 through 21 are error handlers. The first processing (FileNotFoundException) goes to the absence of grammar files, and the second (UnsupportedEncodingException) refers to the function that encodes our request in UTF-8, and processes the support of the encoding. On the 22nd line we transfer everything from the temporary buffer (stemList) to the result (list), and on the 23rd line we return it.
Let me remind you once again that you need to transfer the AFF and DIC files to the folder with the application, so that the program can use them. If this cannot be done, then you can specify the full path to the dictionary in line 10. The extension does not need to be specified.
If desired, this servlet can be converted into a small service that will communicate via XML. Connect to it more languages Hunspell, which are free, and are constantly replenished. Also, you can connect the language recognition library, which I will discuss in the next article.
Now I will share with you links where you can take all this stuff:
Hunspell API -
dren.dk/hunspell.html
OpenOffice dictionaries -
wiki.services.openoffice.org/wiki/Dictionaries
Also, dictionaries can be found on the pages of developers. There they are much "fresher".