How to solve the old problem with ML in Python and .Net

It happens that some tasks pursue you for many years. For me, such a task was the gluing of sentences of texts in which the transition to a new line is rigidly scored, and often also the transfer of words. In practice, it is extracted from a PDF or using OCR text. Often it was possible to find such texts on the websites of online libraries, in the archives of old documents, which were edited by DOS editors. And such formatting greatly hinders then the correct breakdown into sentences (and with hyphenation, into tokens) for subsequent NLP processing. Yes, and trite show such a document in the search results - it will be ugly.

I solved this problem several times - on Delphi, C #. Then it was a hard algorithm, where he wrote with his hands, for example, what could be the width of the text, so that the text was considered formatted "as before". It did not always work perfectly, but in general, it was enough.

At the moment, I crawl on some ML projects in Python. At one point, it turned out that the next corpus of documents consists of texts extracted from PDF versions of scientific articles. Of course, the text was extracted with a rigid breakdown of the paragraph paragraph characters, with hyphenation. That is, to work normally with such texts was no longer possible. Python attracts the fact that it has almost everything! But a couple of hours of searching did not give anything sane (perhaps, of course, I was looking for this). And then I decided once again to write a postprocessor for such documents. The choice was of two options - to port your past code with C #, or write something that could be taught. Finally, I was led to the second approach by the fact that the scientific texts were partially exported from two-column texts, and partly from single-column texts. Also different font sizes. This led to the fact that the old version, with its hard-wired admissible boundaries, often worked incorrectly. Sitting by hand again select options - well, no, soon the singularity will come , I do not have time for that! So, it is decided - we are writing a library that uses machine learning.

All code can be found in the repository :

Markup

What is the buzz and complexity of machine learning - if the algorithm fails somewhere, you often do not need to change the program itself. It is enough to collect new data (often they need to be annotated) and restart the construction of the model. The computer will do the rest for you. Of course, there is a chance that new data will have to come up with new features, change the architecture, but in most cases it turns out to get by just checking that everything began to work well. This is also a challenge - typing and marking data can be difficult. Or very difficult. And also - scary bored :-)

So the most boring is markup. The corpus folder contains documents that I just took from the body of the Krapivin2009 documents I worked with at that moment. There are 10 documents that seemed typical to me. I marked only 3, because already at the start of training on this base, a sufficient quality of the "glue stick" was obtained. If in the future it turns out that everything is not so simple, then new documents with markup will appear in this folder and the learning process will repeat.

In this case, it seemed to me convenient that the files remain text, so the markup format was to add a sign at the beginning of the line that this line should be glued to the previous one (the '+' character) or not (the '*' character). Here is a fragment (file 1005058.txt ):

*Introduction *Customers on the web are often overwhelmed with options and flooded with promotional messages for +products or services they neither need nor want. When users cannot find what they are searching for, the +e-commerce site struggles to maintain good customer relations. *Employing a recommender system as part of a site's Customer Relationship Management (CRM) activities +can overcome the problems associated with providing users with too little information, or too much of +the wrong information. Recommender systems are able to assist customers during catalog browsing and are +an effective way to cross-sell and improve customer loyalty. *In this paper, we will compare several recommender systems being used as an essential component of +CRM tools under development at Verizon. Our solutions are purposely for the current customers and current +products - recommendations for new customers and new products are out of the scope of this paper.

A couple of hours of tedious work and 3 files with 2300 examples (one line - one sample) are ready. This is already sufficient in many cases for simple classifiers such as logistic regression, which has been applied further.

Features

Classifiers do not work directly with textual data. At the entrance they are given features - either numbers or boolean signs (which, again, are translated into the numbers 0/1) that there is some kind of feature or not. Designing the right features from good data is the key to machine learning success. A feature of our case is that our corpus is English texts. And I want to get at least the minimum language independence. At least within European languages. Therefore, for text features, we apply a little trick.

Converting text into a list of features and labels, whether to glue with the previous line, is performed by the auxiliary function _featurize_text_with_annotation :

 x, y = pdf_lines_gluer._featurize_text_with_annotation(raw_text)

Note - hereinafter, mostly python code snippets are going on, which you can see completely in the laptop .

Used features:

'this_len' is the length of the current string in characters.
'mean_len' is the average length of lines in the range -5 ... + 5 lines.
'prev_len' is the length of the previous line in characters.
'first_chars' - here is our cunning feature. The first 2 characters of the string are placed here. But at the same time, all lower case letters (of any alphabet) are replaced by the English character 'a', uppercase letters - by 'A', numbers - by '0'. This significantly reduces the number of possible signs, while generalizing them. Examples of what happens: 'Aa', 'aa', 'AA', '0.', 'a-' ...
'isalpha' - whether the letter is the last character of the previous line.
'isdigit' - whether the digit is the last character of the previous line.
'islower' - whether the last character of the previous line is a lowercase letter.
'punct' - punctuation mark, which ends the previous line, or a space for other symbols.

An example of a feature set for a single line:

 {'this_len': 12, 'mean_len': 75.0, 'prev_len': 0, 'first_chars': 'Aa', 'isalpha': False, 'isdigit': False, 'islower': False, 'punct': ' '}

In order to work with them the classifier from the sklearn package, we use the DictVectorizer class, with which string features (we have 'first_chars'), are converted into several columns entitled (names can be obtained via get_feature_names () ) as' first_chars = Aa ',' first_chars = 0. '. Boolean features turn into zeros and ones, and numeric values remain numbers — field names do not change. Outside, the method returns a numpy.array of approximately the following form (only one line is shown):

 [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 39.1 30. 0. 1. 36. ]]

Classifier Training

Having received a set of features in the form of an array of floating-point numbers, we can now start the learning process. For this, we use logistic regression as a classifier. Classes are unbalanced, so we set the option class_weight = 'balanced', check the result on the test part of the body:

 from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report clf = LogisticRegression(random_state=1974, solver='liblinear', max_iter=2000, class_weight='balanced') clf.fit(x_train, y_train) y_pred = clf.predict(x_test) print(classification_report(y_true=y_test, y_pred=y_pred))

We get the following quality indicators:

  precision recall f1-score support False 0.82 0.92 0.86 207 True 0.96 0.91 0.94 483 accuracy 0.91 690 macro avg 0.89 0.91 0.90 690 weighted avg 0.92 0.91 0.91 690

As you can see, approximately in 1/10 of the cases we have errors of various kinds. But in practice, it's not so scary. The fact is that even when marking with the eye it is not always clear where the end of the paragraph is, and where the end of the sentence is. Therefore, even the markup itself may contain such errors. But the most critical mistakes are not where they occur at the boundary of the proposal, but where the proposal remains broken. And there are very few such errors in reality.

Restore text

It is time to recover text corrupted by extracting from PDF. We can already determine whether we need to glue the string with the previous one, but there is one more thing - hyphenation. Everything is pretty straightforward, so I coded this part hard (let me pseudo-code):

      :         :       :      :        \n

Armed with this strategy, we restore the English text (there are errors in the type of "ff" and "fi" in the original - it is simply copied from the body of Krapivin2009):

Original English text

 text = """The rapid expansion of wireless services such as cellular voice, PCS (Personal Communications Services), mobile data and wireless LANs in recent years is an indication that signicant value is placed on accessibility and portability as key features of telecommunication (Salkintzis and Mathiopoulos (Guest Ed.), 2000). devices have maximum utility when they can be used any- where at anytime". One of the greatest limitations to that goal, how- ever, is nite power supplies. Since batteries provide limited power, a general constraint of wireless communication is the short continuous operation time of mobile terminals. Therefore, power management is y Corresponding Author: Dr. Krishna Sivalingam. Part of the research was supported by Air Force Oce of Scientic Research grants F-49620-97-1- 0471 and F-49620-99-1-0125; by Telcordia Technologies and by Intel. Part of the work was done while the rst author was at Washington State Univer- sity. The authors' can be reached at cej@bbn.com, krishna@eecs.wsu.edu, pagrawal@research.telcordia.com, jcchen@research.telcordia.com c 2001 Kluwer Academic Publishers. Printed in the Netherlands. Jones, Sivalingam, Agrawal and Chen one of the most challenging problems in wireless communication, and recent research has addressed this topic (Bambos, 1998). Examples include a collection of papers available in (Zorzi (Guest Ed.), 1998) and a recent conference tutorial (Srivastava, 2000), both devoted to energy ecient design of wireless networks. Studies show that the signicant consumers of power in a typical laptop are the microprocessor (CPU), liquid crystal display (LCD), hard disk, system memory (DRAM), keyboard/mouse, CDROM drive, oppy drive, I/O subsystem, and the wireless network interface card (Udani and Smith, 1996, Stemm and Katz, 1997). A typical example from a Toshiba 410 CDT mobile computer demonstrates that nearly 36% of power consumed is by the display, 21% by the CPU/memory, 18% by the wireless interface, and 18% by the hard drive. Consequently, energy conservation has been largely considered in the hardware design of the mobile terminal (Chandrakasan and Brodersen, 1995) and in components such as CPU, disks, displays, etc. Signicant additional power savings may result by incorporating low-power strategies into the design of network protocols used for data communication. This paper addresses the incorporation of energy conservation at all layers of the protocol stack for wireless networks. The remainder of this paper is organized as follows. Section 2 introduces the network architectures and wireless protocol stack considered in this paper. Low-power design within the physical layer is brie y discussed in Section 2.3. Sources of power consumption within mobile terminals and general guidelines for reducing the power consumed are presented in Section 3. Section 4 describes work dealing with energy ecient protocols within the MAC layer of wireless networks, and power conserving protocols within the LLC layer are addressed in Section 5. Section 6 discusses power aware protocols within the network layer. Opportunities for saving battery power within the transport layer are discussed in Section 7. Section 8 presents techniques at the OS/middleware and application layers for energy ecient operation. Finally, Section 9 summarizes and concludes the paper. 2. Background This section describes the wireless network architectures considered in this paper. Also, a discussion of the wireless protocol stack is included along with a brief description of each individual protocol layer. The physical layer is further discussed. """ corrected = pdf_lines_gluer._preprocess_pdf(text, clf, v) print(corrected)

After recovery we get:

Recovered English text

In recent years, there has been an indication of the number of mobile services that have been given to mobile communication and wireless LANs. .), 2000). It’s not a problem that it can be used at any time. It is the Corresponding Author: Dr. Krishna Sivalingam who has been a research team supported by F-49620-97-10471 and F-49620-99-1-0125; by Telcordia Technologies and by Intel. At the State University of the United States of America, the author of the article was at cej@bbn.com, krishna@eecs.wsu.edu, pagrawal@research.telcordia.com, jcchen@research.telcordia.com c
2001 Kluwer Academic Publishers. Printed in the Netherlands.
Jones, Sivalingam, Agrawal and Chen have agreed to choose this topic (Bambos, 1998). Examples include a collection of papers available in (Zorzi, (Guest Ed.), 1998) and a recent conference tutorial (Srivastava, 2000),
Studies of the microprocessor (CPU), liquid crystal display (LCD), hard disk, system memory (DRAM), keyboard / mouse, CDROM drive, oppy drive, I / O subsystem, and the wireless network interface card (Udani and Smith, 1996, Stemm and Katz, 1997). A typical example from a Toshiba 410 CDT mobile computer demonstrates that nearly 36% of the CPU power consumption,
18% by the wireless interface, and 18% by the hard drive. Consequently, it has been financed by its mobile terminal design (Chandrakasan and Brodersen, 1995) and in components such as CPU, disks, displays, etc. Signicant additional power savings can be achieved by incorporating low-power strategies used for data communication. Addresses for wireless networks.
It is organized as follows. Section 2 introduces the network protocol and wireless protocol stack in this paper. Low-power design
discussed in Section 2.3. SECTION 4: SECTION 4: SECTION 4: SECTION 3
5. Section 6 discusses power aware protocols within the network layer. There is a possibility to discuss the use of energy.
Finally, Section 9 summarizes and concludes the paper.
2. Background
This section describes the wireless network architectures considered in this paper. It is also possible to review the individual protocol layer. The physical layer is further discussed.

There is one controversial place, but in general, sentences have been restored and such text can already be processed as whole sentences.

But we planned to make a language independent option. And that is exactly what our feature set targets. Let's check in the Russian text (also a fragment of the text from the PDF):

Original Russian text

 ru_text = """       -       (. 1.10),    ,   - .       ,      , -   .       ,       .          : 1.        ,       -  (   ,   . 1.10,    ). 2.    ( )           ,     .      ,     .""" corrected = pdf_lines_gluer._preprocess_pdf(ru_text, clf, v) print(corrected)

Got:

Recovered Russian text

The support vector method is designed to solve classification problems by searching for good decision boundaries (Fig. 1.10) separating two sets of points belonging to different categories. The decisive boundary can be a line or surface dividing a sample of training data into spaces belonging to two categories. To classify new points, you only need to check on which side of the border they are located.
The search for such boundaries is the support vector machine in two stages:
1. Data is displayed in a new space of higher dimension, where the border can be represented as a hyperplane (if the data were two-dimensional, as in Figure 1.10, the hyperplane degenerates into a line).
2. A good decision boundary (separating the hyperplane) is calculated by maximizing the distance from the hyperplane to the nearest points of each class, this stage is called the gap maximization. This allows us to generalize the classification of new samples that do not belong to the training data set.

Everything is perfect here.

How to use (code generation)

Initially, I had a plan to make a package that can be delivered using PIP, but then I came up with an easier way (for me). The feature set was not very large, the logistic regression itself and the DictVectorizer have a simple structure:

With DictVectorizer, it’s enough to save the feature names and vocabulary_
LogisticRegression - coef , classes , intercept_

Therefore, another version was born with code generation (in the laptop it goes in the section "Serialize as code"):

We read the pdf_lines_gluer.py file, which contains auxiliary code for vectorization and text recovery using a trained classifier.
In the place designated in the source code as "# inject code here #", we insert the code initializing DictVectorizer and LogisticRegression in the state that we got in the laptop after the training. We also inject here the only public (as far as possible for Python) function preprocess_pdf:
```
 def preprocess_pdf(text: str) -> str: return _preprocess_pdf(text, _clf, _v) 
```
The resulting code is written to the pdf_preprocessor.py file .

It is this generated pdf_preprocessor.py file that contains everything we need. To use it - just take this one file and drop it into your project. Using:

 from pdf_preprocessor import preprocess_pdf ... print(preprocess_pdf(text))

If you have any problems with some texts, this is what you need to do:

Put your texts in the corpus folder, annotate them.
Start your laptop https://github.com/serge-sotnyk/pdf-lines-gluer/blob/master/pdf_gluer.ipynb - it takes less than 5 seconds for me on current texts.
Take away and test the new version of the pdf_preprocessor.py file

Perhaps something will go wrong and the quality will not satisfy you. Then it will be somewhat more difficult - you will need to add new features until you find their correct combination.

C # and ML.NET

In our company, most of the backend code is based on .Net. Of course, interaction with Python adds inconvenience here. And I would like to have a similar solution in C #. Long time followed the development of the ML.NET framework. I made small attempts to do something in the past year, but they were disappointed by the insufficient coverage of various cases, a small amount of documentation, API instability. Since spring of this year, the framework has moved to the release state and I decided to try it again. Moreover, the most tedious work with the markings of the case has already been done.

At first glance, the framework added in convenience. More often I began to find the necessary documentation (although, it’s still far from quality and quantity in sklearn). But most importantly - a year ago I still almost did not know sklearn. And now I began to see that many things in ML.NET were trying to do in the image and likeness (as far as possible, given the difference in platforms). These analogies have simplified the assimilation of ML.NET principles in practice.

A working project on this platform can look at https://github.com/serge-sotnyk/pdf-postprocess.cs

The general principles remain the same - in the folder corpus are anotised (and not very) documents. Having started the ModelCreator project, next to the corpus folder, we will see the models folder, where the archive with the trained model will be placed. This is all the same logistic regression with the same features.

But here I didn’t bother with code generation. To use the trained model, take the PdfPostprocessor project (which inside has the PdfPostprocessModel.zip model compiled as a resource inside). After that, the model can be used as shown in the minimal example - https://github.com/serge-sotnyk/pdf-postprocess.cs/blob/master/MinimalUsageExample/Program.cs :

 using PdfPostprocessor; ... static void Main(string[] args) { var postprocessor = new Postprocessor(); Console.WriteLine(); Console.WriteLine("Restored paragraphs in the English text:"); Console.WriteLine(postprocessor.RestoreText(EnText)); Console.WriteLine(); Console.WriteLine("Restored paragraphs in the Russian text:"); Console.WriteLine(postprocessor.RestoreText(RuText)); }

While the model is being copied from the models folder to the PdfPostprocessor project, it is done manually - it was more convenient for me to better control which model would fall into the final project.

There is nuget-package - PdfPostprocessor. To use the package and the model that you trained, use the overloaded version of the Postprocessor constructor.

Comparing options in Python and C #

While the thrill of developing on two platforms is fresh, it may be worthwhile to briefly retell them. I have not been a militant supporter of this or that platform for a long time and I understand the feelings of believers of different confessions with understanding. You also need to understand that I still work with static-typed languages for most of my life, so they are just a little closer to me.

What did not like when switching to C #

Verbosity. Yet the code on Python is smaller. This is the absence of operator brackets, and brackets after if, for. Lack of endless new. Active use of fields, since they are easy to turn into properties if necessary. Even to the fact that privacy in Python, which is simply indicated by underlining an identifier at the beginning, you get used to it and in practice turned out to be very convenient, more convenient than a bunch of privacy modifiers in other languages. And the brevity of designs speeds up development and makes it easier to read code.
In most cases, the code on Python looks cleaner and more elegant (this is just subjective). From this it is also easier to read and maintain.
On Python, for almost everything, there is some kind of function or decorator in some kind of package, but in C # a lot has to be added. This further inflates the code with various helper functions, classes. And more time consuming.
The degree of documentation of C # and its frameworks is significantly lower than in the Python ecosystem.
The stricter ML.NET typing compared to the omnivorous sklearn also made it necessary to spend some time in search of the right transformations, and the previous paragraph did not help to resolve this issue.

What did you like when switching to C #

A sense of security. Already not very often, but quite regularly, Python omnivorousness leads me to subtle problems. And now, when porting code to C #, there was an error that made some features useless. After correction, accuracy rose by a couple of percent.
Speed. The code on Python had to abandon the features that were tied to what kind of gluing decision was made in the past sentences - if you submit the proposal classifier one by one, then the overall speed will turn out to be lower than the baseboard. In order for data processing on Python to be fast, it is necessary to vectorize it as much as possible and sometimes it makes you refuse from potentially useful options, or it is very difficult to make them.
Linq. They are much more convenient than List Comprehension (LC) in Python. Even LC with one for makes me write at the beginning what after in, then go back to the beginning and finish the for, and only then write the expression at the beginning of the LC. Just in that order, I think - the source of records, items, what to convert. LINQ ( "" ) . LC ( for) . , , .
Lambda. , . , C# .

— . , .Net , . - — REST C#.

C# . — , - . Microsoft Kotlin — .Net , . Python- — , Julia . .

Conclusion

, — , , - . , , - .
. , ML.NET - . .
, Python- .Net. , .

Source: https://habr.com/ru/post/457072/

All Articles