I stumbled upon this
post once and I thought - since we have such a beautiful, completely open gallery of private data (
Radikal.ru ), can we try to extract this data from it in a convenient form for processing? I.e:
- Download pictures;
- Recognize text on them;
- Extract useful information from this text and classify it for further analysis.
And as a result, after several evenings, a working prototype was made. Many technical details:
Everything was done in C # in the ASP MVC 5 environment. Just because I write there all the time and it's more convenient for me.')
Stage 1: Download image
As it should, after sitting in the source code of the gallery pages, I did not find any sequence - it means you will have to download each webpage, and tear out the link to the picture from the code. Well at least, that the address of the page with the image can be automatically generated - it is just a URL with a sequence number of the image. Ok, take the
HtmlAgilityPack , and write the parser, the benefit of the classes on the page with the picture is enough, and pull the desired node is not difficult.
We pull out the node, we look - there is no link. The link turns out to be generated by javascript, which we have not run. This is sad because the scripts are obfuscated, and I did not have enough patience to understand the principles of their work.
Ok, there is another way - open the page in the browser, wait for the execution of scripts, and get a link from the completed page. Fortunately, there is a wonderful bundle in the form of
Selenium and
PhantomJS (browser without graphical shell), because how to do everything through, for example, FireFox - and longer in execution time, and more inconvenient. Unfortunately, this is also very slow - there is hardly any slower way :( About 1 second per picture.
Parser:
public static string Parse_Radikal_ImagePage(IWebDriver wd, string Url) { wd.Url = Url; wd.Navigate(); new WebDriverWait(wd, TimeSpan.FromSeconds(3)); HtmlDocument html = new HtmlDocument(); html.OptionOutputAsXml = true; html.LoadHtml(wd.PageSource); HtmlNodeCollection Blocks = html.DocumentNode.SelectNodes("//div[@class='show_pict']//div//a//img"); return Blocks[0].Attributes["src"].Value; }
* All code is greatly simplified, non-critical details are removed. More in sourceHandler controller:
IWebDriver wd = new PhantomJSDriver("C:\\PhantomJS"); for (var imageCode = data.imgCode; imageCode > data.imgCode - data.imgCount; imageCode--) { if (ParserResult.Processed(imageCode)) continue; var Url = "http://radikal.ru/Img/ShowGallery#aid=" + imageCode.ToString() + "&sm=true"; var imageUrl = Parser.Parse_Radikal_ImagePage(wd, Url); if (imageUrl != null) { var image = Parser.GetImageFromUrl(imageUrl); var Filename = TempFilesRepository.TempFilesDirectory() + "Radikal_" + imageCode.ToString() + "." + Parser.GetImageFormat(image); image.Save(Filename); } } wd.Quit();
Everything is over somewhere to store and process. It is logical to choose an already deployed MS SQL Server, create a small base on it and add the links to the pictures and the path to the downloaded file. We write a small class for storing and writing the result of parsing a picture. Why not keep pictures in the database? About it below, in the section about recognition.
[Table(Name = "ParserResults")] public class ParserResult { [Key] [Column(Name = "id", IsPrimaryKey = true, IsDbGenerated=true)] public long id { get; set; } [Column(Name = "Url")] public string Url { get; set; } [Column(Name = "Code")] public long Code { get; set; } [Column(Name = "Filename")] public string Filename { get; set; } [Column(Name = "Date")] public DateTime Date { get; set; } [Column(Name = "Text")] public string Text { get; set; } [Column(Name = "Extracted")] public bool Extracted { get; set; } public ParserResult() { } public ParserResult(string Url, long Code, string Filename, string Text) { this.Url = Url; this.Code = Code; this.Filename = Filename; this.Date = DateTime.Now; this.Text = Text; this.Extracted = false; DataContext Context = DataEngine.Context(); Context.GetTable<ParserResult>().InsertOnSubmit(this); Context.SubmitChanges(); } public static bool Processed(long imgCode) { return DataEngine.Data<ParserResult>().Where(x => x.Code == imgCode).Count() > 0; } }
Stage 2: Recognize Text
Also, it would seem, not the most difficult task. We take
Tesseract (more precisely, a wrapper for it under .NET), download
data for the Russian language , and ... bummer! As it turned out, for normal operation of Tesseract with the Russian language, conditions close to ideal are necessary - excellent scan quality, and not a photo of the document on a crappy mobile phone. The percentage of recognition is good if it approaches 10.
In general, all acceptable Cyrillic recognition is represented by only three products: CuneiForm, Tesseract, FineReader. Reading the forums and blogs reinforced the idea that there is no sense in trying CuneiForm (many people write that the recognition quality didn’t get far from Tesseract), and I decided to try FineReader right away. Its main disadvantage is that it is paid, very paid. In addition, the Finereader Engine (which provides an API for recognition) was not at hand, and I had to make a terrible bike: run Abbyy Hotfolder, which looks into the specified folder, recognizes the pictures that appear there, and puts text files next to it. Thus, after waiting a bit after downloading the images, we can take the ready recognition results and put them into the database. Very slowly, very crutch - but the quality of recognition, I hope, pays for these costs.
var data = DataEngine.Data<ParserResult>().Where(x => x.Text == null & x.Filename != null).ToList(); foreach (var result in data) { var textFilename = result.Filename.Replace(Path.GetExtension(result.Filename), ".txt"); if (System.IO.File.Exists(textFilename)) { result.Text = System.IO.File.ReadAllText(textFilename, Encoding.Default).Trim(); result.Update(); } }
By the way, it is precisely because of such crutches that the images are not stored in the database - Abbyy Hotfolder from the database, unfortunately, does not work.
Stage 3: Extract information from the text
Surprisingly, this stage was the easiest. Probably because I knew what to look for - a year ago I took a
Natural Language Processing course at Coursera.org and imagined how such problems are solved and what terminology is used. This is why I decided not to write regular bicycles, but didn’t google for long, I took the
PullEnti library, which:
- sharpened to work with the Russian language;
- immediately wrapped to work with C #;
- free for non-commercial use.
It turned out to be very easy to select entities with it:
public static List<Referent> ExtractEntities(string source) {
Selected entities must be stored and analyzed, for this we write them into a simple table in the database: the image ID / entity type / entity value. After parsing, you get something like this:
Docid | EntityType | Value |
63 | Territorial education | city of Ussuriysk |
63 | Address | Dzer street, 1; city of Ussuriysk |
63 | date | November 17, 2014 |
PullEnti can isolate from the text (automatically correcting errors) quite a few such entities: Bank details, Territorial education, Street, Address, URI, Date, Period, Designation, Money Amount, Person, Organization, etc ... And then you should sit down above the tables and think: select documents for a specific city, search for a specific organization, etc. We performed the main task - the data was extracted and prepared.
results
Let's see what happened on a small test sample.
- Processed gallery pages - 2,263 ;
- Images received - 1,972 (on the remaining pages, images have been deleted or closed with privacy settings);
- Selected text - 773 (in other images FineReader did not find anything suitable for recognition);
- Selected entities from the text - 293 .
The correct operation is the last indicator, since quite often, the text in the form of "
^ 71 1 / " is highlighted from a picture with rich graphics and so on. It turns out that we can find a text suitable for analysis approximately in every tenth image. This is not bad for such a messy repository!
And here, for example, a list of extracted cities (quite often the documents from which they were extracted are passport photos): Ankara, Bobruisk, Warsaw, Zlatoust, Kazan, Kiev, Krasnoyarsk, Minsk, Moscow, Omsk, St. Petersburg, Sukhum, Tver, Ussuriysk, Ust-Kamenogorsk, Chelyabinsk, Shuya, Yaroslavl.
Results
- The problem is solved; created a working prototype of the solution.
- The speed of this prototype so far does not hold water :( Picture per second is very slow.
- And, of course, there are a number of unsolved problems: for example, a crash after PhantomJS eats all the memory.
Source code (project for Visual Studio 2013) -
download .