📜 ⬆️ ⬇️

From searching to examining documents in network balls and file dumps


In the previous article we talked about our open-source product for searching data by balls and file dumps. Since then, we have improved the product and significantly improved the search by adding named entities, tags, statistics on demand and folder structure. These improvements allow you to move from search to data analysis, the article will look at all this in more detail.


Theoretical part


First, I will tell you about the theoretical part, namely how tags and named entities work in Ambar.


Tags in Ambar is additional meta information at the file level, give an example - you found a scan boo. last year’s report, so that you can’t lose it anymore, you can add a “report” tag to it. After all the reports are marked, you can easily find them by searching by tag.


tagging


To make life easier, Ambar can automatically tag by internal rules, examples of rules:



To summarize, with the help of tags Ambar can answer the following search requests: show all images (request: tags:image ), show all the files on which the word 'confidential' is recognized by using OCR (request: tags:ocr ), show me all report scans (request: tags:image, ).


Named entities in Ambar work at the document content level, for example, Ambar is now able to find IP addresses, TIN, company names, phone numbers, car registration numbers, URI identifiers (links), email addresses in the document contents. mail.


A named entity is a certain rule that makes it possible to determine with high probability that a given word or several words in a text define an entity of a certain type. For example, for an INN, you can simplifyly describe the rules as follows: 11 or 13 digits, which satisfy the special rule for calculating the checksum. After the named entity is found, we bring it to a normal form, so the following phone numbers are the same entity: +7 999 111 22 33 and 8999111-22-33.


You can see which Ambar entities are found in the document using the "View" button. It is also worth noting that the types of named entities found in the document are immediately added as tags, this means that if IP addresses are found in the document content, the file will certainly receive an "ip" tag.


entity view


To summarize, Ambar is able to respond to the following search queries using named entities: show me all the files where the IP address 192.168.1.1 is found (request: entitites:"192.168.1.1" ), show me the scans of documents in which the TIN of such and such a company is found (request: entitites:"123123123123" tags:ocr ). Finally, I will say a secret - in the next Ambar release, we plan to add a connection of third-party entities as plug-ins.


From theory to practice


Suppose you have already set up Ambar and have indexed a certain number of files, in order to understand what is stored on these balls, I suggest entering a search query * (show all) and switching to the "Statistics" view. From this view, it becomes immediately clear how many files were found and what their size was, as well as what types of files were found (for torrents and movies for sure not to leave!).


View statistics


Suppose you find that 30% of your balls are occupied by .avi files from last year's corporate party, how do you know in which folders they are located? Enter the query size>500M filename:*.avi and go to the folder view. We see in which folders the greatest number of hits and delete them with a pure soul.


Folder view


Consider a more complex example; you need to find an employee’s phone number. Enter the query " " tags:phone and go to the "Statistics" tab. We select found named entities of the phone type and go to the detailed view to view the text of the document, if we use a lot of results, use a table view or refine the query.


Detailed view


In the future, we plan to develop the analytical part of Ambar, namely the creation of custom user tagging rules, custom entities (there have already been requests to add car brands) and the visualization of links between the found entities.


Thanks for attention!


')

Source: https://habr.com/ru/post/342978/


All Articles