Data mining: what's inside

Information levels

I do not think that I will discover America if I say that not all information is equally useful. Sometimes it is necessary to write a lot of text to explain a concept, and sometimes it’s enough to look at a simple diagram to explain the most complicated questions. To reduce the redundancy of information, mathematical formulas, drawings, symbols, program code, etc. were invented. In addition, not only the information itself is important, but also its presentation. It is clear that stock quotes can be more clearly demonstrated using graphics, and mathematical formulas will describe Newton's laws in a more compact form.

In the process of development of information technology, as well as data collection and storage systems - databases (databases), data warehousing, and more recently, cloud repositories, there is a problem of analyzing large amounts of data when an analyst or manager is not able to manually process large amounts of data and make a decision. It is clear that the analyst needs to somehow present the source information in a more compact form that the human brain can handle in a reasonable time.

Select several levels of information:

raw data (raw data, historical data or just data) - raw data arrays obtained as a result of monitoring a certain dynamic system or object and displaying its state at specific points in time (for example, data on stock prices over the past year);
information - processed data that carries some information value for the user; raw data presented in a more compact form (for example, search results);
knowledge - carry a kind of know-how, display hidden relationships between objects that are not publicly available (otherwise, it will be just information); data with a large entropy (or a measure of uncertainty).

Consider an example. Suppose we have some data on currency transactions in the Forex market for a certain period of time. This data can be stored in text form, in XML format, in a database or in binary form and by themselves do not carry any useful meaning. Next, the analyst loads this data, for example, in Excel and builds a graph of changes, thus obtaining information. Then he loads the data (fully or partially processed in Excel), for example, in Microsoft SQL Server and with the help of Analysis Services he gets the knowledge that it is better to sell the shares tomorrow. After that, the analyst can use the already acquired knowledge for new assessments, thus obtaining feedback in the information process.
')
There are no clear lines between the levels, but such a classification will allow us not to get confused with terminology in the future.

Data mining

Historically, the term Data Mining has several translations (and meanings):

extraction, data collection, data mining (also use Information Retrieval or IR);
knowledge extraction, data mining (Knowledge Data Discovery or KDD, Business Intelligence).

IR operates the first two levels of information, respectively, KDD works with the third level. If we talk about ways to implement, the first option relates to the application area, where the main goal is the data itself, the second - to mathematics and analytics, where it is important to get new knowledge from a large amount of existing data. Most often, data extraction (collection) is a preparatory stage for knowledge extraction (analysis).

I dare to introduce another term for the first paragraph - Data Extracting , which I will use in the future.

Tasks solved by Data Mining:

Classification - the assignment of the input vector (object, event, observation) to one of the previously known classes.
Clustering is the division of a set of input vectors into groups (clusters) according to the degree of “similarity” to each other.
Description shorthand - for data visualization, simplification of counting and interpretation, compression of the amount of collected and stored information.
Association - search for repetitive patterns. For example, the search for "strong links in the shopping cart."
Prediction - finding the future state of the object based on previous states (historical data)
Deviation analysis - for example, identifying atypical network activity allows you to detect malware.
Data visualization.

Information retrieval

Information retrieval is used to obtain structured data or a smaller, representative sample. According to our classification, information retrieval operates on data of the first level, and as a result, it produces information of the second level.

The simplest example of information retrieval is a search engine, which, on the basis of certain algorithms, derives some of the information from the complete set of documents. In addition, any system that works with test data, meta-information or databases in one way or another uses information retrieval tools. Tools can be the methods of indexing, filtering, data sorting, parsers, etc.

Text mining

Other names: text data mining, text analysis, a very close concept - concern mining.

Text mining can work with both raw data and partially processed data, but unlike information retrieval, text mining analyzes textual information using mathematical methods, which allows to get results with knowledge elements.

Tasks that text mining solves: finding data patterns, obtaining structured information, building object hierarchies, classifying and clustering data, defining a topic or area of knowledge, automatic document reviewing, automatic content filtering tasks, determining semantic links, and others.

To solve text mining problems, statistical methods, interpolation, approximation and extrapolation methods, fuzzy methods, and content analysis methods are used.

Web mining

And finally we got to web mining - a set of approaches and techniques for extracting data from web resources.
Since web sources, as a rule, are not text data, the approaches to the data extraction process are different in this case. The first thing to remember is that the information on the web is stored as a special HTML markup language (although there are other formats - RSS, Atom, SOAP, but we'll talk about this later), web pages may have additional meta-information, as well as information about the structure (semantics) of the document, each web document is located inside a certain domain and the rules of search engine optimization (SEO) can be applied to it.

This is the first article in a data mining / extracting / web mining cycle. Wishes and reasoned criticism are accepted.

Source: https://habr.com/ru/post/95209/

All Articles