Greetings
It became curious how the topic of Data Mining on Habré. I saw only
one article devoted to this topic. I want to make a small contribution to the development of this topic.
Historically, the term Data Mining has several translation options:
- data retrieval
- knowledge extraction, data mining
If we talk about the methods of implementation, the first option relates to the applied area, the second - to mathematics and science, and, as a rule, they overlap a little. If we talk about the possibility of application - there are a lot of options. It so happened that I worked with the first option (at the university - a scientific work), and with the other (work, freelance). Consider more.
')
Data retrieval
Data extraction is the process of finding, collecting information, and also saving (converting) them in different formats. Simply put, programs for extracting data are called parsers (parser), grabbers (grabber), spiders (spider), crawlers, etc. In fact, such programs make life much easier for everyone, since they allow data to be systematized (namely, data, and not knowledge!). Such programs can collect addresses of companies in your industry, links from relevant forums, parse entire directories, can also serve as an excellent tool for building databases.
Being engaged in this for a long time, I can say that there are many applications of data mining in this sense. As a rule, data are taken from open sources, without violating someone's intellectual rights.
Examples:
- drawing up a list of banks of a country
- compiling a school base
- list of sites on a specific topic
Basically, this is a “list”, “catalog”, “base” of something that you need at the moment.
In the following publications I will tell about real examples in more detail.
Extracting knowledge
The essence of the "extraction of knowledge": we have huge amounts of
data , we need to get
knowledge . Life example: we have a lot of data on Forex currency quotes (a lot - it is about several gigabytes of textual information per day). So, text files are the data, but the statement “the fall of the action A leads to the fall of the action B” is already the knowledge obtained on the basis of this data. Needless to say, the availability of convenient tools for obtaining this kind of knowledge would help more than one manager in making decisions.
The main categories of data mining are:
- data clustering (dividing objects into similar groups)
- data classification (assignment of objects to predetermined groups)
- neural networks, genetic algorithms (universal optimizers)
- association rules (rules of the form "if ... then ...")
- decision trees
- time series analysis
I would also refer here to regression, multifactorial and other analyzes, since they can also be used to solve similar problems. Each of these categories has its own mathematical and algorithmic apparatus and allows you to solve a certain range of tasks.
What do we have at the moment?
To be honest, not quite thick, but still:
The rest is fragments of data, examples and code scattered throughout the network.
Data Mining Source Code
Being a .NET developer, I needed examples of implemented algorithms in this language, but in 90% of cases it was either C ++ (mostly under Linux) or Java. The problem of the lack of examples in C # (or VB.NET) made me write everything myself.
Most of all I wanted to systematize what I had and what I was able to find in the Internet. Thus, an open source project on a codeplex called
Data Mining Source Code appeared and as a small explanation to this project -
“Data Minig Source Code Blog” . There are sources in C #, VB.NET, Java and JavaScript, although most of the sources are in C #. There is an additional project
Numerical Methods on C # , which implements a large number of numerical methods.
Projects are not commercial, I just liked it (and still needed to study at the university), so I post them in open access. Projects are still living now, students who need to gain programming experience are working on them, so if someone has the source code or you want to learn algorithms and methods, you can join in and send in your work.
Well, in the end I would like to ask how interesting this topic is and what would you like to read more about the above?