You know, I love books about all sorts of interesting algorithms, and just recently another such book came across. The book “Programming Collective Intelligence” is mainly devoted to classification and clustering algorithms, although there are chapters on other topics, such as creating your own search engine, genetic algorithms, and genetic programming. Almost all the described algorithms are applied in the spirit of Web 2.0, using the analysis of user behavior on different sites that provide their API. But what was especially pleasantly surprised was that all the examples were written in Python. Here are the algorithms described in the book:
Collaborative filtering. Or, in human terms, algorithms that can recommend you some purchases, websites or music, depending on the ratings that you put to other similar things. According to such algorithms, the imposition of purchases in online stores or the selection of music on last.fm works. At the end of the chapter is an example that will recommend you links from the del.icio.us service.
Algorithms grouping (clustering). The created example analyzes the RSS feeds of blogs and tries to automatically divide them into groups in the form of a tree depending on the frequency of the words that come across in the blog. At the same time, Segaran tells how it is possible to make the names of blogs located on a plane in groups depending on their proximity in terms of the topics under consideration.
A separate chapter is devoted to the construction of search engines - the creation of a spider and, most importantly, algorithms for ranking links are considered, including taking into account the links of pages to each other, thus creating an analogue of Google PageRank. It is also interesting that in this same chapter there is an example where a neural network is used to issue the most relevant links, which learns as the user clicks on the links he likes.
Another chapter is devoted to optimization techniques. Here, in principle, there is nothing special, the algorithms of random search, descent from the mountain, annealing, and the genetic algorithm are briefly described. An example of a web specific mission is to search for flights for a rather specific task using the Kayak site API. But in this chapter I was more interested in the algorithms for the resettlement of students in a hostel in order to satisfy as many people as possible, and to draw connections between people in such a way that the lines denoting connections do not overlap.
Sorting algorithms are considered on the example of spam filtering. Given the Bayesian algorithm, which usually comes to mind first when solving a similar problem, and its modification is the Fisher method. As an example, is the work with the site Akismet .
It tells about the algorithms for building decision trees, and at the end of the chapter an example is provided that loads data on real estate prices from the Zillow website, and then builds a decision tree that shows which house parameters influence the price, and if you want, you can predict what price at home with the specified parameters.
One chapter is devoted to numerical predictions. Here we consider the algorithm of k nearest neighbors and the weighted algorithm of k nearest neighbors. The example at the end of the chapter tries to guess the most likely price for the goods on eBay auction by the specified parameters.
A separate chapter is devoted to such classification algorithms as the support vector machine and nuclear methods. Unfortunately, the most interesting thing is that the support vector method is almost not considered. More precisely, literally on the fingers it is told what he does, but not how he does it, and the example uses the http://www.csie.ntu.edu.tw/~cjlin/libsvm/ library, where this algorithm is already programmed. At the end of the chapter is an example of matching pairs using the Facebook API.
A separate chapter is devoted to the identification of independent features by data arrays, but here I, frankly, did not really understand what the described algorithms do.
And the last chapter is devoted to genetic programming — types of genetic algorithms, when classes simulating different programs that intersect and mutate between themselves compete for a place under the sun (in RAM).
As a result, I liked the book, not only that I am interested in similar topics by themselves, but also the examples are written in their favorite language. From the pleasant trifles it can be noted that the author has tried and, for many examples, selected ready-made data for analysis, which can be downloaded from the author’s site . By the way, he recently released two more books about data processing , but if we see them in Russian is a big question. Among the shortcomings it can be noted that I would still like to read about some of the described algorithms in more detail, but as a starting point, the book is quite suitable.