Download the translation in the form of a Mathematica document that contains all the code used in the article here (archive, ~ 76 MB).Introduction
Some time ago, to be exact - 515 days, Matthias Odisio published a post entitled β
Random and Optimal Mathematica Walks on IMDb's Top Films β (
Mathematica Random and Optimal Wander on the list of the 250 best films from the version of IMDB). It describes how to get the optimal sequence of watching movies from the corresponding
list , based on the proximity of film genres and the proximity of movie posters in terms of color.
In [1]: =
')
Out [1] =
The idea of ββthis post seemed quite interesting to me, but I wanted to significantly expand and deepen it, following a few ideas:
- It is not enough objectively to build a function of the distance between films based on the proximity of movie posters according to the colors and genres of films used in them. It seems to me reasonable to build a function of the distance between films based on several factors: film genres, film description, cast, director (s), year of production, screenwriter (s), etc.
- In the article by Mattias, only Wolfram | Alpha data was used, which certainly simplifies the task and compacts the code. I also want to talk about how you can use in the calculations data taken from anywhere, for example, obtained using web parsing from Wikipedia pages, loaded from text databases, etc.
I will not talk in this article about how to build an optimal sequence of viewing the
list of 250 best films of KinoPoisk for the reason that I simply do not want to have problems with the terms of use of this resource, which are quite clearly spoken (see
Section 6 ) that simply take their list of films and analyze it without their consent will not work. In this case, apply the algorithms that I give below for this list is quite simple. I would also like to note that during my work with one of the domestic film companies for their needs, a parser was written in the Wolfram Language language, which loaded information about films from the KinoPoisk site (the legal side of the question was settled) for the subsequent automatic generation of an advertising booklet about several thousands of films owned by this company. Below you can see an example of one such fully automatically created page of a booklet (an inconclusive version is given, due to NDA).
This article will use information about the films presented in Wikipedia, which will avoid any problems with the copyright holders. On the one hand, this complicates the task (a parser from a centralized repository like IMDB or KinoPoisk is easier to write), but at the same time allows you to build some additional, interesting programs.
Import data from Wikipedia website
To begin with, load the symbolic HTML representation of the Wikipedia page β
250 best films in IMDb version β (in the document we will display only a part of the result using the
Short function):
In [2]: =
Out [3] =
Now select the links to the films shown on the page in the table:
In [4]: ββ=
Out [4] =
Let's create a function that will load and save a symbolic representation of the HTML code of the pages of each of the films:
In [5]: =
Secondary functions
Create a set of auxiliary functions that we need to handle immersed character HTML:
- Function to remove HTML wrappers, leaving only the data:
In [8]: =
- A function that determines whether a string can be a word in Russian (that is, consists of letters of the Russian alphabet or a hyphen):
In [9]: =
In [10]: =
- A function that determines whether a string can be a word in Russian or English (that is, consists of letters of the Russian, English alphabet or hyphen):
In [12]: =
In [13]: =
- A function that converts (in a row) capital letters of the Russian alphabet into capital letters:
In [15]: =
In [16]: =
- To analyze the descriptions of films, we need information about the words of the Russian language and the links between the forms of the same word. Let's load the morphological dictionary of the Russian language, created by Academician Andrei Anatolyevich Zaliznyak :
In [17]: =
Out [17] =
- We process the dictionary data, making on its basis a list of words of the Russian language ( russianWords ) and a list of rules for replacing the forms of words of the Russian language into their standard form ( russianWords Standard TrackForm )
In [18]: =
The dictionary contains 2 645 347 words:
In [19]: =
Out [19] =
Out [20] =
- Create a function that checks whether a word is contained in the dictionary, as well as a function that converts a Russian word into its standard form:
In [21]: =
In [22]: =
Examples of functions:
In [23]: =
Out [23] =
In [24]: =
Out [24] =
- Create a function that will determine whether an adjective is:
In [25]: =
In [26]: =
Data processing
Now you can process the data of each of the films. At the same time, the output in the
filmsData variable will be the database of movie information based on the
Association function, which will allow us to access data very easily.
In [27]: =
In [29]: =
An example of a call to the generated database by movie number:
In [31]: =
Out [31] =
An example of a request for a director and the year of each filmβs release:
In [32]: =
Out [32] // Short =
Some statistics based on data.
For a start, let's just create a collage of posters of all films:
In [33]: =
Out [33] =
We construct the distribution of the number of films depending on the year:
In [34]: =
Out [34] =
We construct the distribution of films by their duration:
In [35]: =
Out [35] =
We construct the distribution of films according to their duration and year of release:
In [36]: =
Out [36] =
The first 10 actors by the number of films in which they played:
In [37]: =
Out [37] =
The first 10 filmmakers by the number of films they made:
In [38]: =
Out [38] =
The first 10 screenwriters by the number of films, the script for which they wrote:
In [39]: =
Out [39] =
The first 10 composers by the number of films, the music for which they wrote:
In [40]: =
Out [40] =
The first 10 countries by the number of films that were shot in them:
In [41]: =
Out [41] =
The first 10 genres by the number of films that include:
In [42]: =
Out [42] =
For those who are interested in cinema genres, I can recommend an article β
Films and Mathematica: importing and processing information from the IMDB database β written some time ago, in which, in particular, the following distribution of films by genres is obtained:

The function that determines the distance between the films
To determine the measure of the difference between the two lists of objects, we will use the generalization of
the Chekanovsky-Srensen coefficient (measure) :
In [43]: =
Example:
In [45]: =
Out [45] =
To determine the proximity of the descriptions using this coefficient, we will create a function that selects the words of the Russian language from the film description with their translation into the standard form:
In [46]: =
Example of the function (the frequency of each word was additionally calculated using the
Tally function, and the frequencies were sorted according to their decrease):
In [47]: =
Out [47] =
Now create a function that determines the degree of proximity of the films to each other. It is the sum of several parameters normalized to one with different weights. A total of 11 parameters (degrees) of similarity are taken: film description, genre (s), director, screenwriter (s), actors, cameraman (s), composer (s), country (s) of production, release year, duration, closeness of posters. In this case, you can ask them different weights, but by default they will be the same.
In [48]: =
For further work, we will choose those films for which at least some information is known (in view of the fact that for several films their Wikipedia pages are empty):
In [62]: =
Calculate all measures of proximity (distance) between the films:
In [63]: =
Movie linkage analysis
We will study the relationships between films using the methods of graph theory, namely, using the theory of the
community structure
in graphs . To do this, create a function based on
CommunityGraphPlot :
In [64]: =
This function searches, based on the previously constructed function of the distance between the films, the community in the graph, while the redder and thicker the connection between the vertices, the closer they are (closer). When you hover over each of the vertices of the graph, you can get a pop-up hint with a poster and the title of the film (you can download the document with interactive graphs and source code from the link at the very beginning of the post).
In [65]: =
Out [65] =
In [66]: =
Out [66] =
In [67]: =
Out [67] =
In [68]: =
Out [68] =
In [69]: =
Out [69] =
In [70]: =
Out [70] =
In [71]: =
Out [71] =
In [72]: =
Out [72] =
Construction of the optimal sequence of watching movies
We have done quite a lot of work and now, finally, we can build an optimal sequence of watching movies:
In [73]: =
So now we can get it (the function provides for the output either as a table or as a poster from posters):
In [74]: =
Table of the optimal sequence of watching movies from the list of 250 best films according to IMDbOut [74] =
You can also display it as a poster from posters (the sequence of watching movies will be from left to right, top to bottom):
In [75]: =
Out [75] =
We may also consider optimal sequences for individual criteria:
Sequence viewing based on the description of the filmIn [76]: =
Out [76] =
Sequence viewing based on the genre of the filmIn [77]: =
Out [77] =
Sequence viewing based on the cast of the filmIn [78]: =
Out [78] =
Sequence viewing based on the director of the filmIn [79]: =
Out [79] =
The sequence of viewing on the basis of the film writersIn [80]: =
Out [80] =
Sequence viewing based on film composersIn [82]: =
Out [82] =
View sequence based on movie lengthIn [83]: =
Out [83] =
View sequence based on movie posterIn [84]: =
Out [84] =
Sequence viewing based on film production countryIn [85]: =
Out [85] =
Conclusion
I hope that my post could interest you, and some of the ideas and programs presented in it will be useful to you. Of course, you can think of many ways to use these algorithms, their further expansion and improvement. Many things have been specially simplified by me, since not all ready-made codes can be laid out completely in free access. I think that if you are interested, you can create a parser from KinoPoisk or IMDB directly (in the latter case,
an article about uploading and analyzing information from the IMDB databases that are freely available on these resources can help you) and based on it more detailed and qualitative analysis of the movie, as well as to improve the optimal sequence of watching movies obtained in this article. I hope that all these tasks will interest you!
Resources for learning Wolfram Language ( Mathematica ) in Russian: http://habrahabr.ru/post/244451