πŸ“œ ⬆️ ⬇️

Search for the best sequence of viewing the list of 250 best movies using the Wolfram Language (Mathematica)


Download the translation in the form of a Mathematica document that contains all the code used in the article here (archive, ~ 76 MB).

Introduction


Some time ago, to be exact - 515 days, Matthias Odisio published a post entitled β€œ Random and Optimal Mathematica Walks on IMDb's Top Films ” ( Mathematica Random and Optimal Wander on the list of the 250 best films from the version of IMDB). It describes how to get the optimal sequence of watching movies from the corresponding list , based on the proximity of film genres and the proximity of movie posters in terms of color.

In [1]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_2.png
')
Out [1] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_3.png

The idea of ​​this post seemed quite interesting to me, but I wanted to significantly expand and deepen it, following a few ideas:



I will not talk in this article about how to build an optimal sequence of viewing the list of 250 best films of KinoPoisk for the reason that I simply do not want to have problems with the terms of use of this resource, which are quite clearly spoken (see Section 6 ) that simply take their list of films and analyze it without their consent will not work. In this case, apply the algorithms that I give below for this list is quite simple. I would also like to note that during my work with one of the domestic film companies for their needs, a parser was written in the Wolfram Language language, which loaded information about films from the KinoPoisk site (the legal side of the question was settled) for the subsequent automatic generation of an advertising booklet about several thousands of films owned by this company. Below you can see an example of one such fully automatically created page of a booklet (an inconclusive version is given, due to NDA).

Sample page
Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_4.png

This article will use information about the films presented in Wikipedia, which will avoid any problems with the copyright holders. On the one hand, this complicates the task (a parser from a centralized repository like IMDB or KinoPoisk is easier to write), but at the same time allows you to build some additional, interesting programs.

Import data from Wikipedia website


To begin with, load the symbolic HTML representation of the Wikipedia page β€œ 250 best films in IMDb version ” (in the document we will display only a part of the result using the Short function):

In [2]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_5.gif

Out [3] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_6.png

Now select the links to the films shown on the page in the table:

In [4]: ​​=

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_7.png

Out [4] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_8.png

Let's create a function that will load and save a symbolic representation of the HTML code of the pages of each of the films:

In [5]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_9.gif

Secondary functions


Create a set of auxiliary functions that we need to handle immersed character HTML:


In [8]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_10.png


In [9]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_11.png

In [10]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_12.gif


In [12]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_13.png

In [13]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_14.gif


In [15]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_15.png

In [16]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_16.png


In [17]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_17.png

Out [17] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_18.png


In [18]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_19.png

The dictionary contains 2 645 347 words:

In [19]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_20.gif

Out [19] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_21.png

Out [20] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_22.png


In [21]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_23.png

In [22]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_24.png

Examples of functions:

In [23]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_25.png

Out [23] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_26.png

In [24]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_27.png

Out [24] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_28.png


In [25]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_29.png

In [26]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_30.png

Data processing


Now you can process the data of each of the films. At the same time, the output in the filmsData variable will be the database of movie information based on the Association function, which will allow us to access data very easily.

In [27]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_31.png

In [29]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_32.gif

An example of a call to the generated database by movie number:

In [31]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_33.png

Out [31] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_34.png

An example of a request for a director and the year of each film’s release:

In [32]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_35.png

Out [32] // Short =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_36.png

Some statistics based on data.


For a start, let's just create a collage of posters of all films:

In [33]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_37.png

Out [33] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_38.png

We construct the distribution of the number of films depending on the year:

In [34]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_39.png

Out [34] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_40.png

We construct the distribution of films by their duration:

In [35]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_41.png

Out [35] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_42.png

We construct the distribution of films according to their duration and year of release:

In [36]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_43.png

Out [36] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_44.png

The first 10 actors by the number of films in which they played:

In [37]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_45.png

Out [37] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_46.png

The first 10 filmmakers by the number of films they made:

In [38]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_47.png

Out [38] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_48.png

The first 10 screenwriters by the number of films, the script for which they wrote:

In [39]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_49.png

Out [39] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_50.png

The first 10 composers by the number of films, the music for which they wrote:

In [40]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_51.png

Out [40] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_52.png

The first 10 countries by the number of films that were shot in them:

In [41]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_53.png

Out [41] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_54.png

The first 10 genres by the number of films that include:

In [42]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_55.png

Out [42] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_56.png

For those who are interested in cinema genres, I can recommend an article β€œ Films and Mathematica: importing and processing information from the IMDB database ” written some time ago, in which, in particular, the following distribution of films by genres is obtained:

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_57.png

The function that determines the distance between the films


To determine the measure of the difference between the two lists of objects, we will use the generalization of the Chekanovsky-Srensen coefficient (measure) :

In [43]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_58.gif

Example:

In [45]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_59.png

Out [45] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_60.png

To determine the proximity of the descriptions using this coefficient, we will create a function that selects the words of the Russian language from the film description with their translation into the standard form:

In [46]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_61.png

Example of the function (the frequency of each word was additionally calculated using the Tally function, and the frequencies were sorted according to their decrease):

In [47]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_62.png

Out [47] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_63.png

Now create a function that determines the degree of proximity of the films to each other. It is the sum of several parameters normalized to one with different weights. A total of 11 parameters (degrees) of similarity are taken: film description, genre (s), director, screenwriter (s), actors, cameraman (s), composer (s), country (s) of production, release year, duration, closeness of posters. In this case, you can ask them different weights, but by default they will be the same.

In [48]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_64.gif

For further work, we will choose those films for which at least some information is known (in view of the fact that for several films their Wikipedia pages are empty):

In [62]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_65.png

Calculate all measures of proximity (distance) between the films:

In [63]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_66.png

Movie linkage analysis


We will study the relationships between films using the methods of graph theory, namely, using the theory of the community structure in graphs . To do this, create a function based on CommunityGraphPlot :

In [64]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_67.png

This function searches, based on the previously constructed function of the distance between the films, the community in the graph, while the redder and thicker the connection between the vertices, the closer they are (closer). When you hover over each of the vertices of the graph, you can get a pop-up hint with a poster and the title of the film (you can download the document with interactive graphs and source code from the link at the very beginning of the post).

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_68.png

In [65]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_69.png

Out [65] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_70.png

In [66]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_71.png

Out [66] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_72.png

In [67]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_73.png

Out [67] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_74.png

In [68]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_75.png

Out [68] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_76.png

In [69]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_77.png

Out [69] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_78.png

In [70]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_79.png

Out [70] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_80.png

In [71]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_81.png

Out [71] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_82.png

In [72]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_83.png

Out [72] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_84.png

Construction of the optimal sequence of watching movies


We have done quite a lot of work and now, finally, we can build an optimal sequence of watching movies:

In [73]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_85.png

So now we can get it (the function provides for the output either as a table or as a poster from posters):

In [74]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_86.png

Table of the optimal sequence of watching movies from the list of 250 best films according to IMDb
Out [74] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_87.png

You can also display it as a poster from posters (the sequence of watching movies will be from left to right, top to bottom):

In [75]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_88.png

Out [75] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_89.png

We may also consider optimal sequences for individual criteria:

Sequence viewing based on the description of the film
In [76]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_90.png

Out [76] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_91.png

Sequence viewing based on the genre of the film
In [77]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_92.png

Out [77] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_93.png

Sequence viewing based on the cast of the film
In [78]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_94.png

Out [78] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_95.png

Sequence viewing based on the director of the film
In [79]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_96.png

Out [79] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_97.png

The sequence of viewing on the basis of the film writers
In [80]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_98.png

Out [80] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_99.png

Sequence viewing based on film composers
In [82]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_102.png

Out [82] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_103.png

View sequence based on movie length
In [83]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_104.png

Out [83] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_105.png

View sequence based on movie poster
In [84]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_106.png

Out [84] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_107.png

Sequence viewing based on film production country
In [85]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_108.png

Out [85] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_109.png

Conclusion


I hope that my post could interest you, and some of the ideas and programs presented in it will be useful to you. Of course, you can think of many ways to use these algorithms, their further expansion and improvement. Many things have been specially simplified by me, since not all ready-made codes can be laid out completely in free access. I think that if you are interested, you can create a parser from KinoPoisk or IMDB directly (in the latter case, an article about uploading and analyzing information from the IMDB databases that are freely available on these resources can help you) and based on it more detailed and qualitative analysis of the movie, as well as to improve the optimal sequence of watching movies obtained in this article. I hope that all these tasks will interest you!

Resources for learning Wolfram Language ( Mathematica ) in Russian: http://habrahabr.ru/post/244451

Source: https://habr.com/ru/post/245735/


All Articles