Search for the best sequence of viewing the list of 250 best movies using the Wolfram Language (Mathematica)

Download the translation in the form of a Mathematica document that contains all the code used in the article here (archive, ~ 76 MB).

Introduction

Some time ago, to be exact - 515 days, Matthias Odisio published a post entitled “ Random and Optimal Mathematica Walks on IMDb's Top Films ” ( Mathematica Random and Optimal Wander on the list of the 250 best films from the version of IMDB). It describes how to get the optimal sequence of watching movies from the corresponding list , based on the proximity of film genres and the proximity of movie posters in terms of color.

In [1]: =

')
Out [1] =

The idea of this post seemed quite interesting to me, but I wanted to significantly expand and deepen it, following a few ideas:

It is not enough objectively to build a function of the distance between films based on the proximity of movie posters according to the colors and genres of films used in them. It seems to me reasonable to build a function of the distance between films based on several factors: film genres, film description, cast, director (s), year of production, screenwriter (s), etc.

In the article by Mattias, only Wolfram | Alpha data was used, which certainly simplifies the task and compacts the code. I also want to talk about how you can use in the calculations data taken from anywhere, for example, obtained using web parsing from Wikipedia pages, loaded from text databases, etc.

I will not talk in this article about how to build an optimal sequence of viewing the list of 250 best films of KinoPoisk for the reason that I simply do not want to have problems with the terms of use of this resource, which are quite clearly spoken (see Section 6 ) that simply take their list of films and analyze it without their consent will not work. In this case, apply the algorithms that I give below for this list is quite simple. I would also like to note that during my work with one of the domestic film companies for their needs, a parser was written in the Wolfram Language language, which loaded information about films from the KinoPoisk site (the legal side of the question was settled) for the subsequent automatic generation of an advertising booklet about several thousands of films owned by this company. Below you can see an example of one such fully automatically created page of a booklet (an inconclusive version is given, due to NDA).

Sample page

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_4.png

This article will use information about the films presented in Wikipedia, which will avoid any problems with the copyright holders. On the one hand, this complicates the task (a parser from a centralized repository like IMDB or KinoPoisk is easier to write), but at the same time allows you to build some additional, interesting programs.

Import data from Wikipedia website

To begin with, load the symbolic HTML representation of the Wikipedia page “ 250 best films in IMDb version ” (in the document we will display only a part of the result using the Short function):

In [2]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_5.gif

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_5.gif

Out [3] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_6.png

Now select the links to the films shown on the page in the table:

In [4]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_7.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_7.png

Out [4] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_8.png

Let's create a function that will load and save a symbolic representation of the HTML code of the pages of each of the films:

In [5]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_9.gif

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_9.gif

Secondary functions

Create a set of auxiliary functions that we need to handle immersed character HTML:

Function to remove HTML wrappers, leaving only the data:

In [8]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_10.png

A function that determines whether a string can be a word in Russian (that is, consists of letters of the Russian alphabet or a hyphen):

In [9]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_11.png

In [10]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_12.gif

A function that determines whether a string can be a word in Russian or English (that is, consists of letters of the Russian, English alphabet or hyphen):

In [12]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_13.png

In [13]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_14.gif

A function that converts (in a row) capital letters of the Russian alphabet into capital letters:

In [15]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_15.png

In [16]: =

To analyze the descriptions of films, we need information about the words of the Russian language and the links between the forms of the same word. Let's load the morphological dictionary of the Russian language, created by Academician Andrei Anatolyevich Zaliznyak :

In [17]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_17.png

Out [17] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_18.png

We process the dictionary data, making on its basis a list of words of the Russian language ( russianWords ) and a list of rules for replacing the forms of words of the Russian language into their standard form ( russianWords Standard TrackForm )

In [18]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_19.png

The dictionary contains 2 645 347 words:

In [19]: =

Out [19] =

Out [20] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_22.png

Create a function that checks whether a word is contained in the dictionary, as well as a function that converts a Russian word into its standard form:

In [21]: =

In [22]: =

Examples of functions:

In [23]: =

Out [23] =

In [24]: =

Out [24] =

Create a function that will determine whether an adjective is:

In [25]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_29.png

In [26]: =

Data processing

Now you can process the data of each of the films. At the same time, the output in the filmsData variable will be the database of movie information based on the Association function, which will allow us to access data very easily.

In [27]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_31.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_31.png

In [29]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_32.gif

An example of a call to the generated database by movie number:

In [31]: =

Out [31] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_34.png

An example of a request for a director and the year of each film’s release:

In [32]: =

Out [32] // Short =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_36.png

Some statistics based on data.

For a start, let's just create a collage of posters of all films:

In [33]: =

Out [33] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_38.png

We construct the distribution of the number of films depending on the year:

In [34]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_39.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_39.png

Out [34] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_40.png

We construct the distribution of films by their duration:

In [35]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_41.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_41.png

Out [35] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_42.png

We construct the distribution of films according to their duration and year of release:

In [36]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_43.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_43.png

Out [36] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_44.png

The first 10 actors by the number of films in which they played:

In [37]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_45.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_45.png

Out [37] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_46.png

The first 10 filmmakers by the number of films they made:

In [38]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_47.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_47.png

Out [38] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_48.png

The first 10 screenwriters by the number of films, the script for which they wrote:

In [39]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_49.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_49.png

Out [39] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_50.png

The first 10 composers by the number of films, the music for which they wrote:

In [40]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_51.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_51.png

Out [40] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_52.png

The first 10 countries by the number of films that were shot in them:

In [41]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_53.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_53.png

Out [41] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_54.png

The first 10 genres by the number of films that include:

In [42]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_55.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_55.png

Out [42] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_56.png

For those who are interested in cinema genres, I can recommend an article “ Films and Mathematica: importing and processing information from the IMDB database ” written some time ago, in which, in particular, the following distribution of films by genres is obtained:

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_57.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_57.png

The function that determines the distance between the films

To determine the measure of the difference between the two lists of objects, we will use the generalization of the Chekanovsky-Srensen coefficient (measure) :

In [43]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_58.gif

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_58.gif

Example:

In [45]: =

Out [45] =

To determine the proximity of the descriptions using this coefficient, we will create a function that selects the words of the Russian language from the film description with their translation into the standard form:

In [46]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_61.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_61.png

Example of the function (the frequency of each word was additionally calculated using the Tally function, and the frequencies were sorted according to their decrease):

In [47]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_62.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_62.png

Out [47] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_63.png

Now create a function that determines the degree of proximity of the films to each other. It is the sum of several parameters normalized to one with different weights. A total of 11 parameters (degrees) of similarity are taken: film description, genre (s), director, screenwriter (s), actors, cameraman (s), composer (s), country (s) of production, release year, duration, closeness of posters. In this case, you can ask them different weights, but by default they will be the same.

In [48]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_64.gif

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_64.gif

For further work, we will choose those films for which at least some information is known (in view of the fact that for several films their Wikipedia pages are empty):

In [62]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_65.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_65.png

Calculate all measures of proximity (distance) between the films:

In [63]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_66.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_66.png

Movie linkage analysis

We will study the relationships between films using the methods of graph theory, namely, using the theory of the community structure in graphs . To do this, create a function based on CommunityGraphPlot :

In [64]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_67.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_67.png

This function searches, based on the previously constructed function of the distance between the films, the community in the graph, while the redder and thicker the connection between the vertices, the closer they are (closer). When you hover over each of the vertices of the graph, you can get a pop-up hint with a poster and the title of the film (you can download the document with interactive graphs and source code from the link at the very beginning of the post).

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_68.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_68.png

In [65]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_69.png

Out [65] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_70.png

In [66]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_71.png

Out [66] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_72.png

In [67]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_73.png

Out [67] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_74.png

In [68]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_75.png

Out [68] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_76.png

In [69]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_77.png

Out [69] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_78.png

In [70]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_79.png

Out [70] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_80.png

In [71]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_81.png

Out [71] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_82.png

In [72]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_83.png

Out [72] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_84.png

Construction of the optimal sequence of watching movies

We have done quite a lot of work and now, finally, we can build an optimal sequence of watching movies:

In [73]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_85.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_85.png

So now we can get it (the function provides for the output either as a table or as a poster from posters):

In [74]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_86.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_86.png

Table of the optimal sequence of watching movies from the list of 250 best films according to IMDb

Out [74] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_87.png

You can also display it as a poster from posters (the sequence of watching movies will be from left to right, top to bottom):

In [75]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_88.png

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_88.png

Out [75] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_89.png

We may also consider optimal sequences for individual criteria:

Sequence viewing based on the description of the film

In [76]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_90.png

Out [76] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_91.png

Sequence viewing based on the genre of the film

In [77]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_92.png

Out [77] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_93.png

Sequence viewing based on the cast of the film

In [78]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_94.png

Out [78] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_95.png

Sequence viewing based on the director of the film

In [79]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_96.png

Out [79] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_97.png

The sequence of viewing on the basis of the film writers

In [80]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_98.png

Out [80] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_99.png

Sequence viewing based on film composers

In [82]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_102.png

Out [82] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_103.png

View sequence based on movie length

In [83]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_104.png

Out [83] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_105.png

View sequence based on movie poster

In [84]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_106.png

Out [84] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_107.png

Sequence viewing based on film production country

In [85]: =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_108.png

Out [85] =

Poisk-posledovatelnosti-prosmotra-spiska-250-luchshih-filmov-Wolfram-Language-Mathematica_109.png

Conclusion

I hope that my post could interest you, and some of the ideas and programs presented in it will be useful to you. Of course, you can think of many ways to use these algorithms, their further expansion and improvement. Many things have been specially simplified by me, since not all ready-made codes can be laid out completely in free access. I think that if you are interested, you can create a parser from KinoPoisk or IMDB directly (in the latter case, an article about uploading and analyzing information from the IMDB databases that are freely available on these resources can help you) and based on it more detailed and qualitative analysis of the movie, as well as to improve the optimal sequence of watching movies obtained in this article. I hope that all these tasks will interest you!

Resources for learning Wolfram Language ( Mathematica ) in Russian: http://habrahabr.ru/post/244451

Source: https://habr.com/ru/post/245735/

All Articles