
Someone is waiting for Christmas, someone - a new series of Star Wars, "The Awakening of Power." And at this time I decided to process the entire six-part cycle from a quantitative point of view and isolate the social networks contained in it - both from each film separately, and from the whole universe of pollutants together. A close look at the social networks reveals interesting differences between the original parts and their prequels.
Below is a social network extracted from all 6 films in total.
')
openYou can open an interactive page, where visualization will be presented with mouse pulling capabilities of individual nodes. When you hover over the node you will see the name of the character.
Nodes are characters. Their connecting line means that they speak in the same scene. The more they say, the thicker the line. The size of each node is proportional to the number of scenes in which the character appears. We had to make a lot of difficult decisions: for example, Anakin and Darth Vader are obviously the same character, but they are represented by different nodes in the visualization, since this separation is important for the plot. And vice versa, I specifically united Palpatine with Darth Sidious, and Amidala - with Padme.
The characters of the original trilogy are mostly located to the right and practically separated from the characters of the prequels, since most of the characters appear only in one of the trilogies. The main nodes connecting both networks are Obi-Wan Kenobi, R2-D2 and C-3PO. Robots, obviously, are of particular importance for the plot, because they appear most often in films. The structure of both subnets is different. The original trilogy has fewer important knots (Luke, Hahn, Leah, Chewbacca, and Darth Vader), and they are tightly interconnected. Prequels show more nodes and more connecting lines.
Character timelines
Since the same characters are found in different films, I created a comparative timeline, broken down into episodes.

Here are all the references to the characters, including the mention of their names in the conversations of others. Anakin appears with Darth Vader in episode 3, and then Darth Vader picks up. Anakin reappears by the end of episode 6, in which Darth Vader turns away from the Dark Side.
The same characters that are constantly involved in all the films are in the center of the social network. These are Obi-Wan, C-3PO and R2-D2. Yoda and the Emperor are also in all films, but they speak to a small number of people.
Networks for individual episodes
Now consider the episodes separately. Notice how the number of nodes and the complexity of networks vary from prequels to original episodes. (clickable).






Again it is clear that in the prequels there are more characters and more interactions of different characters with each other. In the original films, the characters are smaller, but they interact more often.
George Lucas once
said :
In fact, this is the story of the tragedy of Darth Vader, it begins when he is nine years old, and ends with his death.
But is Darth Vader / Anakin really a central character? Let's try to apply the methods of network analysis to identify the central characters and their social structure. I calculated two parameters showing the importance of the character on the network.
- degree of importance: the number of connecting lines at the node in the network. That is, the total number of scenes in which he talks.
- Intermediateness: The number of shortest paths leading through a node. For example, if you are Leia, and you want to send a message to Grido, then the shortest path to it will be the path through Han Solo. And in order to send a message to Luke, it is not necessary to go through Hana, since Leia knows him personally. In this way, the interimness of Khan is calculated - through the number of shortest paths between all other characters passing through it.
The first parameter as a result shows how many characters the character contacts, and the second one is how important it is for the story as a whole. Characters with high intermediateness unite different parts of social networks.
The larger the parameter, the more important it is. Below are the Top-5 characters, ranked by parameters, for each movie.

In the first three episodes, Anakin was the most connected character. At the same time, he practically does not participate in integration - his interimness is so small that he did not even get into the Top-5. It turns out that other characters communicate personally, and not through him. And what will it look like for the original trilogy?

The analysis of centrality numerically expresses our impression from the visualization of social networks. In prequels, the social structure is more complex, more characters. And Anakin is not a central figure - some storylines develop in parallel, or relate to it only indirectly. On the other hand, the original trilogy looks more coherent. There are fewer characters linking the story.
Perhaps because of this, the original trilogy is more popular. The plots are more consistent, and develop thanks to the main characters. The prequel structure is less centralized, there is no central character.
And what will these measurements look like when applied to all films at once? I made two versions of the calculations - with the separation of the characters of Anakin and Darth Vader, and with the union.
On the left - two separate characters, on the right - the characters are combined:

In the first case, Anakin remains the most connected character, but not the central one. When they are combined, he becomes the third most important character in the intermediate ranking. In any case, it turns out that in reality, the films are united by the character Obi-Wan Kenobi.

How is it done
For the most part, I used
F # , combining it with
D3.js for social network visualizations, and
R for analyzing the centrality of networks. All sources are available on
github . Here I will analyze only the individual, most interesting parts of the code.
Parsing scripts
I freely downloaded all the scripts from
The Internet Movie Script Database (IMSDb) (example:
Episode IV script: The New Hope ). True, there are mostly drafts, which are often different from the final versions.
The first step is the analysis of scenarios. It turned out that different files have a slightly different format. They are all represented in HTML, either between tags, or between. I used the
Html Parser from the F # Data library, which allows accessing individual tags using queries like:
open FSharp.Data let url = "http://www.imsdb.com/scripts/Star-Wars-A-New-Hope.html" HtmlDocument.Load(url).Descendants("pre")
The code is available in the
parseScripts.fs file
The next step is to extract the necessary information from the scripts. Usually they look like this:
INT. GANTRY - OUTSIDE CONTROL ROOM - REACTOR SHAFT
Luke moves along the railing and up to the control room.
[...]
LUKE
He told me enough! It was you
who killed him.
VADER
No. I am your father.
Shocked, Luke looks at Vader in utter disbelief.
LUKE
No. No. That's not true!
That's impossible!
Each scene begins with a scene designation and INT notes. (inside) or EXT. (outside). There may also be an explanatory text. In the dialogs, the names of the characters are indicated in capital letters and bold.
Therefore, the separators of scenes can be notes INT. and EXT., written by bold.
A recursive function that accepts the entire script and searches for patterns - EXT. or int. Bolds, before which the scene number can go. It breaks the lines into the current scene and the rest of the text, and then recursively repeats the procedure.
Get a list of characters
In some scenes, the names of the characters are in the format that I described earlier. Some use only colon names. And all this can be present on one line. The only common feature was the presence of names written in capital letters.
I had to use regulars.
Each regular season is looking for not only capital, but also numbers, dashes, spaces and slashes. Since the names of the characters are different: "R2-D2" or even "FODE / BEED".
I also had to take into account that some characters have several names. Palpatine - Darth Sidious - Emperor, Amidala - Padme. I made the
aliases.csv alias
file , where I
specified the names to be merged.
let aliasFile = __SOURCE_DIRECTORY__ + "/data/aliases.csv"
And now, finally, you can extract the names of the characters from the scenes. The following function retrieves all character names from all scripts for which URLs are specified.
let allNames = scriptUrls |> List.map (fun (episode, url) -> let script = getScript url let scriptParts = script.Elements() let mainScript = scriptParts |> Seq.map (fun element -> element.ToString()) |> Seq.toArray
There is one more problem - some character names were not names. These were names like “Pilot”, “Officer” or “Captain”. I had to manually filter the names that were real. This is how the characters.csv list appeared.
Character interaction
To build networks, I needed to identify all the cases where the characters spoke to each other. They talk in the same scene (cases when people talk to each other on the intercom or walkie-talkie, and therefore, are in different scenes, I dropped).
let characters = File.ReadAllLines(__SOURCE_DIRECTORY__ + "/data/characters.csv") |> Array.append (Seq.append aliasDict.Keys aliasDict.Values |> Array.ofSeq) |> set
Here I created a set of all character names and their aliases for searching and filtering. Then I used it to search for characters in each of the scenes.
let scenes = splitByScene mainScript [] |> List.rev let namesInScenes = scenes |> List.map getCharacterNames |> List.map (fun names -> names |> Array.filter (fun n -> characters.Contains n))
Then I used a filtered list of characters to determine the social network.
This is how the list of nodes came out, with the number of conversations throughout the script — this calculation is used to determine the size of the node. Then I created a line between two characters that speak in the same scene, and calculated their number. Together, nodes and lines define the entire social network.
Finally, I output this data in JSON format. All social networks, global and individual by episode, can be found on my github. The complete code for this step is in the getInteractions.fsx file.
Mentions characters
I also decided to find references to all the characters to build a timeline. The code turned out to be similar to the one that extracts the dialogues of the characters, only here I was looking for all the mentions, not only in the dialogues. I also counted the scene numbers. The following code returns a list of scene numbers and characters mentioned in them.
let scenes = splitByScene mainScript [] |> List.rev let totalScenes = scenes.Length scenes |> List.mapi (fun sceneIdx scene ->
To extract the timelines, I used scene numbering to assign an interval to each episode in the form [episode index-1, episode index]. This gave me a relative scale of the appearance of characters in the episodes. Times in cells of intervals [0,1] refer to Episode I, in cells [1,2] to episode II, etc.
I saved it in csv , where each line contains the name of the character and the exact times in which he appeared in movies, separated by commas. The full code is available in the getMentions.fsx file.
Add characters without words
Looking through the statistics of character conversations, I saw that R2-D2 and Chewbacca were missing from it. Wookie not only did not get a medal, but disappeared from all the dialogues. Of course, they are mentioned in the script, but only as characters without dialogues.
Of course, it was impossible to ignore them, and I decided to insert them into the social network on the basis of dialogues.
I extracted the sizes of nodes and the connection between two missing characters from the network, determined by their references. To turn this into a connection within the social network, I decided to scale all the data obtained in proportion to other similar characters that participate in the script. I chose C-3PO because he is a R2-D2 intermediary, and Han is an intermediary of ChĂĽy, assuming that their interactions will be similar. I applied the following formula to calculate the strength of connections in the interactive social network:

Visualization
After the manual return of Chewbacca and R2-D2, I had a complete set of social networks for both individual films and the entire franchise. I used Force to visualize social networks ... Well, in fact,
force-directed network layout from the D3.js library. This method uses physical simulation of charged particles. The most important in the code is the following:
d3.json("starwars-episode-1-interactions-allCharacters.json", function(error, graph) { var link = svg.selectAll(".link") .data(graph.links) .enter().append("line") .attr("class", "link") .style("stroke-width", function(d) { return Math.sqrt(d.value); }); var node = svg.selectAll(".node") .data(graph.nodes) .enter().append("circle") .attr("class", "node") .attr("r", 5) .style("fill", function (d) { return d.colour; }) .attr("r", function (d) { return 2*Math.sqrt(d.value) + 2; }) .call(force.drag); });
In the previous steps, I saved all the networks in JSON. Here I load them and define nodes and links. For each node, a different color is added, and a value denoting importance (number of character phrases). This parameter defines the radius r, as a result, all nodes are scaled in importance. Also for links - the thickness of each link was stored in JSON, and here it is displayed through the width of the line.
Central analysis
And at the end I analyzed the centrality of each character. For this, I used
RProvider along with the
R igraph package to analyze networks in F #. First, I loaded the network from JSON via FSharp.Data:
open RProvider.igraph let [<Literal>] linkFile = __SOURCE_DIRECTORY__ + "/networks/starwars-episode-1-interactions.json" type Network = JsonProvider<linkFile> let file = __SOURCE_DIRECTORY__ + "/networks/starwars-full-interactions-allCharacters.json" let nodes = Network.Load(file).Nodes |> Seq.map (fun node -> node.Name) let links = Network.Load(file).Links
The variable links contains all the links in the network, and the nodes are characterized by their indices. To simplify the work, I assigned the character names to the indexes:
let nodeLookup = nodes |> Seq.mapi (fun i name -> i, name) |> dict let edges = links |> Array.collect (fun link -> let n1 = nodeLookup.[link.Source] let n2 = nodeLookup.[link.Target] [| n1 ; n2 |] )
Then I created a graph object using the igraph library:
let graph = namedParams["edges", box edges; "dir", box "undirected"] |> R.graph
Counting Intermediateness and Centralness:
let centrality = R.betweenness(graph) let degreeCentrality = R.degree(graph)
The complete code
can be found here .
Results
As is always the case with scientific research, the most difficult thing is to bring the data into a readable form. Since the SW scripts had a slightly different format, I spent most of the time defining the general properties of the documents to create one function to process them. After that, it was necessary to tinker only with the problems of the Wookiees and the droid, who did not have any replicas.
Networks in JSON format can be downloaded on github .
Links
Sources
github.com/evelinag/StarWars-social-networkSocial networks in JSON format:
github.com/evelinag/StarWars-social-network/tree/master/networksScripts:
www.imsdb.com