📜 ⬆️ ⬇️

Star Wars Universe Social Network

image

Someone is waiting for Christmas, someone - a new series of Star Wars, "The Awakening of Power." And at this time I decided to process the entire six-part cycle from a quantitative point of view and isolate the social networks contained in it - both from each film separately, and from the whole universe of pollutants together. A close look at the social networks reveals interesting differences between the original parts and their prequels.

Below is a social network extracted from all 6 films in total.
')
image

open

You can open an interactive page, where visualization will be presented with mouse pulling capabilities of individual nodes. When you hover over the node you will see the name of the character.

Nodes are characters. Their connecting line means that they speak in the same scene. The more they say, the thicker the line. The size of each node is proportional to the number of scenes in which the character appears. We had to make a lot of difficult decisions: for example, Anakin and Darth Vader are obviously the same character, but they are represented by different nodes in the visualization, since this separation is important for the plot. And vice versa, I specifically united Palpatine with Darth Sidious, and Amidala - with Padme.

The characters of the original trilogy are mostly located to the right and practically separated from the characters of the prequels, since most of the characters appear only in one of the trilogies. The main nodes connecting both networks are Obi-Wan Kenobi, R2-D2 and C-3PO. Robots, obviously, are of particular importance for the plot, because they appear most often in films. The structure of both subnets is different. The original trilogy has fewer important knots (Luke, Hahn, Leah, Chewbacca, and Darth Vader), and they are tightly interconnected. Prequels show more nodes and more connecting lines.

Character timelines


Since the same characters are found in different films, I created a comparative timeline, broken down into episodes.

image

Here are all the references to the characters, including the mention of their names in the conversations of others. Anakin appears with Darth Vader in episode 3, and then Darth Vader picks up. Anakin reappears by the end of episode 6, in which Darth Vader turns away from the Dark Side.

The same characters that are constantly involved in all the films are in the center of the social network. These are Obi-Wan, C-3PO and R2-D2. Yoda and the Emperor are also in all films, but they speak to a small number of people.

Networks for individual episodes

Now consider the episodes separately. Notice how the number of nodes and the complexity of networks vary from prequels to original episodes. (clickable).

image

image

image

image

image

image

Again it is clear that in the prequels there are more characters and more interactions of different characters with each other. In the original films, the characters are smaller, but they interact more often.

George Lucas once said :
In fact, this is the story of the tragedy of Darth Vader, it begins when he is nine years old, and ends with his death.


But is Darth Vader / Anakin really a central character? Let's try to apply the methods of network analysis to identify the central characters and their social structure. I calculated two parameters showing the importance of the character on the network.



The first parameter as a result shows how many characters the character contacts, and the second one is how important it is for the story as a whole. Characters with high intermediateness unite different parts of social networks.

The larger the parameter, the more important it is. Below are the Top-5 characters, ranked by parameters, for each movie.

image

In the first three episodes, Anakin was the most connected character. At the same time, he practically does not participate in integration - his interimness is so small that he did not even get into the Top-5. It turns out that other characters communicate personally, and not through him. And what will it look like for the original trilogy?

image

The analysis of centrality numerically expresses our impression from the visualization of social networks. In prequels, the social structure is more complex, more characters. And Anakin is not a central figure - some storylines develop in parallel, or relate to it only indirectly. On the other hand, the original trilogy looks more coherent. There are fewer characters linking the story.

Perhaps because of this, the original trilogy is more popular. The plots are more consistent, and develop thanks to the main characters. The prequel structure is less centralized, there is no central character.

And what will these measurements look like when applied to all films at once? I made two versions of the calculations - with the separation of the characters of Anakin and Darth Vader, and with the union.

On the left - two separate characters, on the right - the characters are combined:

image

In the first case, Anakin remains the most connected character, but not the central one. When they are combined, he becomes the third most important character in the intermediate ranking. In any case, it turns out that in reality, the films are united by the character Obi-Wan Kenobi.

image

How is it done


For the most part, I used F # , combining it with D3.js for social network visualizations, and R for analyzing the centrality of networks. All sources are available on github . Here I will analyze only the individual, most interesting parts of the code.

Parsing scripts

I freely downloaded all the scripts from The Internet Movie Script Database (IMSDb) (example: Episode IV script: The New Hope ). True, there are mostly drafts, which are often different from the final versions.

The first step is the analysis of scenarios. It turned out that different files have a slightly different format. They are all represented in HTML, either between tags, or between. I used the Html Parser from the F # Data library, which allows accessing individual tags using queries like:

open FSharp.Data let url = "http://www.imsdb.com/scripts/Star-Wars-A-New-Hope.html" HtmlDocument.Load(url).Descendants("pre") 


The code is available in the parseScripts.fs file

The next step is to extract the necessary information from the scripts. Usually they look like this:

INT. GANTRY - OUTSIDE CONTROL ROOM - REACTOR SHAFT

Luke moves along the railing and up to the control room.

[...]
LUKE
He told me enough! It was you
who killed him.

VADER
No. I am your father.

Shocked, Luke looks at Vader in utter disbelief.

LUKE
No. No. That's not true!
That's impossible!


Each scene begins with a scene designation and INT notes. (inside) or EXT. (outside). There may also be an explanatory text. In the dialogs, the names of the characters are indicated in capital letters and bold.

Therefore, the separators of scenes can be notes INT. and EXT., written by bold.

 // split the script by scene // each scene starts with either INT. or EXT. let rec splitByScene (script : string[]) scenes = let scenePattern = "<b>[ 0-9]*(INT.|EXT.)" let idx = script |> Seq.tryFindIndex (fun line -> Regex.Match(line, scenePattern).Success) match idx with | Some i -> let remainingScenes = script.[i+1 ..] let currentScene = script.[0..i-1] splitByScene remainingScenes (currentScene :: scenes) | None -> script :: scenes 


A recursive function that accepts the entire script and searches for patterns - EXT. or int. Bolds, before which the scene number can go. It breaks the lines into the current scene and the rest of the text, and then recursively repeats the procedure.

Get a list of characters

In some scenes, the names of the characters are in the format that I described earlier. Some use only colon names. And all this can be present on one line. The only common feature was the presence of names written in capital letters.

I had to use regulars.

 // Extract names of characters that speak in scenes. // A) Extract names of characters in the format "[name]:" let getFormat1Names text = let matches = Regex.Matches(text, "[/A-Z0-9 -]+ *:") let names = seq { for m in matches -> m.Value } |> Seq.map (fun name -> name.Trim([|' '; ':'|])) |> Array.ofSeq names // B) Extract names of characters in the format "<b> [name] </b>" let getFormat2Names text = let m = Regex.Match(text, "<b>[ ]*[/A-Z0-9 -]+[ ]*</b>") if m.Success then let name = m.Value.Replace("<b>","").Replace("</b>","").Trim() [| name |] else [||] 


Each regular season is looking for not only capital, but also numbers, dashes, spaces and slashes. Since the names of the characters are different: "R2-D2" or even "FODE / BEED".

I also had to take into account that some characters have several names. Palpatine - Darth Sidious - Emperor, Amidala - Padme. I made the aliases.csv alias file , where I specified the names to be merged.

 let aliasFile = __SOURCE_DIRECTORY__ + "/data/aliases.csv" // Use csv type provider to access the csv file with aliases type Aliases = CsvProvider<aliasFile> /// Dictionary for translating character names between aliases let aliasDict = Aliases.Load(aliasFile).Rows |> Seq.map (fun row -> row.Alias, row.Name) |> dict /// Map character names onto unique set of names let mapName name = if aliasDict.ContainsKey(name) then aliasDict.[name] else name /// Extract character names from the given scene let getCharacterNames (scene: string []) = let names1 = scene |> Seq.collect getFormat1Names let names2 = scene |> Seq.collect getFormat2Names Seq.append names1 names2 |> Seq.map mapName |> Seq.distinct 


And now, finally, you can extract the names of the characters from the scenes. The following function retrieves all character names from all scripts for which URLs are specified.

 let allNames = scriptUrls |> List.map (fun (episode, url) -> let script = getScript url let scriptParts = script.Elements() let mainScript = scriptParts |> Seq.map (fun element -> element.ToString()) |> Seq.toArray // Now every element of the list is a single scene let scenes = splitByScene mainScript [] // Extract names appearing in each scene scenes |> List.map getCharacterNames |> Array.concat ) |> Array.concat |> Seq.countBy id |> Seq.filter (snd >> (<) 1) // filter out characters that speak in only one scene 


There is one more problem - some character names were not names. These were names like “Pilot”, “Officer” or “Captain”. I had to manually filter the names that were real. This is how the characters.csv list appeared.

Character interaction

To build networks, I needed to identify all the cases where the characters spoke to each other. They talk in the same scene (cases when people talk to each other on the intercom or walkie-talkie, and therefore, are in different scenes, I dropped).

 let characters = File.ReadAllLines(__SOURCE_DIRECTORY__ + "/data/characters.csv") |> Array.append (Seq.append aliasDict.Keys aliasDict.Values |> Array.ofSeq) |> set 


Here I created a set of all character names and their aliases for searching and filtering. Then I used it to search for characters in each of the scenes.

 let scenes = splitByScene mainScript [] |> List.rev let namesInScenes = scenes |> List.map getCharacterNames |> List.map (fun names -> names |> Array.filter (fun n -> characters.Contains n)) 


Then I used a filtered list of characters to determine the social network.

 // Create weighted network let nodes = namesInScenes |> Seq.collect id |> Seq.countBy id // optional threshold on minimum number of mentions |> Seq.filter (fun (name, count) -> count >= countThreshold) let nodeLookup = nodes |> Seq.map fst |> set let links = namesInScenes |> List.collect (fun names -> [ for i in 0..names.Length - 1 do for j in i+1..names.Length - 1 do let n1 = names.[i] let n2 = names.[j] if nodeLookup.Contains(n1) && nodeLookup.Contains(n2) then // order nodes alphabetically yield min n1 n2, max n1 n2 ]) |> Seq.countBy id 


This is how the list of nodes came out, with the number of conversations throughout the script — this calculation is used to determine the size of the node. Then I created a line between two characters that speak in the same scene, and calculated their number. Together, nodes and lines define the entire social network.

Finally, I output this data in JSON format. All social networks, global and individual by episode, can be found on my github. The complete code for this step is in the getInteractions.fsx file.

Mentions characters

I also decided to find references to all the characters to build a timeline. The code turned out to be similar to the one that extracts the dialogues of the characters, only here I was looking for all the mentions, not only in the dialogues. I also counted the scene numbers. The following code returns a list of scene numbers and characters mentioned in them.

 let scenes = splitByScene mainScript [] |> List.rev let totalScenes = scenes.Length scenes |> List.mapi (fun sceneIdx scene -> // some names contain typos with lower-case characters let lscene = scene |> Array.map (fun s -> s.ToLower()) characters |> Array.map (fun name -> lscene |> Array.map (fun contents -> if containsName contents name then Some name else None ) |> Array.choose id) |> Array.concat |> Array.map (fun name -> mapName (name.ToUpper())) |> Seq.distinct |> Seq.map (fun name -> sceneIdx, name) |> List.ofSeq) |> List.collect id, totalScenes 


To extract the timelines, I used scene numbering to assign an interval to each episode in the form [episode index-1, episode index]. This gave me a relative scale of the appearance of characters in the episodes. Times in cells of intervals [0,1] refer to Episode I, in cells [1,2] to episode II, etc.

 // extract timelines [0 .. 5] |> List.map (fun episodeIdx -> getSceneAppearances episodeIdx) |> List.mapi (fun episodeIdx (sceneAppearances, total) -> sceneAppearances |> List.map (fun (scene, name) -> float episodeIdx + float scene / float total, name)) 


I saved it in csv , where each line contains the name of the character and the exact times in which he appeared in movies, separated by commas. The full code is available in the getMentions.fsx file.

Add characters without words

Looking through the statistics of character conversations, I saw that R2-D2 and Chewbacca were missing from it. Wookie not only did not get a medal, but disappeared from all the dialogues. Of course, they are mentioned in the script, but only as characters without dialogues.

Of course, it was impossible to ignore them, and I decided to insert them into the social network on the basis of dialogues.

I extracted the sizes of nodes and the connection between two missing characters from the network, determined by their references. To turn this into a connection within the social network, I decided to scale all the data obtained in proportion to other similar characters that participate in the script. I chose C-3PO because he is a R2-D2 intermediary, and Han is an intermediary of ChĂĽy, assuming that their interactions will be similar. I applied the following formula to calculate the strength of connections in the interactive social network:

image

Visualization


After the manual return of Chewbacca and R2-D2, I had a complete set of social networks for both individual films and the entire franchise. I used Force to visualize social networks ... Well, in fact, force-directed network layout from the D3.js library. This method uses physical simulation of charged particles. The most important in the code is the following:

 d3.json("starwars-episode-1-interactions-allCharacters.json", function(error, graph) { /* More code here */ var link = svg.selectAll(".link") .data(graph.links) .enter().append("line") .attr("class", "link") .style("stroke-width", function(d) { return Math.sqrt(d.value); }); var node = svg.selectAll(".node") .data(graph.nodes) .enter().append("circle") .attr("class", "node") .attr("r", 5) .style("fill", function (d) { return d.colour; }) .attr("r", function (d) { return 2*Math.sqrt(d.value) + 2; }) .call(force.drag); /* More code here */ }); 


In the previous steps, I saved all the networks in JSON. Here I load them and define nodes and links. For each node, a different color is added, and a value denoting importance (number of character phrases). This parameter defines the radius r, as a result, all nodes are scaled in importance. Also for links - the thickness of each link was stored in JSON, and here it is displayed through the width of the line.

Central analysis


And at the end I analyzed the centrality of each character. For this, I used RProvider along with the R igraph package to analyze networks in F #. First, I loaded the network from JSON via FSharp.Data:

 open RProvider.igraph let [<Literal>] linkFile = __SOURCE_DIRECTORY__ + "/networks/starwars-episode-1-interactions.json" type Network = JsonProvider<linkFile> let file = __SOURCE_DIRECTORY__ + "/networks/starwars-full-interactions-allCharacters.json" let nodes = Network.Load(file).Nodes |> Seq.map (fun node -> node.Name) let links = Network.Load(file).Links 


The variable links contains all the links in the network, and the nodes are characterized by their indices. To simplify the work, I assigned the character names to the indexes:

 let nodeLookup = nodes |> Seq.mapi (fun i name -> i, name) |> dict let edges = links |> Array.collect (fun link -> let n1 = nodeLookup.[link.Source] let n2 = nodeLookup.[link.Target] [| n1 ; n2 |] ) 


Then I created a graph object using the igraph library:

 let graph = namedParams["edges", box edges; "dir", box "undirected"] |> R.graph 


Counting Intermediateness and Centralness:

 let centrality = R.betweenness(graph) let degreeCentrality = R.degree(graph) 


The complete code can be found here .

Results


As is always the case with scientific research, the most difficult thing is to bring the data into a readable form. Since the SW scripts had a slightly different format, I spent most of the time defining the general properties of the documents to create one function to process them. After that, it was necessary to tinker only with the problems of the Wookiees and the droid, who did not have any replicas. Networks in JSON format can be downloaded on github .

Links


Sources github.com/evelinag/StarWars-social-network
Social networks in JSON format: github.com/evelinag/StarWars-social-network/tree/master/networks
Scripts: www.imsdb.com

Source: https://habr.com/ru/post/273319/


All Articles