Developers Club geek daily blog

2 years, 10 months ago

Someone waits for Christmas, someone – a new series of Star wars, "Awakening of force". And at this time I decided to process all six-serial cycle from the quantitative point of view and to isolate the social networks which are contained in it – both from each movie separately, and from all Universe of ZV together. Fixed looking at social networks reveals interesting distinctions between original parts and their prequels.

Below – the social network got from all 6 movies in the sum.


to open

It is possible to open an interactive page where visualization with dragging opportunities a mouse of separate nodes will be provided. When aiming at a node you will see a name of the character.

Nodes are characters. Their connection by the line means what they tell in the same scene. The more they speak, the line is thicker. The size of each node is proportional to quantity of scenes in which the character appears. It was necessary to make many difficult decisions: for example, Enakin and Darth Vader, obviously, the same character, but they are provided by different nodes on visualization as it their separation is important for a plot. To the contrary, I specially integrated Palpatin with Darth Sidius, and Amidala – with Padma.

Characters of the original trilogy are located mainly on the right and almost separated from characters of prequels as most of characters appears only in one of trilogies. The main nodes integrating both networks are Ob van Kenobi, R2-D2 and C-3PO. Robots, obviously, represent special importance for a plot because appear most often in movies. Structure at both subnets different. In the original trilogy there are less important nodes (Luke, Hang, Lea, Chubakka and Darth Vader), and they are densely connected among themselves. At prequels more nodes more than the connecting lines are visible.

Temporary scales of characters

As the same characters occur in different movies, I created the comparative timeline broken on episodes.


Here all references of characters, including mentioning of their names in another stories are collected. Enakin appears together with Darth Vader in the 3rd episode, and then Darth Vader takes up. Enakin appears again by the end of the 6th episode in which Darth Vader turns away from the Dark side.

The same characters, that are constantly involved in all movies, stand also in the center of a social network. It Ob Van, C-3PO and R2-D2. Yoda and the Emperor also are in all movies, but they talk to a small amount of people.

Networks for separate episodes

Now we will consider episodes separately. Pay attention as the quantity of nodes and complexity of networks changes from prequels to original episodes. (clickable).







Again it is visible that in prequels more interactions of different characters with each other there are more characters. In original movies of characters it is less, but they interact more often.

George Lucas somehow told:
Actually, it is history of the tragedy of Darth Vader, it when he is nine years old begins, and comes to an end with his death.

But whether really Darth Vader / Enakin turned out the central character? Let's try to apply methods of the network analysis to reveal the central characters and their social structure. I counted two parameters showing importance of the character in a network.

  • importance degree: quantity of connecting lines at a node in a network. That is, total quantity of scenes in which he talks.
  • betweenness: quantity of the shortest ways conducting through a node. For example, if you are Lea, and want to send Grido's message, then in the shortest way to it there will be a way through Han Solo. And to send the message to Luke, it is not necessary to go through Khan as Lea knows it personally. Thus Khan's betweenness – through quantity of the shortest ways between all other characters passing through him is also counted.

The first parameter as a result shows to how many characters the character, and the second – as far as it in general is important for history contacts. Characters integrate different sections of social networks with high betweenness.

The parameter is more, the it is more important. Below – Top-5 of the characters ranged in parameters for each movie.


In the first three episodes Enakin was the most coherent character. At the same time he practically does not participate in integration – its betweenness is so small that it did not even get to Top-5. It turns out that other characters communicate personally, but not through it. And how it will look for the original trilogy?


The analysis of centrality in a numerical type expresses our impression received from visualization of social networks. In prequels the social structure is more difficult, it is more than characters. And Enakin is not the central figure – some subject lines develop in parallel, or concern him only indirectly. On the other hand, the original trilogy looks more coherent. There are less characters connecting history.

Perhaps, because of it the original trilogy is more popular. Plots are more consecutive, and develop thanks to the main characters. Prequels the structure which was less centralized have no central character.

And how these measurements in application to all movies will look at once? I made two options of calculations – with separation of characters Enakin and Darth Vader, and with consolidation.

At the left – two certain characters, on the right — characters are integrated:


In the first case Enakin remains the most connected character, but not central. At their consolidation he becomes the character, the third on importance, in a betweenness rating. Anyway it turns out that movies are integrated in realities by Obi-Wan Kenobi's character.


As it is made

Mostly I used F#, having combined it with D3.js for visualization of a social network, and with R for the analysis of centrality of networks. All source codes are available on a gitkhaba. Here I will sort only separate, most interesting parts of year.

Analysis of scenarios

I freely downloaded all scenarios with The Internet Movie Script Database (IMSDb) (an example: script for Episode IV: The New Hope). However, generally draft copies which often differ from final versions lie there.

The first step – analysis of scenarios. It turned out that at different files a little different format. All of them are provided to HTML, or between tags <td class="srctext"></td>, or between <pre></pre>. I used Html Parser from F# Data library allowing to address separate tags by means of requests of type:

open FSharp.Data
let url = ""

The code is available in the parseScripts.fs file

The following step – extraction of the necessary information from scenarios. Usually they look so:

INT. GANTRY - OUTSIDE CONTROL ROOM - REACTOR SHAFT Luke moves along the railing and up to the control room. [...] LUKE He told me enough! It was you who killed him. VADER No. I am your father. Shocked, Luke looks at Vader in utter disbelief. LUKE No. No. That's not true! That's impossible!

Each scene begins with designation of a scene of action and the note INT. (in) or EXT. (outside). Also there can be an explanatory text. In dialogs names of characters are entered by capital letters and bold print.

Therefore dividers of scenes can serves the notes INT. and EXT., written by a bolt.

// split the script by scene
// each scene starts with either INT. or EXT. 
let rec splitByScene (script : string[]) scenes =
    let scenePattern = "<b>[ 0-9]*(INT.|EXT.)"
    let idx = 
        |> Seq.tryFindIndex (fun line -> Regex.Match(line, scenePattern).Success)
    match idx with
    | Some i ->
        let remainingScenes = script.[i+1 ..]
        let currentScene = script.[0..i-1]
        splitByScene remainingScenes (currentScene :: scenes)
    | None -> script :: scenes 

The recursive function which is accepting all scenario, and looking for templates — EXT. or INT. a bolt before which there can be number of a scene. It breaks lines into the current scene and other text, and then recursively repeats procedure.

We receive the list of characters

In some scenes names of characters are entered in that format which I described earlier. Some use only names with a colon. And all this can be present at one line. Existence of the names written by capital letters was the only general sign.

It was necessary to use regulyarka.

// Extract names of characters that speak in scenes. 
// A) Extract names of characters in the format "[name]:"
let getFormat1Names text =
    let matches = Regex.Matches(text, "[/A-Z0-9 -]+ *:")
    let names = 
        seq { for m in matches -> m.Value }
        |> (fun name -> name.Trim([|' '; ':'|]))
        |> Array.ofSeq

// B) Extract names of characters in the format "<b> [name] </b>"
let getFormat2Names text =
    let m = Regex.Match(text, "<b>[ ]*[/A-Z0-9 -]+[ ]*</b>")
    if m.Success then
        let name = m.Value.Replace("<b>","").Replace("</b>","").Trim()
        [| name |]
    else [||]

Each regulyarka looks for not only capital, but also numbers, a dash, spaces and slashes. As names of characters happen different: "R2-D2" or even "FODE/BEED".

Also it was necessary to consider that at some characters on some names. Palpatin – Darth Sidius – the Emperor, Amidala – Padma. I made the file of the aliases aliases.csv where set the names which are subject to consolidation.

let aliasFile = __SOURCE_DIRECTORY__ + "/data/aliases.csv"
// Use csv type provider to access the csv file with aliases
type Aliases = CsvProvider<aliasFile>

/// Dictionary for translating character names between aliases
let aliasDict = 
    |> (fun row -> row.Alias, row.Name)
    |> dict

/// Map character names onto unique set of names
let mapName name = if aliasDict.ContainsKey(name) then aliasDict.[name] else name

/// Extract character names from the given scene
let getCharacterNames (scene: string []) =
    let names1 = scene |> Seq.collect getFormat1Names 
    let names2 = scene |> Seq.collect getFormat2Names 
    Seq.append names1 names2
    |> mapName
    |> Seq.distinct

And now, at last, it is possible to retrieve names of characters from scenes. The following function retrieves all names of characters from all scenarios for which URL are set.

let allNames =
  |> (fun (episode, url) ->
    let script = getScript url
    let scriptParts = script.Elements()

    let mainScript = 
        |> (fun element -> element.ToString())
        |> Seq.toArray

    // Now every element of the list is a single scene
    let scenes = splitByScene mainScript [] 

    // Extract names appearing in each scene
    scenes |> getCharacterNames |> Array.concat )
  |> Array.concat
  |> Seq.countBy id
  |> Seq.filter (snd >> (<) 1)  // filter out characters that speak in only one scene

There was one more problem – some names of characters were not names. It were names it seems "Pilot", "Officer" or "Captain". It was necessary to filter manually those names which were real. So there was a characters.csv list

Interaction of characters

For creation of networks I needed to reveal all cases when characters spoke with each other. They talk in one scene (cases when people speak with each other on Intercom or a handheld transceiver and therefore they, are in different scenes, I lowered).

let characters = 
    File.ReadAllLines(__SOURCE_DIRECTORY__ + "/data/characters.csv") 
    |> Array.append (Seq.append aliasDict.Keys aliasDict.Values |> Array.ofSeq)
    |> set

Here I created a set of all names of characters and their aliases for search and filtering. Then I used it for search of characters in each of scenes.

let scenes = splitByScene mainScript [] |> List.rev

let namesInScenes = 
    |> getCharacterNames
    |> (fun names -> names |> Array.filter (fun n -> characters.Contains n)) 

Then I used the filtered list of characters for determination of a social network.

// Create weighted network
let nodes = 
    |> Seq.collect id
    |> Seq.countBy id        
    // optional threshold on minimum number of mentions
    |> Seq.filter (fun (name, count) -> count >= countThreshold)

let nodeLookup = nodes |> fst |> set

let links = 
    |> List.collect (fun names -> 
        [ for i in 0..names.Length - 1 do 
            for j in i+1..names.Length - 1 do
                let n1 = names.[i]
                let n2 = names.[j]
                if nodeLookup.Contains(n1) &&nodeLookup.Contains(n2) then
                    // order nodes alphabetically
                    yield min n1 n2, max n1 n2 ])
    |> Seq.countBy id

So the list of nodes turned out, with the number of their talk throughout the scenario — this calculation is used for determination of the size of a node. Then I created the line between two characters who speak in one scene, and counted their quantity. Together nodes and lines define all social network.

At last, I output these data in the JSON format. All social networks, global and individual on episodes, it is possible to find on my gitkhab. The complete code of this step lies in the getInteractions.fsx file

References of characters

Also I decided to find references of all characters for creation of a timeline. The code turned out similar to that which retrieves dialogs of characters, only here I looked for all references, not only in dialogs. Also I conducted calculation of numbers of scenes. The following code returns the list of numbers of the scenes and characters mentioned in them.

let scenes = 
    splitByScene mainScript [] |> List.rev
let totalScenes = scenes.Length

|> List.mapi (fun sceneIdx scene -> 
    // some names contain typos with lower-case characters
    let lscene = scene |> (fun s -> s.ToLower()) 

    |> (fun name -> 
        |> (fun contents -> if containsName contents name then Some name else None )
        |> Array.choose id)
    |> Array.concat
    |> (fun name -> mapName (name.ToUpper()))
    |> Seq.distinct 
    |> (fun name -> sceneIdx, name)
    |> List.ofSeq)
|> List.collect id,

For extraction of temporary scales I used a numerovka of scenes to deliver in compliance an interval to each episode in a type [episode index−1, episode index]. It gave me a relative scale of appearance of characters in episodes. Times in cells of intervals [0,1] belong to the Episode of I, in cells [1,2] — to an episode of II, etc.

// extract timelines
[0 .. 5]
|> (fun episodeIdx -> getSceneAppearances episodeIdx)
|> List.mapi (fun episodeIdx (sceneAppearances, total) ->
    |> (fun (scene, name) -> 
        float episodeIdx + float scene / float total, name))      

I saved it in csv where every line contains a name of the character and exact times in which it appeared in movies, separated by commas. Completely the code is available in the getMentions.fsx file.

Let's add characters without words

Browsing statistics of talk on characters, I saw that in it there are no R2-D2 and Chubakka. Vuki not only did not receive a medal, but also was gone from all dialogs. Of course, they are mentioned in the scenario, but only as characters without dialogs.

Of course, it was impossible to ignore them in any way, and I decided to insert them into a social network on the basis of dialogs.

I retrieved the sizes of nodes and the communications between two absent characters from a network determined by their references. To turn it into communications in a social network, I decided to scale all data retrieveds in proportion to other similar characters who participate in the scenario. I selected C-3PO as it is an intermediary of R2-D2, and Khan – as the intermediary Chyui, having assumed that their interactions will be similar. I applied the following formula to calculation of force of communications in a dialogue social network:



After manual return of Chubakki and R2-D2 at me the complete set of social networks both for separate movies, and for all franchize turned out. For visualization of social networks I used to Sil … Well, actually, silo-directed network layout (force-directed network layout) from D3.js library. This method uses physical simulation of the charged particles. The most important following in a code:

d3.json("starwars-episode-1-interactions-allCharacters.json", function(error, graph) {
  /* More code here */
  var link = svg.selectAll(".link")
      .attr("class", "link")
      .style("stroke-width", function(d) { return Math.sqrt(d.value); });

  var node = svg.selectAll(".node")
      .attr("class", "node")
      .attr("r", 5)
      .style("fill", function (d) { return d.colour; })
      .attr("r", function (d) { return 2*Math.sqrt(d.value) + 2; })
  /* More code here */

On the previous steps I saved all networks in JSON. Here I load them and I define nodes and communications. For each node the color, and the value designating importance (the number of phrases of the character) is added. This parameter determines r radius, as a result all nodes are scaled on importance. Also and for communications – thickness of each communication was stored in JSON, and here it is displayed through line width.

Analysis of centrality

And at the end I carried out the analysis of centrality of each character. For this purpose I used RProvider together with R igraph packet to carry out the analysis of networks in F#. At first I loaded a network from JSON through FSharp.Data:

open RProvider.igraph

let [<Literal>] linkFile = __SOURCE_DIRECTORY__ + "/networks/starwars-episode-1-interactions.json"
type Network = JsonProvider<linkFile>

let file = __SOURCE_DIRECTORY__ + "/networks/starwars-full-interactions-allCharacters.json"
let nodes = Network.Load(file).Nodes |> (fun node -> node.Name) 
let links = Network.Load(file).Links

The links variable contains all communications in a network, and nodes are characterized by their indexes. For work simplification I put names of characters in compliance to indexes:

let nodeLookup = nodes |> Seq.mapi (fun i name -> i, name) |> dict
let edges = 
    |> Array.collect (fun link ->
        let n1 = nodeLookup.[link.Source]
        let n2 = nodeLookup.[link.Target]
        [| n1 ; n2 |] )

Then I created object of graph by means of igraph library:

let graph =
    namedParams["edges", box edges; "dir", box "undirected"]
    |> R.graph

Calculation of betweenness and centrality:

let centrality = R.betweenness(graph)
let degreeCentrality =

The code entirely can be found here.


As always it happens to scientific researches, the most difficult – to bring data into a readable type. As the scenarios SW had a little different format, I spent the most part of time, defining the general properties of documents to create one function for their processing. After that it was necessary to tinker only with problems of a vuka and the droid which had no remarks. Networks in the JSON format can be downloaded on a gitkhaba.


Source codes of
Social networks in the JSON format:

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus