wiki:soc/2007/UserFriendlyGraps/IMDb

Fun with the Internet Movie Database

It turns out that you can download the content of the IMDb as text files - so I did. I also wrote a parser, and turned it into a gigantic semi-social network (graph). This isn't the typical social network where people are connected to people by relation edges - there's just too much data. In this graph, actor vertices are connected to movie vertices which are connected to actor vertices - nicely bipartite, but it changes some of the measures.

Also, I'm doing a significant amount of filtering to remove straight-to-video releases, tv series and shows, video games, archive footage, voice roles, etc. It shrinks the graph considerably. Even still, if I had tried to build a typical social network, there would have been something like 100K vertices, but 20M edges. As it is, there are roughly 120K actors and 280K movies in the graph, and 2.9M roles (edges).

That said, distance measures only really work on connected graphs, so I had to run connected components and stripped the small ones out of the graph, leaving one gigantic connected graph. I actually removed 224 little components, none of which had more than 9 or 10 actors. Most were individual people that were no longer connected after the filter was applied.

First, the mean geodesic distance was reported at 5.8 - very, close to the 6-degrees that we might expect from a social network. Of course, this measure is defined over a network with only one vertex type, and doesn't take into account the bipartite structure of this graph. Oops. Consdier that every path between two actors includes at least one movie and must be a multiple of 2. This means that we can effectively cut the mean geodesic distance in half, which is 2.9. It seems small, but it's actually true that most actor-to-actor queries can be solved in 2-3 steps.

I also ran a query to find who the most central actors are. Centrality is the property of being "central" or "important" within a social network. In the case of the Six Degrees of Kevin Bacon game, central actors are important those whose names would appear the most often as you connect actors. There are a number of different ways in which centrality is defined.

Degree centrality is a measure of connectivity. Actors with high degree centrality have more connections with other actors. This is trivially defined as the degree of a vertex for typical social networks. I redefined it for this graph as the number of distinct actors that an actor has worked with. More formally, it's the number of distinct vertices reached in exactly two hops from an actor.

Closeness centrality is a measure of distance. Actors with high closeness centrality have a shorter distance to all other actors in the network. Normally, this can be trivially computed over the resulting distance matrix of an all-pairs shortest paths (e.g., floyd warshall). Then I realized that it would require allocating a matrix that was about 400K squared elements - which is about 160GB of mem. I just got a new lab computer (with 2GB RAM), but that's still a little shy of the mark. As a result, I just computed the shortest distance to every other vertex for all actors in the graph - it took 2 days.

Here are the top 20 actors in for both types of centrality (obviously, based on a 0 index):

Rank Degree Centrality Closeness Centrality
0 Depardieu, Gerard Fonda, Henry
1 Lee, Christopher (I) Lee, Christopher
2 Fonda, Henry Quinn, Anthony
3 Galabru, Michel von Sydow, Max
4 Williams, Guinn 'Big Boy' Pleasence, Donald
5 Howard, Rance Borgnine, Ernest
6 Sutherland, Donald (I) Steiger, Rod
7 Carradine, John Carradine, John
8 Zardi, Dominique Hopper, Dennis
9 Cassel, Seymour Caine, Michael (I)
10 Walsh, M. Emmet Sutherland, Donald (I)
11 Corrigan, Kevin (I) Sharif, Omar
12 Brialy, Jean-Claude Heston, Charlton
13 Hopper, Dennis Lawrence, Marc (I)
14 Riehle, Richard Mitchum, Robert
15 Piccoli, Michel Gazzara, Ben
16 Starr, Mike (I) Keitel, Harvey
17 Caine, Michael (I) Connery, Sean
18 Durning, Charles Saxon, John
19 Howard, Clint Mifune, Toshiro

I'm not sure how interesting this is, but I think it provides a good basis for reasoning about the differences in centrality measures and how they can produce such different rankings. Otherwise, it's a fun exercise.

Last modified 15 years ago Last modified on Aug 9, 2007, 4:01:16 PM
Note: See TracWiki for help on using the wiki.