Introducing the IMDb

The Internet Movie Database — for those who still have not used it.

A few years ago I acquired a copy of Halliwell’s Film & Video Guide (13th edition), but such books are obsolete the moment they are printed and I shall never need another one: for of the myriad wonders of the World Wide Web, one that fascinates me is the Internet Movie DataBase. If you’ve never used a real database, this is a perfect example of how that term means so much more than just a heck of a lot of information (which is how the term gets abused).

In a true database, you can cross-reference anything against anything else, because you can search on each class of information. In the IMDb, when you want to know about a movie, you select “title” and type in some or all of the title and it does a search. Searching is the main task of a database system, and this one is good. It’ll find you all the exact matches, and also (shown separately underneath) all the partial matches. It’ll do the same for (real) people by name, fictional characters by name or description, jobs in movie production, and so on.

To do this, all the basic information in the database is held in tables, each table containing one kind of information — all the movie titles in a huge Titles table; all the people’s names in an even more huge People table, all the different roles (and here I mean job titles, not names of characters) that people can perform on a movie project, such as writer, actor, producer, art director, cinematographer, electrician, carpenter, stunt performer, and so on.

Each table has a number for each record (each name or each title). Clearly, these run into millions. Each record is referred to by means of this number.

Now here is the important bit, which is true of all powerful database: the facts about who did what are held separately!

This means that, if there is an entry in the People table for Humphrey Bogart, that entry holds some facts about that man personally, such as his date of birth, but it holds no information about what he did in movies! Similarly, there is an entry in the Titles table for a movie made in 1942 with the title Casablanca, but that entry holds no information about which actors appeared in it. (By the way, other movies or TV series with the title Casablanca were made in 1955, 1961, 1983, 1998, 2002 and 2003! Look them up.)

All that information is held separately. To be specific, there clearly must be a separate (absolutely vast) table in the database — perhaps called (for the sake of argument) the “Worked On” table — which contains records something like this (I guess — I am not privy to the internal design details):

Title number
Number of the record in the Titles table for a movie (or TV movie, or TV drama series, or other TV show covered by the IMDb criteria for inclusion)
Person number
Number of the record in the People table for a person who worked on the movie
Role number
Number of the record in the Roles table for the role in which the person worked on the movie
Character number
If the role was “actor”, the number in the Characters table for the name (such as “James Bond”) or description (such as “Chauffeur to Lord Peter Oldham” or “Mexican Street Hooker”)

For example, the fact that Bruce Willis appeared in a movie called Die Hard is a record in the Worked On table. And, as the description shows, this record in this Worked On table for our fact doesn’t contain the name “Willis” or the word “Die” at all. Rather, it contains the record number for Mr Willis in the People table, the record number for Die Hard in the Titles table, and the record number for “Actor” in the Job table.

What’s more, where someone does several jobs on one movie, they do not have a special extra-big record with several role numbers! No! They have several records for that same movie in the Worked On table, one for each job. Remember, each fact can be got independently. If, every time you wanted to find some fact, you had to search through records of different structures, the search routine would have to perform a more complictaed task to cope with these different structures, and that would take longer. Therefore all big databases are designed so that the complicated work of breaking down the information into the most basic facts is done just once, when the information is added to the database (such as when work on a new movie begins and somebody adds the information about it to the IMDb).

In this way, the task that is done once is allowed to be bigger so that the tasks that are done many times, the searches by people consulting the database, is as simple as possible and can be performed as quickly as possible with the smallest amount of complicated processing.

So for his version of Roald Dahl’s children’s story Matilda (starring little Mara Wilson as Matilda Wormwood — I just love watching this movie), Danny De Vito would have three records in the Worked On table: one as Actor, one as Director, and one as Producer. 

This structure is what makes it possible for the computer to come back in seconds, with answers, to each of tens of thousands of people asking it questions every moment, all the time.

If there is one amazing fact about the entertainment industry that emerges for me from all this, it is the amount of duplication. Because a database works on searches, it gives you all the answers, and you pick the one you want.

I referred above to the fact that there were seven different titles (movies or TV series) called Casablanca. The database has to identify different answers somehow, so for titles it appends the year of release; for people, it adds an arbitrary Roman numeral (starting with I for the first person it is told about), and so on. Thus the Humphrey Bogart movie is listed as Casablanca (1942). Similarly, if I search for the title Die Hard, I find that the movie I am looking for is Die Hard (1988). There was also a Die Hard (1997) made in Russia, which was just 2 minutes of animation. The sequels are listed with their subtitles, such as Die Hard: With A Vengeance (1995)aka “Die Hard 3” it says in the list. And in 2004 something called Die Hard 4.0 was listed as due out in 2006, but by January 2007 that had changed to the USA working title, evidently the movie was behind its 2004 schedule, and its revised official title was Live Free or Die Hard (2007) which also indicated when it was expected to be released. Its status was listed as “filming” When a film is new but finished its status goes to “Completed” but when it is history (as with Casablanca (1942)) no status is listed at all, for obvious reasons: it is redundant. By the way, as you may know now long after its release, the movie listed at the time as Live Free or Die Hard (2007) starred our Bruce as the same character, and is directed by John McTiernan (who did the original and Die Hard 3, but not Die Hard 2).

A true web-interface characteristic of IMDb, something I would expect of any web database interface, is that once you are looking at any page of information you can click on key words such as titles or people's names and instantly jump to a page of information. On the page for each Die Hard movie, you can click on the name Bruce Willis and get a list of all his other work in movies, in whatever capacity.

That applies to every last person in every credits list, from the dozens of people in the art department, or sound, or special effects, down to the assistant electricians and drivers. You can click and see each career mapped out: a lifetime in wardrobe, or additonal special effects — or sometimes an actor who worked on one movie, in a small part, and then never had another job in movies at all. Note that the IMDb credit list often includes all the people who were uncredited if you only watched the credits at the beginning or end of the movie itself.

For each significant movie, there are also collections of reviews, from “loved it” to “hated it”, from members of the public. You just sign on and add your two pennyworth!

On duplication, De Vito’s Matilda is Matilda (1996). The IMDb also lists Matilda (1991), an Italian production in which Matilda is played by Carla Benedetti and “is an unlucky girl whose boyfriends keep dying ”. In 2004 Jonathan Ross interviewed Elliot Gould and brought up his appearance as a small-time talent agent in Matilda (1978) in which the title role is a boxing kangaroo. There it is, in IMDb. And there are also many cases of people sharing names.

The moral of all this is simple: read the package carefully when you shop for DVDs!

Revised January 2007; first published in SEMantics Issue 168, October 2004