Background

Tim is secretly a Hollywood producer on the weekends. He's holding a Hollywood mini golf tournament, and he needs to figure out who to invite. He has an initial list of guests, but he needs to make sure that they know each other well enough for the tournament to go smoothly. He wants to determine how closely the actors have worked together using the films they have played in. He's made his own database of film data to make the task easier.

mini golf course

Overview

In the movie business, it is traditional for actors to present their costars with gifts of golf balls at the premieres of the movies they appear in together. In fact, the same ball is sometimes regifted many times at successive movie premieres.

For any two actors, it is interesting to ask, “What is the most obvious way that a ball could have passed from one to the other?” Given two actors, the most obvious answer is that the actors starred in the same film, so a ball could have been given from one to the other at that movie's premiere.

Next most obvious, the first actor may have worked with an actor (we’ll call him “Kevin”) in one film, and then in a later film, Kevin worked with the second actor. It is clear that a ball is more likely to pass through films with more actors, as well. Tim loves attending movie premieres and taking actors on trips throughout time, so the order of the movie release dates is irrelevant.

There is one caveat: people in the movie business want the recipient of the ball to have some relation to them. This manifests itself as actors only passing a ball to another actor if the first character of the recipient's first name matches up with the first character of the donor's last name. If an actor has only one word in their name, just use that word as both the first and last name. If an actor has more than two words in their name, the first word is the first name and the last word is the last name. For example, “Kevin Norwood Bacon” could pass a ball to “BeyoncĂ© Knowles”, and “BeyoncĂ© Knowles” could also pass a ball to “Kevin Norwood Bacon”, but “Kevin Norwood Bacon” could not pass a ball to “Jason”.

Given that, we would like you to help us find the most obvious way that a ball might have been passed between two actors.

A note on partners: We expect the work for this assignment to be divided up fairly. To that end, you should document in your README what both partner's primary contributions were with your final submission. Your submission must have a functioning Stars REPL and GUI in addition to the specifications for tIMDB. Thus, you should first be integrating your code from Stars then developing tIMDb functionality together. A fair division and contribution of labor will be part of your grades. If you do not have a partner, please reach out to the HTAs immediately.

Setup

You should have received an invitation to join a GitHub repo with your partner by the evening of Friday February 21. If you do not see one by then (check your email!), please reach out to the HTAs.

This repo is basically empty. You should be moving over code and dependencies from your Stars repos to this new repo. We will be only be providing two SQL files and a folder with the public system test suite. These can be found at /course/cs0320/data/timdb. You should copy:

  1. smallTimdb.sqlite3 - a small SQL database with actors and movies
  2. timdb.sqlite3 - a much larger SQL database with actors and movies
  3. tests/* - a folder with the five public system tests

If you're working on a department machine, you'll need to copy the timdb.sqlite3 database to the "/ltmp" directory. This directory is local to the specific machine you're working on, so if you switch machines (including if you ssh to another machine), you'll need to copy the database again. Specifically, we'd like you to mkdir /ltmp/<your-login> and copy timdb.sqlite3 to this new, personal directory. This way, your copy of the database will have permissions that protect it from being modified by other users on the department network. From within your program's REPL, you'll be able to load the copied database from "/ltmp/<your-login>/timdb.sqlite3".

If you're working on a department machine and you do not copy the full tIMDb database to /ltmp/<your-login>/, your program's database interactions will be very slow because they'll be sent across the department network, rather than the local machine's hard drive.

Program Specification

Your REPL will need to support these two new commands (in addition to the commands from Stars):

  • mdb <sql_db> should load the database file named <sql_db>
  • connect <name1> <nameN> should describe the most obvious path of a ball between the actor named <name1> (who starts with the ball) and the actor named <nameN> (who ends with the ball)

When loading a database file, you can use the example files provided. For example: mdb data/timdb/smallTimdb.sqlite3. Unless there is an error, the output of this command should be: db set to data/timdb/smallTimdb.sqlite3. If the mdb command is run again with a different database, it should replace the original database with the new one. If you're caching any database information within your program, be sure to clear these caches when mdb is run with a new database.

As actor names generally contain spaces, you'll generally need quotes around the names.

connect “Samuel L. Jackson” “Sylvester Stallone”

The output will be one of two options, depending on whether or not it was able to find a path. In cases where a path exists, output should be formatted in the following way:

<name1> -> <name2> : <movie1>
<name2> -> <name3> : <movie2>
.
.
.
<nameN-1> -> <nameN> : <movieN-1> 

When no path is found, the output should be:

<name1> -/- <nameN> 

The names of actors and films should be unquoted when they are printed out. For example:

connect “Samuel L. Jackson” “Sylvester Stallone”
Samuel L. Jackson -> John Travolta : Pulp Fiction
John Travolta -> Tony Shalhoub : Primary Colors
Tony Shalhoub -> Sylvester Stallone : Men in Black
connect “David S. Howard” “Sigourney Weaver”
David S. Howard -/- Sigourney Weaver
connect “Abigail Mason” “Laurel Bryce”
Abigail Mason -> Martin Landau : You’ll Never Amount to Anything
Martin Landau -> Lisa Marie : Sleepy Hollow
Lisa Marie -> Mike Starr : Ed Wood
Mike Starr -> Scarlett Johansson : The Black Dahlia
Scarlett Johansson -> Justin Long : He’s Just Not That Into You
Justin Long -> Laurel Bryce : After.Life
connect “Sam Worthington” “Lance Henriksen”
Sam Worthington -> Wes Studi : Avatar
Wes Studi -> Sigourney Weaver : Avatar
Sigourney Weaver -> William Finley : Terror in the Aisles
William Finley -> Faye Dunaway : Terror in the Aisles
Faye Dunaway -> Donald Sutherland : Terror in the Aisles
Donald Sutherland -> Steve Lemme : Beerfest
Steve Lemme -> Lance Henriksen : The Slammin’ Salmon

Note that these actor names are case sensitive (Sylvester is different from sylvester, for example). If you can't find an actor by name, print an error and do not exit. Remember to error check for bad commands and handle errors cleanly!

If there are multiple actors with the same name but different IDs, you can simply use whichever the database returns when you query.

If the user inputs the same starting and ending actor name, it's up to you to handle the output for this case!

Just like in the previous projects, your server should be run when given the --gui flag. Please see the GUI specification section for more details.

Caching

Querying information from a database is a fairly optimized process, however there is nothing quite as fast as accessing locally stored data. Caching is the process of temporarily storing information to reduce latency next time the data is accessed. While in this assignment caching is not required for minimum functionality, we strongly recommend implementing some sort of cache to improve query time. You can check out Google Guava caching, but feel free to implement your own as well.

Keep in mind that a cache should never contain the entire database, as this defeats the purpose of a database in the first place!

Data Format

The actor and film data are provided by Freebase. This data will be formatted in a way that is new to you. Instead of text files like Stars, we will be providing you with a SQLite3 database to query.

In both databases we provide, there are three tables: actor, film and the joint table actor_film.

actor

The actor table stores all of the actors in our dataset. There are two columns, id and name. IDs will always be unique, names will not necessarily be unique.

film

The film table stores all of the films in our dataset. Like the actor table, there are two columns, id and name, the first of which is always unique and the second of which is not.

actor_film

This third table is called a junction table, because it exists to link rows from the actor table with rows from the film table. It describes all the relationships between actors and films. It has two columns, actor and film. For each movie each actor is in, there is a row pairing that actor's ID with that film's ID.

For example, if the movie Groundhog Day has twenty actors in it, then there will be twenty entries in the actor_film table where the film is set to the ID of Groundhog Day, /m/011x_4. Likewise, if Bill Murray is in 60 movies (wow!), then there will be 60 entries in the table where the actor is set to the ID of Bill Murray, /m/0p_pd. Since we know Bill Murray acted in Groundhog Day, we know there will be an entry somewhere in our table that has /m/0p_pd in the actor column and /m/011x_4 in the film column, representing that Bill Murray acted in Groundhog Day.

Grad Student Feature

Grad Functionality: Grad students will also implement update functionality. From the actor and film pages of your website, it should be possible to change the name of actors and films, and to add new actors and films. For example, Taylor Swift is in the film "The Giver." From her page, we should be able to change the name of the film "The Giver". If we find that her role in "The Giver" is not currently in our database, we should also be able to add "The Giver" to the list of films she has acted in. Similarly, from the page for "The Giver", we should be able to change Taylor Swift's name, or add Meryll Strepp to the actor list.

Implementation

Dijkstra’s Algorithm

Dijkstra’s algorithm is used to find the shortest path between two nodes in a graph. For this assignment, we want you to put actors at the nodes of some graph, with edges representing possible ball passes at movie premieres. We would like the edge weight for a particular movie premiere to be the reciprocal of the number of people who starred in it, so that movies with a larger number of actors have a lower edge weight.

Memory & Time Usage

The Freebase data set is large. We may use an even larger dataset when we grade your project. This means that reading in the entire graph from the database, and then doing a search entirely in memory is likely out of the question. Instead, you should keep data you haven't used yet on disk, and query it in when you need it. Any data structures you implement should account for this.

We expect your code to be able to identify both connections and non-connections in under 60 seconds on department machines using timdb.sqlite3. This will be enforced only when we test your implementation against our public and private tests. As a hint, you can modify Maven's default test timeout using mvn test --timeout <n> to limit each test to <n> seconds instead of the default 5.

Design Questions

  1. How could you modify your project, so other developers can easily add new graph search algorithms without having to worry about other constraints of the project (e.g. structure of the database, first initial -> last initial)?

  2. How could you improve your code to make it able to accept multiple types of files? For example, what if you wanted your program to be able to accept both a SQL database or a number of CSV files containing the same data?

  3. What would you need to change if movies now had an associated year of release and the chain of movies had to go in chronological order?

GUI Specification

This time around, you'll be building a website with a more complicated URL structure. For the GUI, we would like a home screen, /timdb, that has two search boxes for the starting actor and ending actor. There should also be some way to submit your query. You should display the query's results in a nice, intuitive way that shows the path through movies and actors from the starting name to the ending name. Each mention of movie or actor should link to an actor or movie's "personal page". The personal pages should contain a nice display of what movies the actor is in or what actors are in the movie. These lists should also be links to those respective movies and actors pages. And so on!

You can choose the URL structure for everything besides /timdb, as long as any URLs specific to this project are prefixed with /timdb/. For actor and movie pages, we would recommend checking out Spark's documentation on route patterns. In Spark, you can easily account for dynamic URLs using route patterns. Instead of creating a Spark handler for every possible movie and actor URL, you can create a get with a parameter pattern like: get("/some_url/:something", ...). Then, inside that handler, calling request.params(":something") will return whatever the user typed in after /some_url/.

One tip: all actor and film IDs have slashes in them, which could cause trouble when you include them in your generated URLs. Make sure to encode and decode the IDs in appropriate places in the frontend and/or backend!

Note: Your GUI must work on the large actor database.

Other Note: As your submission should be able to process both tIMDb and Stars commands in the REPL, you GUI should have a route that brings you to a tIMDb page and a route that brings you to a Stars page. The Stars GUI should still be able to process the four basic Stars queries, give descriptive error messages on bad input, and never crash.

Testing

As usual, you must thoroughly test your program using both JUnit and system tests. The public test suite should be copied to your repo as described in the Setup section, but remember that you are required to come up with your own, more thorough set of tests. Note that we will be grading you on the efficiency of your code as well as its correctness, so make sure you write tests for large data sets!

Minimum Functionality

As specified in the course missive you must meet minimum functionality for each assignment in order to pass CS32. If you hand in a project that misses minimum functionality, you have until the end of semester to implement minimum functionality and get it checked off at TA hours.

For tIMDb, minimum functionality requires passing the basic system tests provided. Additionaly, you need a minimally functionally GUI which supports searching between actors and displays actor/movie pages. Each actor name and movie name must hyperlink to a page that displays their movies/actors, respectively.

Handing In

Before you hand in, update your README.md with the following:

  • Known bugs
  • Design details specific to your code
  • Any runtime/space optimizations you made beyond the minimum requirements
  • How to build and run your program from the command line
  • How to run your tests
  • Any tests you wrote and tried by hand
  • Design questions and answers
  • Checkstyle appeals
  • Partner division of labor

When you are ready to hand in, follow the handin guide (just like previous projects)!

Only one of you and your partner needs to hand it in.