Foreword

This lab appears to be very long, but don't fret; most of what is below requires only that you read it. You will be learning something today that you have (likely) not ever seen before, and is very complex and interesting. If you would like to learn more about the topics covered in this lab, or if you find them particularly intriguing, consider taking the fall course CS127: Database Management Systems.

Overview

In this lab, we will be giving you a crash course in databases. We are going to cover what databases are, why you should use and love them, how to interact with them, and then finally how we do all of this in Java! You will be creating a small database from a dataset that we provide, and then practice queries on a much larger database that we will also provide.

Why Databases?

Fair question. A "flat" file representation of millions of datapoints might be formatted as comma or tab-separated values: essentially a plain text way to store a table, where each row is a datapoint and each column is a certain aspect of that datapoint.

At first thought this seems to be a perfectly fine way to store data; however, when trying to use data that is stored in this fashion it becomes evident that it is not. Searching through the data, a very common process, requires a binary search through the entire file, if it is sorted on the field that you are searching for. And if it isn't sorted, you must traverse the whole file, one line at a time, to find what you are looking for. And as your dataset grows, say to a terabyte or greater, these operations become infeasible.

So, what's the better way?

Databases! The idea behind databases is that they are (nearly) infinitely scalable. No matter how large your dataset grows to be, accesses and searches will always be quick. On the highest level, databases are still just tables, but more intelligently represented.

Each row ("tuple") of the table ("relation") is still one datapoint, and each column ("attribute") is still a certain aspect of that datapoint. The difference is that a database stores more information about these tables, in a structure called an index. The index basically says where in the table the tuple with a specific value for a specific attribute is located. For example, assuming we have a relation R which is sorted on integer attribute A, our index would give us pointers into the table for the first tuple where A=0, A=1, A=2, etc. So then when we ask our database for every tuple where 56 < A < 95, we can jump right to the first tuple where A=56 and read until A=95.

Keys to the Kingdom

Another very important aspect of databases is the idea of keys. There are multiple types of keys, but we are only going to cover two: primary keys and foreign keys.

Primary Keys

The primary key is the attribute of a relation that is chosen by you to be unique for each tuple; most of the time this will be a unique ID of some sort. By declaring an attribute to be a primary key, we are guaranteeing that no two tuples will have the same value for that attribute. You can have multiple attributes of a single relation be primary keys; this would reject any duplicates of those many attributes. In other words, if we had two primary keys, any new tuple for which both of those keys were the same as an existing tuple would be rejected.

Foreign Keys

A foreign key is a little bit different than a primary key. Rather than being an attribute of this relation, a foreign key refers to an attribute of another relation. In essence a foreign key is saying, "anything assigned to this attribute of a tuple must be a value of this other attribute in this other relation." Let's see an example; suppose we have the following two tables:

persons
P_Id LastName FirstName Address City
1 Hansen Ola Timoteivn 10 Sandnes
2 Svendson Tove Borgvn 23 Sandnes
3 Pettersen Kari Storgt 20 Stavanger
orders
O_Id OrderNo P_Id
1 77895 3
2 44678 3
3 22456 2
4 24562 1

For these two tables, P_Id is a primary key of persons, and O_Id is a primary key of orders. Also, P_Id in orders references the attribute P_Id in persons; thus it is a foreign key. To reiterate the explanation above in more concrete terms, this means that for any insertion into orders, the value specified for P_Id must be a value that is already present in the P_Id attribute of persons.

Interacting with a Database

We interact with a database using a language called SQL (Structured Query Language, pronounced like "sequel"). SQL gives us all of the tools needed to create relations, insert tuples into the relation, and access data stored in a relation.

As you read through the following, feel free to play around with the commands yourself by launching the interactive command line shell for sqlite3. Simply run sqlite3 on your terminal (if this doesn't work on your laptop, you can SSH in)!

Creation

To create a relation cs_classes with attributes number, title, semester, and instructor where number is the primary key and instructor is a foreign key, we execute the following command:

			CREATE TABLE "cs_classes"(
			"number" INTEGER,
			"title" TEXT,
			"semester" TEXT,
			"instructor" TEXT,
			PRIMARY KEY ("number"),
			FOREIGN KEY ("instructor") REFERENCES instructor(instr_id)
			ON DELETE CASCADE ON UPDATE CASCADE
			);
			

Let's break this down:

CREATE TABLE "cs_classes"(...);is the command to create a relation called cs_classes.

"number" INTEGER

declares an attribute called number, which is of type INTEGER.

"title" TEXT

declares an attribute called title, which is of type TEXT.

"semester" TEXT

declares an attribute called semester, which is of type TEXT.

PRIMARY KEY ("number")

declares the attribute number to be the primary key of this relation.

FOREIGN KEY ("instructor") REFERENCES instructor(instr_id)

declares the attribute instructor to reference the attribute instr_id of the relation instructor. It also tells our database to immediately verify that things inserted or deleted from this relation match with what it references. In other words, any new tuple will have its instructor attribute checked against instructor(instr_id) right away, or will be rejected.

ON DELETE CASCADE ON UPDATE CASCADE

declares the behavior of the foreign key when instructor(instr_id) is deleted or updated. If a tuple in instructor is deleted, it tells our database to also delete (cascade) the tuples with a instructor attribute that matched the deleted instructor(instr_id). Similarly, if a tuple's instructor(instr_id) is updated, then all referenced instructor attributes are also updated.

Note: If you don't want to replace an existing table, you can use the keywords "IF NOT EXISTS" which will abort table creation if there already exists one under the name.

CREATE TABLE IF NOT EXISTS "cs_classes" ...

Sidenote: Types in SQL

In SQL, there are many types that an attribute can be declared to be. You can read all about them here, but the ones that you will need to know about are TEXT, INTEGER, and REAL. TEXT is just what is sounds like: text, or a string. INTEGER and REAL store numbers — integers and floating point numbers respectively.

Insertion

For our basic insertion, we can use the following command:

INSERT INTO "cs_classes" VALUES (32, 'Software Engineering', 'Spr20','tn');

For this command, we do not need to specify the attributes of the relation; only what we want their values for this new tuple to be.

We can then execute additional insertions:

INSERT INTO "cs_classes" VALUES (33, 'Systems Programming', 'Fll18', 'twd');
INSERT INTO "cs_classes" VALUES (167, 'Operating Systems', 'Spr20', 'twd');

However, if we try to do this:

INSERT INTO "cs_classes" VALUES (32, 'Software Engineering', 'Spr20', 'tn');

Then we will get an error, as there is already a tuple in our relation with the attribute number=32. Since it is a primary key for the relation, there can only be one tuple with each value, and this insertion will be rejected.

Note: When inserting into a field that is autoincremented, a unique id for example, you should either have that value be NULL, or specify what fields you want and leave the autoincremented field out.

For example, if "cs_classes" contains "number" field that is auto-incremented:

INSERT INTO "cs_classes" VALUES (NULL, 'Machine Learning', 'Spr10', 'abc');

Or

INSERT INTO "cs_classes" (title, semester, instructor) VALUES ('Machine Learning',
			'Spr19', 'abc');

Update

If there already exists a relation, it's also possible to edit its values by using an UPDATE statement.

orders
O_Id OrderNo P_Id
1 77895 3
2 44678 3
3 22456 2
4 24562 1

Say we have the orders table above where O_Id is a primary key. If we made a mistake on the quantity of the order with O_Id=2, we can update the quantity by executing:

UPDATE "orders" SET ORDERNO = 44679 WHERE O_Id = 2;

Sometimes when dealing with primary keys or uniqueness constraints, we might want to always insert a new relation, but we don't know if there already exists an old relation with that key. For example, suppose we reuse O_Id numbers and only want to keep track of the latest order with that Id. We can use a REPLACE statement in SQLite which will first delete any row with the existing primary key and the insert our new relation.

Consider inserting a new order with O_Id=3, OrderNo=12345 in the existing table above. Because O_Id is a primary key, there can only be one relation with the value O_Id=3. However, since we only want to keep track of the latest order with that O_Id, we can do the following:

REPLACE INTO "orders" (O_Id, OrderNo) VALUES (3, 12345);

The REPLACE statement will check uniqueness constraints (violated by O_Id in this case), delete the current row violating that constraint, and then insert the new values.

Querying

Now that we know how to create a database, we need to learn how to use it! The bulk of SQL is related to querying a database; that is, asking a database for information. There are a lot of commands and keywords for querying, and the statements for queries can get very, very complicated, so let's start simple.

SELECT and FROM

Referring to the relation that we creating above, let's see a very simple selection:

SELECT name FROM cs_classes;

This query introduces two new keywords: SELECT and FROM.

The SELECT keyword indicates which attributes of the relation to include in the output. In this case, we are including only the name attribute of every tuple in our relation, but you can specify as many as you like, separated by commas, or use the wildcard * to select every attribute.

The FROM keyword is just what it appears to be: it specifies the name of the relation that we are looking at. (This can also be multiple, comma-separated values, but we'll get to that in a bit!)

WHERE

Now we will add a third keyword to our queries: WHERE. The WHERE keyword specifies constraints on the values of the attributes that must be satisfied for the corresponding tuple to be a member of the result. For example:

SELECT * FROM cs_classes WHERE semester = 'Spr20';

The result of this query will be every attribute of every tuple of the relation cs_classes such that the attribute semester has the value Spr20. So this would exclude the tuple for CS33, but include CS32 and CS167.

You can also have multiple constraints, by stringing them together with the keywords AND or OR. In addition, many common comparison operators exist: = != < > <= >=

ORDER BY

ORDER BY is used to sort the results of a query on a given attribute. It's usage is as follows:

SELECT * FROM cs_classes ORDER BY number DESC;

So, we select every attribute of every tuple of the relation cs_classes, and then sort them based on their course number. The last word, DESC, is short for descending, meaning that the largest course number will be first, and they will descend as one travels down the results. The other option there is ASC, but it is rarely used since it is the default.

A keyword that very often goes along with ORDER BY is LIMIT this is a simple one, that only has a number after it, which specifies how many of the top results to return after they are sorted. It's usage is as follows:

SELECT * FROM cs_classes ORDER BY number DESC LIMIT 5;

This would grab the five classes with the highest numbers.

Aggregate Functions and Grouping

Another super powerful aspect of SQL is that it provides a set of functionalities that can summarize the resulting numeric data from your SELECT statement. They basically perform useful computations on a particular column of selected data. We're only going to be looking over one quick example in this lab, but feel free to look up other available commands online!

COUNT

The COUNT function returns the number of rows that matches a specific criteria. For example,

SELECT COUNT(*) FROM cs_classes WHERE instructor='tn';

would return the total number of courses taught by Tim!

GROUP BY

Now, with aggregate functions, we can use the keyword GROUP BY to collect data across multiple rows and group the results together by column. Take a look at the following:

SELECT semester, COUNT(*) FROM cs_classes GROUP BY semester;

This query would give us two columns, one with the semester and the other with the number of courses taught during the corresponding semester. Neat, huh?

Cross-Products

SQL wouldn't be that great if you could only pick from one relation at a time, right? So it allows syntax like this:

SELECT * FROM students, faculty;

Since we did not specify any WHERE conditions, we will end up with a table that is the cross-product of the relations students and faculty; that is, it has every tuple of one concatenated with every tuple of the other. On the surface this may not seem terribly useful, but things change when we add in some WHERE conditions. Suppose we have the following query:

SELECT * FROM students, faculty WHERE students.student_name =
			faculty.faculty_name;

What is this doing? Well, similar to above, it is going to compute a cross-product of the two relations. But as it is going through the motions of creating each tuple, it checked to see whether the new tuple matches all of the given WHERE restrictions; if it does, then it ends up in the result relation. If it doesn't, then it won't. So for the above query, we will end up with a relation that gives all the information about every person who is both a student and a faculty member.

Let's look further at this example. Suppose we have the following two tables:

students
student_id student_name age hometown
B01234567 Prithu 21 Boston
B01234568 Lena 21 New York
B01234569 Jacob 21 Anaheim
faculty
faculty_id faculty_name department
7789 Tim Nelson Computer Science
1237 Jacob Alternative Athletics

If we get the cross product of these two relations, like this:

SELECT * FROM students, faculty;
we get the following output:

students ⋈ faculty
student_id student_name age hometown faculty_id faculty_name department
B01234567 Prithu 21 Boston 7789 Tim Nelson Computer Science
B01234567 Prithu 21 Boston 1237 Jacob Alternative Athletics
B01234568 Lena 21 New York 7789 Tim Nelson Computer Science
B01234568 Lena 21 New York 1237 Jacob Alternative Athletics
B01234569 Jacob 21 Anaheim 7789 Tim Nelson Computer Science
B01234569 Jacob 21 Anaheim 1237 Jacob Alternative Athletics

This is not particularly helpful! But if we limit our cross product like this:

SELECT * FROM students, faculty WHERE students.student_name = faculty.faculty_name;
we get a narrowed down result:

student_id student_name age hometown faculty_id faculty_name department
B01234569 Jacob 21 Anaheim 1237 Jacob Alternative Athletics

Joins

All of the cross products that we have seen so far can be written as explicit joins. For example:
SELECT * FROM students, faculty
is the same as
SELECT * FROM students JOIN faculty
Additionally,
SELECT * FROM students, faculty WHERE students.student_name =
			faculty.faculty_name;
can also be written as:
SELECT * FROM students JOIN faculty ON students.student_name =
			faculty.faculty_name;
Using the keyword JOIN is an "explicit" join, whereas using a comma is an "implicit" join. Using the JOIN keyword gives you more control over your cross-product and could be useful in upcoming assignments. You can read about different kinds of SQL joins here.


Putting It All Together

Now that you know the basics of SQL, take a moment to think about what the following query is doing, and then continue reading to make sure that you are correct:

SELECT * FROM orders AS o, suppliers AS s
			WHERE o.supplier_id = s.id AND o.date < '2011-06-14 17:17:23' ;

Think you've got it? Let's find out!

The first few facts are simple enough:

  • Whatever the output is, we are selecting every attribute of it.
  • We are using two relations in out query, which are called orders and suppliers, but we give them a new name, or something?
  • We have two constraints on the result, but they look a little funny.

So, one potential confusion at a time.

First, we are using yet another keyword, AS. This allows us to rename a relation just for the purposes of this query; as far as the database is concerned, its name stays the same. This is mostly used to prevent oneself from needing to write out the whole name of a relation multiple times. So here we are referring to orders as o and suppliers as s in the rest of the query.

The other weird thing about this query is this stuff with the dots in the WHERE portion. What we are doing here is being very explicit about exactly which attributes are being referred to, by specifying the relation to which they belong. This is done in the format <relation>.<attribute>; however, you can also use the temporary name for a relation instead, as is done above.

SQL in Java

So, now that you know some SQL, you need to learn how to use your newfound knowledge for some real projects in Java! There are a lot of steps to this, and some of them are a little weird, so be sure to follow this guide closely to minimize problems!

Connecting to the Database

The first step is to establish a connection to our database. Assuming that it is a file called db.sqlite3, we can connect to it by doing the following:

// this line loads the driver manager class, and must be
// present for everything else to work properly
Class.forName("org.sqlite.JDBC");
String urlToDB = "jdbc:sqlite:" + "<path/to/db>.sqlite3"; 
Connection conn=DriverManager.getConnection(urlToDB); 
// these two lines tell the database to enforce foreign keys during operations, and should be present 
Statement stat = conn.createStatement(); 
stat.executeUpdate("PRAGMA foreign_keys=ON;"); 

We now have a Connection object that we can use to interact with the database. The Connection is really only used for one thing, which is to create another object called a PreparedStatement. The PreparedStatement is what we use to send queries to our database, and to receive the results back.

Creating a Relation

In order to create a relation using this new Connection, we can do the following:

PreparedStatement prep;
	prep = conn.prepareStatement("CREATE TABLE cs_classes("
	+ "number INTEGER,"
	+ "title TEXT,"
	+ "semester TEXT,"
	+ "instructor TEXT,"
	+ "PRIMARY KEY (number),"
	+ "FOREIGN KEY (instructor) REFERENCES instructor(instr_id)"
	+ "ON DELETE CASCADE ON UPDATE CASCADE);");
	prep.executeUpdate();

We have now successfully sent a query to the server, one which has caused a new table, called cs_classes, to be created. Note the use of the method executeUpdate() here; this is important, because when we are executing queries for which we want some results, like SELECT statements, we will use a different method!

Adding to a Relation

Now that we've got a database, let's add some things to it, using more features of prepared statements: setting and batches.

prep = conn.prepareStatement(
	"INSERT INTO cs_classes VALUES (?, ?, ?, ?);");
	prep.setInt(1, 32);
	prep.setString(2, "Software Engineering");
	prep.setString(3, "Spr20");
	prep.setString(4, "tn");
	prep.addBatch();

	prep.setInt(1, 33);
	prep.setString(2, "Systems Programming");
	prep.setString(3, "Fll18");
	prep.setString(4, "twd");
	prep.addBatch();

	prep.setInt(1, 167);
	prep.setString(2, "Operating Systems");
	prep.setString(3, "Spr20");
	prep.setString(4, "twd");
	prep.addBatch();

	prep.executeBatch();

So, what went on here? First off, there are some weird question marks in the PreparedStatement. These are placeholders: we know that every query we want to make takes that form, but we want to be able to reuse the same PreparedStatement for multiple queries. So we use the question marks, and then fill them using the methods like setInt() and setString(). The first argument of any set method is the index of the '?' that we want to fill (indexed beginning at 1), and the second is what to fill it with. Then we use the method addBatch() to save the current state of the PreparedStatement so that it can be executed later. We are now free to keep modifying the PreparedStatement, and continue saving states of it with addBatch(). Once we are finally done, we can execute all of those statements at once by calling executeBatch().

You should use this "pattern filling" technique even if you only intend to execute one query, especially if it uses data from user input. The SQL library knows how to properly quote parameters. If you just concatenate strings, you would probably have a problem if someone queried the course named "America's History", because of that apostrophe.

Querying the Database

Okay, now let's see a query where we have to retrieve some information from the database.

prep = conn.prepareStatement(
	"SELECT * FROM cs_classes WHERE semester = ?;");
	prep.setString(1, "spr20");
	ResultSet rs = prep.executeQuery();

	while (rs.next()) {
	int resCourseNumber = rs.getInt(1);
	String resCourseTitle = rs.getString(2);
	String resCourseSemester = rs.getString(3);
	// Do something else with these results...
	}
	rs.close();

The first part of this query is not much different from the previous ones: we use a PreparedStatement, and fill the placeholder with what we want to query for. But that is where the similarities end.

We then use a method called executeQuery(), as opposed to executeUpdate() or executeBatch() like before. This method returns something to us, an object of type ResultSet. This object contains, as you have likely guessed, the results of the query that was made. The ResultSet object can be stepped through using a while loop as demonstrated above; next() will return false when there are no more elements. Since you know the attributes that you are getting back (here, every attribute of cs_classes) you can use the get() methods on the ResultSet to grab each attribute of each tuple of the result, and then act upon them.

The final step is crucial: you must close the ResultSet after you are done with it. You also need to close() PreparedStatements. Due to the way that JDBC works, each of these is an open reading reference to the database, and if it is not closed it will interfere with operations later on and slow things down considerably. In general, most resource-related JDBC objects require closure when you are done with them. If in doubt, check to see if it has a close() method!

The Assignment

About time, huh? Stencil can be cloned through git here.

Part 1: Setup

The National Table Tennis Federation wants to partner with the International Quidditch League to create a database of player names, and they want to be able to autocorrect on each name. However, they want to be able to only use one set of names or the other, without overlapping! They've contracted you to update their existing autocorrection software to implement this functionality.

We have provided a new class for you: Database. Database is a wrapper class for the SQLite database you'll be using to store words. You won't need to change anything in Main - we've made all necessary updates for you! Note: this lab is conducted entirely through the command line! While you can still run the gui, it won't be affected by the changes you make to Database.

If you run Autocorrect with the database argument and the data argument like ./run --database=path to database file --data=path to txt file , it should load the .txt files requested into the database if they have not already been loaded and print a message on success. Otherwise, it will autocorrect normally on the specified database if given other Autocorrect flags (prefix, whitespace, led). An empty database file is provided for you at data/data.sqlite3.

In order to run Autocorrect with a database, use ./run --database=path to database --prefix --whitespace --led=1 . To do this, we want you to create the following relations:

corpus
id filename
word
corpus_id word
where word.corpus_id is a foreign key referencing corpus.id

To complete this part of the assignment, follow the TODOs in the constructor of the Database file! You'll know you're done when you're autocorrecting normally using the --database flag.

If you are receiving a NullPointerException, ensure you are assigning the private instance variable conn in your constructor instead of making an entirely new Database connection.

Note: We recommend not loading dictionary.txt or any of the long text files into your database, as even with a correct implementation this could take as long as 10 minutes. We've provided some shorter corpus files in cthulu.txt and matchstick.txt for you to test on.

Part 2: Statistics.

Now that you have a working database, your clients want you to provide some metadata about the database! Specifically, they want to know how many words are in each corpus, and how many times each word occurs across all corpi. We're only asking you to write the SQL for this part - all other code has been provided for you.

You can print the statistics for your database using the --stats flag when the --database flag has been provided. This will print out the results of the getInstanceMap() and getFrequencyMap() functions.

Getting Checked Off

Once you are done with both parts, get checked off by a TA at lab hours! The TA will first verify that your autocorrect works properly with a database, and then that you print the right statistics for a specific corpus set.