Syllabus:
-
Introduction
Database Management Systems (DBMS) have become an enormously successful industry with an estimated $40B in annual sales. The big players include IBM (DB2), Microsoft (SQLServer), Oracle, and Sybase. It is hard to imagine a major company running their IT department without a major investment in database technology. It is probably fair to say that most students graduating from Brown in Computer Science will need to interact with a DBMS at some time in their career.
Standard database systems from the big players all have a similar architecture that was developed over the last 30 years that has been motivated by the prevailing business applications and by the characteristics of the available hardware. This architecture (as is studied in CS127) has served the needs of the industry well.
However, there have been some very significant shifts in the database landscape over the last several years. First, the amount of data that needs to be managed has exploded. Whereas, in the old days, multiple megabytes or maybe a small number of gigabytes was sufficient, now some applications have a need to manage multiple terabytes or in some extreme cases petabytes (e.g., scientific applications like astronomy). Second, applications have evolved sufficiently such that we can characterize broad classes that each display very different data access patterns and performance needs. Third, the hardware that is available today has some very different cost and performance characteristics than what was available in the 80's and 90's when most of the major database systems were originally conceived. All of these shifts have a mojor impact on DBMS architecture and design.
It has been observed that in order to deal with the changes listed above, it might be time to rethink database architectures. The idea is that the "one size fits all" mentality of the big vendors is no longer appropriate. In order to meet the performance requirements for new applications, each application class needs its own database architecture. There should be a multiplicity of DBMS-types available. This leads to what we are calling non-standard DBMs and is the topic of this course.
-
Course Topics
We will look at a few broad classes of these non-standard database systems. This will include:
- Column stores (for OLAP)
- Parallel systems (for OLTP)
- Cloud-based systems
- The so-called "NoSQL" systems (for web applications)
The bulk of the course will be centered around this last category. We will be looking at all of this from a database perspective. Thus, we will take a skeptical stance on the new technology - asking ourselves why a traditional DBMS would not work just as well. In other words, is there an intrinsic reason that relational technology could not be made to be responsive to the needs of the new applications? This will help us separate the innovation from the hype.
-
How will the Course Work?
This is fundamentally a seminar course. There will be no exams. You will be graded on the basis of your participation in projects and presentations. There will be readings assigned for each class. In some cases these will be assigned papers and in others it will be a topic that you will be responsible for finding information from the Internet or other sources. Each class will have one or more presenters whose job it will be to lead the discussion in class. The non-presenters should prepare a 1-page position paper on the topic.
We will, as a class exercise (and as a service to the world), develop a website that describes and compares the similarities and the differences among the NoSQL systems. No such reasoned comparison exists (as far as I know, and it would be useful if someone (us) did this. The comparison will likely involve benchmarking the performance of these systems - possibly against some of the traditional DBMS's.
A more detailed description of a typical class follows.
- Everyone reads papers with special attention to primary papers.
- Each class will have 1-2 presenters for a topic.
- Non-presenters will write and submit a one-page description of the three most important ideas from the readings. Each idea should be described in a well-written paragraph that states the new idea and describes what it is about the idea that makes it worth remembering.
- Class will raise questions. Any question that we deem as important and for which we cannot come up with an answer in class will be deferred until the following class.
- Second half of the class will be be used to discuss the NoSQL project.
Presentations should have the following content (at least):
- Introduce the background and the problem that the paper(s) are trying to solve.
- Summarize the key points from the paper(s) with examples if possible.
- Compare with a corresponding relational solution to the same problem(s).