CS2270 - Advanced Topics in Database Management

Syllabus:

Background

Over the last thirty years, database management systems (DBMS’s) have matured into an industry with sales in excess of $20 billion and a technology that has a well-defined place in the every major corporation worldwide. Formal courses in this area are commonplace (CS127), and research continues to produce innovations and new product lines.

Database management systems provide a one-size-fits-all solution to the problems of managing large volumes of data. When you purchase a modern DBMS, you get support for complex data structuring (e.g., indices), fully optimized query processors (SQL), data evolution (e.g., views), concurrency control, recovery, and security. The modern DBMS is a package of many technologies that can be used in concert to solve real data management problems within a given organization. Vendors would like you to believe that their tools will solve all problems related to managing large amounts of data.

Lately, however, there has been recent evidence that one-size-fits all solutions do not scale. In fact, we have shown in several application areas that special-purpose data management engines can achieve at least an order of magnitude better performance than a generic solution.

This term we will study the topic of how database systems might best service the demanding data management needs of scientific applications. Science applications are data-intensive. Astronomers are awaiting the completion of a new telescope (LSST) that will produce a petabyte of data per month. This amount of data is very hard to archive, let alone analyze with any reasonable performance guarantees. It seems obvious that specialized DBMS’s are needed to handle this data deluge. Yet commercial DBMS’s seem to have made little or no impact in this arena. Why is that?

This course is intended to get us closer to an answer to this question. Furthermore, we would like to understand better what the requirements of a database system for science might be.
Course Content

As the growth of information within the science community continues to explode, we begin to see the appearance of tools and technologies that can help in some particular way to manage the large volumes of data. Data interchange formats act as the data model, and special tools have evolved to operate on files that conform to these formats. Ad hoc processing pipelines are built using scripting languages. Uncertainty is handled in the application instead of in the database. No one has yet managed to build an integrated DBMS that will operate effectively in this environment.

This course will investigate some of the main requirements and technologies that are evolving within this space. They include, but are not limited to:
1. Data Models
2. Provenance and Lineage
3. Uncertainty Management
4. Data Integration
5. Cluster Computing and Storage Systems
6. Spatial Databases
The course will focus on each of these technologies individually and will try to understand better how they might be integrated into a DBMS’s. Much of what we will study will be results from recent research papers. Often research papers will present incomplete results which do not necessarily agree with results from other research groups. It will be our job to try to understand the roots of this disagreement as well as what seems to be commonly accepted and what requires further research.
Game Plan

CS227 has always been a seminar course. In the past, we read current research papers together and discussed them in class. This year, however, the format will be different. We are starting with the premise that the area of scientific data management is still very early in its development. We will uncover the state of the art and will try to synthesize a written tutorial on each of several broad topics (likely the topics listed above). We will assemble the tutorials into a single work and publish it for the benefit of others.

At the very least, we will publish our book on the web. If it is very successful, we will look into the possibility of getting a traditional publisher to put it out as a printed volume. The quality required for the later is quite a bit higher than the typical term paper in a one-semester course. Yes, writing style and good organization matter – in fact, they are required. The chapters in this book must be pedagogically sound. This, of course, requires that the authors have a very good mastery of the material.

The topics listed above will be the basis for the chapters (although I am open to suggestion). Each topic will be addressed by a team. The ideal team size is three, but we could live with teams of size two or four. It will be the team’s responsibility to research the topic and organize the material. This requires the ability to discriminate between what’s important and what’s not. It also requires the ability to step back and try to abstract what is going on as distinct from the details. These are not easy things to do, and I will help with this.

This course is designed to be anti-competitive. We will have a common goal, that of producing a quality book, and we will all work together to that end. Thus, groups will help other groups to make their chapters better. The overall quality is the goal, not individual performance.

Another basic premise is that we should never do anything that feels like “busy-work”. It’s hard enough to write a book! Thus, if it seems like a class is not necessary, or it will not contribute to the goal, we will cancel it. Classes should be used for us to educate each other and for us to give each other feedback. They are working sessions – not opportunities for us to put each other to sleep!

CSCI 2270:Advanced Topics in Database Management

Syllabus:

Background

Course Content

Game Plan

CSCI 2270:
Advanced Topics in Database Management