Over the last thirty years, database management systems (DBMS's) have matured into an industry with sales in excess of $20 billion and a technology that has a well-defined place in the every major corporation worldwide. Formal courses in this area are commonplace (CS127), and research continues to produce innovations and new product lines.
Database management systems provide a one-size-fits-all solution to the problems of managing large volumes of data. When you purchase a modern DBMS, you get support for complex data structuring (e.g., indices), fully optimized query processors (SQL), data evolution (e.g., views), concurrency control, recovery, and security. The modern DBMS is a package of many technologies that can be used in concert to solve real data management problems within a given organization. There are many advantages to this tightly integrated approach.
What is data management? Data management is the use of application-level knowledge to provide users with appropriate functionality and performance in the face of severe problems of scale. Examples of data management techniques from modern database systems include indexing, clustering, caching, replication and query optimization. In a network setting this list might expand to include things like the use of broadcast, push-based delivery, and request scheduling.
The web has caused a revolution in how people think about information systems. While large corporate databases are not going to disappear, the web makes a panoply of data available to ordinary users through their web browsers. Most of this data is rather disorganized, but the promise of a worldwide information repository that ordinary folks can contribute to is so appealing that users have been very forgiving of its lack of discipline.
The database community, of course, has been busy trying to figure out how to respond to the challenges of the web. Some researchers have proposed XML-based query languages. Others have worked on XML-based storage systems. All of these topics are interesting and have been the topic of previous versions of CS227. This year we are trying something different.
As the growth of information within the Internet continues to explode, we begin to see the appearance of tools and technologies that can help in some particular way to manage the large volumes of data. Search engines help us search for information; web proxy caches help keep "hot" data nearby. These tools are in no way integrated as they were in the days of the DBMS. They are instead point solutions.
This course will investigate some of the main technologies that are evolving within this space. They include, but are not limited to:
The course will focus on each of these technologies individually and will try to understand better why they are distinct services and not a more integrated system as in DBMS's. Perhaps it is historical or perhaps there are some good reasons for this division of the world.
CS227 has always been a seminar course. In the past, we read current research papers together and discussed them in class. This year, however, the format will be different. We are starting with the premise that the area of network data services is new and has never been looked at as a unit. We will uncover the state of the art and will try to synthesize a tutorial on each. We will assemble the tutorials into a single work and publish it for the benefit of others.
At the very least, we will publish our book on the web. If it is very successful, we will look into the possibility of getting a traditional publisher to put it out as a printed volume. The quality required for the later is quite a bit higher than the typical term paper in a one-semester course. Yes, writing style and good organization matter - in fact, they are required. The chapters in this book must be pedagogically sound. This, of course, requires that the authors have a very good mastery of the material.
The topics listed above will be the basis for the chapters (although I am open to suggestion). Each topic will be addressed by a team. The ideal team size is three, but we could live with teams of size two or four. It will be the team's responsibility to research the topic and organize the material. This requires the ability to discriminate between what's important and what's not. It also requires the ability to step back and try to abstract what is going on as distinct from the details. These are not easy things to do, and I will help with this.
This course is designed to be anti-competitive. We will have a common goal, that of producing a quality book, and we will all work together to that end. Thus, groups will help other groups to make their chapters better. The overall quality is the goal, not individual performance.
Another basic premise is that we should never do anything that feels like "busy-work". It's hard enough to write a book! Thus, if it seems like a class is not necessary, or it will not contribute to the goal, we will cancel it. Classes should be used for us to educate each other and for us to give each other feedback. They are working sessions - not opportunities for us to put each other to sleep!
The following schedule is subject to change. We will need a good way to dynamically reconfigure.
Activity | Group | Due | |
---|---|---|---|
Jan 24 | Intro to CS227 | ||
Jan 29 | Brainstorm 1 | ||
Jan 31 | Brainstorm 2 | ||
Feb 5 | Brainstorm 3 | ||
Feb 7 | Brainstorm 4 | ||
Feb 12 | NO CLASS | ||
Feb 14 | Field overviews 1 | 1 & 2 | |
Feb 19 | President's Day - no class | ||
Feb 21 | Field overviews 2 | 3 & 4 | |
Feb 26 | Field overviews 3 | 5 & 6 | |
Feb 28 | Field overviews 4 | 7 & 8 | Annotated bibliography |
Mar 5 | Outline exchange | Initial outline | |
Mar 7 | Outline exchange | ||
Mar 12 | work day | Draft outline (midnight) | |
Mar 14 | Critique rotation | ||
Mar 19 | no class | Final outline due | |
Mar 21 | no class | ||
Mar 26 | Spring Break - no class | ||
Mar 28 | Spring Break - no class | ||
Apr 2 | no class | ||
Apr 4 | no class | Draft 1 (midnight) | |
Apr 9 | Critiques of Draft 1 (1/2) | ||
Apr 11 | Critiques of Draft 1 (2/2) | ||
Apr 16 | no class | ||
Apr 18 | no class | Draft 2 (midnight) | |
Apr 23 | Critiques of Draft 2 (1/2) | ||
Apr 25 | Critiques of Draft 2 (2/2) | ||
Apr 30 | no class | ||
May 2 | no class | Final Draft (midnight) | |
May 7 | Postmortem - Final Decision about book | should have read full book |
Brainstorm 1 - discuss the topics as given above and determine if these are the correct ones. We will entertain suggestions from the class, but remember the guiding principle, a topic should transcend more than one application.
Brainstorm 2 - What kinds of things should go into each area. This is very broad. We are looking for a list to guide each group's initial research into the literature. Each group should come with a list of issues for their area (e.g., for proxy caching: replacement policies).
Brainstorm 3 - Each group comes prepared with a strategy for where to obtain information. This could be from internet sites, journals, magazines, personal interviews, using products, etc. Please be specific.
Brainstorm 4 - What are the dominant issues that seem to come up most in the literature?
Each group will prepare a list and present this list to the class for a critique.
Field overview n - Group n gives a 30 minute presentation of the most important aspects of their field. This is meant to be preliminary, but should be a good summary of what they have discovered to date. Questions from the class will be useful to challenge each group's thinking. This is not a competition. It is a way for us all to help each other.
Tutorial talk n - Group n gives a short talk that would serve as a tutorial to someone who is computer science savvy, but who knows nothing about the specific field at hand. This is an exercise to make sure that each group understands the material since it is impossible to give a great tutorial talk unless you really understand the area.
Oral Progress reports - This is a chance for each group to say where they are in the process. Groups can talk about how confident they are at this point in the progress that they have made. If confidence is low, this is an opportunity to ask for help from others.
Chapter discussions - Groups will sit down with each other for short roundtable discussions about how well the chapter as it stands meets the goals of the book. Is it understandable? Does it present the non-obvious aspects of the area well?
March 5 - Annotated bibliography
March 12 - Final Outline
April 4 - First draft of chapter
April 16 - Second draft of chapter
April 30 - Final draft of chapter