CS227 - Topics in Database Management

Topic: Network Data Services

Spring, 2001

Brown University
Department of Computer Science

Background

Over the last thirty years, database management systems (DBMS's) have matured into an industry with sales in excess of $20 billion and a technology that has a well-defined place in the every major corporation worldwide. Formal courses in this area are commonplace (CS127), and research continues to produce innovations and new product lines.

Database management systems provide a one-size-fits-all solution to the problems of managing large volumes of data. When you purchase a modern DBMS, you get support for complex data structuring (e.g., indices), fully optimized query processors (SQL), data evolution (e.g., views), concurrency control, recovery, and security. The modern DBMS is a package of many technologies that can be used in concert to solve real data management problems within a given organization. There are many advantages to this tightly integrated approach.

What is data management? Data management is the use of application-level knowledge to provide users with appropriate functionality and performance in the face of severe problems of scale. Examples of data management techniques from modern database systems include indexing, clustering, caching, replication and query optimization. In a network setting this list might expand to include things like the use of broadcast, push-based delivery, and request scheduling.

The web has caused a revolution in how people think about information systems. While large corporate databases are not going to disappear, the web makes a panoply of data available to ordinary users through their web browsers. Most of this data is rather disorganized, but the promise of a worldwide information repository that ordinary folks can contribute to is so appealing that users have been very forgiving of its lack of discipline.

The database community, of course, has been busy trying to figure out how to respond to the challenges of the web. Some researchers have proposed XML-based query languages. Others have worked on XML-based storage systems. All of these topics are interesting and have been the topic of previous versions of CS227. This year we are trying something different.

Course Content

As the growth of information within the Internet continues to explode, we begin to see the appearance of tools and technologies that can help in some particular way to manage the large volumes of data. Search engines help us search for information; web proxy caches help keep "hot" data nearby. These tools are in no way integrated as they were in the days of the DBMS. They are instead point solutions.

This course will investigate some of the main technologies that are evolving within this space. They include, but are not limited to:

Web search engines
Web proxy caches
Data warehouses and data integration services
Web servers and dynamic content generation
Publish/subscribe services
Customization and enterprise portals
Synchronization services (AvantGo)

The course will focus on each of these technologies individually and will try to understand better why they are distinct services and not a more integrated system as in DBMS's. Perhaps it is historical or perhaps there are some good reasons for this division of the world.

The Game Plan

CS227 has always been a seminar course. In the past, we read current research papers together and discussed them in class. This year, however, the format will be different. We are starting with the premise that the area of network data services is new and has never been looked at as a unit. We will uncover the state of the art and will try to synthesize a tutorial on each. We will assemble the tutorials into a single work and publish it for the benefit of others.

At the very least, we will publish our book on the web. If it is very successful, we will look into the possibility of getting a traditional publisher to put it out as a printed volume. The quality required for the later is quite a bit higher than the typical term paper in a one-semester course. Yes, writing style and good organization matter - in fact, they are required. The chapters in this book must be pedagogically sound. This, of course, requires that the authors have a very good mastery of the material.

The topics listed above will be the basis for the chapters (although I am open to suggestion). Each topic will be addressed by a team. The ideal team size is three, but we could live with teams of size two or four. It will be the team's responsibility to research the topic and organize the material. This requires the ability to discriminate between what's important and what's not. It also requires the ability to step back and try to abstract what is going on as distinct from the details. These are not easy things to do, and I will help with this.

This course is designed to be anti-competitive. We will have a common goal, that of producing a quality book, and we will all work together to that end. Thus, groups will help other groups to make their chapters better. The overall quality is the goal, not individual performance.

Another basic premise is that we should never do anything that feels like "busy-work". It's hard enough to write a book! Thus, if it seems like a class is not necessary, or it will not contribute to the goal, we will cancel it. Classes should be used for us to educate each other and for us to give each other feedback. They are working sessions - not opportunities for us to put each other to sleep!

Tentative Schedule

The following schedule is subject to change. We will need a good way to dynamically reconfigure.

Activity Group Due

Jan 24 Intro to CS227

Jan 29 Brainstorm 1

Jan 31 Brainstorm 2

Feb 5 Brainstorm 3

Feb 7 Brainstorm 4

Feb 12 NO CLASS

Feb 14 Field overviews 1 1 & 2

Feb 19 President's Day - no class

Feb 21 Field overviews 2 3 & 4

Feb 26 Field overviews 3 5 & 6

Feb 28 Field overviews 4 7 & 8 Annotated bibliography

Mar 5 Outline exchange Initial outline

Mar 7 Outline exchange

Mar 12 work day Draft outline (midnight)

Mar 14 Critique rotation

Mar 19 no class Final outline due

Mar 21 no class

Mar 26 Spring Break - no class

Mar 28 Spring Break - no class

Apr 2 no class

Apr 4 no class Draft 1 (midnight)

Apr 9 Critiques of Draft 1 (1/2)

Apr 11 Critiques of Draft 1 (2/2)

Apr 16 no class

Apr 18 no class Draft 2 (midnight)

Apr 23 Critiques of Draft 2 (1/2)

Apr 25 Critiques of Draft 2 (2/2)

Apr 30 no class

May 2 no class Final Draft (midnight)

May 7 Postmortem - Final Decision about book should have read full book

	Activity	Group	Due
Jan 24	Intro to CS227
Jan 29	Brainstorm 1
Jan 31	Brainstorm 2
Feb 5	Brainstorm 3
Feb 7	Brainstorm 4
Feb 12	NO CLASS
Feb 14	Field overviews 1	1 & 2
Feb 19	President's Day - no class
Feb 21	Field overviews 2	3 & 4
Feb 26	Field overviews 3	5 & 6
Feb 28	Field overviews 4	7 & 8	Annotated bibliography
Mar 5	Outline exchange		Initial outline
Mar 7	Outline exchange
Mar 12	work day		Draft outline (midnight)
Mar 14	Critique rotation
Mar 19	no class		Final outline due
Mar 21	no class
Mar 26	Spring Break - no class
Mar 28	Spring Break - no class
Apr 2	no class
Apr 4	no class		Draft 1 (midnight)
Apr 9	Critiques of Draft 1 (1/2)
Apr 11	Critiques of Draft 1 (2/2)
Apr 16	no class
Apr 18	no class		Draft 2 (midnight)
Apr 23	Critiques of Draft 2 (1/2)
Apr 25	Critiques of Draft 2 (2/2)
Apr 30	no class
May 2	no class		Final Draft (midnight)
May 7	Postmortem - Final Decision about book		should have read full book

Brainstorm 1 - discuss the topics as given above and determine if these are the correct ones. We will entertain suggestions from the class, but remember the guiding principle, a topic should transcend more than one application.

Brainstorm 2 - What kinds of things should go into each area. This is very broad. We are looking for a list to guide each group's initial research into the literature. Each group should come with a list of issues for their area (e.g., for proxy caching: replacement policies).

Brainstorm 3 - Each group comes prepared with a strategy for where to obtain information. This could be from internet sites, journals, magazines, personal interviews, using products, etc. Please be specific.

Brainstorm 4 - What are the dominant issues that seem to come up most in the literature?

Each group will prepare a list and present this list to the class for a critique.

Field overview n - Group n gives a 30 minute presentation of the most important aspects of their field. This is meant to be preliminary, but should be a good summary of what they have discovered to date. Questions from the class will be useful to challenge each group's thinking. This is not a competition. It is a way for us all to help each other.

Tutorial talk n - Group n gives a short talk that would serve as a tutorial to someone who is computer science savvy, but who knows nothing about the specific field at hand. This is an exercise to make sure that each group understands the material since it is impossible to give a great tutorial talk unless you really understand the area.

Oral Progress reports - This is a chance for each group to say where they are in the process. Groups can talk about how confident they are at this point in the progress that they have made. If confidence is low, this is an opportunity to ask for help from others.

Chapter discussions - Groups will sit down with each other for short roundtable discussions about how well the chapter as it stands meets the goals of the book. Is it understandable? Does it present the non-obvious aspects of the area well?

DUE DATES

March 5 - Annotated bibliography

March 12 - Final Outline

April 4 - First draft of chapter

April 16 - Second draft of chapter

April 30 - Final draft of chapter

Example High-Level Outline

Introduction
The Technical Problem

Precise statement
What makes it a data management problem?

What Makes the Problem Hard?

Issues
Hardware and resource restrictions
Research topics

Some abstract solutions to the above problems
Some systems and how they do/do not make use of the above solutions
Literature survey/Previous work
Open questions
Summary and conclusions