Syllabus:
-
Background
This semester, CS227 will consider topics related to data ingestion. Data ingestion is the process by which data is prepared and loaded into a DBMS for future use. On the surface, this seems obvious or trivial, but in practice, it may represent the biggest problem that DBA’s face in support of data processing.
The 4 V’s have been used to characterize big Data: (1) volume, (2) variety, (3) velocity, and (4) veracity. Of course, data ingestion must consider all four problems, but we will be especially interested in variety and velocity. Velocity is a big problem because data is being created so fast that it can stress the storage media, as well as the data management software to keep up. Variety is also a key element to the data ingestion problem, because input data typically arrives in many different forms and must be appropriately translated to work with data from other sources.
Traditionally, the data ingestion problem has been handled by ETL (Extract, Transform, Load) tools. ETL tools normally work in batch mode. They rely on collecting new data in files that are then sucked into the data management platform as needed. We are interested in online data ingestion in which new data is loaded as it arrives. Of course, before loading, many complex transformations must be performed. For example, data must be cleaned, schemas must be integrated and duplicate data must be located and removed. We are interested in doing this on the fly.
This seminar is designed to investigate prior work in related technologies with the aim of getting a better picture of what a full-featured, online data ingestion tool might look like. We assume that this might be built on a stream processing engine like S-Store (developed at Brown and MIT). The course will try to come up with a better picture of what this might look like.
-
Course Layout
Students will organize themselves into groups (approx 3 students per group) for the purpose of doing a project. The project will typically involve building a simple proof-of-concept system. More than one group will work on the same project. This creates some competition to see how different approaches compare.
The paper presentations will occupy the first 2/3 of the class. The last 1/3 will be reserved for discussion of projects and their progress as well as interesting thoughts that might have been inspired by our readings, etc.
During the semester, you will be responsible for the following deliverables:
- Paper Summaries (1-2 pages written per class meeting)
- Paper Presentations
- Project Design Document (1-2 pages written + talk)
- Project Status Update
- Project Demo
-
Paper Presentations
Each group will choose papers (related to their project choice) and present them to the class. However, this talk is supposed to be an in depth description and analysis of the paper. Because it is the responsibility of the presenters to teach the class about this system, they will be expected to know and understand all the aspects of the system. Thus, it is important to be prepared. If you have questions regarding the content of a paper, you should arrange to meet with John well in advance of your talk date.
WARNING: It is acceptable for students to use information and content (e.g., images and graphics) found on the Internet but the original source must be properly attributed/cited. No credit will be given for presentations without proper citations.
-
Projects
The main component of this course will be the project. All projects will involve modern data ingestion in some capacity, particularly relating to the subjects discussed in this class. However, beyond that, the projects will vary greatly in both scope and topic. We will discuss this more in depth during class and provide a list of topics that will contribute meaningfully to data ingestion research.