Project areas:

Project Group

Nested Transactions

Nested transactions have been studied in the past. Briefly, a nested transaction is a transaction that is made up of other transactions. The parent can only commit when all of its children successfully commit. If any child aborts, the parent must abort. S-Store has a need for a nested transaction facility – especially when the child transactions need to share state (i.e., tables). This project involves implementing a nested transaction facility for S-Store. There is an interesting question about how much of the distributed transaction code can be reused.
Project involves: heavy systems work on S-Store internals.
Chenggang Wu
Da Yu

Understanding Latency vs. Throughput

Traditional stream processors have largely been concerned with what we will call monitoring applications. When some condition is detected, the system should report this as quickly as possible. Often the result degrades quickly over time (e.g., arbitrage opportunities; the arrival of a missile), and, thus, the system must try to reduce latency. Systems like S-Store must process a very large number of small transactions, and thus are concerned with throughput (transactions per second). Are these two criteria at odds? Can we build a system that can do either or both? What are the essential roadblocks to this (i.e., where are the bottlenecks)?
Project involves: experimentation on S-Store platform, some research
Cansu Aslantas
Sima Zhu

MIMIC monitoring (3 groups)

MIMIC II is a very large dataset that was gathered over 10 years are Mass. General Hospital from thousands of ICU patients. It contains clinical data (e.g., drug therapy) as well as waveform data (e.g., ECG data sampled at 125Hz). We would like to have several teams implement some monitoring application using MIMIC and S-Store. When an interesting result is found, S-Store will push that result to a RESULT table. The groups will compete for the most impressive monitoring task. This will involve taking the stored MIMIC data and turning it into a stream by sending tuples from the stored data to a TCP socket (or equivalent).
Project involves: interacting with MIMIC dataset, writing a workload for S-Store, some front-end graphical UI work
Group 1
Jonathan Lessinger
Krishna Aluru
Young-Rae Kim

Group 2
Andrew Crotty

Evaluating Anti-Caching/NVM

Anti-caching is a technique for spilling data to lower levels of the memory hierarchy when the memory of a main-memory database fills up. This project would involve evaluating the use of Non-Volitle Memory (NVM) as a part of the memory hierarchy in the context of anti-caching. There is already work in this direction, so you would likely be working with a team. NVM is not commercially available yet. We would run experiments using a simulator that is supplied by Intel. An extension is to study the use of NVM in the context of S-Store.
Project involves: low-level evaluation of NVM caching performance, some research
Sam Zhao
William Truong
Andrew Osgood
Harsha Yeddanapudy
Xinwei Liu

Graphical Workflow design environment

Currently putting together an S-Store workflow is a long and laborious task. It would be wonderful to have a graphical tool that lets the user draw the workflow and allow the user to easily edit it as needed. This tool could also have an easy way to define parameters like the batch size or the window size and to specify where the result should show up.
Project involves: creating a graphical tool, S-Store file manipulation
Craig Hawkins
Christian Mathiesen
Dave Lee

Query API for S-Store (like ODBC)

S-Store can also be considered to be a data storage system since it has the ability to create and update tables. Natively, S-Store accesses these tables through streaming workflows of transactions. It would be useful to provide a query interface on the tables that S-Store creates. This would be much like ODBC or JDBC which use SQL. You might be able to steal the Postgres query parser for use here. Linking that up with S-Store tables in an interesting exercise. A complex extension that you might want to think about is how would this interact with a running S-Store system.
Project involves: interfacing with S-Store internals, working with query APIs
Brigitte Harder
Liyun Zhang

Evaluate and invent distributed cost functions

It is crucial for optimization and for automatic partitioning to have an accurate cost function that evaluates to some measure of cost for a given partitioning . We have some simple such cost models. This project would involve evaluating the cost models that we have and if they are not especially accurate (almost guaranteed) adjusting them to do better. This would require that the student understand the sources of cost in a distributed streaming data management system.
Project involves: in-depth understanding of database internals, measuring cost of S-Store internals, some research
Lixiang Zhang

Main Memory Databases on Infiniband

Exact project TBD. Will involve evaluating distributed main-memory database architectures on Infiniband.
Erfan Zamanian
Yeounoh Chung

Streaming ETL

ETL (Extract, Translate, Load) is an important part of the data warehouse environment. It is the subsystem that takes data from the live transactional systems (note the plural), and grabs the data that is relevant for the warehouse (extract) potentially from multiple sources, translates the data formats to some common representation, and loads this data into the data warehouse. We would like to identify an ETL-like task in the context of MIMIC to do this kind of data preparation and cleaning. We could entertain more than one project in this area.
Project involves: interacting with MIMIC dataset, writing a workload for S-Store

S-Store ingest + egress / get data from TCP socket

The S-Store system needs a way to efficiently ingest batches of tuples. This facility should take data from a file (maybe comma-delimited) and create a stream that is sent to the S-Store input ports. Right now, this has to go through the client process. Similarly, S-Store needs a reliable place to put any results that it computes. This will likely be through some kind of RESULT table.
Project involves: some work with S-Store internals, creating reusable client-side scripts

Tool for managing data and procedure partitioning

Right now, it is cumbersome in S-Store to specify to the system how to partition data across multiple nodes. Furthermore, it is difficult to specify the node on which a part of a streaming application should run. It would be great to have a graphical tool that would allow the application designer to specify these things and perhaps even move them around to facilitate experimentation with different configurations.
Project involves: creating a graphical tool, S-Store file manipulation

Abbreviated TPC-E or TPC-DI

This project would involve taking one or both of two well-known benchmarks, TPC-E and/or TPC-DI, which were designed for OLTP and making them into a streaming benchmark that eliminates some of the complexity of the original, yet provides a plausible version for streaming. You would build a stream generator that shoots tuples from the benchmark at S-Store running a streaming application.
Project involves: modifying an existing benchmark design, implementing benchmark on S-Store

Recovery management

This is an open-ended topic. It requires some deep distributed system understanding. It will involve working with the S-Store team to design an appropriate recovery mechanism and to assist in implementing these algorithms.
Project involves: heavy research, some work with S-Store internals