Project topics:

Project Group

Streaming Ingestion via Kafka

Streaming ETL requires a data-ingestion module that can take a raw data stream via a socket and begin the ETL process. It is possible that this input data may come from a variety of sources and appear in different formats.

Apache Kafka is a popular publish/subscribe messaging system with a diverse set of use cases such as data logging and ingestion. Similar to S-Store, Kafka provides a scalable, partitioned architecture as well as a number of guarantees (e.g., durability, ordering). This project involves writing software modules that can read data feeds from Kafka and ingest them into S-Store, and vice-versa. While doing that, the project team will explore a number of interesting research questions, such as:

a) How do guarantees of Kafka and S-Store compare, and what can be done to integrate them seamlessly? For example, how do we ensure exactly-once exchange of ordered data between S-Store and Kafka? In case of failures, could Kafka reliably serve as an “upstream backup” facility for S-Store?

b) How do ingest throughput of the two systems compare? Where are the bottlenecks, if any? Any ideas how to remove them?

GROUP 1
Bikong Wang
Shuai Wang
Han Zhang

Caching between S-Store and Postgres

Assume that S-Store receives all input as streams, and that ETL is performed in real-time by Store through the use of its stored procedures and its dataflow graph. It then passes resulting tuples to an external database system like Postgres. In addition, S-Store has its own storage system, serving as an in-memory database. Suppose that part of the ETL being performed by S-Store requires access to historical data. (e.g., compare yesterday to today). S-Store could reach out to Postgres to retrieve yesterday’s values, or it could have kept yesterday’s data in its own local cache (relations). This should make the ETL more efficient. In this scenario, there is coordination required between S-Store and Postgres so that tuples exist in exactly one system.

The first part of this project is to link up S-Store and Postgres so that data can flow back and forth between them. It also requires some method of cache management, including when to cache something and when to purge it. Finally, the project should be able to demonstrate the performance advantages gained by the caching in S-Store.

GROUP 2
Ian Stewart
Lezhi Qu
Qian Liu

S-Store as Streaming ETL

This project, like the previous one, assumes a multiple system combination in which all new data is first fed as streams to S-Store, then loaded into Postgres. However, this project focuses more on the ETL functions. It investigates how easy it is to set up S-Store as an ETL engine, complete with data cleaning, de-duplication, schema integration, format translation, etc.

Using S-Store as ETL first requires a connection between S-Store and Postgres, such that data can be sent between the two. One major question is how to best feed data into Postgres; the new data would obviously have to be converted to a format that Postgres understands. The most efficient way to accomplish this would probably involve the Postgres bulk-loader. Another question is how to best send data to Postgres in a fault-tolerant manner. Both S-Store and Postgres are transactional on their own, but the connection between the two requires transactions as well.

GROUP 3
Andy Ly
Kevin Shu
Jessica Fu

Benchmarking Streaming TPC-DI

Benchmarks are always needed to demonstrate how well an experimental system performs. This typically involves using a realistic application and implementing this to simulate a real functioning system. In this case, the simulation is a subset of TPC-DI, a data integration benchmark. Getting something like this working for both S-Store and Postgres would be the thrust of this project.

A subset of TPC-DI has been implemented in S-Store as a streaming ETL application. We would like to implement the same subset in Postgres as standard ETL and compare the performance between the two. Performance metrics would include total run time, latency, and throughput.

As an additional (stretch) goal, we expect our system to work for high-velocity input sources, we would want to be able to run TPC-DI across multiple nodes. S-Store has support for this built-in, but the benchmark itself will need to be extended. Configuring the benchmark to run on multiple nodes is the main part of this project.

GROUP 4
Yana Hrytsenko
JJ Chen
Tommy Yao

ECA Rule Language

S-Store executes a series of transaction (sometimes called stored procedures) in an order specified by a dataflow graph. The stored procedures are specified by writing a combination of Java and SQL. For the purposes of ETL, many of the common tasks and workflows can be thought of as rules. Rules are a more intuitive way for users to specify their queries.

Rule languages have been proposed in the past (sometimes called Event/Condition/Action (ECA rules)). If a known event happens and some condition is met, then a specified action is triggered. The task here is to specify a simple rule language that can be used to carry out some or all of the ETL functions.

Your job is to specify a simple rule language and to write a translator that transforms rules into stored procedures. The language need not be complete (i.e., there may be some things that are not expressible), but should include many of the common functionalities of ETL workloads (filters, data translation, etc).

GROUP 5
Cansu Aslantas
Kylee Hench
Alan Hwang