Tim Kraska, Cloud Computing Database Research

In 2017, we are offering the following projects as independent studies/reading and research. If you are interested please email alexander_galakatos@brown.edu with the project(s) that interests you the most, your CV, and transcript. Because of the increased interest, we will have interviews in May and then again in September before the semester starts. Please also state in your email, which interview period you prefer.

Visualizations are one of the most important tools for exploring, understanding, and conveying facts about data. However, the rapidly increasing volume of data often exceeds our capabilities to digest and interpret it, even with sophisticated visualizations. Traditional OLAP queries offer high-level summaries about the data but cannot reveal more meaningful insights. On the other hand, complex analytics tasks, such as machine learning (ML) and advanced statistics, can help to uncover hidden signals.

Unfortunately, these techniques are not magical tools that can miraculously produce incredible insights on their own; instead, they must be guided by the user to unfold their full potential. For example, even the best ML algorithms are doomed to fail without proper feature selection, but the process of finding these features is often the result of iterative trial-and-error, where a domain expert tests different subsets of features until finding those that work best.

We envision a new system called HILDE (Human-In-the Loop Data Exploration) that helps users to perform complex analytical tasks including ML using an easy-to-use pen-and-touch interface (see picture). We will build on two existing systems which you might already know from the Data Science class, PanoramicData and Tupleware. PanoramicData is a visual front-end to SQL that allows users to rapidly search through datasets using visual queries constructed by pen-and-touch manipulation. For HILDE, we are extending PanoramicData's functionality to include sophisticated ML and statistical operators. Tupleware is a new high-performance distributed analytics framework designed at Brown to efficiently support complex analytics tasks. Tupleware leverages code generation in order to improve the performance of complex CPU-intensive analytics tasks. Over the next months, we will bring these two systems together to build a first version of HILDE for a particular medical use case.

Yet, to make HILDE really work we need to do way more and that is where you come in. First, we need to extend the functionality and available operations (e.g., K-Means, SVMs etc) of HILDE to work with other use cases. This is crucial and will be an easy way to get started with this research project. In addition, there are many open research challenges from exploring new/better visualizations for interactive ML or developing system techniques to make ML more interactive across a spectrum of algorithms. Furthermore, often interactive speeds are only possible using approximation techniques (e.g., visual approximation techniques, incremental result propagation) for which we require new ways to quantify the uncertainty.

For next fall, we are looking for 3-4 people, who help us build a first usable version of HILDE. You will work in a team together with graduate students and have regular meetings with faculty members from the DB group. This is a hands-on project to get involved into systems and data science research and has high potential to lead to a top-tier conference publication. If you are interested please send your CV and transcript to alexander_galakatos@brown.edu as well as mention what aspect of the system you are most interested in and your level of confidence with C++, Python, C#, Database Internals (indexes, joins, etc), LLVM, as well as front-end development is.

There has been an explosion of Big Data Engines over the last few years ranging from engines for structured data over text to graph data. Each of these engines uses a separate programming model to implement jobs for data-parallel processing. For example, Hadoop-like systems offer a MapReduce-based programming model, graph engines provide a wide range of programming models from declarative pattern based query languages like Cipher to more state-based programming models used in Pregel-like systems.

Most real-world big data applications need to deal with data sitting in different engines and need to combine the data to gain relevant insights. For example, enterprise applications often need to combine data from classical ERP system (OLTP engine) with hierarchical or graph-like data such as bills of material (graph engine). Another example is a use case of a major oil company to evaluate data from sensing the sea bottom where it is necessary to combine structured data (SQL engine) with matrices representing sensing data (Array database).

Today, big data applications use a lot of glue code implemented in a host language in the application layer to combine the data in different engines. This approach has major drawbacks: (1) High Development Effort: Developers need to write code for different engines, which requires expertise in the different programming models and data models (e.g, the SQL for the relational model, Scala for Spark RDDs, and Python for GraphLabs graphs). Moreover, a lot of glue code is required to integrate the different engines and to combine the results. This bloats programs with a lot of low-level primitives about the run-time (e.g., connecting to the engine via different driver frameworks etc.) and affects programmer's productivity. (2) Expensive Execution: Results are copied into the application layer and are then combined using glue code. Often result sizes of individual queries are huge resulting in expensive data copies. Moreover, glue code is often implemented in a naive way; i.e., application code is often not parallelized and not optimized very well. For example, a join of two different data sources is sometimes naively implemented using a nested loop join in the application layer.

In this project, we want to implement a novel language called BABELQL. BABELQL is a high-level language that can be used to write cross-engine programs. BABELQL enables data transformations on typed objects (e.g., extracting a sub-graph on a object of type Graph). Moreover, BABELQL also offers more complex constructs to express iterations and shared state that are required for implementing complex analytics. As a first step, real-world cross-engines use cases should be implemented based on the traditional approach (i.e., using glue code to combine programs in different engines) . Based on these use cases, a first version of BABELQL should then be defined. Moreover, a compiler / optimizer as well as a runtime for BABELQL will be developed.

For next fall, we are looking for 1-3 people, who help us build a first version of BABELQL. This is a great project to get involved into systems and database research.

Home automation is here to stay and it will become even more prevalent in the future. For example, Prof. Tim Kraska's house is already equipped with several sensors to greet the first person coming down to the kitchen with the weather forecast, to automatically turn of the heating (and cooling in the summer) when nobody is at home, or to watch if the basement is flooded. Recently, in addition to talking, the house acquired the ability to listen using Amazon Echo. For example, you can say "Alexa, open house and turn of the lights in the living room " and it does so (note, the "open house" is an artifact of the current Amazon Echo API). However, the house also contains bugs. For example, the door lock closes automatically after 30 seconds, even if the door is open, or the camera is successfully recording rain drops but not people actually entering the house. Even worse, there is a lot of functionality which would be great to have and which is simply missing. For example, it would be nice to be able to say "Alexa, turn of the lights in 5 min " before going to bed. Removing the bugs or adding functionality is not necessarily hard (even for a professor ;) but it is time consuming. Furthermore, people who are not that familiar with programming will have a hard time to program the house in languages like Groovy or Python. Finally, the more functionality is added to the home the harder it becomes to maintain. For example, the home automation hub SmartThings uses an app-concept, where every functionality is implemented as a separate app. But after a while it is non-transparent which functionality is implemented in which app.

The goal of this project is to explore and develop a new way to program home automation systems. First, we will explore new ways to implement and manage the functionality of the house beyond the app concept. Second, and arguably even more interesting, we will investigate how we can use crowd-sourcing to help users create individualized programs. For example, if somebody requests that a new functionality should be added to support the phrase "Alexa, turn off the lights in 5 min ", we will ask the crowd to create and test a program for a small dollar amount (e.g., a dollar). Not only will it be interesting to see if crowd-workers (e.g., from Amazon Mechanical Turk) can be trained to perform such a complex task, but also how we can guarantee the quality and correctness of the program. Furthermore, using crowd-sourcing to program homes opens up a whole new set of research challenges from security to how to best support the workers in implementing the functionality.

Again for next fall, we are looking for 2-4 people, who help us build a first version of NextHome. You will work in a team together with graduate students and have regular meetings with faculty members. This is a hands-on project to get involved into HCI research.

Tim Kraska

Assistant Professor, Computer Science Department, Brown University

Courses and Seminars

Independent Study / Reading and Research

HILDE: Human-In-the Loop Data Exploration

BABELQL: A Novel Programming Language for Cross-Engine Analytics

NextHome: The Next Generation Home Automation System