Tim Kraska, Cloud Computing Database Research

2017 Early Career Research Achievement Award from Brown University
2017 Alfred P. Sloan Research Fellow in Computer Science
2017 VMware Early Career Faculty Grant
2016 ACM TODS - Best of SIGMOD invitation
2015 Google Research Award
2015 NSF Career award
2015 AFOSR Young Investigator Research Award
2015 Honorable Mention for TCDE Early Career Award
2015 VLDB Best Demo Award
2015 Robotics Science and Systems (AAAI-RSS) Blue Sky Award
2013 ICDE Best Paper Award
2011 VLDB Best Demo Award
2010 Prospective Researcher Fellowship, Swiss National Science Foundation.
2008 ACM TODS - Best of SIGMOD invitation
2006 DAAD Short-Term Scholarship , DAAD, Germany
2005 School of Information Technology Scholarship for outstanding achievements, University of Sydney, Australia
2005 Siemens Prize for Solving an Industry Problem in Research Project Work for the master thesis, University of Sydney

Tupleware logo Google Faculty Research Award (PI)

Tupleware logo Together with SAP we are exploring new data models for Polystores. (Co-PI)

Together with Oracle and Mellanox we are investigating the implications of RDMA on OLTP and OLAP data management systems. (PI) Tupleware logo

III: Medium: Quantifying the Unknown Unknowns for Data Integration (PI): As the amount and variety of data available online explodes, it is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, this project will develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) for analytical queries. This will help to better understand answers in the presence of incomplete information across fields ranging from business and the military to medical applications.

This project will develop and exploit the following paradoxical statistical phenomenon: the ability to see certain data items more than once (across multiple data sets) enables one to estimate parameters of data items that have never been seen at all. This project will therefore develop new statistical techniques which take advantage of overlapping datasets, and software backed by both theory and experiments. This will enable users with overlapping incomplete data sets to actively "see the unseen," and in many cases perform as though they had access to missing information not represented in any of their data sources. The project will also focus on data validation, and how to use multiple unreliable data sources to correct each other. Further, as the proposed analysis is nuanced and novel, the project will also explore how to best convey valuable insights to the user, via interactive visualizations of the predictions

Tupleware logo Data Management for Small High-Performance Clusters (PI) Two current hardware trends will fundamentally change the design of modern parallel analytical systems: (1) high-performance RDMA capable networks such as Infiniband FDR/EDR and (2) high-end many-core machines with considerable amounts of main memory. Existing parallel analytical systems, such as Spark and Hadoop, are built in a fundamentally wrong way to effectively leverage the benefits of these trends since these systems target the wrong hardware: huge cloud deployments with cheap but low-end machines connected via high-latency low-bandwidth networks. This is not the infrastructure most businesses nor defense agencies operate. Instead, for the increasing need for advance statistical machine learning techniques and agile analytics we see the future in Small High-Performance Computing (SHPC) clusters. Already today, SHPC clusters that are equipped with fast networks and terabytes of main memory are reasonably affordable. Together with Airforce we are exploring how data management systems, especially analytical systems, for SHPC clusters have to change.

CAREER: Query Compilation Techniques for Complex Analytics on Enterprise Clusters (PI): Sharing of data sets can provide tremendous mutual benefits for industry, researchers and nonprofit organizations. For example, companies can profit from the fact that university researchers explore their data sets and make discoveries, which help the company to improve their business. At the same time, researchers are always on the search for real world data sets to show that their newly developed techniques work in practice. Unfortunately, many attempts to share relevant data sets between different stakeholders in industry and academia fail or require a large investment to make data sharing possible. A major obstacle is that data often comes with prohibitive restrictions on how it can be used (requiring e.g., the enforcement of legal terms or other policies, handling data privacy issues, etc.). In order to enforce these requirements today, lawyers are usually involved in negotiation the terms of each contract. It is not atypical that this process of creating an individual contract for data sharing ends up in protracted negotiations, which are both disconnected from what the actual stakeholders aim to do and fraught as both sides struggle with the implications and possibilities of modern security, privacy, and data sharing techniques. Worse, fear of missing a loophole in how the data might be (mis)used often prevents many data sharing efforts from even getting off the ground. To address these challenges, our new data sharing spoke will enable data providers to easily share data while enforcing constraints on the use of the data. This effort has two key components:(1) Creating a licensing model for data that facilitates sharing data that is not necessarily open or free between different organizations and (2) Developing a prototype data sharing software platform, ShareDB, which enforces the terms and restrictions of the developed licenses. We believe these efforts will have a transformative impact on how data sharing takes place. By moving data out of the silos of individuals and single organizations and into the hands of broader society, we can tackle many societally significant problems.

BD Spokes: SPOKE: NORTHEAST: Collaborative: A Licensing Model and Ecosystem for Data Sharing (PI): Sharing of data sets can provide tremendous mutual benefits for industry, researchers and nonprofit organizations. For example, companies can profit from the fact that university researchers explore their data sets and make discoveries, which help the company to improve their business. At the same time, researchers are always on the search for real world data sets to show that their newly developed techniques work in practice. Unfortunately, many attempts to share relevant data sets between different stakeholders in industry and academia fail or require a large investment to make data sharing possible. A major obstacle is that data often comes with prohibitive restrictions on how it can be used (requiring e.g., the enforcement of legal terms or other policies, handling data privacy issues, etc.). In order to enforce these requirements today, lawyers are usually involved in negotiation the terms of each contract. It is not atypical that this process of creating an individual contract for data sharing ends up in protracted negotiations, which are both disconnected from what the actual stakeholders aim to do and fraught as both sides struggle with the implications and possibilities of modern security, privacy, and data sharing techniques. Worse, fear of missing a loophole in how the data might be (mis)used often prevents many data sharing efforts from even getting off the ground. To address these challenges, our new data sharing spoke will enable data providers to easily share data while enforcing constraints on the use of the data. This effort has two key components:(1) Creating a licensing model for data that facilitates sharing data that is not necessarily open or free between different organizations and (2) Developing a prototype data sharing software platform, ShareDB, which enforces the terms and restrictions of the developed licenses. We believe these efforts will have a transformative impact on how data sharing takes place. By moving data out of the silos of individuals and single organizations and into the hands of broader society, we can tackle many societally significant problems.

III: Medium: 20/20: A System for Human-in-the-Loop Data Exploration (Co-PI): Explorative data analysis plays a key role in data-driven discovery in a wide range of domains including science, engineering and business. In order for data analysis to become a commodity during a period when their user base is continually expanding and diversifying, human productivity and ease-of-use must become first-class design considerations for any database system. Unfortunately, data tools that are user friendly and designed to improve human productivity are still sorely lacking. This project will enable users at different skill levels to interact with and explore their large datasets far easier and faster than they do today. Rather than spending a lot of precious time to build complex analytics tasks, this work will offer a more agile, responsive and user-friendly system based on direct manipulation of the visual representations (e.g., charts, graphs, maps) of the data sets and analysis results. The system can also be used as a learning tool: e.g., a teacher could walk students through a complex dataset to verify specific hypothesis. This project will make large-scale data exploration more accessible to more users. Overall, it will accelerate discovery and breakthroughs in many domains such as e-commerce, finance and science. This research will be incorporated in undergraduate and graduate coursework. The outreach activities include special research and education-focused programs that are geared towards undergraduates and high school girls.

Tim Kraska

Assistant Professor, Computer Science Department, Brown University

Awards

Grants