The Beckman Report on Database Research

Meeting was 2013

Big Data emerged due to three major trends:

"The new era of Big Data has drawn many communities into the 'data management game.'" (c.f. Hadoop, NoSQL). DBMS principles have increasingly beend recognized in bid data and incorporated, e.g. Hive is more popular than MapReduce and NoSQL tools move to higher-lvl languages and ACID transactions.

Research Challenges

Scalable Big/Fast Data Infrastructures

Distributed computing achieved success in scaling up on commidity machines via constrained programming models like MapReduce. Higher level languages like Hive were implemented on top. Thus we need db-style query processing for big data, e.g. scale-out (and up) query optimizers and execution engines.

New tech: RDMA, specialized processors (GPU, FPGA, ASICs), NUMA

Schema-on-read might scale better than schema-on-write (no ingest overhead), e.g. for rarely read data.

Also: measure scalability not only in petabytes but also in TCO.

Diversity in the Data Management Landscape

Traditionally, there was a data warehouse, now there are many systems. Probably no one-size-fits-all big data system will suffice. Instead multiple classes: Data dedup, graph analysis, stream processing, ...

Many different language: SQL, R, Python, etc. Research opportunity: tools to develop new scalable, data-parallel languages.

Hadoop YARN interesting for inter-operable & -connectable ("Lego") big data.

End-to-end Processing and Understanding of Data

Data-to-knowledge pipeline relies on domain-specific knowledge, thus knowledge bases should be built.

I don't get it.

Cloud Services

They identify the following challenges:

Roles of Humans in the Data Life Cycle

They suggest that the "traditional" split of "devs build, analysts query, and DBAs tune" has dramatically changed, with less central control and more complexity.

Call for new interfaces ("multi-touch", combined visualization, querying, and navigation).

DB Community Challenges

DB Education

"we still teach 80s tech" designed for different hardware.

Meeting consesus that change is necessary, but no consensus on what this change should be. Suggestions include:

Research Culture

"Alarming increase in emphasis on publication and citation counts instead of research impact. This discourages large systems projects, end-to-end tool building, and sharing of large data sets due to the longer times required and the resulting lower publication density.".

No face-to-face meetings on PCs, thus less individual accountability and difficult for younger members to learn from more senior ones.

"The field should strive to return to a state where fewer publications per researcher per time unit is the norm"

Data Science is a Thing Now

Goal: "Transform large volumes of data into actionable knowledge."

Not only "data" but broader, cross-disciplinary skillset required.

Computer science might become more important in curricula of many other sciences, with "data" maybe in a more prominent role.