The Claremont Report on Database Research

31
The Claremont Report on Database Research SIGMOD 2008

description

The Claremont Report on Database Research. SIGMOD 2008. What is it?. May, 2008 prominent DB researchers, architects, users, pundits met in Berkeley, CA at Claremont Resort Seventh meeting in 20 years Report based on discussion of new directions in DBs. Turning point in DB Research. - PowerPoint PPT Presentation

Transcript of The Claremont Report on Database Research

Page 1: The Claremont Report on Database Research

The Claremont Report on Database Research

SIGMOD 2008

Page 2: The Claremont Report on Database Research

What is it?

• May, 2008 prominent DB researchers, architects, users, pundits met in Berkeley, CA at Claremont Resort

• Seventh meeting in 20 years• Report based on discussion of new directions

in DBs

Page 3: The Claremont Report on Database Research

Turning point in DB Research

• New opportunities for technical advances, impact on society, etc.

1. Big Data– not only traditional enterprises, but also e-

science, digital entertainment, natural language processing, social network analysis

– Design new custom data management – solutions from simpler components

Page 4: The Claremont Report on Database Research

2. Data analysis as profit center– Barriers between IT dept. and business units

dropping– Data is the business– Data capture, integration, etc. keys to efficiency

and profit– BI vendors - $10B (only front-end)– Also need better analytics, sophisticated analysis– non-technical decision makers want data

Page 5: The Claremont Report on Database Research

3. Ubiquity of structured and unstructured data– Structured data – extracted from text, SW logs,

sensors and deep web crawl– Semi-structured – blogs, Web 2.0 communities,

instant messaging– Publish and curate structured data– Develop techniques to extract useful data, enable

deeper explorations, connect datasets

Page 6: The Claremont Report on Database Research

4. Expanded developer demands– Adoption of relational DBMS and query languages

has grown• MySQL, PostegreSQL, Ruby on Rails• Less interest in SQL, view DBMS as too much to learn

relative to other open source components

– Need new programming models for Data management

Page 7: The Claremont Report on Database Research

5. Architectural Shifts in computing– Computing substrates for DM are shifting– Macro: Rise of cloud computing• Democratizes access to parallel clusters

– Micro: shift from increasing chip clock speed to increase number of cores, threads• Changes in memory hierarchy• Power consumption

– New DM technologies

Page 8: The Claremont Report on Database Research

Research Opportunities• Impact of DB research has not evolved beyond

traditional DBs• Reformation– Reform data centric ideas for new applications and

architectures• Synthesis– Data integration, information extraction, data privacy

• Some topics not mentioned, because still part of significant effort– Must continue with these efforts– Also must continue with

• Uncertain data, data privacy and security, e-science, human-centric interactions, social networks, etc.

Page 9: The Claremont Report on Database Research

DB Engines

• Big market relational DBs well known limitations

• Peak performance:– OLTP with lots of small, concurrent transactions

debit/credit workloads– OLAP with few real-mostly, large join, aggregation

• Bad for:– Text indexing, server web pages, media delivery

Page 10: The Claremont Report on Database Research

• DB engine technology could be useful in sciences and Web 2.0 applications, but not in current bundled DB systems

• Petabytes of storage and 1000s processors, but current DB cannot scale

• Need schema evolution, versioning, etc• Currently, many DB engine startup companies

Page 11: The Claremont Report on Database Research

1. Broaden range for multi-purpose DBs2. Design special purpose DBs• Topics in DB engine area:– Systems for clusters of many processors– Exploit remote RAM and Flash as persistent– Query opt. and data layout continuous– Compress and encrypt data integrated with data

layout and optimization– Embrace non-relational DB models– Trade off consistency/availability for performance– Design power aware dBMS

Page 12: The Claremont Report on Database Research

• Declarative programming for emerging platforms

• Programmer productivity is important– Non-expert must be able to write robust code– Data Centric programming techniques• Map reduce – language and data parallelism• Declarative languages – Data log• Enterprise application programming – Ruby Rails, LINQ

Page 13: The Claremont Report on Database Research

• New challenges – programming across multiple machines• Data independence valuable, no assumptions about where

data stored• XQuery for declarative programming?• Also need language design, efficient compilers, optimize code

across parallel processors and vertical distribution of tiers• Need more expressive languages• Attractive syntax, development tools, etc• Data management – not only storage service, but

programming paradigm

Page 14: The Claremont Report on Database Research

Interplay of Structured and Unstructured Data

• Data behind forms – Deep Web• Data items in HTML • Data in Web 2.0 services (photo, video sites)

• Transition from traditional DBs to managing structured, semi-structured and unstructured data in enterprises and on the web

• Challenge of managing dataspaces

Page 15: The Claremont Report on Database Research

• On the web– Vertical search engines– Domain independent technology for crawling

• Within the enterprise– Discover relationships between structured and

unstructured data

Page 16: The Claremont Report on Database Research

• Extract structure and meaning from un- and semi-structured data

• Information extraction technology – pull entities and relationships from unstructured text

• Need: apply and management predictions from independent extractors– Algorithms to determine correctness of extraction– Join with IR and ML communities

Page 17: The Claremont Report on Database Research

• Better DB technology needed to manage data in context– Discover implicit relationships, maintain context

through storage and computation• Query and derive insight from heterogeneous data– Answer keyword queries over heterogeneous data

sources– Analysis to extract semantics– Cannot assume have semantic mappings or domain

is known

Page 18: The Claremont Report on Database Research

• Develop algorithms to provide best-effort services on loosely integrated data– Pay as you go as semantic relationships

discovered

• Develop index structures to support querying hybrid data

• New notions of correctness and consistency

Page 19: The Claremont Report on Database Research

• Innovate on creating data collections• Ad-hoc communities to collaborate– Schema will be dynamic– Consensus to guide users– Need visualization tools to create data that are

easy to use• Result of tools may be easier to extract info

Page 20: The Claremont Report on Database Research

Cloud Data Services

• Infrastructures providing software and computing facilities as a service

• Efficient for applications – Limit up-front capitol expenses– reduce cost of ownership over time

• Services hosted in a data center– Shared commodity hardware for computation and

storage

Page 21: The Claremont Report on Database Research

Cloud services available today

• Application services (salesforce.com)• Storage services (Amazon S3)• Compute services (Google App Enginer,

Amazon EC2)• Data services (Amazon SimpleDB, SQL Server

Data Services, Google’s Datastore)

Page 22: The Claremont Report on Database Research

• Cloud data services offer API more restricted than traditional DBs– Minimalist query languages, limited consistency– More predictable services• Difficult if had to provide full-function SQL data service

– Managability important in cloud environments• Limited human intervention• High workloads• Variety of shared infrastructures

Page 23: The Claremont Report on Database Research

• No DBA or system admin • Automatically by platform• Large variations in workloads– Economical to user more resources for short

bursts– Service tuning depends upon virtualization• HW virtual machines as programming interface (EC2)• Multi-tenant hosting many independent schemas in

single managed DBMS (salesforce.com)

Page 24: The Claremont Report on Database Research

• Need for manageability• Adaptive online techniques• New architectures and APIs– Depart from SQL and transactions semantics when

can

• SQL DBs cannot scale to thousands of nodes– Different transactional implementation

techniques or different storage semantics?

Page 25: The Claremont Report on Database Research

• Query processing and optimization– Cannot exhaust search plan if 1000s sites

• More work needed to understand scaling realities

• Data security and privacy– No longer physical boundaries of machines or

networks

Page 26: The Claremont Report on Database Research

• New scenarios– Specialized services with pre-loaded data sets

(stock prices, weather)

• Combine data from private and public domains

• Reaching across clouds (scientific grids)– Federated cloud architectures

Page 27: The Claremont Report on Database Research

Mobile applications and virtual worlds• Manage massive amounts of diverse user-created data,

synthesize intelligently and provide real-time services

• Mobile space– Large user bases– Emergence of mobile search and social networks

• Timely information to users depending on locations, preference, social circles, extraneous factor and context in which operate

• Synthesize user input and behavior to determine location and intent

Page 28: The Claremont Report on Database Research

• Virtual worlds – Second Life– Began as simulations for multiple users• Blur distinction with real-world• Co-space, for both virtual and physical worlds

– Events in physical captured by sensors, materialized in virtual– Events in virtual can affect physical

• Need to process heterogeneous data streams• Balance privacy against sharing person RT info• Virtual actors requires large-scale parallel programs

– Efficient storage, data processing, power sensitive

Page 29: The Claremont Report on Database Research

Moving Forward• DB research community doubles in size last decade• Increasing technical scope make it difficult to keep track of

field• Review load for papers growing– Quality of reviews decreasing over time

• Need more technical books, blogs, wikis• Open source software development in DB– Competition: system components for cloud computing– Large-scale information extraction

Page 30: The Claremont Report on Database Research
Page 31: The Claremont Report on Database Research