Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social...

32
Database-as-a-Service for Long Tail Science Bill Howe Garret Cole Nodira Khoussainova Luke Zettlemoyer Shaminoo Kapoor Patrick Michaud

Transcript of Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social...

Page 1: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Database-as-a-Service

for Long Tail Science

Bill Howe

Garret Cole

Nodira Khoussainova

Luke Zettlemoyer

Shaminoo Kapoor

Patrick Michaud

Page 2: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

All science is reducing to a database problem

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)

New model: “Download the world” (Data acquired en masse, in support of many hypotheses)

Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)

Oceanography: high-resolution models, cheap sensors, satellites

Biology: lab automation, high-throughput sequencing,

Page 3: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

data

volu

me

rank

CERN

(~15PB/year)

LSST

(~100PB)

PanSTARRS

(~40PB)

Ocean

Modelers <Spreadsheet

users>

SDSS

(~100TB)

Seis-

mologistsMicrobiologistsCARMEN

(~50TB)

“The future is already here;

it’s just not very evenly

distributed.”

-- William

Gibson

The Long Tail

Page 4: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Biology

Oceanograph

y

Astronom

y

The other “Large Scale”# o

f byte

s

# of types, # of apps

LSST

SDSS

Galaxy

BioMart

GEO

IOOS

OOI

LANL

HIV Pathway

Commons

PanSTARR

S Client + Cloud Viz, SSDBM 2010

Science Dataspaces, CIDR 2007, IIMAS 2008

This talk

Mesh Algebra, VLDB 2004, VLDBJ

2005, ICDE 2005, eScience

2008

HaLoop, VLDB 2010

see also:

Skew handling, SOCC 2010

Clustering, SSDBM 2010

Science Mashups, SSDBM 2009

Cloud Viz, UltaScale Viz 2009, Visualization 2010

Page 5: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Ad Hoc Research Data

5/18/10 Garret Cole, eScience Institute

Fasta formatSpread sheets

Delimited ASCII

Page 6: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Problem

How much time do you spend “handling

data” as opposed to “doing science”?

Mode answer: “90%”

Page 7: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

5/18/10 Garret Cole, eScience Institute

Simple Example###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1chr_4[480001-580000].287 4500chr_4[560001-660000].1 3556chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SPT16 subunitchr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf family, translational repressorchr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf family, translational repressorchr_24[160001-260000].65 3542chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf family, translational repressorchr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrolasechr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase familychr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase familychr_11[1-100000].70 2886chr_11[80001-180000].100 1523

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 2852 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 2333 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872…

2853FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 10892854FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316

…3566FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105

COGAnnotation_coastal_sample.txt

Page 8: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

id query hit e_value query_start query_end hit_start hit_end hit_length6409FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 1346410FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 1346411FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 1346412FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 1346413FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 6066414FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 6066415FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 6066416FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 1536417FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 1536418FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153

coastal sample

Complex Example

…[H] COG4547 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase) Ype: YPMT1.87 Atu: AGl2410 Sme: SMc00701 Bme: BMEI0050 Mlo: mll3561 Ccr: CC0672…[J] COG5099 RNA-binding protein of the Puf family, translational repressor Sce: YGL014w YGL178w YJR091c YLL013c YPR042c

Spo: SPAC1687.22c SPAC4G8.03c SPAC4G9.05 SPAC6G9.14 SPBC56F2.08c SPBP35G2.14 SPCC1682.08c Ecu: ECU11g1730…

COG database

###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1chr_4[480001-580000].287 4500chr_4[560001-660000].1 3556chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SPT16 subunitchr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase)chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf family, translational repressorchr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf family, translational repressorchr_24[160001-260000].65 3542chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf family, translational repressorchr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrolasechr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase familychr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and protein kinases of the PI-3 kinase familychr_11[1-100000].70 2886chr_11[80001-180000].100 1523

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

SwissProt web service

Browser Cross-Reference

TIGR01650 GO:0051116 contributes_to

TIGR01651 GO:0009236 NULL

TIGR01651 GO:0051116 NULL

TIGR01660 GO:0008940 NULL

TIGR01660 GO:0009061 NULL

TIGR01660 GO:0009325 NULL

TIGR01663 GO:0000012 NULL

TIGR01663 GO:0046403 NULL

TIGRFAM to GO Mapping

id query hit e_value query_start query_end hit_start hit_end hit_length6409FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 1346410FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 1346411FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 1346412FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 1346413FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 6066414FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 6066415FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 6066416FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 1536417FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 1536418FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153

coastal sample

5/18/10 Garret Cole, eScience Institute

Page 9: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

An observation about “handling data”

How many plasmids were bombarded in July and

have a rescue and expression?

5/18/10 Garret Cole, eScience Institute

SELECT count(*)

FROM [bombardment_log]

WHERE bomb_date BETWEEN ‟7/1/2010' AND ‟7/31/2010'

AND rescue clone IS NOT NULL

AND [expression?] = 'yes'

Page 10: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

An observation about “handling data”

Which samples have not been cloned?

5/18/10 Garret Cole, eScience Institute

SELECT *

FROM plasmiddb

WHERE NOT (ISDATE(cloned) OR cloned = „yes‟)

Page 11: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

An observation about “handling data”

How often does each RNA hit appear inside the

annotated surface group?

5/18/10 Garret Cole, eScience Institute

SELECT hit, COUNT(*) as cnt

FROM tigrfamannotation_surface

GROUP BY hit

ORDER BY cnt DESC

Page 12: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

An observation about “handling data”

For a given promoter (or protein fusion), how many expressing line have been generated (they would all have different strain designations)

5/18/10 Garret Cole, eScience Institute

SELECT strain, count(distinct line)

FROM glycerol_stocks

GROUP BY strain

Page 13: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

An observation about “handling data”

Find all TIGRFam ids (proteins) that are missing from at

least one of three samples (relations)

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

UNION

SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

UNION

SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

EXCEPT

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

INTERSECT

SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

INTERSECT

SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

Page 14: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Long Tail Science DaaS Requirements

Schema-Later or Schema-Free

Schema represents a shared consensus on structure,

semantics, data model, usage modalities

By definition, no such consensus exists at the frontier of research

By definition, lots of schema churn

By definition, dirty data

Consistency?

Read mostly, appends, versioning/batch replace

Scale?

Relatively small (<100GB)

Dataspace abstraction attractive [Halevy, Maier, Franklin 2005]

anecdotally well-received

Page 15: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Some Science DaaS Motivations

Chronic IT poverty + exploding data volumes

especially in the long tail

Data sharing is the whole point

mandated by funding agencies

in the cloud, sharing reduces to policy

Public reference databases

Globally accessible in the cloud

Page 16: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Chavi dataspace

Page 17: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

More Examples

What is the location of the E.Coli glycerol stock(s) for gene X promoter

fusion?

What is the -80 freezer and liquid nitrogen location of worm strain for

gene

x promoter fusion and/or protein fusion?

Show me all worm strains currently in storage?

Show me all worm strains for gene X?

Show me all worm strains for gene X promoter fusion?

Show me all worm strains for gene X protein fusion?

Show me a table of all worm strains with early embryonic expression?

Show me the location of the imaging data for gene x?

What strains have been shipped to Yale, Stanford etc, and when were

they shipped?

Show me a list of all primers with PCR failure?

What genes have midiprep stocks but no worm strains?

Page 18: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

18

Discovery: SQL Does not Terrify Scientists

5/18/10 Garret Cole, eScience Institute

Page 19: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

5/18/10 Garret Cole, eScience Institute

What‟s the point?

Databases are underused in (long tail) science

Conventional wisdom says “Scientists won‟t write SQL”

This is utter horseshit

witness SDSS if you don‟t trust us

Instead, we implicate difficulty in

installation

configuration

schema design

performance tuning

data ingest

app-building (over-reliance on GUIs)

So we ask “What kind of platform can support ad hoc scientific Q&A?”

Page 20: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Example Workflow: Environmental

Metagenomics

5/18/10 Garret Cole, eScience Institute

Page 21: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

5/18/10 Garret Cole, eScience Institute

Page 22: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

5/18/10 Garret Cole, eScience Institute

Page 23: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

5/18/10 Garret Cole, eScience Institute

Page 24: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

metadata

search results

sequence

data

Page 25: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

5/18/10 Garret Cole, eScience Institute

SQL

Page 26: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

5/18/10 Garret Cole, eScience Institute

Old UI (1)

Page 27: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

5/18/10 Garret Cole, eScience Institute

Old UI (2)

Page 28: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

New UI (1)

Page 29: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Usage

about 5 months old

8 labs around UW campus

~200 tables

~400 views

Page 30: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

Implementation

Windows Azure app serves GUI and RESTful API for uploading data, saving queries

SQL Azure Database

SQL Server on AWS to spill over 50GB and manage distributed query

shared database, separate schemas per account

Accounts 1:1 with DB roles

Page 31: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

View Semantics and Features

“Saved query” = View with attached metadata

Unify views and tables as “datasets”

table = “select * from [raw_table]”

Replacement semantics for name conflicts

old versions materialized and archived

Materialize downstream views

when dependencies deleted

when dependencies become incompatible

Permissions

public vs. private vs. ACLs vs. groups

Sharing, social querying, CQMS*

search, recent queries, friends’ queries, favorites, ratings

facilitate sharing and recommendations of not just whole queries, but common predicates, join patterns, etc.

Discover and expose implicit relationships between datasets

View synthesis [Garcia-molina, Widom, ICDT 2010]

Proactively create views for potential joins, unions, filters* [Khoussainova, CIDR 2009]

Page 32: Database-as-a-Service for Long Tail Sciencepublic vs. private vs. ACLs vs. groups Sharing, social querying, CQMS* search, recent queries, friends’ queries, favorites, ratings facilitate

SQLShare as a Research Platform

SQL Autocomplete (Nodira Khoussainova, YongChul Kwon, Magda Balazinska)

English to SQL (Bill Howe, Luke Zettlemoyer, Shaminoo Kapoor)

Automatic Mashups and Visualization (Bill Howe, Alicia Key)

Semi-Automatic Logical Design Join, Union Recommendations (Bill Howe, Garret Cole)

View Synthesis: Find Q given result R and database D s.t. R = Q(D)

Crowdsourced SQL authoring

Information Extraction

Logs -> Snippets

English -> Snippets

Crowd -> Snippets

Schema, Data -> Snippets

Raw Data -> Snippets