Crowdsourcing to structure biological knowledge (USC/ISI)

58
Crowdsourcing to structure biological knowledge Andrew Su, Ph.D. Department of Molecular and Experimental Medicine The Scripps Research Institute ISI, USC August 16, 2012

description

Talk given at USC's Information Sciences Institute (http://www.isi.edu). The AV recording is pretty horrible, but for anyone interested: http://webcasterms1.isi.edu/mediasite/SilverlightPlayer/Default.aspx?peid=89751f8537c44f2fa241db99c793cd231d

Transcript of Crowdsourcing to structure biological knowledge (USC/ISI)

Page 1: Crowdsourcing to structure biological knowledge (USC/ISI)

Crowdsourcing to structure biological knowledge

Andrew Su, Ph.D.Department of Molecular and Experimental Medicine

The Scripps Research Institute

ISI, USC

August 16, 2012

Page 2: Crowdsourcing to structure biological knowledge (USC/ISI)

Human genetics underlies human health2

~3 billion bases

~23,000 genes

Molecular diagnostics & therapeutics

Molecular understanding of:• Biological function• Genetic variation• Mutation• Deletion• Amplification• …

“Gene annotation”

Page 3: Crowdsourcing to structure biological knowledge (USC/ISI)

Structured gene annotations enable computation3

Structured annotations

Page 4: Crowdsourcing to structure biological knowledge (USC/ISI)

Few genes are well annotated4

38%

59%

TP53TNFAPOEMTHFRIL6HLA-DRB1VEGFAEGFRTGFB1ACE

Data: NCBI gene2pubmed, August 2010

23,278 protein-coding genes

Genes, sorted by decreasing counts

Co

un

ts

Gene ontology (GO)

PubMed

Page 5: Crowdsourcing to structure biological knowledge (USC/ISI)

Biocuration is a key annotation bottleneck5

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

0

200,000

400,000

600,000

800,000

1,000,000

Number of PubMed-indexed articles

Page 6: Crowdsourcing to structure biological knowledge (USC/ISI)

6

311,696 articles (1.5% of PubMed)have been cited by GO annotations

Page 7: Crowdsourcing to structure biological knowledge (USC/ISI)

7

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

Page 8: Crowdsourcing to structure biological knowledge (USC/ISI)

The Long Tail is a prolific source of content8

ShortHead

Long Tail

Content produced

Contributors (sorted)

News :Video:

Product reviews:Food reviews:Talent judging:

NewspapersTV/Hollywood

Consumer reportsFood criticsOlympics

BlogsYouTube

Amazon reviewsYelp

American Idol

Page 9: Crowdsourcing to structure biological knowledge (USC/ISI)

9

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

Page 10: Crowdsourcing to structure biological knowledge (USC/ISI)

From crowdsourcing to structured data10

The Gene Wiki

Biological Games

Page 11: Crowdsourcing to structure biological knowledge (USC/ISI)

10,000 gene “stubs” within Wikipedia11

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Page 12: Crowdsourcing to structure biological knowledge (USC/ISI)

Gene Wiki has a critical mass of readers12

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Page 13: Crowdsourcing to structure biological knowledge (USC/ISI)

Gene Wiki has a critical mass of editors13

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Edi

tor

coun

t Editors

Edits Edi

t co

unt

Page 14: Crowdsourcing to structure biological knowledge (USC/ISI)

A review article for every gene is powerful14

Hyperlinks to related concepts

References to the literature

Reelin: 68 editors, 543 edits since July 2002

Heparin: 175 editors, 320 edits since June 2003

AMPK: 44 editors, 84 edits since March 2004

RNAi: 232 editors, 708 edits since October 2002

Page 15: Crowdsourcing to structure biological knowledge (USC/ISI)

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Page 16: Crowdsourcing to structure biological knowledge (USC/ISI)

Document- and concept-centric text mining16

Subject Object

Predicate

Page 17: Crowdsourcing to structure biological knowledge (USC/ISI)

Simple text mining for gene annotations17

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

6319 novel Gene Ontology annotations2147 novel Disease Ontology annotations

Page 18: Crowdsourcing to structure biological knowledge (USC/ISI)

Gene Wiki+ for integrative queries18

http://genewikiplus.org

mwsync

Page 19: Crowdsourcing to structure biological knowledge (USC/ISI)

Dynamic queries across genes, diseases, SNPs19

Page 20: Crowdsourcing to structure biological knowledge (USC/ISI)

20

Page 21: Crowdsourcing to structure biological knowledge (USC/ISI)

21

TOP 100 GENES

Page 22: Crowdsourcing to structure biological knowledge (USC/ISI)

Gene Wiki+ for integrative queries22

http://genewikiplus.org

mwsync

{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}

OMIMPharmGKB

Page 23: Crowdsourcing to structure biological knowledge (USC/ISI)

OMIMPharmGKB

Gene Wiki+ for integrative queries23

http://genewikiplus.org

mwsync

Page 24: Crowdsourcing to structure biological knowledge (USC/ISI)

From crowdsourcing to structured data24

The Gene Wiki

Biological Games

Page 25: Crowdsourcing to structure biological knowledge (USC/ISI)

Not just the biomedical literature…25

Page 26: Crowdsourcing to structure biological knowledge (USC/ISI)

BioGPS aggregates gene-centric information26

http://biogps.org

Page 27: Crowdsourcing to structure biological knowledge (USC/ISI)

The plugin interface is simple and universal27

KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}

STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}

Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}

URL template

Gene entityRendered URL

Page 28: Crowdsourcing to structure biological knowledge (USC/ISI)

The plugin interface is simple and universal28

Page 29: Crowdsourcing to structure biological knowledge (USC/ISI)

The plugin interface is simple and universal29

Page 30: Crowdsourcing to structure biological knowledge (USC/ISI)

The plugin interface is simple and universal30

Page 31: Crowdsourcing to structure biological knowledge (USC/ISI)

The plugin interface is simple and universal31

Page 32: Crowdsourcing to structure biological knowledge (USC/ISI)

The plugin interface is simple and universal32

Total of 389 gene-centric online databases registered as BioGPS plugins

Page 33: Crowdsourcing to structure biological knowledge (USC/ISI)

BioGPS has a critical mass of users33

• > 4100 registered users• 4000 unique visitors per week• 40,000 page views per week

1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge

7. U Penn8. Stanford9. Wash U10. UNC

Top 10 organizations

Daily pageviews

Page 34: Crowdsourcing to structure biological knowledge (USC/ISI)

All resources should provide RDF…34

Page 35: Crowdsourcing to structure biological knowledge (USC/ISI)

Mining structured content from HTML35

Page 36: Crowdsourcing to structure biological knowledge (USC/ISI)

Defining a data extraction template36

TP53 TNF APOE IL6 VEGF …EGFR TGFB1

Page 37: Crowdsourcing to structure biological knowledge (USC/ISI)

The BioGPS Semantic Annotator37

http://50.112.124.237

Page 38: Crowdsourcing to structure biological knowledge (USC/ISI)

All resources should provide flat files…38

Page 39: Crowdsourcing to structure biological knowledge (USC/ISI)

From crowdsourcing to structured data39

The Gene Wiki

Biological Games

Page 40: Crowdsourcing to structure biological knowledge (USC/ISI)

40

http://www.flickr.com/photos/archana3k1/4124330493/

Seven million human hours

Page 41: Crowdsourcing to structure biological knowledge (USC/ISI)

41

Twenty million human hours

http://www.flickr.com/photos/ableman/2171326385/

Page 42: Crowdsourcing to structure biological knowledge (USC/ISI)

-42

150 billion human hours

http://www.flickr.com/photos/rvp-cw/6243289302/

per year

Page 43: Crowdsourcing to structure biological knowledge (USC/ISI)

Using games to fold proteins43

Fold.it players have successfully:• Outperformed state of the art protein

folding algorithms (Cooper, Nature, 2010)

• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)

• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)

• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)

Page 44: Crowdsourcing to structure biological knowledge (USC/ISI)

Using games to fold RNAs44

http://eterna.cmu.edu/

Page 45: Crowdsourcing to structure biological knowledge (USC/ISI)

Using games to align sequences 45

http://phylo.cs.mcgill.ca

Page 46: Crowdsourcing to structure biological knowledge (USC/ISI)

Using games to annotate gene-disease links46

http://genegames.org

If its ‘right’, you get points

then on to the next question

Click the related disease

hurry!

Page 47: Crowdsourcing to structure biological knowledge (USC/ISI)

Dizeez players seem pretty smart…47

In total:• 207 unique gamers• 1045 games played• 8525 guesses

# Occurrences Gene Disease

7 GAST gastrinoma

7 RBP3 retinoblastoma

7 SSX1 synovial sarcoma

6 TG Graves' disease

6 CRYGC Cataract

6 SOX8 mental retardation

6 WRN Werner syndrome

6 ABL1 leukemia

6 MLL3 leukemia

6 SNAI2 breast carcinoma

Pubmed OMIM PharmGKB Gene Wiki

Page 48: Crowdsourcing to structure biological knowledge (USC/ISI)

Dizeez players seem pretty smart…48

# Occurrences Gene Disease

5 MECOM sarcoma

4 ATF7 cancer

3 ABCB5 acute myeloid leukemia

3 SART1 glioblastoma

3 NCK1 leukemia

3 NEK1 cancer

Pubmed OMIM PharmGKB Gene Wiki

In total:• 207 unique gamers• 1045 games played• 8525 guesses

Page 49: Crowdsourcing to structure biological knowledge (USC/ISI)

GenESP: Two-player annotation games49

Page 50: Crowdsourcing to structure biological knowledge (USC/ISI)

COMBO: Genomic predictors for disease50

cancer normal

find patterns

make predictions on new samples

cancer

normal

Page 51: Crowdsourcing to structure biological knowledge (USC/ISI)

COMBO: Genomic predictors for disease51

Page 52: Crowdsourcing to structure biological knowledge (USC/ISI)

COMBO: Genomic predictors for disease52

Page 53: Crowdsourcing to structure biological knowledge (USC/ISI)

COMBO: Genomic predictors for disease53

Page 54: Crowdsourcing to structure biological knowledge (USC/ISI)

COMBO: Genomic predictors for disease54

Page 55: Crowdsourcing to structure biological knowledge (USC/ISI)

COMBO: Genomic predictors for disease55

Page 56: Crowdsourcing to structure biological knowledge (USC/ISI)

COMBO: Genomic predictors for disease56

Page 57: Crowdsourcing to structure biological knowledge (USC/ISI)

57

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

Page 58: Crowdsourcing to structure biological knowledge (USC/ISI)

58

Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors

WP:MCB Project

Collaborators

Erik ClarkeBen GoodSalvatore Loguercio

Ian MacleodChunlei Wu

Group members

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

Contacthttp://sulab.org

[email protected]@andrewsu+Andrew Su

Summer internships for students!

Recruiting graduate students in quantitative biology! See http://education.scripps.edu/