Cultivating and mining the Gene Wiki for crowdsourced gene annotation

43
Cultivating and mining the Gene Wiki for crowdsourced gene annotation ISMB Bio-Ontologies SIG July 14, 2011 Andrew Su, Ph.D.

description

Keynote presentation at the ISMB Bio-ontologies SIG (Vienna, Austria) on July 15, 2011. (Apologies, I occasionally use animations that obscures some slide content, so feel free to download the PowerPoint version to see what's underneath...)

Transcript of Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Page 1: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Cultivating and mining the Gene Wiki for crowdsourced gene annotation

ISMBBio-Ontologies SIG

July 14, 2011

Andrew Su, Ph.D.

Page 2: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Few genes are well annotated…2

38%

59%

TP53TNFAPOEMTHFRIL6HLA-DRB1VEGFAEGFRTGFB1ACE

Data: NCBI gene2pubmed, August 2010

23,278 protein-coding genes

Genes, sorted by decreasing counts

Co

un

ts

Gene ontology

PubMed

Page 3: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

0

200,000

400,000

600,000

800,000

1,000,000

Number of PubMed-indexed articles

… because the literature is sparsely curated?3

Page 4: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

… because the literature is sparsely curated?4

0

1 0

2 0

Average capacity of human scientistNumber of articles read by typical scientist

Page 5: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

5

311,696 articles (1.5% of PubMed)have been cited by GO annotations

Page 6: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

6

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

Page 7: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

The Long Tail is a prolific source of content7

ShortHead

Long Tail

Content produced

Contributors (sorted)

Publishing:Video:

Product reviews:Food reviews:

Judging:

NewspapersTV/Hollywood

Consumer reportsFood criticsOlympics

BlogsYouTube

Amazon reviewsYelp

American Idol

Page 8: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Wikipedia is reasonably accurate8

Page 9: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Wikipedia has breadth and depth9

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Words/ article

Wikipedia Britannica Online

Page 10: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

10

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

Page 11: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

10,000 gene “stubs” within Wikipedia11

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Page 12: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Wiki success depends on a positive feedback12

Gene wiki page utility

Number ofusers

Number ofcontributors

1001

2002

Page 13: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Page 14: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

A review article for every gene is powerful14

Hyperlinks to related concepts

References to the literature

Reelin: 68 editors, 543 edits since July 2002

Heparin: 175 editors, 320 edits since June 2003

AMPK: 44 editors, 84 edits since March 2004

RNAi: 232 editors, 708 edits since October 2002

Page 15: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Gene Wiki has a diverse critical mass of readers15

Utility

Users

Contributors

Rank 1-10: General society

InsulinTitin

Human chorionic gonadotropinVasopressin

ANKHCLOCKCatalase

ErythropoietinGlucagon

Parathyroid hormone

Rank 1001-1010: Specialists

CSDACNTNAP2

IGSF8Adenosine A3 receptor

RYR1ETV6

Small heterodimer partner5-HT1D receptor

TRPC6Interleukin-6 receptor

Rank 101-110: Scientists

Tau proteinInterleukin 10

APCC-Met

Factor VInterleukin 8

CD44Histamine H1 receptorKappa Opioid receptor

Dihydrofolate reductase

Total: 5.0 million views / month

Page 16: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Readership is poised to grow16

Utility

Users

Contributors

Page 17: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

The Gene Wiki has a critical mass of editors17

Utility

Users

Contributors

In Jan – Jun 2010 …

… 7474 edits were made by 2109 unique users

… total increase in text ≈ 20 PLoS Biology research articles

Edi

tor

coun

t Editors

Edits Edi

t co

unt

Page 18: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Making the Gene Wiki more reliable18

The company name is derived from old Greek, and means

"destroyer of birds".

Novartis is a multinational pharmaceutical company

based in Basel, Switzerland that manufactures drugs such

as clozapine (Clozaril), diclofenac (Voltaren), …

2

2

Page 19: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Making the Gene Wiki more reliable19

http://www.wikitrust.net/

The company name is derived from old Greek, and means

"destroyer of birds".

Novartis is a multinational pharmaceutical company

based in Basel, Switzerland that manufactures drugs such

as clozapine (Clozaril), diclofenac (Voltaren), …

*

36211 total edits 36 total edits

High-trust author Low-trust author

******

** *

*

*

**

2

Page 20: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Making the Gene Wiki more computable20

Structured annotations

!

Free text

Page 21: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Example text from 5-HT1A receptor

“…5-HT1A receptor agonists decrease blood pressure and heart rate or cause hypotension via a central mechanism, by inducing peripheral vasodilation, and by stimulating the vagus nerve…”

Snippet from article on 5-HT1A receptor:

ReceptorAgonists Heart rate

Blood pressure

Hypotension

Vagus nerve

Vasodilation

“…5-HT1A receptor agonists decrease blood pressure and heart rate or cause hypotension via a central mechanism, by inducing peripheral vasodilation, and by stimulating the vagus nerve…”

Snippet from article on 5-HT1A receptor:

Page 22: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Example text from 5-HT1A receptor

5-HT1A receptor

ReceptorAgonists Heart rate

Blood pressure

Hypotension

Vagus nerve

Vasodilation

Page 23: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

23

Page 24: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Re-discovering common knowledge24

Wikilink

GO exact synonym

Gene Wiki mapping

NCBI Entrez Gene: 3362

GO:0004993

Candidate assertion

Page 25: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Mining the most recent literature25

Wikilink

GO related concept

Gene Wiki mapping

NCBI Entrez Gene: 57620

GO:0030154

Candidate assertion

Page 26: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Filling the gaps in gene annotation26

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

GO:0006897

Candidate assertion

Page 27: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Disease associations mined from the Gene Wiki

2147 candidate

annotations

Gene Wiki Articles (10,271)

Filter out seeded text

NCBO Annotator

Compare to DO database

Matched Disease Ontology terms

(2983)

70% have no match

2% match child

23% exact match

5% match parent

Page 28: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Disease associations mined from the Gene Wiki

Expert curation

Correct Maybe Incorrect

86%

4%

10%Overall specificity: 90-93%

Page 29: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

GO associations mined from the Gene Wiki

6319 candidate

annotations

Gene Wiki Articles (10,271)

Filter out seeded text

NCBO Annotator

Compare to GO database

Matched Gene Ontology terms

(11,022)

55% have no match

2% match child

17% exact match

26% match parent

Page 30: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

GO associations mined from the Gene Wiki

Expert curation

Correct Maybe Incorrect

60% Overall specificity: 48-64%26%

14%

Page 31: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Common sources of error in GO associations31

OR2F1: “Olfactory receptors … are responsible for the recognition and G protein-mediated transduction of odorant signals.”

1) Incorrect concept recognition

Transduction (GO:0009293)

The transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector.

Signal transduction (GO:0007165)

The cellular process in which a signal is conveyed to trigger a change in the activity or state of a cell. Signal transduction begins with reception of a signal, e.g. a ligand binding to a receptor or receptor activation by a stimulus such as light, and ends with regulation of a downstream cellular process…

Page 32: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Common sources of error in GO associations32

MEF2C: “Several post translational modifications have been identified including phosphorylation on serine-59 …”

2) Incorrect sentence context

DephosphorylationExcretionGene expressionGlycosylationLocalizationMethylationProteolysisSecretionTransportTranscriptionTranslation

MEF2C

Myelination

Phosporylation

Neurogenesis

Page 33: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Is 48 – 64 % specificity useful?33

GO term

Gene listConcept

recognitionPubMed abstracts

Gene Wiki

+

Enrichment analysis

GO:0006936 GO:0006936

muscle contraction

(GO:0006936)

87 genes

Linked genes by PubMed

only

Linked genes by PubMed +

Gene Wiki

P = 1.0 P = 1.22 E-09

5449 articles

87 articles

Page 34: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

GO associations improve enrichment analyses34

p-value (PubMed only)

p-value (PubMed + Gene Wiki)

Muscle contraction

Page 35: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

35“Like the image of the [mammoth] hairball, it is equally unhelpful in understanding the object’s properties. You can guess that the network is large and its connectivity is complex, but not more. At best, the visualization is merely decorative.”

- Martin Krzywinski

http://mkweb.bcgsc.ca/linnet/talks/linnet-informatics2010.pdf

Page 36: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

36

TOP 100 GENES

Page 37: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Mapping to many biomedical semantic groups37

Page 38: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

ü

Semantic representation

From text mining to a Semantic Gene Wiki38

Community contributions

Semantics Semantic querying

Gene Wiki/ Wikipedia ü ûSemantic Gene Wiki – ü ü

Home-grown wiki û ü ü

?

Page 39: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

Semantic Wiki Links39

apoptosis apoptosis apoptosis

[[apoptosis]] apoptosis[[apoptosis]]

Semantic Gene Wiki

Based on Semantic MediaWiki (SMW)

Gene Wiki

Based on MediaWiki

apoptosis[[promote::apoptosis]][[repress::apoptosis]]

[[modulate::apoptosis]]

{{SWL|target=apoptosis|type=promotes}}

Rendered text

Mirror and translate

Semantic queries, RDF, etc

Page 40: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

For community-based science, data is king40

Data without structure is valuable, but structure without data is not.

Page 41: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

For community-based science, data is king41

Domain expert

Information scientist

Copy-editing

Figures

Structure

Citations

Provenance

X =

Data without structure is valuable, but structure without data is not.

XX

Wikipedia

WP:MCB, Boghog

Artists and illustrators

Wiki links, infoboxes

DOI bot, CitationBot

WikiTrust

Page 42: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

The Gene Wikisuccessfully harnesses the

Long Tail of scientists for community annotation

of gene function

42

Page 43: Cultivating and mining the Gene Wiki for crowdsourced gene annotation

43

(*) See talk on SNPedia mashup

at 1:55 PM

Doug Howe, ZFINSalvatore Loguercio (*), TU DresdenJohn Hogenesch, U PennJon Huss, GNFAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors

WP:MCB Project

Collaborators

Erik ClarkeBen Good (*)

Ian MacleodChunlei Wu

Group members

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

ISMB travel support

Contacthttp://sulab.org

[email protected]@andrewsu+Andrew Su

Luca de AlfaroBo AdlerIan Pye

WikiTrust(UCSC)