Gene Wiki and Mark2Cure update for BD2K

57
Gene Wiki and Mark2Cure update for BD2K Benjamin Good, Ph.D. @bgood [email protected] April 17, 2015

Transcript of Gene Wiki and Mark2Cure update for BD2K

Page 1: Gene Wiki and Mark2Cure update for BD2K

Gene Wiki and Mark2Cureupdate for BD2K

Benjamin Good, Ph.D.@bgood

[email protected]

April 17, 2015

Page 2: Gene Wiki and Mark2Cure update for BD2K

The challenge: make biomedical knowledge organized, accessible, and computable

2

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Number of new PubMed-indexed articles

Page 3: Gene Wiki and Mark2Cure update for BD2K

Our strategy taps into the Long Tail3

ShortHead

Long Tail

Content produced

Contributors (sorted)

News :Video:

Product reviews:Food reviews:Talent judging:

NewspapersTV/Hollywood

Consumer reportsFood criticsOlympics

BlogsYouTube

Amazon reviewsYelp

American Idol

Page 4: Gene Wiki and Mark2Cure update for BD2K

4

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

Page 5: Gene Wiki and Mark2Cure update for BD2K

From crowdsourcing to structured data5

The Gene Wiki

Mark2Cure

Page 6: Gene Wiki and Mark2Cure update for BD2K

Filtering, extracting, and summarizing PubMed

Documents

Concepts Review article

Page 7: Gene Wiki and Mark2Cure update for BD2K

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Page 8: Gene Wiki and Mark2Cure update for BD2K

Wikis depend on a positive feedback loop8

Gene wiki page utility

Number ofusers

Number ofcontributors

1001

2002

Page 9: Gene Wiki and Mark2Cure update for BD2K

10,000 gene “stubs” within Wikipedia9

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Page 10: Gene Wiki and Mark2Cure update for BD2K

Gene Wiki has a critical mass of readers10

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Utility

Users

Contributors

Page 11: Gene Wiki and Mark2Cure update for BD2K

Gene Wiki has a critical mass of editors11

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Utility

Users

Contributors

Edi

tor

coun

t Editors

Edits Edi

t co

unt

Page 12: Gene Wiki and Mark2Cure update for BD2K

A review article for every gene is powerful12

References to the literature

Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002

Heparin: 358 editors, 654 edits since June 2003

AMPK: 109 editors, 203 edits since March 2004

RNAi: 394 editors, 994 edits since October 2002

Page 13: Gene Wiki and Mark2Cure update for BD2K

Collaborating with the journal Gene for recruiting

• Authors write standard review article for Gene

• Also required to create or update Gene Wiki article

• 1o complete, 20 more in process

13

Su, Good and van Wijnen (2013)

Page 14: Gene Wiki and Mark2Cure update for BD2K

Gene Wiki as a tool

• Mechanism for collaboration amongst teams working on gene annotations

• Don’t roll your own wiki if you can do the same job on Wikipedia!

14

Page 15: Gene Wiki and Mark2Cure update for BD2K

Making the Gene Wiki more computable15

Structured annotationsFree text

Analyses

Text-miningGood, BMC Genomics, 2011

Page 16: Gene Wiki and Mark2Cure update for BD2K

Making the Gene Wiki more computable16

Structured annotationsFree text

Analyses

Text-mininghttp://fiehnlab.ucdavis.edu/projects/rice_metabolome/

Page 17: Gene Wiki and Mark2Cure update for BD2K

Making the Gene Wiki more computable17

Structured annotationsFree text

Analyses

Text-mining

Page 18: Gene Wiki and Mark2Cure update for BD2K

Making the Gene Wiki more computable18

Structured annotationsFree text

Databases

Page 19: Gene Wiki and Mark2Cure update for BD2K

Making the Gene Wiki more computable19

Structured annotationsFree text

Page 20: Gene Wiki and Mark2Cure update for BD2K

Making the Gene Wiki more computable20

Structured annotationsFree text

Page 21: Gene Wiki and Mark2Cure update for BD2K

Wikidata21

Provide a database of the world’s knowledge that

anyone can edit

- Denny Vrandečić

Page 22: Gene Wiki and Mark2Cure update for BD2K

Centralizing key data storage22

Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf

Page 23: Gene Wiki and Mark2Cure update for BD2K

Centralizing key data storage23

Page 24: Gene Wiki and Mark2Cure update for BD2K

Centralizing key data storage24

Page 25: Gene Wiki and Mark2Cure update for BD2K

Centralizing key data storage25

287 language editions of Wikipedia

Bioinformatics community

Page 26: Gene Wiki and Mark2Cure update for BD2K

Loading biological data into Wikidata26

Entrez Gene

Ensembl

UniProt

UCSC

PDB

RefSeq

Page 27: Gene Wiki and Mark2Cure update for BD2K

Wikidata for biology27

is a

regulates

Interacts with

Protein

Glycoprotein

Neural development

VLDL receptor

Amyloid precursor protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Page 28: Gene Wiki and Mark2Cure update for BD2K

Wikidata for biology28

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

Page 29: Gene Wiki and Mark2Cure update for BD2K

Current progress

• All human and mouse genes and proteins loaded

• All diseases (Human Disease Ontology) loaded

• Dataset of FDA-approved drugs in preparation

• Datasets for gene-disease, drug-disease, and drug-protein relationships in preparation

29

Page 30: Gene Wiki and Mark2Cure update for BD2K

Gene Wiki(Data) as a tool

• Mechanism for collaboration amongst teams working on biomedical data

• Don’t roll your own open public database if you can do the same job on WikiData!

30

Page 31: Gene Wiki and Mark2Cure update for BD2K

The Long Tail of scientists is a valuable source of

information on gene function

31

Page 32: Gene Wiki and Mark2Cure update for BD2K

From crowdsourcing to structured data32

The Gene Wiki

Mark2Cure

Page 33: Gene Wiki and Mark2Cure update for BD2K

The biomedical literature is growing fast…33

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Number of new PubMed-indexed articles

Page 34: Gene Wiki and Mark2Cure update for BD2K

… but it is very hard to query and compute34

Page 35: Gene Wiki and Mark2Cure update for BD2K

… but it is very hard to query and compute35

Imatinib

Crizotinib

Erlotinib

Gefitinib

Sorafenib

Lapatinib

Dasatinib

Acute myeloid leukemia

Acute lymphoblastic leukemia

Chronic myelogenous leukemia

Chronic lymphocytic leukemia

Hodgkin lymphoma

Non-Hodgkin lymphoma

Myeloma

AND

Page 36: Gene Wiki and Mark2Cure update for BD2K

Extracting semantic networks from PubMed with the crowd’s help

Documents

Network of linked concepts

Page 37: Gene Wiki and Mark2Cure update for BD2K

Information Extraction37

1. Find mentions of high level concepts in text

2. Map mentions to specific terms in ontologies

3. Identify relationships between concepts

Page 38: Gene Wiki and Mark2Cure update for BD2K

Disease mentions in PubMed abstracts38

NCBI Disease corpus• 793 PubMed abstracts

• (100 development, 593 training, 100 test)

• 12 expert annotators (2 annotate each abstract)

6,900 “disease” mentions

Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.

Page 39: Gene Wiki and Mark2Cure update for BD2K

Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts?

39

Page 40: Gene Wiki and Mark2Cure update for BD2K

The Mechanical Turk40

http://en.wikipedia.org/wiki/The_Turk

Page 41: Gene Wiki and Mark2Cure update for BD2K

The Mechanical Turk41

http://en.wikipedia.org/wiki/The_Turk

Page 42: Gene Wiki and Mark2Cure update for BD2K

Amazon Mechanical Turk (AMT)42

Requester

Amazon

For each task, specify:

• a qualification test

• how many workers per task

• how much we will pay per task

Manages:

• parallel execution of jobs

• worker access to tasks via qualification tests

• payments

• task advertising

Workers

1. Create tasks

2. Execute

3. Aggregate

Page 43: Gene Wiki and Mark2Cure update for BD2K

Instructions to workers43

• Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients

received...”• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked

immunodeficiency…”

• Highlight the longest span of text specific to a disease • “... contains the insulin-dependent diabetes mellitus locus …”

• Highlight disease conjunctions as single, long spans. • “... a significant fraction of familial breast and ovarian cancer , but

undergoes…”

• Highlight symptoms - physical results of having a disease– “XFE progeroid syndrome can cause dwarfism, cachexia, and

microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.

Page 44: Gene Wiki and Mark2Cure update for BD2K

Qualification test44

Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”

Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.”

Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…”

26 yes / no questions

Page 45: Gene Wiki and Mark2Cure update for BD2K

Simple annotation interface45

Click to see instructions

Highlight disease mentions

Page 46: Gene Wiki and Mark2Cure update for BD2K

Experimental design

• Task: Identify the disease mentions in the 593 abstracts from the NCBI disease corpus– $0.06 per Human Intelligence Task (HIT)– HIT = annotate one abstract from PubMed– multiple workers annotate each abstract

46

Page 47: Gene Wiki and Mark2Cure update for BD2K

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

Aggregation function based on simple voting47

47

1 or more votes (K=1)This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

K=2

K=3 K=4

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

Page 48: Gene Wiki and Mark2Cure update for BD2K

Comparison to gold standard48

F = 0.87, k = 6

• 593 documents• 15 users / doc• 9 days• $630.96

Precision

Recall

Good, PSB, 2015

Page 49: Gene Wiki and Mark2Cure update for BD2K

49

In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease

concept recognition.

Page 50: Gene Wiki and Mark2Cure update for BD2K

Information Extraction50

1. Find mentions of high level concepts in text

2. Map mentions to specific terms in ontologies

3. Identify relationships between concepts

Page 51: Gene Wiki and Mark2Cure update for BD2K

Annotating the relationships51

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

subject

predicate

object

GENE

DISEASE

Page 52: Gene Wiki and Mark2Cure update for BD2K

Does Mechanical Turk scale?52

1,000,000 articles per year

10 annotators / article

4 tasks / doc

$0.06 / task

$ 2,400,000 / year

Page 53: Gene Wiki and Mark2Cure update for BD2K

53

http://mark2cure.org

Page 54: Gene Wiki and Mark2Cure update for BD2K

Key stats

• Launched Jan 19, 2015• Stopped Feb 16, 2015• In 4 weeks

– 10,275 documents annotated– (589 docs, 15+ annotators per doc)– 212 unique users– Reproduced AMT results– Paid zero dollars

54

Page 55: Gene Wiki and Mark2Cure update for BD2K

Current work in progress

• Expanding to identify genes, drugs, and diseases.

• Targeting a new volunteer campaign about May 1.

• Ongoing experiments with relationship identification/verification.

55

Page 56: Gene Wiki and Mark2Cure update for BD2K

Mark2Cure as a tool?

• Seeking specific use cases for information extraction and collaborators in text-mining interested in exploring interplay with the crowd..

56

Page 57: Gene Wiki and Mark2Cure update for BD2K

57

Funding and Support

BioGPS: GM83924Gene Wiki: GM089820BD2K COE: GM114833

Max Nanis

Ginger Tsueng

Chunlei Wu

Andrew Su

Andra WaagmeesterElvira MitrakaLynn SchrimlSebastian BurgstallerGang FuEvan BoltonPaul PavlidisPeter RobinsonMany WikiDatans

John HussErik ClarkeMany Wikipedians

The Prince of Crowdsourcing