Using Citizen Science to organize biomedical knowledge

26
Using Citizen Science to organize biomedical knowledge Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org March 5, 2015 Future of Genomic Medicine Slides posted at slideshare.net/andrewsu

Transcript of Using Citizen Science to organize biomedical knowledge

Using Citizen Science to

organize biomedical

knowledge

Andrew Su, Ph.D.@andrewsu

[email protected]

http://sulab.org

March 5, 2015

Future of Genomic Medicine

Slides posted at slideshare.net/andrewsu

2

Candidate genes

FLNB

CTNNB1

EPHA3

SMAD3

XPO1

RPS27

FLCN

ATR

FLT3

BRD2

ERG

RAF1

EGFR

ERBB4

RARA

JAK3

LRP1

WT1

PML

SMARCA4

The biomedical literature is growing fast…3

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1983 1988 1993 1998 2003 2008 2013

Number of new PubMed-indexed articles

… but it is very hard to query and compute4

… but it is very hard to query and compute5

Imatinib

Crizotinib

Erlotinib

Gefitinib

Sorafenib

Lapatinib

Dasatinib

Acute myeloid leukemia

Acute lymphoblastic leukemia

Chronic myelogenous leukemia

Chronic lymphocytic leukemia

Hodgkin lymphoma

Non-Hodgkin lymphoma

Myeloma

AND

6

Pathways

Diseases

Proteins

Variants

Genes

Drugs

Goal: Assemble a network of biomedical

knowledge that is comprehensive,

current, computable and traceable.

Information Extraction7

1. Identify high level concepts in text

2. Identify relationships between concepts

8

Doğan and Lu. Proceedings of the 2012 Workshop on BioNLP, 2012, 91-9.

NCBI Disease Corpus

593 PubMed abstracts 12 expert annotators

(2 per document)

6,900 “disease concept” mentions

Question: Can a group of non-scientists

collectively perform concept recognition in

biomedical texts?

9

Amazon Mechanical Turk (AMT)10

Requester

AmazonWorkers

1. Create tasks

2. Execute

3. Aggregate

Experimental design

Task: Identify the “disease concepts” in

the 593 abstracts from the NCBI disease

corpus

– $0.06 per Human Intelligence Task (HIT)

– HIT = annotate one abstract from PubMed

– 15 workers annotate each abstract

11

Comparison to gold standard12

K = 6

F score = 0.87

• 593 documents

• 15 users / doc

• 9 days

• 145 workers

• $630.96

Precision

Recall

Comparisons to text-mining algorithms13

F s

co

re

Text-miningAMT

experiments

Comparisons to human annotators14

Average level of

agreement

between expert

annotators

(stage 1)

F = 0.76

Comparisons to human annotators15

F = 0.76F = 0.87

Average level of

agreement

between expert

annotators

(stage 2)

Does Mechanical Turk scale?16

1,000,000 articles per year

10 annotators / article

4 tasks / doc

$0.06 / task

$ 2,400,000 / year

Question: Can a group of non-scientists

collectively perform concept recognition in

biomedical texts ?

17

and will they do

it for free?

^

18

http://mark2cure.org

Mark2Cure Campaign #0

• Goal: replicate the NCBI disease corpus

– 593 documents, 15x redundancy

• Launched Jan 19, 2015

• Completed Feb 16, 2015

19

– 4 weeks

– 10,275 document

annotation events

– 212 unique users

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Comparison to gold standard20

k = 6

F score = 0.84

PrecisionRecall

Voting threshold

Total cost: $0

Does Citizen Science scale?21

1,000,000 articles * 10 AE / article 15,828

volunteers

needed

10,275 AE * 365 days

212 annotators* 28 days

AE = Annotation events

=

Number of annotation

events per year

Number of annotation

events per year

per volunteer

Does Citizen Science scale?22

15,828

volunteers

needed

175,000

volunteers

300,000

volunteers

37,000

volunteers

1,000,000

volunteers

Annotating the relationships23

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

therapeutic target

subjectpredicate

object

GENE

DISEASE

24

Candidate genes

FLNB

CTNNB1

EPHA3

SMAD3

XPO1

RPS27

FLCN

ATR

FLT3

BRD2

ERG

RAF1

EGFR

ERBB4

RARA

JAK3

LRP1

WT1

PML

SMARCA4

25

Cyrus Afrasiabi

Sebastian Burgstaller

Ramya Gamini

Louis Gioia

Salvatore Loguercio

Adam Mark

Erick Scott

Greg Stupp

Andra Waagmeester

Kevin Xin

Other group members

Contact

http://sulab.org

[email protected]

@andrewsu

+Andrew Su

Mark2Cure

Ben Good

Max Nanis

Ginger Tsueng

Chunlei Wu

All Mark2Curators!

Funding and Support

BioGPS: GM83924

Gene Wiki: GM089820

BD2K Center of Excellence: GM114833

Icon credits (Noun Project, Wikimedia Commons): Zach VanDeHey, hunotika, Viktorvoigt, Alberto Rojas, Lloyd Humphreys

Matt and Cristina Might

NGLY1 community

Why do I Mark2Cure?26

I am retired, have a doctorate in

medical humanities, and have two

children with Gaucher disease. I am

just looking for some way to put my

education to use.

My 4 year old daughter Phoebe is

living with and battling rare

disease.

I have Ehlers Danlos Syndrome. I hope to help people

learn about this painful and debilitating disorder, so that

others like me can receive more effective medical care.

Take part in

something that

helps humanity.

I Mark2Cure in memory of

my son Mike who had type 1

diabetes.

Studied biology in

college and I really

miss it!

In memory of my daughter

who had Cystic Fibrosis

To give back