Beiko dcsi2013

48
Classifying biological information the promise and perils of DNA sequences Rob Beiko September 19 DCSI 2013 Norm MacDonald Donovan Parks 1

description

Rob Beiko's keynote presentation at DCSI 2013 (http://dcsi.cs.dal.ca/DCSI2013/index.php)

Transcript of Beiko dcsi2013

Page 1: Beiko dcsi2013

Classifying biological information

the promise and perils of DNA sequences

Rob Beiko

September 19

DCSI 2013Norm MacDonald Donovan Parks

1

Page 2: Beiko dcsi2013

From Francis Crick’s letter to his son Michael, 1953

2

Page 3: Beiko dcsi2013

Your Genome and You

23 chromosomes20,000 genes

3.1 billion nucleotides

Mycobacterium tuberculosis

1 chromosome4,000 genes

4.4 million nucleotides

Tremblayaprinceps

1 chromosome110 genes

138,931 nucleotides

Daphnia pulex

12 chromosomes31,000 genes

200 million nucleotides

Paris japonica

?? chromosomes??? genes

150 billion nucleotides3

Page 4: Beiko dcsi2013

DNA Encodes the Business of the Cell

Chromosome

Chromosome region

Gene GGATCCTATGGATGCATGCCGCCGTAGTATAAT…

Protein

Protein functionsCopying the genome and the cellTransport into and out of the cellEnergy production and storageCellular defenseetc…

4

Page 5: Beiko dcsi2013

Three key questions

(1)What genes in an organism’s genome are responsible for its unique properties? For example:- Ability to withstand environmental challenges- Developmental “plan”- Sources of nutrients

(2) How can we use properties of an organism’s genome as a “fingerprint” to identify that organism?

(3) What mutations to an organism’s genome (including single base changes) are responsible for altered properties of that organism?

5

Page 6: Beiko dcsi2013

Microbes: hot or not?+ ++ +++ +++++++

Strain 121

MacDonald, NJ and Beiko, RG (2010). Efficient learning of microbial genotype–phenotype association rules. Bioinformatics 26: 1834-1840. 6

Page 7: Beiko dcsi2013

Beating the heat

Proteins tend to stop working at temperatures above 37-40° C

Heat shock – “Things are getting uncomfortable here”Extreme heat shock – “Make it stop make it stop make it stop!!!!”

What does an organism need to get by at higher temperatures?

(1) Specific proteins that help keep everything working(2) Changes to all proteins that make them more heat tolerant(3) Various other things

Proteins tend to stop working at temperatures above 37-40° C

Heat shock – “Things are getting uncomfortable here”Extreme heat shock – “Make it stop make it stop make it stop!!!!”

What does an organism need to get by at higher temperatures?

(1) Specific proteins that help keep everything working(2) Changes to all proteins that make them more heat tolerant(3) Various other things

7

Page 8: Beiko dcsi2013

The “genotype-phenotype association” problem

Genotype: An organism’s DNA sequence, somehow definedPhenotype: An organism’s physical properties

In this case, “genotype” will refer to the presence of genes that are similar enough that they likely share the same function

8

Page 9: Beiko dcsi2013

The “genotype-phenotype association” problem

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5

9

Page 10: Beiko dcsi2013

A suitable approachProblem: a typical dataset will contain between 50-500 genomes, and presence / absence data for >10,000 genes

We need an approach that can detect interactions among genes, so the potential feature space is very large. Searching all 210,000 rule combinations is obviously not going to happen.

ASSOCIATION RULE MINING (Agrawal et al 1993):Discover associative rules between items, e.g. {Milk, Eggs} -> {Flour}

Classification Based on Predictive Association Rules (Yin and Han, 2003): iteratively generate rules to “cover” each subset of the data

10

Page 11: Beiko dcsi2013

11

F

F, Q

F, Z

ANone above gain threshold

Rules discovered: 1. F, Q -> POSITIVE2. F, Z -> POSITIVE3. A -> POSITIVE

Covered samples get their weight reduced before the next iteration

None above gain threshold

None above gain threshold

Classification based on Predictive Association Rules (CPAR)

Page 12: Beiko dcsi2013

CPAR results

One example for now: THERMOPHILY – the ability of an organism to grow at temperatures above 42° C

427 genomes in the dataset: 376 mesophiles (negative set), 51 thermophiles (positive set)26,290 genes to consider

Use CPAR to learn rules, submit identified genes to SVM for classification. 10x 5-fold cross-validation

CPAR accuracy: 84.3% (obtained in 10.6 seconds)Best competitor (NETCAR): 79.3% (obtained in 1250.9 seconds)

12

Page 13: Beiko dcsi2013

CPAR ResultsAeropyrum_pernix_K1 YES 0

Archaeoglobus_fulgidus_DSM_4304 YES 0

Caldicellulosiruptor_saccharolyticus_DSM_8903YES 0

Fervidobacterium_nodosum_Rt17-B1 YES 0

Hyperthermus_butylicus_DSM_5456 YES 0

Ignicoccus_hospitalis_KIN4/I YES 0

Metallosphaera_sedula_DSM_5348 YES 0

Pyrobaculum_arsenaticum_DSM_13514 YES 0

Pyrobaculum_calidifontis_JCM_11548 YES 0

Pyrobaculum_islandicum_DSM_4184 YES 0

Pyrococcus_abyssi_GE5 YES 0

Pyrococcus_furiosus_DSM_3638 YES 0

Pyrococcus_horikoshii_OT3 YES 0

Staphylothermus_marinus_F1 YES 0

Sulfolobus_acidocaldarius_DSM_639 YES 0

Sulfolobus_solfataricus_P2 YES 0

Thermoanaerobacter_tengcongensis_MB4 YES 0

Thermofilum_pendens_Hrk_5 YES 0

Thermotoga_maritima_MSB8 YES 0

Thermotoga_petrophila_RKU-1 YES 0

Thermus_thermophilus_HB27 YES 0

Thermus_thermophilus_HB8 YES 0

Roseiflexus_castenholzii_DSM_13941 YES 0

Thermosipho_melanesiensis_BI429 YES 0

Roseiflexus_sp._RS-1 YES 0

Moorella_thermoacetica_ATCC_39073 YES 0

Streptococcus_thermophilus_LMG_18311 YES 0

Thermoplasma_volcanium_GSS1 YES 1

Methanosaeta_thermophila_PT YES 1

Nanoarchaeum_equitans_Kin4-M YES 1

Thermoplasma_acidophilum_DSM_1728 YES 1

Picrophilus_torridus_DSM_9790 YES 1

Carboxydothermus_hydrogenoformans_Z-2901YES 1

Streptococcus_thermophilus_CNRZ1066 YES 1

Aquifex_aeolicus_VF5 YES 2

Methanopyrus_kandleri_AV19 YES 2

Pelotomaculum_thermopropionicum_SI YES 2

Rubrobacter_xylanophilus_DSM_9941 YES 3

Geobacillus_kaustophilus_HTA426 YES 3

Nitratiruptor_sp._SB155-2 YES 4

Synechococcus_sp._JA-3-3Ab YES 6

Geobacillus_thermodenitrificans_NG80-2 YES 7

Methanocaldococcus_jannaschii_DSM_2661 YES 8

Acidothermus_cellulolyticus_11B YES 8

Deinococcus_geothermalis_DSM_11300 YES 9

Clostridium_thermocellum_ATCC_27405 YES 9

Thermosynechococcus_elongatus_BP-1 YES 9

Sulfurovum_sp._NBC37-1 YES 10

Thermobifida_fusca_YX YES 10

Chlorobium_tepidum_TLS YES 10

Symbiobacterium_thermophilum_IAM_14863YES 10 # of misclassifications in 10 replicate runs

Sulfurovum_sp._NBC37-1 YES 10

The classifier is right; the database is wrong!!

13

Page 14: Beiko dcsi2013

A complication

Organisms are not independent observations!They share common ancestry

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5

14

Page 15: Beiko dcsi2013

What to do?

MUTUAL INFORMATION:

CONDITIONAL MUTUAL INFORMATION:

Weight CMI by total MI – CONDITIONAL WEIGHTED MUTUAL INFORMATION (CWMI)Reweight CPAR rules to reflect MI or CWMI: what patterns emerge?

15

Page 16: Beiko dcsi2013

What genes are identified?

16

Highlighted boxes: genes identified in “A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis”

(Makarova et al., Nucleic Acids Research, 2002, 30 (2) , 482-496)

Top CWMITop MI

Page 17: Beiko dcsi2013

Wrong, but in different ways

1717

Organism CPAR MI CWMI

Streptococcus thermophilus LMG 18311 0 10 10

Streptococcus thermophilus CNRZ1066 1 10 10

Carboxydothermus hydrogenoformans Z-2901 1 8 5

Geobacillus kaustophilus HTA426 3 10 9

Synechococcus sp. JA-3-3Ab 6 8 2

Methanocaldococcus jannaschii DSM 2661 8 0 0

Acidothermus cellulolyticus 11B 8 9 6

Deinococcus geothermalis DSM 11300 9 8 5

Clostridium thermocellum ATCC 27405 9 10 4

Chlorobium tepidum TLS 10 10 8

Page 18: Beiko dcsi2013

Summary

18

Misclassifications (10 replicates)

18

- CPAR is FAST and fairly accurate, but the problem is challenging: no “magic” set of genes that automatically make you a thermophile

- But we can investigate what pops up in the rules to find out which genes are most likely associated with heat tolerance

- The hardest organisms to classify are from weird groups, with few or no close relatives that are also thermophilic

- Different weighting schemes, especially those that consider the confounding effects of taxonomy, have different strengths and can identify different candidate genes

Page 19: Beiko dcsi2013

What’s next?

1919

- Much larger microbial datasets with much broader taxonomic coverage are now available - Will give us more precise models of what genes make a

thermophile, pathogen, etc.

- Consider other lines of evidence: variation WITHIN genes in addition to gene presence/absence

- Apply to emerging pathogen data: classify outbreak isolates based on antibiotic resistance, virulence and other properties(SFU, BCCDC, National Microbiology Laboratory)

Jie (Jessie) Ning

Page 20: Beiko dcsi2013

METAGENOMICS:Because one genome at a time is too easy

MacDonald NJ, Parks DH, and Beiko, RG (2012). Rapid identification of high-confidence taxonomic assignments for metagenomic data. Nucleic Acids Research 40: e111.

Parks DH, MacDonald NJ, and Beiko, RG (2011). Classifying short genomic fragments from novellineages using composition and homology. BMC Bioinformatics 12: 328.

20

Page 21: Beiko dcsi2013

The microbial community problem

- Microbes almost never act alone; samples will typically contain dozens or hundreds of different species

- How can we answer the following questions:- What microbes are present in

a given sample?- What functions do they carry

out?- How do they interact with one

another?

21

Page 22: Beiko dcsi2013

Metagenomics

Sample Extract DNA Sequence DNA Assign sequences

GATAA

? ?

??

22

Page 23: Beiko dcsi2013

The species assignment problem

GATAAATCTGG

? ?

??

- UNSUPERVISED (clustering-ish) and SUPERVISED approaches

- For supervised classification, we need a set of known genomes

- Two attributes provide key clues:(i) Genomic composition of k-

mers (aka n-grams)(ii) Comparison with known

gene sequences

23

Page 24: Beiko dcsi2013

The species assignment problem

GATAAATCTGG

24

Mystery sequence

Where did I come from?

COMPOSITION(k-mers)

k-mer frequency

AA 2/10

AC 0/10

AG 0/10

AT 1/10

k-mer frequency

AA 2/10

AC 0/10

AG 0/10

AT 1/10

k-mer frequency

AA 2/10

AC 0/10

AG 0/10

AT 1/10

k-mer frequency

AA 2/10

AC 0/10

AG 0/10

AT 1/10

k-mer frequency

AA 2/10

AC 0/10

AG 0/10

AT 1/10

Genome models

SIMILARITY

GATAAATCTGG

GATAAGTCTGG

GACCAATCTGG

GATAAACTTAG

CAAGGATAAGC

Sequences fromreference genomes

Sequence frommetagenome

Page 25: Beiko dcsi2013

Metagenomes - the first few years

25

Cost of DNA sequencing(note log scale)

Study Author, Year # of nucleotides

Size of each “read”

Acid mine drainage Tyson et al, 2004 7.62 x 107 737 nt

Obese / Lean twins Turnbaugh et al, 2009 1.83 x 109 341 nt

Human gut “catalogue”

Qin et al, 2010 5.77 x 1011 75 nt

Page 26: Beiko dcsi2013

Summary of challenges

26

- Datasets are already huge, and getting bigger and more numerous

- DNA sequences that we need to classify are SHORT: unstable estimates of composition and similarity

- Our predictions depend on the coverage in our reference database

- We need to combine different lines of evidence into a coherent prediction scheme

Page 27: Beiko dcsi2013

Two approaches

27

PhymmBL: Brady and Salzberg, 2010

- Similarity of sequences assessed through the BLAST algorithm

- Composition assessed using interpolated context models

- Predictions are combined using a formula

RITA: MacDonald, Parks and Beiko, 2012

- Similarity of sequences assessed using UBLAST and BLAST

- Composition assessed using naïve Bayes approach

- Look for agreement between predictors; if no agreement, decide based on best evidence

Page 28: Beiko dcsi2013

The naïve Bayes approach

28

- Build k-mer profiles for each reference genome- The probability that a given DNA sequence fragment F originated from

a given genome Gi is:

- (that is, the combined frequencies of all k-mers from F in genome Gi)

- Note that naïve Bayes assumes INDEPENDENCE, which is a bit funny with overlapping k-mers (But We Did It Anyway)

M

j

iji GwPGFP1

||

AGGCTTGTCAA

Page 29: Beiko dcsi2013

Naïve Bayes in action

29

Build fake metagenomes by chopping up real sequenced genomes into pieces of length 200

Build a reference database that excludes the chopped up genomes ANDTheir close relatives (leave-one-out)

How accurate is the classifier, for different values of k?

k

Average proportionof sequences correctlyclassified

Page 30: Beiko dcsi2013

Composition versus Similarity

30

Similarity (three right-hand sets) are more accurate (and slower)than composition approaches NB and P

1000 nt

200 nt

Page 31: Beiko dcsi2013

RITA: Rapid Identification of Taxonomic Assignments

31

Query DNA sequence fragment

Run naïve Bayes classifier

UBLAST filter(fast, imprecise)

BLAST comparisons(slower, better)

Is there a BLAST match?

Is there a strong naïve Bayes preference?

Do BLAST and naïve Bayes

agree?

Is there a strong BLAST preference?

Group 2 Group 3

Group 1a

Group 1b

Yes!

No!

Page 32: Beiko dcsi2013

Performance on different sequence lengths

32

Page 33: Beiko dcsi2013

Running time

33

0.01

0.1

1

10

100

Ru

nn

ing

tim

e (

h)

Running timeson 116,244 sequences

Page 34: Beiko dcsi2013

Application to human microbiomedata sets

34

Ho

mo

logy

+ C

om

po

siti

on

Co

mp

osi

tio

n

Without HMP genomes: Clostridium, Bacteroides and Eubacterium, but lots of low-confidence calls too

With HMP reference genomes: Add Ruminococcus, Faecalibacterium, Lachnospiraceae

Good Less Good

Data from Turnbaugh et al., 2010

Page 35: Beiko dcsi2013

Application to bioremediation metagenome

35Hug et al., 2012

Three sets of microbes, all can clean up PCEs. Are there differences in the composition of these sets?

Page 36: Beiko dcsi2013

Summary

36

- Naïve Bayes is FAST and performs as well as alternative, more complicated approaches

- The combination of composition and similarity is superior to either approach in isolation

- The accuracy on short reads is good, but a substantial minority of reads are misclassified so the question of “who is doing what” remains somewhat open

Page 37: Beiko dcsi2013

What’s next?

37

- Apply to emerging metagenomic data sets:- Bioremediation- Aging and frailty in mice and humans

- Refine the approach to include both unsupervised and supervised components

Page 38: Beiko dcsi2013

Coda #1: mammalian fertility

38

Random matingCONTROL (105)

Selective breedingSELECTED (344)

Starting colony

30 years of….

Examine genetic variation at >8000 positionswithin the genome.

Are there any genetic differences at one ormore sites that distinguish the populationsand individuals within the populations?

Alex KeddyKatherine

Rutherford

Page 39: Beiko dcsi2013

Machine-learning results

39

Different ML approaches with feature selection

Observed vs Predicted reproductive ratefor RF regression model

Page 40: Beiko dcsi2013

What’s next?

40Jeremy Koenig

- Expand the project: more data, and more types of data!

- Integrating lines of evidence from multiple sources will be a significant challenge – each yields overlapping / different predictions

- Map interesting results into the cow genome and test effectiveness

Developer to benamed later

Page 41: Beiko dcsi2013

Coda #2: data retrieval and GIS

41

20,304 samples1.7 billion sequences

Page 42: Beiko dcsi2013

42Conor Meehan

Page 43: Beiko dcsi2013

Objectives

43

- Automated classification of data from sources such as the EMP

- Retrieval of data from EMP via Web services under development (some plugins already completed – come in October for the story)

Page 44: Beiko dcsi2013

What’s next?

44

My Dal Homecoming lecture on October 4

Page 45: Beiko dcsi2013

Classifying DNA: Adventures in Multidisciplinarity

45

Genetics

Evolution

Statistics

Machine Learning

Throw in the challenges of massive data sets,data retrieval challenges,emerging technologies,and uncertain reliability of some data sets,

And there is a lot of work still to be done!!

Chris Whidden

Donovan Parks

Morgan Langille

Page 46: Beiko dcsi2013

Open Science

46

@rob_beiko

Github

Preprint servers

This presentation: http://www.slideshare.net/beiko/beiko-dcsi2013

Page 47: Beiko dcsi2013

Fin

47

Fin

Page 48: Beiko dcsi2013

Image creditsPlease follow links for copyright information

Slide 1: http://commons.wikimedia.org/wiki/File:DNA Overview2.pngSlide 2: s3.documentcloud.org/documents/706661/francis-crick-letter.pdfSlide 3: http://www.nature.com/nature/journal/v393/n6685/full/393537a0.htmlhttp://www.ncbi.nlm.nih.gov/sutils/static/GP IMAGE/Mycobacterium.jpghttp://commons.wikimedia.org/wiki/File:Shakespeare.jpghttp://commons.wikimedia.org/wiki/File:Wolllaus.jpghttp://phenomena.nationalgeographic.com/files/2013/06/Tremblaya Moranella.jpg (Ryuichi Koga, National Institute of Advanced Industrial Science and Technology, Japan)http://commons.wikimedia.org/wiki/File:Daphnia pulex.pnghttp://commons.wikimedia.org/wiki/File:Paris japonica Kinugasasou in Hakusan 2003 7 27.jpgSlide 4: http://upload.wikimedia.org/wikipedia/commons/2/21/DNA human male chromosomes.gifhttp://commons.wikimedia.org/wiki/File:LKB1 complex structure 2WTK.pngSlide 6: http://commons.wikimedia.org/wiki/File:Yogurt of the Bulgarija Pavilion of Expo 2005 Aichi Japan.jpghttp://en.wikipedia.org/wiki/File:Grand prismatic spring.jpghttp://www.nsf.gov/od/lpa/news/03/images/scsmoker2th.jpghttp://www.nsf.gov/od/lpa/news/03/images/strain121 thin th.jpgSlide 21: http://commons.wikimedia.org/wiki/File:EPA_TECHNICIAN_COLLECTS_WATER_SAMPLE_FROM_PAHRANAGAT_LAKE_ABOUT_10_MILES_SOUTH_OF_ALAMO_-_NARA_-_549007.jpghttp://commons.wikimedia.org/wiki/File:DNA_orbit_animated_small.gifhttp://commons.wikimedia.org/wiki/File:DNA_sequence.svgSlide 24:Stein, L. Genome Biology 2010 11:207Slide 36: http://commons.wikimedia.org/wiki/File:Mouse-19-Dec-2004.jpg

48