Mining Plant Pathogen Genomes for Effectors
-
Upload
leighton-pritchard -
Category
Science
-
view
113 -
download
2
description
Transcript of Mining Plant Pathogen Genomes for Effectors
![Page 1: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/1.jpg)
Mining pathogen genomes for effectors
Leighton Pritchard
![Page 2: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/2.jpg)
The overall goal l Star0ng from a genome sequence, iden0fy genes that code for candidate effectors (or, star0ng from gene product complement, iden0fy candidate effectors)
![Page 3: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/3.jpg)
The overall goal l Star0ng from a genome sequence, iden0fy genes that code for candidate effectors (or, star0ng from gene product complement, iden0fy candidate effectors)
![Page 4: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/4.jpg)
The overall goal l Star0ng from a genome sequence, iden0fy genes that code for candidate effectors (or, star0ng from gene product complement, iden0fy candidate effectors)
![Page 5: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/5.jpg)
The overall goal l Star0ng from a genome sequence, iden0fy genes that code for candidate effectors (or, star0ng from gene product complement, iden0fy candidate effectors)
![Page 6: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/6.jpg)
What is an effector? l Molecule produced by pathogen that (directly?) modifies host molecular/biochemical ‘behaviour’, e.g.
l Inhibits enzyme ac0on (Cladosporium fulvum AVR2, AVR4; Phytophthora infestans EPIC1, EPIC2B; P. sojae glucanase inhibitors)
l Cleaves protein target (Pseudomonas syringae AvrRpt2)
l (De-‐)phosphorylates protein target (Pseudomonas syringae AvrRPM1, AvrB)
l Addi0onal component in/retarge0ng host system, e.g. E3 ligase ac0vity (P. syringae AvrPtoB; P. infestans Avr3a)
l Regulatory control (Xanthomonas campestris AvrBs3, TAL effectors)
![Page 7: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/7.jpg)
What is an effector? l Molecule produced by pathogen that (directly?) modifies host molecular/biochemical ‘behaviour’, e.g.
l Inhibits enzyme ac0on (Cladosporium fulvum AVR2, AVR4; Phytophthora infestans EPIC1, EPIC2B; P. sojae glucanase inhibitors)
l Cleaves protein target (Pseudomonas syringae AvrRpt2)
l (De-‐)phosphorylates protein target (Pseudomonas syringae AvrRPM1, AvrB)
l Addi0onal component in/retarge0ng host system, e.g. E3 ligase ac0vity (P. syringae AvrPtoB; P. infestans Avr3a)
l Regulatory control (Xanthomonas campestris AvrBs3, TAL effectors)
![Page 8: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/8.jpg)
What is an effector? l No unifying biochemical mechanism; may act inside or outwith host cell
l No formal, agreed defini0on (direct/indirect ac0on; structural damage – PCWDEs, etc.)
l No single ‘test for candidate effectors’ l Really tes0ng for protein family membership and/or evidence of
‘effector-‐like behaviour’
l A general sequence classifica0on problem (func0onal annota0on)
l Many possible bioinforma0c/computa0onal approaches
l No big red bu[on
![Page 9: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/9.jpg)
What is an effector? l No unifying biochemical mechanism; may act inside or outwith host cell
l No formal, agreed defini0on (direct/indirect ac0on; structural damage – PCWDEs, etc.)
l No single ‘test for candidate effectors’ l Really tes0ng for protein family membership and/or evidence of
‘effector-‐like behaviour’
l A general sequence classifica0on problem (func0onal annota0on)
l Many possible bioinforma0c/computa0onal approaches
l No big red bu[on
![Page 10: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/10.jpg)
What is an effector? l No unifying biochemical mechanism; may act inside or outwith host cell
l No formal, agreed defini0on (direct/indirect ac0on; structural damage – PCWDEs, etc.)
l No single ‘test for candidate effectors’ l Really tes0ng for protein family membership and/or evidence of
‘effector-‐like behaviour’
l A general sequence classifica0on problem (func0onal annota0on)
l Many possible bioinforma0c/computa0onal approaches
l No big red bu[on
![Page 11: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/11.jpg)
Surgery without knife skills?
![Page 12: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/12.jpg)
Before we start…
A F 4 7 “If a card has a vowel on one side, it has an even number on the other side.” Which card(s) are useful to turn over to test this proposi0on?
![Page 13: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/13.jpg)
Before we start…
A F
4 7
A 7
F 4
A 4
F 7
![Page 14: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/14.jpg)
Before we start…
A F
4 7
A 7
F 4
A 4
F 7 Wason SelecIon Task: confirma0on bias, context
![Page 15: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/15.jpg)
Why is this relevant?
effector not effector RxLR not
RxLR
“If a protein has an RxLR moIf, it is an effector.” Which experiments are useful to perform to test this proposi0on?
![Page 16: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/16.jpg)
Effector Club
The first rule of finding effectors is:
You are not finding effectors
![Page 17: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/17.jpg)
Effector Club
l Classifica0on of sequences is modelling
l simplified representa0on of reality
l criteria based on known effectors
l Iden0fies candidate effectors l experimental verifica0on required
l General bioinforma0c problem
l specifics vary for each classifier (model)
![Page 18: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/18.jpg)
Effector Club
l Classifica0on of sequences is modelling
l simplified representa0on of reality
l criteria based on known effectors
l Iden0fies candidate effectors l experimental verifica0on required
l General bioinforma0c problem
l specifics vary for each classifier (model)
![Page 19: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/19.jpg)
Effector Club
l Classifica0on of sequences is modelling
l simplified representa0on of reality
l criteria based on known effectors
l Iden0fies candidate effectors l experimental verifica0on required
l General bioinforma0c problem
l specifics vary for each classifier (model)
![Page 20: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/20.jpg)
Sequence space
An abstract concept
![Page 21: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/21.jpg)
Sequence space
Each point is a sequence
![Page 22: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/22.jpg)
Sequence space
d1 d2
d1 < d2 Distance reflects sequence similarity
![Page 23: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/23.jpg)
Sequence space
Known exemplar: red
![Page 24: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/24.jpg)
Sequence space
Define distance from the example ≈ ‘similar’
![Page 25: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/25.jpg)
Sequence space
‘similar’ sequences are same class (e.g. func0on)
![Page 26: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/26.jpg)
Sequence space
Known exemplars: red
![Page 27: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/27.jpg)
Sequence space
Define a centre, and a distance that includes the examples
![Page 28: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/28.jpg)
Sequence space
Classify ‘similar’ sequences
![Page 29: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/29.jpg)
Finding effectors l Simple:
1. Have one or more examples of your effector (class)
2. Define some kind of appropriate threshold of similarity
3. Check all the gene/gene product sequences in the genome against that threshold
![Page 30: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/30.jpg)
Finding effectors l Simple:
1. Have one or more examples of your effector (class)
2. Define some kind of appropriate threshold of similarity
3. Check all the gene/gene product sequences in the genome against that threshold
![Page 31: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/31.jpg)
Finding effectors l Simple:
1. Have one or more examples of your effector (class)
2. Define some kind of appropriate threshold of similarity
3. Check all the gene/gene product sequences in the genome against that threshold
![Page 32: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/32.jpg)
Finding effectors l Simple:
1. Have one or more examples of your effector (class)
2. Define some kind of appropriate threshold of similarity
3. Check all the gene/gene product sequences in the genome against that threshold
There are 50 slides to go… it’s not that simple
![Page 33: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/33.jpg)
It’s not that simple
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
![Page 34: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/34.jpg)
It’s not that simple
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
![Page 35: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/35.jpg)
It’s not that simple
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
![Page 36: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/36.jpg)
CharacterisIcs of known effectors l Modularity
l Delivery: localisa0on/transloca0on domain(s)
l Ac0vity: func0onal/interac0on domain(s)
l Sequence mo0fs
l Localisa0on/transloca0on domain(s) ocen common to effector class (e.g. RxLR, T3E)
l Func0onal domain(s) may be common to effector class (e.g. TAL), or divergent (e.g. RxLR, T3E)
![Page 37: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/37.jpg)
CharacterisIcs of known effectors l Modularity
l Delivery: localisa0on/transloca0on domain(s)
l Ac0vity: func0onal/interac0on domain(s)
Greenberg JT, Vinatzer BA (2003) Iden0fying type III effectors of plant pathogens and analyzing their interac0on with plant cells. Curr Opin Microbiol 6: 20–28. Collmer A, Lindeberg M, Petnicki-‐Ocwieja T, Schneider DJ, Alfano JR (2002) Genomic mining type III secre0on system effectors in Pseudomonas syringae yields new picks for all TTSS prospectors. Trends in Microbiology 10: 462–469.
![Page 38: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/38.jpg)
CharacterisIcs of known effectors l Modularity
l Delivery: localisa0on/transloca0on domain(s)
l Ac0vity: func0onal/interac0on domain(s)
Dong S, Yu D, Cui L, Qutob D, Tedman-‐Jones J, et al. (2011) Sequence Variants of the Phytophthora sojae RXLR Effector Avr3a/5 Are Differen0ally Recognized by Rps3a and Rps5 in Soybean. PLoS ONE 6: e20172. doi:10.1371/journal.pone.0020172.t004. Bouwmeester K, Meijer HJG, Govers, F (2011) At the fron0er; RXLR effectors crossing the Phytophthora-‐host interface. FronCers in Plant-‐Microbe InteracCons 10.3389
![Page 39: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/39.jpg)
CharacterisIcs of known effectors l Modularity
l Delivery: localisa0on/transloca0on domain(s)
l Ac0vity: func0onal/interac0on domain(s)
l Sequence mo0fs
l Localisa0on/transloca0on domain(s) typically common to effector class (e.g. RxLR, T3E, CHxC)
l Func0onal domain(s) may be common to effector class (e.g. TAL), or divergent (e.g. RxLR, T3E in general)
Boch J, Scholze H, Schornack S, Landgraf A, Hahn S, et al. (2009) Breaking the code of DNA binding specificity of TAL-‐type III effectors. Science 326: 1509–1512. doi:10.1126/science.1178811.
![Page 40: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/40.jpg)
CharacterisIcs of known effectors l “Arms Races” occur:
l Host defences track effector evolu0on
l Effectors evade host defences
l Divergence of effectors under selec0on pressure l Diversifying selec0on; divergence may
result from evasion of detec0on, rather than change of biochemical ‘func0on’
l Effectors may be found preferen0ally in characteris0c loca0ons
l P. infestans ‘gene sparse’ regions
Raffaele S, Win J, Cano LM, Kamoun S (2010) Analyses of genome architecture and gene expression reveal novel candidate virulence factors in the secretome of Phytophthora infestans. BMC Genomics 11: 637. doi:10.1186/1471-‐2164-‐11-‐637.
![Page 41: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/41.jpg)
CharacterisIcs of known effectors l Applica0on of ‘filters’: reduce the number of sequences to check
l Presence/absence filters:
� SignalP (export signal)
� RxLR/T3SS (transloca0on signal)
� Expression (used by pathogen)
� Posi0ve selec0on (suggests arms race)
� etc…
l Workflows (e.g. Galaxy, Taverna) useful here
Fabro G, Steinbrenner J, Coates M, Ishaque N, Baxter L, et al. (2011) Mul0ple candidate effectors from the oomycete pathogen Hyaloperonospora arabidopsidis suppress host plant immunity. PLoS Pathog 7: e1002348. doi:10.1371/journal.ppat.1002348.
![Page 42: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/42.jpg)
Redefining sequence space l Effectors may share common module, but otherwise be dissimilar.
l We can emphasise sequence similarity by focusing on the common region
l this is essen0ally ‘redefining’ sequence space
l brings known effectors ‘together’
l may bring non-‐effectors with similar sequence closer, too
SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB
![Page 43: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/43.jpg)
Sequence space
Comparing whole sequences
AAAAAAAA
BBBBBBB
![Page 44: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/44.jpg)
Redefining sequence space l Effectors may share common module, but otherwise be dissimilar.
l We can emphasise similarity by focusing on regions common to an effector class, e.g. T3SS, L-‐FLAK
l this is essen0ally redefining sequence space
l brings known effectors ‘closer together’
l may bring non-‐effectors with similar sequence closer, too
SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB
![Page 45: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/45.jpg)
Sequence space
Pull domains together, push non-‐domains away
![Page 46: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/46.jpg)
Building a classifier
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
![Page 47: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/47.jpg)
Defining a distance l Sequence iden0ty (op0mal alignment)
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST
l E-‐value in BLAST
l Derived score (based on other measures)
l Bit score in HMMer
l Clustering l Sequence iden0ty (e.g. CD-‐HIT)
l MCL
![Page 48: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/48.jpg)
Defining a distance l Sequence iden0ty
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST
l E-‐value in BLAST
l Derived score (based on other measures)
l Bit score in HMMer
l Clustering l Sequence iden0ty (e.g. CD-‐HIT)
l MCL
![Page 49: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/49.jpg)
Defining a distance l Sequence iden0ty
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST
l E-‐value in BLAST
l Derived score (based on other measures) [not alignment]
l Bit score in HMMer
l Clustering l Sequence iden0ty (e.g. CD-‐HIT)
l MCL
![Page 50: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/50.jpg)
Defining a distance l Sequence iden0ty
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST
l E-‐value in BLAST
l Derived score (based on other measures)
l Bit score in HMMer
l Clustering (not strictly a distance) l Sequence iden0ty (e.g. CD-‐HIT)
l MCL
(we’re really assessing criteria for class membership)
![Page 51: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/51.jpg)
Defining a distance: sequence idenIty l Distance between sequences ≈ difference between sequences
l sequence iden0ty: propor0on of iden0cal symbols
l e.g. BLAST output
l Gotchas: not always symmetrical; dependent on alignment parameters!
Score = 95.3 bits (51), Expect = 3e-24 ! Identities = 161/212 (76%), Gaps = 15/212 (7%) ! Strand=Plus/Plus !!Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !
![Page 52: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/52.jpg)
Defining a distance: sequence idenIty l Distance between sequences ≈ difference between sequences
l sequence iden0ty: propor0on of iden0cal symbols
l e.g. BLAST output
l Gotchas: not always symmetrical; dependent on alignment parameters!
Score = 4970 !Length of alignment = 533 !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Percentage ID = 32.83 !!Score = 5040 !Length of alignment = 533 !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Percentage ID = 32.46 !
(pairwise alignment in Jalview)
![Page 53: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/53.jpg)
Defining a distance: sequence idenIty l Distance between sequences ≈ difference between sequences
l sequence iden0ty: propor0on of iden0cal symbols
l e.g. BLAST output
l Gotchas: not always symmetrical; dependent on alignment parameters!
Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!Query 34 GFRFHPTDEELVLYYLKRKICRRRILLDA---IAETDVY-KWEPEDLPDLSKLKTGD--- 86 ! GFRF PTD E V + L + + + D+ D Y + EP D+ D !Sbjct 7 GFRFSPTDAEAVTFLL--RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDDND 64 !!Query 87 -RQWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAV-GVKKTLVFYKGRAP 144 ! Q+ FF +K S G WK K + + V G KK++ YK + !Sbjct 65 CTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMC-YKNKGY 123 !!Query 145 VGERTDWVMHEYTM 158 ! E W+M EY + !Sbjct 124 KQEDGHWLMKEYDL 137 !!
(BLASTP, BLOSUM80 matrix)
![Page 54: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/54.jpg)
Defining a distance: sequence idenIty l Distance between sequences ≈ difference between sequences
l sequence iden0ty: propor0on of iden0cal symbols
l e.g. BLAST output
l Gotchas: not always symmetrical; dependent on alignment parameters!
Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !!Query 31 FPPGFRFHPTDEELVLYYLKRKICRRRILLDAIAETDVYKW---EPEDLPDLSKLKTGDR 87 ! + GFRF PTD E V + L R I + + T V + EP D+ D !Sbjct 4 LEEGFRFSPTDAEAVTFLL-RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDD 62 !!Query 88 ----QWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAVGVKKTLVFYKGRA 143 ! Q+ FF +K S G WK K + + V K + YK + !Sbjct 63 NDCTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMCYKNKG 122 !!Query 144 PVGERTDWVMHEYTMDEEELKRCQNAQDYYALYKVFKKS 182 ! E W+M EY + L + L + K++ !Sbjct 123 YKQEDGHWLMKEYDLSTYILDKFDKDCRDIVLCAIKKRT 161 !!
(BLASTP, BLOSUM45 matrix)
![Page 55: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/55.jpg)
Defining a distance: beyond idenIty
Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !
Iden0ty ≈ yes/no We can quan0fy similarity in ‘bits’
![Page 56: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/56.jpg)
Defining a distance: bit score and E-‐value l Bit score and E-‐value can be used as distance measures.
l I prefer (normalised) bit scores
l Small changes in score → large changes in E
l E varies linearly with database size; λS independent of database size
E = kmne-‐λS
![Page 57: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/57.jpg)
Defining a distance: bit score and E-‐value l Bit score and E-‐value can be used as distance measures.
l I prefer (normalised) bit scores
l Small changes in score → large changes in E
l E varies linearly with database size and query length; λS independent of database size
E = kmne-‐λS
Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!!!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !
BLOSUM80
BLOSUM45
![Page 58: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/58.jpg)
Defining a distance: bit score and E-‐value l Bit score and E-‐value can be used as distance measures.
l I prefer (normalised) bit scores
l Small changes in score → large changes in E
l E varies linearly with database size and query length; λS independent of database size
E = kmne-‐λS
Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!
![Page 59: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/59.jpg)
Defining a distance: bit score and E-‐value l Bit score and E-‐value can be used as distance measures.
l I prefer (normalised) bit scores
l Small changes in score → large changes in E
l E varies linearly with database size and query length; λS independent of database size Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !
Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!
E = kmne-‐λS
![Page 60: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/60.jpg)
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!!
Alignments compare two sequences Profiles capture informaIon from several sequences
![Page 61: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/61.jpg)
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!
consensus
Alignments compare two sequences Profiles capture informaIon from several sequences
![Page 62: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/62.jpg)
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!
regular expression
Alignments compare two sequences Profiles capture informaIon from several sequences
![Page 63: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/63.jpg)
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!
PSSM
123456!A405221!C040112!G010110!T100112!!
[AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!
Alignments compare two sequences Profiles capture informaIon from several sequences
![Page 64: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/64.jpg)
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!
123456!A405221!C040112!G010110!T100112!!
hidden Markov model (HMM)
Alignments compare two sequences Profiles capture informaIon from several sequences
![Page 65: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/65.jpg)
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
Goritschnig S, Krasileva KV, Dahlbeck D, Staskawicz BJ (2012) Computa0onal predic0on and molecular characteriza0on of an oomycete effector and the cognate Arabidopsis resistance gene. PLoS GeneCcs 8: e1002502. doi:10.1371/journal.pgen.1002502. Haas BJ, Kamoun S, Zody MC, Jiang RHY, Handsaker RE, et al. (2009) Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 461: 393–398. doi:10.1038/nature08358. Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.
![Page 66: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/66.jpg)
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
![Page 67: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/67.jpg)
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
![Page 68: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/68.jpg)
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
Null model is a control Choice of null model can be important
![Page 69: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/69.jpg)
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
Sequence matches alignment beier than control (null) → log-‐odds > 0 Sequence matches control (null) beier than alignment → log-‐odds < 0 Sequence matches alignment and control (null) equally → log-‐odds ≈ 0
![Page 70: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/70.jpg)
Defining a distance: bit scores in HMMer
Query: NAM [M=129] !Accession: PF02365.10 !Description: No apical meristem (NAM) protein !Scores for complete sequences (score includes all domains): ! --- full sequence --- --- best 1 domain --- -#dom- ! E-value score bias E-value score bias exp N Sequence Description ! ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- ! 3.1e-54 171.0 0.1 5.3e-54 170.3 0.1 1.4 1 StNac1_5 ! 5.5e-54 170.2 0.1 8.8e-54 169.6 0.1 1.3 1 NbNac1_1 ! 4e-53 167.4 0.1 6.3e-53 166.8 0.1 1.3 1 NbNac2_1 ! 1.5e-52 165.6 0.1 3.3e-52 164.5 0.1 1.6 1 StNac2_5 !!!Domain annotation for each sequence (and alignments): !>> StNac1_5 ! # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc! --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- ! 1 ! 170.3 0.1 5.3e-54 5.3e-54 1 128 [. 28 156 .. 28 157 .. 0.97 !! Alignments for each domain: ! == domain 1 score: 170.3 bits; conditional E-value: 5.3e-54 ! PF02365.10 1 lppGfrFhPtdeelvveyLkkkvegkkleleevikevdiykvePwdLp..akvkaeekewyfFskrdkkyatgkrknratksgyWkatgkdkevlskkg 97 ! lp+G+rF+Ptdeelv++yL+ k++g + ++ +vi+evdi+k+ePwdLp ++v+++++ew+fF+++d+ky++g+r nrat++gyWkatgkd+++++kkg! StNac1_5 28 LPVGYRFRPTDEELVNHYLRLKINGADSQV-SVIREVDICKLEPWDLPdlSVVESHDNEWFFFCPKDRKYQNGQRLNRATERGYWKATGKDRNIVTKKG 125 ! 699************************999.99***************888899999****************************************** PP !! PF02365.10 98 elvglkktLvfykgrapkgektdWvmheyrl 128 ! +++g+kktLv+y grap+g++t+Wv+heyr+ ! StNac1_5 126 AKIGMKKTLVYYIGRAPEGKRTHWVIHEYRA 156 ! *****************************96 PP !!
l Easy to read bit scores from HMMer output
![Page 71: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/71.jpg)
Defining a distance: composiIon l Some0mes, sequence comparison doesn’t tell you much (e.g. T3
effector signals)
l Can use ‘bulk proper0es’ of sequence composi0on l Many ways to derive a ‘distance’
Greenberg JT, Vinatzer BA (2003) Iden0fying type III effectors of plant pathogens and analyzing their interac0on with plant cells. Curr Opin Microbiol 6: 20–28. Arnold R, Brandmaier S, Kleine F, Tischler P, Heinz E, et al. (2009) Sequence-‐based predic0on of type III secreted proteins. PLoS Pathog 5: e1000376. doi:10.1371/journal.ppat.1000376.
![Page 72: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/72.jpg)
Defining a ‘distance’: clustering
![Page 73: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/73.jpg)
Defining a ‘distance’: clustering
Not really a distance, more a bound Sequences that cluster with your known examples
![Page 74: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/74.jpg)
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
![Page 75: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/75.jpg)
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
![Page 76: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/76.jpg)
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
![Page 77: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/77.jpg)
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
![Page 78: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/78.jpg)
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
![Page 79: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/79.jpg)
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
![Page 80: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/80.jpg)
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
need to test clusters for robustness
![Page 81: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/81.jpg)
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
need to test clusters for robustness
![Page 82: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/82.jpg)
Defining a ‘distance’: MCL clustering l Clustering algorithm (used in TribeMCL, OrthoMCL)
l Markov Clustering Algorithm
l Finds clusters in networks
l Use BLAST to generate all-‐vs-‐all pairwise comparisons
l Results are a network (similarity graph)
l Given such a network: l Expansion (raise to power) – ‘spreads links’
l Infla0on (scaling) – ‘thickens strong links’
![Page 83: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/83.jpg)
Defining a ‘distance’: MCL clustering l Clustering algorithm (used in TribeMCL, OrthoMCL)
l Markov Clustering Algorithm
l Finds clusters in networks
l Use BLAST to generate all-‐vs-‐all pairwise comparisons
l Results are a network (similarity graph)
l Given such a network: l Expansion (raise to power) – ‘spreads links’
l Infla0on (scaling) – ‘thickens strong links’
![Page 84: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/84.jpg)
Defining a ‘distance’: MCL clustering l Clustering algorithm (used in TribeMCL, OrthoMCL)
l Markov Clustering Algorithm
l Finds clusters in networks
l Use BLAST to generate all-‐vs-‐all pairwise comparisons
l Results are a network (similarity graph)
l Given such a network: l Expansion (raise to power) – ‘spreads links’
l Infla0on (scaling) – ‘thickens strong links’
Repeated applicaIon of the expansion/inflaIon cycle results in the formaIon of clusters.
![Page 85: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/85.jpg)
Defining a ‘distance’: MCL clustering Expansion InflaIon
…
…
… …
→
→
Input
Clustering
![Page 86: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/86.jpg)
Defining a ‘distance’: MCL clustering l Clustering algorithm (used in TribeMCL, OrthoMCL)
l Markov Clustering Algorithm
l Finds clusters in networks
l Use BLAST to generate all-‐vs-‐all pairwise comparisons
l Results are a network (similarity graph)
l Given such a network: l Expansion (raise to power) – ‘spreads links’
l Infla0on (scaling) – ‘thickens strong links’
l One key parameter: inflaIon value
l Need to cluster over several infla0on values to confirm robustness (consistency of clustering)
InflaIon value clusters
1.4 3
2.0 6
4.0 18
6.0 33
![Page 87: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/87.jpg)
Defining a distance l Sequence iden0ty – scores alignment (symmetry?)
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST – scores alignment (subs0tu0on matrix)
l E-‐value in BLAST – scores alignment (sensi0ve to query/db size, subn matrix)
l Derived score (based on other measures)
l Bit score in HMMer – scores sequence rela0ve to model (null model?)
l Clustering l Sequence iden0ty (e.g. CD-‐HIT) – can be sensi0ve to sequence order (mul0-‐
step? test for robustness? CD-‐HIT uses sequence iden0ty)
l MCL – needs all-‐v-‐all pairwise (test for robustness; uses BLAST E-‐value by default)
![Page 88: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/88.jpg)
Many definiIons of distance
l How do we define ‘distance’?
l How large a ‘distance’ (or what clustering resolu0on) do we take?
l How do we know we’ve chosen a sensible ‘distance’?
![Page 89: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/89.jpg)
How large a distance do we allow?
l How do we define ‘distance’?
l How large a ‘distance’ (or what clustering resoluIon) do we take?
l How do we know we’ve chosen a sensible ‘distance’?
![Page 90: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/90.jpg)
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 1 5
Blue 1 36
Confusion matrix:
![Page 91: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/91.jpg)
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 1 5
Blue 1 36
True posiIve
False posiIve True negaIve
False negaIve
![Page 92: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/92.jpg)
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 1 5
Blue 1 36
False posi0ve rate FP/(FP+TN)
False nega0ve rate FN/(TP+FN)
Sensi0vity TP/(TP+FN)
Specificity TN/(FP+TN)
False discovery rate FP/(FP+TP)
![Page 93: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/93.jpg)
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 1 5
Blue 1 36
False posi0ve rate 1/37 = 0.03
False nega0ve rate 5/6 = 0.83
Sensi0vity 1/6 = 0.17
Specificity 36/37 = 0.97
![Page 94: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/94.jpg)
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 5 2
Blue 4 33
False posi0ve rate 0.11
False nega0ve rate 0.29
Sensi0vity 0.81
Specificity 0.89
![Page 95: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/95.jpg)
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 7 0
Blue 14 23
False posi0ve rate 0.38
False nega0ve rate 0
Sensi0vity 1
Specificity 0.62
![Page 96: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/96.jpg)
ROC Curve
l To assess how well a method performs, can use ROC (Receiver Opera0ng Characteris0c) curve
l Typically, we use area under the curve (AUC) to choose between methods
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SensiIvity
False PosiIve Rate
ROC Curve
Classifier
Random
![Page 97: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/97.jpg)
ROC Curve
l To assess how well a method performs, can use ROC (Receiver Opera0ng Characteris0c) curve
l Typically, we use area under the curve (AUC) to choose between methods
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SensiIvity
False PosiIve Rate
ROC Curve
Classifier
Random
be[er performance
![Page 98: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/98.jpg)
ROC Curve
l To assess how well a method performs, can use ROC (Receiver Opera0ng Characteris0c) curve
l The ‘best’ parameter se}ng for a method is typically near the apex.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SensiIvity
False PosiIve Rate
ROC Curve
Classifier
Random
![Page 99: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/99.jpg)
F-‐measure
IN OUT
Red 1 5
Blue 1 36
False posi0ve rate FP/(FP+TN)
False nega0ve rate FN/(TP+FN)
Sensi0vity TP/(TP+FN)
Specificity TN/(FP+TN)
l We can ‘game’ ROC sta0s0cs by increasing irrelevant ‘nega0ve’ examples
l Increasing TN ‘improves’ false posi0ve rate and specificity
l Can use precision and recall instead
![Page 100: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/100.jpg)
F-‐measure
l We can ‘game’ ROC sta0s0cs by increasing irrelevant ‘nega0ve’ examples
l Increasing TN ‘improves’ false posi0ve rate and specificity
l Can use precision and recall instead
IN OUT
Red 1 5
Blue 1 36
Precision (PPV) TP/(TP+FP)
Recall = sensi0vity TP/(TP+FN)
FDR = 1-‐PPV FP/(TP+FP)
![Page 101: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/101.jpg)
F-‐measure
l Precision: Propor0on of accurate posi0ve predic0ons
l Recall: Propor0on of posi0ve examples recovered (sensiCvity)
l F1 = 2 (precision x recall)/(precision + recall)
IN OUT
Red 1 5
Blue 1 36
Precision (PPV) TP/(TP+FP)
Recall = sensi0vity TP/(TP+FN)
FDR = 1-‐PPV FP/(TP+FP)
![Page 102: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/102.jpg)
F-‐measure
l The F-‐measure indicates which set of parameters (which distance) ‘best’
l Several F-‐measures available that weight precision and recall differently
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3
F-‐measure
![Page 103: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/103.jpg)
How large a distance do we allow?
l Assign known ‘posi0ve’ and ‘nega0ve’ examples
l Vary distances and take F-‐measure
l Choose distance that gives the best performance
![Page 104: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/104.jpg)
How large a distance do we allow?
l Assign known ‘posi0ve’ and ‘nega0ve’ examples
l Vary distances and take F-‐measure
l Choose distance that gives the best performance
![Page 105: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/105.jpg)
How large a distance do we allow?
l Assign known ‘posi0ve’ and ‘nega0ve’ examples
l Vary distances and take F-‐measure
l Choose distance that gives the best performance
![Page 106: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/106.jpg)
Confusion Matrix
l BUT: how do we know that we’ve chosen a suitable distance? l Training set choice is cri0cal
IN OUT
Red 5 2
Blue 4 33
False posi0ve rate 0.11
False nega0ve rate 0.29
Sensi0vity 0.81
Specificity 0.89
![Page 107: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/107.jpg)
Training set choice
Train classifier on known examples: looks good…
![Page 108: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/108.jpg)
UnrepresentaIve examples
…but training set biased/unrepresenta0ve sample…
![Page 109: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/109.jpg)
Overfiong
…or ‘fits’ known posi0ves unfeasibly 0ghtly
![Page 110: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/110.jpg)
How large a distance do we allow?
l How do we define ‘distance’?
l How large a ‘distance’ (or what clustering resoluIon) do we take?
l How do we know we’ve chosen a sensible ‘distance’?
![Page 111: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/111.jpg)
How do we know we’ve chosen a suitable distance?
l How do we define ‘distance’?
l How large a ‘distance’ (or what clustering resolu0on) do we take?
l How do we know we’ve chosen a sensible ‘distance’?
![Page 112: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/112.jpg)
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posi0ve
l What is the probability that you have disease X?
![Page 113: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/113.jpg)
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posi0ve
l What is the probability that you have disease X?
![Page 114: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/114.jpg)
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posi0ve
l What is the probability that you have disease X?
![Page 115: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/115.jpg)
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posiIve
l What is the probability that you have disease X?
![Page 116: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/116.jpg)
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posiIve
l What is the probability that you have disease X?
0.01 0.05 0.95 0.99 0.50
![Page 117: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/117.jpg)
How do we know we’ve chosen a suitable distance?
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
![Page 118: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/118.jpg)
Cross-‐validaIon l Es0ma0on of classifier performance depends on
l distance measure
l composi0on of training set (‘posi0ves’ and ‘nega0ves’)
l Cross-‐valida0on gives objec0ve measure of performance
l Many strategies available, including:
l leave-‐one-‐out (LOO)
l k-‐fold crossvalida0on
l repeated (random) subsampling
l Essen0ally: always keep a hold-‐out set (not used to train)
![Page 119: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/119.jpg)
Cross-‐validaIon l Es0ma0on of classifier performance depends on
l distance measure
l composi0on of training set (‘posi0ves’ and ‘nega0ves’)
l Cross-‐valida0on gives objec0ve measure of performance
l Many strategies available, including:
l leave-‐one-‐out (LOO)
l k-‐fold crossvalida0on
l repeated (random) subsampling
l Essen0ally: always keep a hold-‐out set (not used to train)
![Page 120: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/120.jpg)
Cross-‐validaIon l Es0ma0on of classifier performance depends on
l distance measure
l composi0on of training set (‘posi0ves’ and ‘nega0ves’)
l Cross-‐valida0on gives objec0ve measure of performance
l Many strategies available, including:
l leave-‐one-‐out (LOO)
l k-‐fold crossvalida0on
l repeated (random) subsampling
l Essen0ally: always keep a hold-‐out set (not used to train)
![Page 121: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/121.jpg)
Cross-‐validaIon l Es0ma0on of classifier performance depends on
l distance measure
l composi0on of training set (‘posi0ves’ and ‘nega0ves’)
l Cross-‐valida0on gives objec0ve measure of performance
l Many strategies available, including:
l leave-‐one-‐out (LOO)
l k-‐fold crossvalida0on
l repeated (random) subsampling
l Essen0ally: always keep a hold-‐out set (not used to train)
![Page 122: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/122.jpg)
k-‐fold crossvalidaIon l No crossvalida0on:
l One training set
l No test (hold-‐out/valida0on) set
l Risks overfi}ng
Training Set
Test Set
![Page 123: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/123.jpg)
k-‐fold crossvalidaIon l Valida0on:
l One training set, one test (hold-‐out/valida0on) set
l Test performance of classifier on unseen data
Training Set
Test Set
![Page 124: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/124.jpg)
k-‐fold crossvalidaIon l 2-‐fold crossvalida0on:
l Two runs, each with one training set, one test set
l Swap training and test sets, collate results
Training Set
Test Set
run1
run2
![Page 125: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/125.jpg)
k-‐fold crossvalidaIon l 3-‐fold crossvalida0on:
l Three runs, each with one training set, one test set
Training Set
Test Set
run1
run2
run3
![Page 126: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/126.jpg)
k-‐fold crossvalidaIon l k-‐fold crossvalida0on:
l k runs, each with one training set, one test set (n items in dataset, k>1)
Training Set Test Set
run1
run2
runk
n/k n-‐(n/k)
…
![Page 127: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/127.jpg)
Arer crossvalidaIon
False posi0ve rate 0.11
False nega0ve rate 0.29
Sensi0vity 0.81
Specificity 0.89
Precision 0.56
• Use crossvalida0on to find ‘best’ method & parameters • Crossvalida0on gives you es0mated performance metrics on
unseen data • Apply ‘best’ method to complete dataset for predic0on
![Page 128: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/128.jpg)
A trip to the doctor, part II l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posiIve
l To calculate the probability that the test correctly determines whether you have the disease, you need to know the baseline occurrence.
![Page 129: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/129.jpg)
A trip to the doctor, part II l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posiIve
l To calculate the probability that the test correctly determines whether you have the disease, you need to know the baseline occurrence.
![Page 130: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/130.jpg)
A trip to the doctor, part II l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posi0ve
l To calculate the probability that the test correctly determines whether you have the disease, you need to know the baseline occurrence.
Baseline occurrence: 1% ⇒ P(disease|+ve)=0.490 Baseline occurrence: 80% ⇒ P(disease|+ve)=0.997
![Page 131: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/131.jpg)
What is the baseline occurrence for effectors?
l Usually rely on predic0ons for expected baseline
l Bacterial genomes: ≈4500 genes
l Type III effectors: 1-‐10% (Arnold et al. 2009); 1-‐2% (Collmer et al. 2002); 1% (Boch and Bonas, 2010)
l Oomycete/fungal genomes: ≈20000 genes
l RxLRs: 120-‐460 (1-‐2%; Whisson et al. 2007); ≤563 (≲2% Haas et al. 2009);
l CRNs: 19-‐196 (≲1%; Haas et al. 2009)
l CHxC: ≈30 (<1%; Kemen et al. 2011)
l We need to take care over result interpreta0on:
l Predic0on method with 5% false nega0ve rate and 1% false posi0ve rate, with 1% baseline, predic0ng 500 effectors:
� P(effector|posiIve test)≈0.5
![Page 132: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/132.jpg)
What is the baseline occurrence for effectors?
l Usually rely on predic0ons for expected baseline
l Bacterial genomes: ≈4500 genes
l Type III effectors: 1-‐10% (Arnold et al. 2009); 1-‐2% (Collmer et al. 2002); 1% (Boch and Bonas, 2010)
l Oomycete/fungal genomes: ≈20000 genes
l RxLRs: 120-‐460 (1-‐2%; Whisson et al. 2007); ≤563 (≲2% Haas et al. 2009);
l CRNs: 19-‐196 (≲1%; Haas et al. 2009)
l CHxC: ≈30 (<1%; Kemen et al. 2011)
l We need to take care over result interpreta0on:
l Predic0on method with 5% false nega0ve rate and 1% false posi0ve rate, with 1% baseline, predic0ng 500 effectors:
� P(effector|posiIve test)≈0.5
![Page 133: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/133.jpg)
What is the baseline occurrence for effectors?
l Usually rely on predic0ons for expected baseline
l Bacterial genomes: ≈4500 genes
l Type III effectors: 1-‐10% (Arnold et al. 2009); 1-‐2% (Collmer et al. 2002); 1% (Boch and Bonas, 2010)
l Oomycete/fungal genomes: ≈20000 genes
l RxLRs: 120-‐460 (1-‐2%; Whisson et al. 2007); ≤563 (≲2% Haas et al. 2009);
l CRNs: 19-‐196 (≲1%; Haas et al. 2009)
l CHxC: ≈30 (<1%; Kemen et al. 2011)
l We need to take care over result interpreta0on:
l Predic0on method with 5% false nega0ve rate and 1% false posi0ve rate, with 1% baseline, predic0ng 500 effectors:
� P(effector|posiIve test)≈0.5
![Page 134: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/134.jpg)
What is the baseline occurrence for effectors?
l Usually rely on predic0ons for expected baseline
l Bacterial genomes: ≈4500 genes
l Type III effectors: 1-‐10% (Arnold et al. 2009); 1-‐2% (Collmer et al. 2002); 1% (Boch and Bonas, 2010)
l Oomycete/fungal genomes: ≈20000 genes
l RxLRs: 120-‐460 (1-‐2%; Whisson et al. 2007); ≤563 (≲2% Haas et al. 2009);
l CRNs: 19-‐196 (≲1%; Haas et al. 2009)
l CHxC: ≈30 (<1%; Kemen et al. 2011)
l We need to take care over result interpreta0on:
l Predic0on method with 5% false nega0ve rate and 1% false posi0ve rate, with 1% baseline, predic0ng 500 effectors:
� P(effector|posiIve test)≈0.5
![Page 135: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/135.jpg)
A lesson from the literature?
l “The resul0ng computa0onal model revealed a strong type III secre0on signal in the N-‐terminus that can be used to detect effectors with sensi0vity of 71% and [specificity] of 85%.”
l Sensi0vity [P(+ve|T3E)] = 0.71; FPR [1-‐Specificity; P(+ve|not T3E)] = 0.15
l Base rate [P(T3E)] ≈ 3%; Genes = 4500
l We expect P(T3E|+ve) ≈ 0.13
l (and a significant number, up to 15% of the genome, of false posi0ves…)
P (T3E|+ve) =P (+ve|T3E)P (T3E)
P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)
![Page 136: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/136.jpg)
A lesson from the literature?
l “The resul0ng computa0onal model revealed a strong type III secre0on signal in the N-‐terminus that can be used to detect effectors with sensi0vity of 71% and [specificity] of 85%.”
l Sensi0vity [P(+ve|T3E)] = 0.71; FPR [1-‐Specificity; P(+ve|not T3E)] = 0.15
l Base rate [P(T3E)] ≈ 3%; Genes = 4500
l We expect P(T3E|+ve) ≈ 0.13
l (and a significant number, up to 15% of the genome, of false posi0ves…)
P (T3E|+ve) =P (+ve|T3E)P (T3E)
P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)
![Page 137: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/137.jpg)
A lesson from the literature?
l “The resul0ng computa0onal model revealed a strong type III secre0on signal in the N-‐terminus that can be used to detect effectors with sensi0vity of 71% and [specificity] of 85%.”
l Sensi0vity [P(+ve|T3E)] = 0.71; FPR [1-‐Specificity; P(+ve|not T3E)] = 0.15
l Base rate [P(T3E)] ≈ 3%; Genes = 4500
l We expect P(T3E|+ve) ≈ 0.13
l (and a significant number, up to 15% of the genome, of false posi0ves…)
P (T3E|+ve) =P (+ve|T3E)P (T3E)
P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)
![Page 138: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/138.jpg)
A lesson from the literature? l “The surprisingly high number of (false) posi0ves in genomes without TTSS exceeds the expected false posi0ve rate (Table 1)”
0.038 x 5169 x 0.13 ≈ 26 [No. +ve x P(T3E|+ve)]
![Page 139: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/139.jpg)
Director’s Commentary: Finding RxLRs
l Supplementary from Whisson et al. (2007) l Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on
signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.
l Not perfect
l Detail of one way to construct a classifier
![Page 140: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/140.jpg)
Director’s Commentary: Finding RxLRs
l Supplementary from Whisson et al. (2007) l Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on
signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.
l Not perfect
l Detail of one way to construct a classifier
![Page 141: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/141.jpg)
Director’s Commentary: Finding RxLRs
l Supplementary from Whisson et al. (2007) l Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on
signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.
l Not perfect
l Detail of one way to construct a classifier
![Page 142: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/142.jpg)
Building a training set l Star0ng point: 49 candidate sequences (reference set)
l Known: l Contain (puta0vely) RxLR-‐EER mo0f
l All but one transcribed (i.e. not bad gene calls)
l Assumed:
l Presence of signal pep0de and RxLR-‐EER categorises effectors
![Page 143: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/143.jpg)
Building a training set l Star0ng point: 49 candidate sequences (reference set)
l Known: l Contain (puta0vely) RxLR-‐EER mo0f
l All but one transcribed (i.e. not bad gene calls)
l Assumed:
l Presence of signal pep0de and RxLR-‐EER categorises effectors
![Page 144: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/144.jpg)
Building a training set l Star0ng point: 49 candidate sequences (reference set)
l Known: l Contain (puta0vely) RxLR-‐EER mo0f
l All but one transcribed (i.e. not bad gene calls)
l Assumed:
l Presence of signal pep0de and RxLR-‐EER categorises effectors
![Page 145: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/145.jpg)
Building a training set l SignalP 3.0 (Bendtsen et al. 2004) to predict loca0ons of signal pep0des.
l SignalP also has sta0s0cal performance es0mates:
l Se}ngs:
l HMM cutoff probability = 0.9
l Cleavage site between posi0ons 10 and 40 inclusive
l Jus0fica0on: use in previous studies by others
![Page 146: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/146.jpg)
Building a training set l SignalP 3.0 (Bendtsen et al. 2004) to predict loca0ons of signal pep0des.
l SignalP also has sta0s0cal performance es0mates:
l Se}ngs:
l HMM cutoff probability = 0.9
l Cleavage site between posi0ons 10 and 40 inclusive
l Jus0fica0on: use in previous studies by others
![Page 147: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/147.jpg)
Building a training set l Of 49, four sequences failed
l One carried forward on experimental grounds (highly-‐expressed)
l Training set now has 46 sequences
l But seven of these actually have no recognisable RxLR-‐EER mo0f, so are discarded
l Training set now has 39 sequences
![Page 148: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/148.jpg)
Building a training set l Of 49, four sequences failed
l One carried forward on experimental grounds (highly-‐expressed)
l Training set now has 46 sequences
l But seven of these actually have no recognisable RxLR-‐EER mo0f, so are discarded
l Training set now has 39 sequences
![Page 149: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/149.jpg)
Building a training set l Of 49, four sequences failed
l One carried forward on experimental grounds (highly-‐expressed)
l Training set now has 46 sequences
l But seven of these actually have no recognisable RxLR-‐EER mo0f, so are discarded
l Training set now has 39 sequences
![Page 150: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/150.jpg)
Building a training set l Of 49, four sequences failed
l One carried forward on experimental grounds (highly-‐expressed)
l Training set now has 46 sequences
l But seven of these actually have no recognisable RxLR-‐EER mo0f, so are discarded
l Training set now has 39 sequences
![Page 151: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/151.jpg)
Building a classifier l We have a recognisable mo0f, with substan0al local varia0on and indels
l Therefore chose profile HMM
l Use HMMer socware
l Profile HMMs sensi0ve to quality of alignment
l Therefore treat alignment as a parameter of the HMM (much difference between alignments!)
![Page 152: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/152.jpg)
Building a classifier l We have a recognisable mo0f, with substan0al local varia0on and indels
l Therefore chose profile HMM
l Use HMMer socware
l Profile HMMs sensi0ve to quality of alignment
l Therefore treat alignment as a parameter of the HMM (much difference between alignments!)
![Page 153: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/153.jpg)
Building a classifier l We have a recognisable mo0f, with substan0al local varia0on and indels
l Therefore chose profile HMM
l Use HMMer socware
l Profile HMMs sensi0ve to quality of alignment
l Therefore treat alignment as a parameter of the HMM (much difference between alignments!)
![Page 154: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/154.jpg)
Building a classifier l Anchored at RxLR and EER
![Page 155: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/155.jpg)
Building a classifier l ClustalW
![Page 156: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/156.jpg)
Building a classifier l T-‐Coffee
![Page 157: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/157.jpg)
Building a classifier l Parameters modified for HMM
l Alignment package (no alignment, anchored, Clustal, DiAlign, T-‐Coffee) on default se}ngs
l Full-‐length and truncated (no signal pep0de) alignments to test for influence of signal pep0de region on classifier
� Plus one alignment of RxLR-‐EER plus flanking region only (‘cropped’)
l HMM built for each of eleven alignments
l Default parameters
l Once built, the HMM is the classifier.
![Page 158: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/158.jpg)
Building a classifier
Trunca0ng sequences reshapes sequence space
![Page 159: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/159.jpg)
Building a classifier l Parameters modified for HMM
l Alignment package (no alignment, anchored, Clustal, DiAlign, T-‐Coffee) on default se}ngs
l Full-‐length and truncated (no signal pep0de) alignments to test for influence of signal pep0de region on classifier
� Plus one alignment of RxLR-‐EER plus flanking region only (‘cropped’)
l HMM built for each of eleven alignments
l Default parameters
l Once built, the HMM is the classifier.
![Page 160: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/160.jpg)
Building a classifier l Parameters modified for HMM
l Alignment package (no alignment, anchored, Clustal, DiAlign, T-‐Coffee) on default se}ngs
l Full-‐length and truncated (no signal pep0de) alignments to test for influence of signal pep0de region on classifier
� Plus one alignment of RxLR-‐EER plus flanking region only (‘cropped’)
l HMM built for each of eleven alignments
l Default parameters
l Once built, the HMM is the classifier.
hmmbuild --amino <output> <alignment>!
![Page 161: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/161.jpg)
TesIng the classifiers
Only posiIve examples: How well does a classifier cover them?
![Page 162: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/162.jpg)
TesIng the classifiers l Eleven classifiers to test
l Step 1: Consistency test l Does the classifier correctly call as posi0ve the sequences used to train it?
l Es0mates recovery of the informa0on in the training set
l Step 2: Recovery of full sequences l Es0mates performance of classifier on complete sequence data
SigP-‐RxLR-‐Cterm
RxLR-‐Cterm
RxLR
![Page 163: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/163.jpg)
TesIng the classifiers
Only posiIve examples: How well does a classifier recover unseen sequence?
![Page 164: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/164.jpg)
TesIng the classifiers
Only posiIve examples: How well does a classifier recover unseen sequence?
![Page 165: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/165.jpg)
TesIng the classifiers l Step 3: Leave-‐One-‐Out Crossvalida0on
l But only have posi0ve examples!
l Removes possibility that classifier matches on basis of having ‘seen’ a sequence before
![Page 166: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/166.jpg)
TesIng the classifiers l Leave-‐one-‐out (LOO) crossvalida0on:
l k runs, each with one training set, one test set (n items in dataset, k=n)
Training Set Test Set
run1
run2
runk
…
![Page 167: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/167.jpg)
TesIng the classifiers l Step 3: Leave-‐One-‐Out Crossvalida0on
l But only have posi0ve examples!
l Removes possibility that classifier matches on basis of having ‘seen’ a sequence before
SigP-‐ RxLR-‐ Cterm
RxLR-‐ Cterm
RxLR
Beier match to classifier than to control
![Page 168: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/168.jpg)
TesIng the classifiers l Step 4: Tests on nega0ve samples
l Completely shuffled sequences
l Shuffled downstream of the signal pep0de only
l Replace RxLR-‐EER with AAAA-‐AAA
No classifier idenIfies a false posiIve (no classifier matches on sequence composi0on alone)
![Page 169: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/169.jpg)
TesIng the classifiers l Step 4: Tests on nega0ve samples
l Completely shuffled sequences
l Shuffled downstream of the signal pep0de only
l Replace RxLR-‐EER with AAAA-‐AAA
(some recogni0on on basis of signal pep0de)
SigP-‐ RxLR-‐ Cterm
RxLR-‐ Cterm
RxLR
![Page 170: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/170.jpg)
TesIng the classifiers l Step 4: Tests on nega0ve samples
l Completely shuffled sequences
l Shuffled downstream of the signal pep0de only
l Replace RxLR-‐EER with AAAA-‐AAA
(some recogni0on on sequence other than mo0f)
SigP-‐ RxLR-‐ Cterm
RxLR-‐ Cterm
RxLR
![Page 171: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/171.jpg)
Choosing a classifier l The ‘cropped’ classifier has:
l 100% recovery of posi0ve training sequences
l 0% recovery of nega0ve test sequences
l Some varia0on in classifier performance on whole genome:
![Page 172: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/172.jpg)
Choosing a classifier l The ‘cropped’ classifier has:
l 100% recovery of posi0ve training sequences
l 0% recovery of nega0ve test sequences
l Some varia0on in classifier performance on whole genome:
![Page 173: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/173.jpg)
Oranges are not the only fruit l Other classifiers had been proposed, e.g. Bha[acharjee et al. (2006):
l Presence of signal pep0de, with cleavage site in first 40aa
l Regular expression test:
� R.LR.{,40}[ED][ED][KR]in first 100aa acer cleavage site
l Can choose between methods, or report range of predic0ons
![Page 174: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/174.jpg)
Oranges are not the only fruit l Other classifiers had been proposed, e.g. Bha[acharjee et al. (2006):
l Presence of signal pep0de, with cleavage site in first 40aa
l Regular expression test:
� R.LR.{,40}[ED][ED][KR]in first 100aa acer cleavage site
l Can choose between methods, or report range of predic0ons
![Page 175: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/175.jpg)
So how did it work out…? l Refined all RxLR predic0ons to ‘priority set’ of ≈200 for cloning
l First set of 46 candidate effectors (07/11): l 25 host interactors detected by Y2H
l Localisa0on data for 41 candidates
l Silencing phenotypes for 19 candidates
l 22 puta0ve orthologues with P. capsici
l Currently: l 44 silencing phenotypes
Transient expression in leaf of GFP-‐fused RxLR candidate, showing plasma membrane localisa0on
![Page 176: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/176.jpg)
Acknowledgements l Phytophthora groups at JHI
l (Paul Birch, Steve Whisson, Dave Cooke)
l Bacteriology groups at JHI l (Ian Toth, Nicola Holden)
l Imaging at JHI
l (Petra Boevink)
l Numerous sta0s0cians
l (David Broadhurst, Andy Woodward, BioSS)
![Page 177: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/177.jpg)
Sequence space
![Page 178: Mining Plant Pathogen Genomes for Effectors](https://reader033.fdocuments.in/reader033/viewer/2022052321/54c639544a7959476b8b4573/html5/thumbnails/178.jpg)
CD-‐Hit sequence ordering l “Algorithm limita0ons: […]
Let say, there are two clusters: cluster #1 has A, X and Y where A is the representa0ve, and cluster #2 has B and Z where B is the representa0ve. The problem is that even if Y is more similar to B than to A, it can s0ll be in cluster #1 because Y first hits A during the clustering process.”
l h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/wiki/doku.php?id=cd-‐hit_user_guide