Large scale genomes comparisons Bioinformatics aspects (Introduction)

40
Large scale genomes comparisons Bioinformatics aspects (Introduction) Fredj Tekaia Institut Pasteur [email protected] EMBO Bioinformatic and Comparative Genome Analysis Course Institut Pasteur Paris June 27 - July 9, 2011

description

Large scale genomes comparisons Bioinformatics aspects (Introduction). Fredj Tekaia Institut Pasteur [email protected]. EMBO Bioinformatic and Comparative Genome Analysis Course Institut Pasteur Paris June 27 - July 9, 2011. - PowerPoint PPT Presentation

Transcript of Large scale genomes comparisons Bioinformatics aspects (Introduction)

Page 1: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large scale genomes comparisonsBioinformatics aspects

(Introduction)

Fredj TekaiaInstitut Pasteur

[email protected]

EMBO Bioinformatic and Comparative Genome Analysis Course

Institut Pasteur Paris

June 27 - July 9, 2011

EMBO Bioinformatic and Comparative Genome Analysis Course

Institut Pasteur Paris

June 27 - July 9, 2011

Page 2: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Starting from genomes (whole sequence, whole gene sequences or whole protein sequences of given species) what Large-scale Genome Comparisons include?

Page 3: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large-scale genome comparisons:

Comparing a genome (in terms of whole sequence, whole set of predicted genes or whole set of predicted proteins) to itself (intra-species comparisons) or to another genome (inter-species comparisons).

Page 4: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large scale genome comparisons

-Duplication;

-Conservation;

-Specificity (species-specific genes, proteins);

-Paralogues, orthologues;

-Families (clusters) of paralogues, of orthologues;

-Genomes organisations (duplicated, conserved genes);

-Search for shared motifs in proteins of the same cluster;

-Protein conservation profiles;

-Selection pressure analyses

(synonymous, non synonymous substitutions,..),….

Page 5: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Evolution

Page 6: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Time

Duplication

Speciation

A B

Duplication

G

G1 G2

B-G21 B-G22

A-G2A-G1 B-G1

orthologs

outparalogs

inparalogsoutparalogs

•Speciation

•Duplication

•Inparalogs

•Orthologs

•Outparalogs

•Loss of genes

Predict these events by comparing genomes?

Speciation - Duplication

Page 7: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Orthologs / Paralogs

• How to detect orthologous genes?

- easy way: best reciprocal hit (RBH)

2.1a

3a

2.1b

3b

1a 1b

2.2a 2.2b

Organism A Organism B

Page 8: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Ancestor

species genome

Evolutionary processes include

Phylogeny*duplication genesis

Expansion*

HGT HGT

Exchange* loss Deletion*selection*

Expansion, Exchange and Deletion.

• Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes:

Page 9: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Kellis et al. Nature, 2004

S. cerevisiae genomeColours reveal Duplications

Page 10: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Kellis et al. Nature, 2004

SpeciationDuplication

Deletion

Actual content of the 2 copies

Reconstruction of the ancestral organization

Page 11: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.

Original version

Actual version

Page 12: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Search for similarity

Page 13: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Methods:

• Important to know how algorithms that allow sequence comparisons work,

• There are many comparisons methods,

• Among most used:

• BLAST

• FASTA

• Smith-Waterman algorithm dynamic programming method

• HMM (Hidden Markov Model)

Page 14: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Sequence Comparaisons

V I T K L G T C V G S V I T K L G T C V G SV I S . . . T Q V G S V . S K . G T Q V . S

• Identity • Similarity • Homology

Page 15: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Comparison of 2 sequences

• Aims at finding the optimal alignment: the one that shows most similar regions and regions that are less similar.

• In describing sequence comparisons, three different terms are commonly used :

Identity, Similarity and Homology.

Need for a score that evaluates:

- matches

- mismatches

- gaps

and a method that evaluates the numerous possible alignments.

Page 16: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Identity

• Refers to the occurence of identical nucleotides or amino acids in the same position in aligned sequences ;

• Identity is objective and well defined;

• Identity can be quantified: Percent i.e the number of identical matches divided by the length of the aligned region.

Page 17: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Similarity

• Sequence similarity takes approximate matches into account, and is meaningful only when such substitutions are scored according to some measure of «difference» with conservative substitutions assigned more favorable scores than non-conservative ones (substitution matrices). • Given a number of parameters (alphabet, scoring matrix, filtering procedure, etc...), the similarity of an aligned region is defined by a score calculated on that region;

• The score depends on the chosen parameters;• Contrarily to homology : expression like significant or weak similarity are often used.

Page 18: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Homology

• Sequence homology underlies  common ancestry and sequence conservation;

• Homology can be inferred, under suitable conditions from sequence similarity ;

• The main objective of sequence similarity searching studies aims at inferring homology between sequences;

• Homology is not a measure.It is an all or none relashionship (i.e homology exits or does not exist. Expressions like : significant or weak homology are meaningless!).

Sequence similarity is a measure of the matching characters in an alignment, whereas homology is a statement of common evolutionary origin.

Page 19: Large scale genomes comparisons Bioinformatics aspects (Introduction)

A

B

Local alignment

A

B

Global alignment

Local Alignment

Global Alignment

Page 20: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Compare one query sequence to a BLAST formatted database

Page 21: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Amino acid scoring schemes

(substitution matrices)• All algorithms comparing protein sequences rely on some schemes to score the equivalence of each of the 210 possible pairs of amino acids.

As a result : what a local alignment program produces depends strongly upon the scores it uses.

• implicitly a scheme may represent a particular theory of evolution,• choice of a matrix can strongly influence the outcome of an analysis.

•The scores in the matrix are integer values which assign a positive score to identical or similar character pairs, and a negative value to dissimilar character pairs.

Sij = (ln(qij/pipj))/u; qij are target frequencies for aligned pairs of amino acids, the pi and pj are background frequencies, and u is a statistical parameter.

Page 22: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Examples of substitution matrices# PAM250 substitution matrix, scale = ln(2)/3 = 0.231049# Expected score = -0.844, Entropy = 0.354 bits# Lowest score = -8, Highest score = 17

A R N D C Q E G H I L K M F P S T W Y V B Z X *A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -8Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -8V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -8B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -8Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -8X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 -8* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1

Page 23: Large scale genomes comparisons Bioinformatics aspects (Introduction)

• PAM matrices (Dayhoff et al. (1978))

PAM stands for “point accepted mutation”. • 1 PAM corresponds to 1 amino acid change per 100 residues,• 1 PAM ~1% divergence,• Extrapolate to predict patterns at longer distances.Assumptions :

• replacements are independent of surrounding residues,• sequences being compared are of average composition,• all sites are equally mutable,

Source of error : • small, globular proteins were used to derive PAM matrices (departure from average composition) • errors in PAM1 are magnified up to PAM250,.... • does not account for conserved blocks or motifs.Strategy : • PAM40 short alignments, highly similar • PAM120 average similarity • PAM250 longer , weaker local alignments.

Page 24: Large scale genomes comparisons Bioinformatics aspects (Introduction)

• BLOSUM matrices (Henikoff, S., and Henikoff, J., G. (1992))

BlosumX denotes a matrix obtained from alignments of clustered sequence segments with more than X% identity.

Examples : - Blosum62 is obtained from clustered sequences with identity greater than 62%.- Blosum80 is obtained from clustered sequences with identity greater than 80%.

Which substitution matrix to choose?

Blosum80 Blosum62 Blosum45 PAM10 PAM120 PAM250 Less divergent <------ searching ------> More divergent

Page 25: Large scale genomes comparisons Bioinformatics aspects (Introduction)

BLAST(Basic Local Alignment Search Tool)

Nucleo tide BLA ST• Nucleotide query - nucleotide database [blastn]

Prote in B LAST• Protein query - protein database [blastp]• PSI-BLAST Position Specific Iterative BLAST

Trans lated BLAST Sea rches• Nucleotide query - Protein db [blastx]• Protein query - Translated db [tblastn]• Nucleotide query - Translated db [tblastx]

Seac h for con se rve d do mains• Search the Conserved Domain Database [RPS-BLAST]

Pairwis e BLAST• BLAST 2 Sequences

Page 26: Large scale genomes comparisons Bioinformatics aspects (Introduction)

• Position Specific Scoring Matrix (PSSM)

- Conserved motifs are identified and amino acid profile matrix for each motif is calculated.

-This matrix (n x 20 aa ) is representative of the relative amino acid probabilities at specific positions and is characteristic of a protein family.

-Such matrices are used by the profile database searching programs (including PSI-BLAST and HMM based programs).

Page 27: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Example of a PSSM matrices determined (PSI-BLAST program): A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 3 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 5 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 6 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 7 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 8 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 9 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 10 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 11 G 0 -2 0 -1 -2 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 12 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 13 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 14 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 15 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 16 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 17 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 18 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 19 Q -1 1 0 0 -3 5 3 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 20 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 21 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 22 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 23 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 ..................................................................... 573 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 574 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 575 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1

Page 28: Large scale genomes comparisons Bioinformatics aspects (Introduction)

(2) Compare the word list to the database and identify exact matches.

Blast algorithm:

(3)For each word match, extend alignment in both directions to

(1) Query sequence: list of high scoring words of length w.

Query Sequence of length L

Maximum of L-w+1 words; w=3,11

.....

List the words that score at least T using a substitution matrix (Bosum62 or PAM250,...)

.....

DB sequences

Extract matches of words from word list.

Maximal Segment Pairs (MSPs): HSPs

find alignments with scores > S

Page 29: Large scale genomes comparisons Bioinformatics aspects (Introduction)

BLASTP 2.2.1 [Apr-13-2001]

............................

Query= YAL005c SSA1 heat shock protein of HSP70 family,cytosolic (642 letters)

Database: S. cerevisiae proteome version 22/05/2002 5829 sequences; 2,798,770 total letters................................................ Score ESequences producing significant alignments: (bits) Value

YAL005c SSA1 heat shock protein of HSP70 family, cyt... 674 0.0YLL024c SSA2 heat shock protein of HSP70 family, cyt... 663 0.0YER103w SSA4 heat shock protein of HSP70 family, cyt... 589 e-169YBL075c SSA3 heat shock protein of HSP70 family, cyt... 588 e-169YJL034w KAR2 nuclear fusion protein 480 e-136YDL229w SSB1 heat shock protein of HSP70 family 428 e-120YNL209w SSB2 heat shock protein of HSP70 family, cyt... 427 e-120YJR045c SSC1 mitochondrial heat shock protein 70-rel... 336 5e-93YEL030w heat shock protein of HSP70 family 324 2e-89YLR369w SSQ1 mitochondrial heat shock protein 70 296 4e-81YBR169c SSE2 heat shock protein of the HSP70 family 173 7e-44YPL106c SSE1 heat shock protein of HSP70 family 172 1e-43YHR064c regulator protein involved in pleiotro... 143 6e-35YKL073w LHS1 chaperone of the ER lumen 100 4e-22YLR135w subunit of SLX1P/Ybr228p-SLX4P complex... 33 0.13...................

Page 30: Large scale genomes comparisons Bioinformatics aspects (Introduction)

>YLL024c SSA2 P14.1.f13.1 heat shock protein of HSP70 family, cytosolic

Length = 639

Score = 663 bits (2508), Expect = 0.0 Identities = 558/607 (91%), Positives = 570/607 (92%)

Query: 1 MSKAVGIDLGTTYSCVAHFANDRVDIIANDQGNRTTPSFVAFTDTERLIGDAAKNQAAMN 60 MSKAVGIDLGTTYSCVAHF+NDRVDIIANDQGNRTTPSFV+FTDTERLIGDAAKNQAAMNSbjct: 1 MSKAVGIDLGTTYSCVAHFSNDRVDIIANDQGNRTTPSFVGFTDTERLIGDAAKNQAAMN 60..........................................................................Query: 601 IMSKLYQ 607 IMSKLYQSbjct: 601 IMSKLYQ 607

>YER103w SSA4 P14.1.f13.1 heat shock protein of HSP70 family, cytosolic

Length = 642

Score = 589 bits (2224), Expect = e-169 Identities = 473/609 (77%), Positives = 539/609 (87%), Gaps = 3/609 (0%)

Query: 1 MSKAVGIDLGTTYSCVAHFANDRVDIIANDQGNRTTPSFVAFTDTERLIGDAAKNQAAMN 60 MSKAVGIDLGTTYSCVAHFANDRV+IIANDQGNRTTPS+VAFTDTERLIGDAAKNQAAMNSbjct: 1 MSKAVGIDLGTTYSCVAHFANDRVEIIANDQGNRTTPSYVAFTDTERLIGDAAKNQAAMN 60....................................................................Query: 598 ANPIMSKLY 606 ANPIMSK+YSbjct: 601 ANPIMSKFY 609

>YBL075c SSA3 P14.1.f13.1 heat shock protein of HSP70 family, cytosolic

Length = 649

Score = 588 bits (2220), Expect = e-169 Identities = 467/609 (76%), Positives = 539/609 (87%), Gaps = 3/609 (0%)

Query: 1 MSKAVGIDLGTTYSCVAHFANDRVDIIANDQGNRTTPSFVAFTDTERLIGDAAKNQAAMN 60 MS+AVGIDLGTTYSCVAHF+NDRV+IIANDQGNRTTPS+VAFTDTERLIGDAAKNQAA+NSbjct: 1 MSRAVGIDLGTTYSCVAHFSNDRVEIIANDQGNRTTPSYVAFTDTERLIGDAAKNQAAIN 60........................................Query: 598 ANPIMSKLY 606 ANPIM+K+YSbjct: 601 ANPIMTKFY 609

>YJL034w KAR2 P14.1.f13.1 nuclear fusion protein

Length = 682...........................................

Page 31: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large-scale proteome comparisons

Page 32: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Systematic comparisons Comparenewg2eachg ng list Compareeachg2newg ng list

blastp, blosum62, SEG filter

ro

bestgs1ng allgs1ng

bestgs2ng allgs2ng

bestgsnng allgsnng

NG new proteome

bestnggs1 allnggs1

bestnggs2 allnggs2

bestnggsn allnggsn

GS1 proteome1

GS2 proteome2

GSn proteomen

bestnggsi NG1 size GSij blast p HS/IS/NS

allnggsi NG1 size GSij blast p HS/IS/NS NG2 size GSik blast p HS/IS/NS

- fast determination of significant matches; multiple matches; orthologs determination;

The expected number of HSPs with score at least S is given by: E = Kmne-S. m and n are sequence and database lengths.

Page 33: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Systematic Analysis of Completely Sequenced Organisms

• In silico species specific comparisons;

• Degree of ancestral duplication and of ancestral conservation between pairs of species;

• Families of paralogs (Partition-MCL);

• Families of orthologs (Partition-MCL);

• Determination of the protein dictionary (orthologs);

• Determination of protein conservation profiles;

Page 34: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Working Examples

Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome

Page 35: Large scale genomes comparisons Bioinformatics aspects (Introduction)

BLASTP 2.2.1 [Apr-13-2001]

............................

Query= YAL005c SSA1 heat shock protein of HSP70 family,cytosolic (642 letters)

Database: S. cerevisiae proteome version 22/05/2002 5829 sequences; 2,798,770 total letters................................................ Score ESequences producing significant alignments: (bits) Value

YAL005c SSA1 heat shock protein of HSP70 family, cyt... 674 0.0YLL024c SSA2 heat shock protein of HSP70 family, cyt... 663 0.0YER103w SSA4 heat shock protein of HSP70 family, cyt... 589 e-169YBL075c SSA3 heat shock protein of HSP70 family, cyt... 588 e-169YJL034w KAR2 nuclear fusion protein 480 e-136YDL229w SSB1 heat shock protein of HSP70 family 428 e-120YNL209w SSB2 heat shock protein of HSP70 family, cyt... 427 e-120YJR045c SSC1 mitochondrial heat shock protein 70-rel... 336 5e-93YEL030w heat shock protein of HSP70 family 324 2e-89YLR369w SSQ1 mitochondrial heat shock protein 70 296 4e-81YBR169c SSE2 heat shock protein of the HSP70 family 173 7e-44YPL106c SSE1 heat shock protein of HSP70 family 172 1e-43YHR064c regulator protein involved in pleiotro... 143 6e-35YKL073w LHS1 chaperone of the ER lumen 100 4e-22YLR135w subunit of SLX1P/Ybr228p-SLX4P complex... 33 0.13...................

SC vs SC

Page 36: Large scale genomes comparisons Bioinformatics aspects (Introduction)

bestscsc ( SC / SC )

YAL002w 1176 - NSYAL003w 206 - NSYAL004w 215 - NSYAL005c 642 YLL024c HS 0.0YAL007c 215 YOR016c HS 1e-44

allscsc ( SC / SC )YAL002w 1176 - NS

YAL003w 206 - NS

YAL004w 215 - NS

YAL005c 642 YLL024c HS 0.0YAL005c 642 YER103w HS 0.0YAL005c 642 YBL075c HS 0.0YAL005c 642 YJL034w HS e-147YAL005c 642 YDL229w HS e-130YAL005c 642 YNL209w HS e-130YAL005c 642 YJR045c HS e-100YAL005c 642 YEL030w HS 2e-96YAL005c 642 YLR369w HS 1e-87YAL005c 642 YBR169c HS 2e-47YAL005c 642 YPL106c HS 4e-47YAL005c 642 YHR064c HS 7e-38YAL005c 642 YKL073w HS 5e-24

YAL007c 215 YOR016c HS 1e-44YAL007c 215 YGL200c IS 5e-05YAL007c 215 YHR110w IS 0.017YAL007c 215 YDL018c IS 0.021

- Paralogs - multiple matches

- Partitions/clustering

Multiple matches of sc in sc

ORF matches in scYAL005c 13YAL007c 1YDR214w 1YDR216w 2YDR399w 1YDR406w 9YDR409w 1YCR040w 1YKL218c 1YKL219w 14YKL220c 6YKL221w 2YKL222c 3YKL223w 5YKL224c 22YKR001c 2YKR003w 5YBR104w 6YBR105c 1YKR013w 2YKR014c 13.................................... ..........................Max : YDR477w 77

Page 37: Large scale genomes comparisons Bioinformatics aspects (Introduction)

bestscce (SC / CE)

YAL002w 1176 C42C1.4 HS 2e-15YAL003w 206 F54H12.6 HS 4e-22YAL004w 215 - NSYAL005c 642 F26D10.3 HS e-172YAL007c 215 F57B10.5 HS 9e-08YAL009w 259 F16D3.7 IS 0.013YAL019w 1131 M03C11.8 HS 7e-92YAL020c 333 F07C3.4 IS 7e-04YAL021c 837 ZC518.3 HS 5e-47

allscce (SC / CE)

YAL002w 1176 C42C1.4 HS 2e-15

YAL003w 206 F54H12.6 HS 4e-22YAL003w 206 Y41E3.10 HS 2e-17

YAL004w 215 - NS

YAL005c 642 F26D10.3 HS e-172YAL005c 642 F44E5.4 HS e-153YAL005c 642 F44E5.5 HS e-153YAL005c 642 C12C8.1 HS e-152YAL005c 642 C15H9.6 HS e-148YAL005c 642 F43E2.8 HS e-144YAL005c 642 C37H5.8 HS e-104YAL005c 642 F11F1.1 HS 1e-77YAL005c 642 F54C9.2 HS 4e-51YAL005c 642 K09C4.3 HS 4e-47YAL005c 642 T28F3.2 HS 2e-45YAL005c 642 C30C11.4 HS 7e-43YAL005c 642 T24H7.2 HS 2e-34YAL005c 642 T14G8.3 HS 8e-33

bestcesc ( CE / SC)

C42C1.4 1259 YAL002w HS 8e-16F54H12.6 213 YAL003w HS 4e-20F26D10.3 640 YER103w HS e-174F26D10.3 640 YER103w HS e-174F57B10.5 203 YAL007c HS 7e-13F16D3.7 516 YHL003c IS 9e-04M03C11.8 1038 YAL019w HS 2e-87AC3.1 356 - NSAC3.2 949 YLR189c IS 0.038AC3.3 425 - NSAC3.4 600 YNL326c HS 1e-12

allcesc (CE / SC )

C42C1.4 1259 YAL002w HS 8e-16

F54H12.6 213 YAL003w HS 4e-20

F26D10.3 640 YER103w HS e-174F26D10.3 640 YBL075c HS e-174F26D10.3 640 YLL024c HS e-172F26D10.3 640 YAL005c HS e-171F26D10.3 640 YJL034w HS e-141F26D10.3 640 YDL229w HS e-129F26D10.3 640 YNL209w HS e-129F26D10.3 640 YJR045c HS e-100F26D10.3 640 YEL030w HS 2e-97F26D10.3 640 YLR369w HS 1e-83F26D10.3 640 YPL106c HS 2e-45F26D10.3 640 YBR169c HS 5e-45F26D10.3 640 YHR064c HS 8e-36F26D10.3 640 YKL073w HS 3e-22

SC/CE CE/SC

Reciprocal Best Hits (RBH)

Page 38: Large scale genomes comparisons Bioinformatics aspects (Introduction)

segmatchSCCE

Test siz Hit siz e-val %id %sim gap Ssiz dT eT dH eHYAL002w 1176 C42C1.4 1259 5e-14 16 44 7 674 438 1111 547 1196

YAL005c 642 F26D10.3 640 1e-159 73 84 0 605 3 607 5 613YAL005c 642 F44E5.5 645 1e-142 63 79 0 605 3 607 5 611YAL005c 642 F44E5.4 645 1e-142 63 79 0 605 3 607 5 611YAL005c 642 C12C8.1 643 1e-141 62 79 0 605 3 607 5 611YAL005c 642 C15H9.6 661 1e-137 60 78 1 603 5 607 36 641YAL005c 642 F43E2.8 657 1e-134 58 76 1 606 1 606 29 637YAL005c 642 C37H5.8 657 1e-96 46 67 2 606 2 607 31 632YAL005c 642 F11F1.1b 607 1e-73 36 60 0 599 4 602 2 600YAL005c 642 F11F1.1a 614 8e-72 36 60 2 599 4 602 2 607YAL005c 642 F54C9.2 469 3e-47 38 66 2 379 2 380 52 433YAL005c 642 K09C4.3 310 2e-43 71 88 0 186 4 189 6 192YAL005c 642 K09C4.3 310 1e-04 54 70 61 327 387 189 249YAL005c 642 C30C11.4 776 1e-39 26 50 8 600 5 604 4 647YAL005c 642 T24H7.2 925 1e-31 24 50 3 506 4 509 26 548YAL005c 642 T14G8.3 926 3e-30 24 51 6 510 4 513 28 560

Page 39: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Conclusion

Large-scale analyses of Completely sequenced genomes allow a systematic vision of genes, genome organization and their macro as well their micro evolutions.

Starting step for further evolutionary analyses that will be dealt with during this course.

Page 40: Large scale genomes comparisons Bioinformatics aspects (Introduction)

Practical sessions(see text)