Introduction to Bioinformatics

51
Introduction to Bioinformatics Lecture 12: Iterative homology searching and Protein Structure- Function Relationships Centre for Integrative Bioinformatics VU (IBIVU)

description

Introduction to Bioinformatics. Lecture 12 : Iterative h omology searching and Protein Structure-Function Relationships Centre for Integrative Bioinformatics VU (IBIVU). PSI ( Position Specific Iterated ) BLAST. basic idea use results from BLAST query to construct a profile matrix - PowerPoint PPT Presentation

Transcript of Introduction to Bioinformatics

Page 1: Introduction to  Bioinformatics

Introduction to Bioinformatics

Lecture 12: Iterative homology searching and Protein Structure-Function

Relationships

Centre for Integrative Bioinformatics VU (IBIVU)

Page 2: Introduction to  Bioinformatics

PSI (Position Specific Iterated) BLAST

• basic idea– use results from BLAST query to construct a

profile matrix– search database with profile instead of query

sequence• iterate

Page 3: Introduction to  Bioinformatics

A Profile Matrix (Position Specific Scoring Matrix – PSSM)

This is the same as a profile without position-specific gap penalties

Page 4: Introduction to  Bioinformatics

PSI BLAST• Searching with a Profile• aligning profile matrix to a simple sequence

– like aligning two sequences– except score for aligning a character with a matrix

position is given by the matrix itself– not a substitution matrix

Page 5: Introduction to  Bioinformatics

PSI BLAST:Constructing the Profile Matrix

Figure from: Altschul et al. Nucleic Acids Research 25, 1997

Page 6: Introduction to  Bioinformatics

PSI BLAST:Determining Profile Elements

• the value for a given element of the profile matrix is given by:

• where the probability of seeing amino acid ai in column j is estimated as:

Observed frequency

Pseudocount

e.g. = number of sequences in profile, =1

Page 7: Introduction to  Bioinformatics

PSI-BLAST iteration

Q

ACD..Y

PiPx

Query sequence

PSSM

Q Query sequenceGapped BLAST search

Database hits

Gapped BLAST searchACD..Y

PiPx

PSSM

Database hits

xxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxx

iterate

Page 8: Introduction to  Bioinformatics

PSI-BLAST• Query sequences are first scanned for the presence of so-

called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment.

• The program then initially operates on a single query sequence by performing a gapped BLAST search

• Then, the program takes significant local alignments (hits) found, constructs a multiple alignment (master-slave alignment) and abstracts a position-specific scoring matrix (PSSM) from this alignment.

• Rescan the database in a subsequent round, using the PSSM, to find more homologous sequences. Iteration continues until user decides to stop or search has converged

Page 9: Introduction to  Bioinformatics
Page 10: Introduction to  Bioinformatics
Page 11: Introduction to  Bioinformatics

1 - This portion of each description links to the sequence record for a particular hit.

2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence).

3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e-117 meaning that a sequence with a similar score is very unlikely to occur simply by chance.

4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase.

Page 12: Introduction to  Bioinformatics

‘X’ residues denote low-complexity sequence fragments that are ignored

Page 14: Introduction to  Bioinformatics

Alignment Bit Score

•S is the raw alignment score •The bit score (‘bits’) B has a standard set of units•The bit score B is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment and K and are the statistical parameters of the scoring system (BLOSUM62 in Blast). •See Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices. •Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches based on different scoring schemes (a.a. exchange matrices)

B = (S – ln K) / ln 2

Page 15: Introduction to  Bioinformatics

Normalised sequence similarityThe p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences. This probability follows the Poisson distribution (Waterman and Vingron, 1994): P(x, n) = 1 – e-nP(S x),

where n is the number of sequences in the databaseDepending on x and n (fixed)

Page 16: Introduction to  Bioinformatics

Normalised sequence similarityStatistical significance

The E-value is defined as the expected number of non-homologous sequences with score greater than or equal to a score x in a database of n sequences: E(x, n) = nP(S x)

For example, if E-value = 0.01, then the expected number of random hits with score S x is 0.01, which means that this E-value is expected by chance only once in 100 independent searches over the database.if the E-value of a hit is 5, then five fortuitous hits with S x are expected within a single database search, which renders the hit not significant.

Page 17: Introduction to  Bioinformatics

A model for database searching score probabilities

• Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955).

• Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)

Page 18: Introduction to  Bioinformatics

Extreme Value Distribution

Probability density function for the extreme value distribution resulting from parameter values = 0 and = 1, [y = 1 – exp(-e-x)], where is the characteristic value and is the decay constant.

y = 1 – exp(-e-(x-))

Page 19: Introduction to  Bioinformatics

Extreme Value Distribution (EDV)

You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit.

real data

EDV approximation

Page 20: Introduction to  Bioinformatics

Extreme Value Distribution

The probability of a score S to be larger than a given value x can be calculated following the EDV as:

E-value: P(S x) = 1 – exp(-e -(x-)),

where =(ln Kmn)/, and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices).

Page 21: Introduction to  Bioinformatics

Extreme Value DistributionUsing the equation for (preceding slide), the probability for the raw alignment score S becomes

P(S x) = 1 – exp(-Kmne-x).

In practice, the probability P(Sx) is estimated using the approximation 1 – exp(-e-x) e-x, which is valid for large values of x. This leads to a simplification of the equation for P(Sx):

P(S x) e-(x-) = Kmne-x.

The lower the probability (E value) for a given threshold value x, the more significant the score S.

Page 22: Introduction to  Bioinformatics

Normalised sequence similarityStatistical significance

• Database searching is commonly performed using an E-value in between 0.1 and 0.001.

• Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.

Page 23: Introduction to  Bioinformatics

Words of Encouragement

• “There are three kinds of lies: lies, damned lies, and statistics” – Benjamin Disraeli

• “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination”

• “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates

Page 24: Introduction to  Bioinformatics

Protein structure-functionrelationships

Page 25: Introduction to  Bioinformatics

Genome/DNA

Transcriptome/mRNA

Proteome

Metabolome

Physiome

Transcription factors

Ribosomal proteinsChaperonins

Enzymes

Protein function

Page 26: Introduction to  Bioinformatics

Protein function

Not all proteins are enzymes:

-crystallin: eye lens protein – needs to stay stable and transparent for a lifetime (very little turnover in the eye lens)

Page 27: Introduction to  Bioinformatics

Protein function groups• Catalysis (enzymes)• Binding – transport (active/passive)

– Protein-DNA/RNA binding (e.g. histones, transcription factors)

– Protein-protein interactions (e.g. antibody-lysozyme)– Protein-fatty acid binding (e.g. apolipoproteins)– Protein – small molecules (drug interaction, structure

decoding)• Structural component (e.g. -crystallin)• Regulation• Transcription regulation• Signalling• Immune system• Motor proteins (actin/myosin)

Page 28: Introduction to  Bioinformatics

What can happen to protein function through evolution

Proteins can have multiple functions (and sometimes many -- Ig).

Enzyme function is defined by specificity and activityThrough evolution:• Function and specificity can stay the same• Function stays same but specificity changes• Change to some similar function (e.g. somewhere

else in metabolic system)• Change to completely new function

Page 29: Introduction to  Bioinformatics

How to arrive at a given function

• Divergent evolution – homologous proteins –proteins have same structure and “same-ish” function

• Convergent evolution – analogous proteins – different structure but same function

• Question: can homologous proteins change structure (and function)?

Page 30: Introduction to  Bioinformatics

How to evolveImportant distinction:• Orthologues: homologous proteins in different species (all

deriving from same ancestor)• Paralogues: homologous proteins in same species (internal gene

duplication)

• In practice: to recognise orthology, bi-directional best hit is used in conjunction with database search program (this is called an operational definition)

Page 31: Introduction to  Bioinformatics

How to evolve

By addition of domains (at either end of protein sequence) – Lesk book page 108

Often through gene duplication followed by divergence

Multi-domain proteins are result of gene fusion

Page 32: Introduction to  Bioinformatics

Protein structure evolutionInsertion/deletion of secondary structural

elements can ‘easily’ be done at loop sites

Page 33: Introduction to  Bioinformatics

Flavodoxin fold

5() fold

Page 34: Introduction to  Bioinformatics

Flavodoxin family - TOPS diagrams (Flores et al., 1994)

1 2345

1

234

5

Page 35: Introduction to  Bioinformatics

Protein structure evolutionInsertion/deletion of structural domains can

‘easily’ be done at loop sites

N

C

Page 36: Introduction to  Bioinformatics

The basic functional unit of a protein is the domain

A domain is a:• Compact, semi-independent unit

(Richardson, 1981).• Stable unit of a protein structure that can

fold autonomously (Wetlaufer, 1973).• Recurring functional and evolutionary

module (Bork, 1992). “Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).

Page 37: Introduction to  Bioinformatics

Delineating domains is essential for:

• Obtaining high resolution structures (x-ray, NMR)• Sequence analysis • Multiple sequence alignment methods• Prediction algorithms (SS, Class, secondary/tertiary

structure)• Fold recognition and threading• Elucidating the evolution, structure and function of

a protein family (e.g. ‘Rosetta Stone’ method)• Structural/functional genomics• Cross genome comparative analysis

Page 38: Introduction to  Bioinformatics

Pyruvate kinasePhosphotransferase

barrel regulatory domain

barrel catalytic substrate binding domain

nucleotide binding domain

1 continuous + 2 discontinuous domains

Structural domain organisation can be nasty…

Page 39: Introduction to  Bioinformatics

Complex protein functions are a result of multiple domains

• An example is the so-called swivelling domain in pyruvate phosphate dikinase (Herzberg et al., 1996), which brings an intermediate enzymatic product over about 45 Å from the active site of one domain to that of another.

• This enhances the enzymatic activity

Page 40: Introduction to  Bioinformatics

The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.

http

://w

ww

.msh

ri.on

.ca/

paw

son

Page 41: Introduction to  Bioinformatics
Page 42: Introduction to  Bioinformatics

Globin fold proteinmyoglobinPDB: 1MBN

Page 43: Introduction to  Bioinformatics

sandwich proteinimmunoglobulinPDB: 7FAB

Page 44: Introduction to  Bioinformatics

TIM barrel / proteinTriosephosphate IsoMerasePDB: 1TIM

Page 45: Introduction to  Bioinformatics

A fold in + proteinribonuclease APDB: 7RSA

The red balls represent waters that are ‘bound’ to the protein based on polar contacts

Page 46: Introduction to  Bioinformatics
Page 47: Introduction to  Bioinformatics

434 Cro protein complex(phage)

PDB: 3CRO

Page 48: Introduction to  Bioinformatics

Zinc finger DNA recognition

(Drosophila) PDB: 2DRP

..YRCKVCSRVY THISNFCRHY VTSH...

Page 49: Introduction to  Bioinformatics

Characteristics of the family:

     Function: The DNA-binding motif is found as part of transcription regulatory proteins.  

  

Structure: One of the most abundant DNA-binding motifs. Proteins may contain more than one finger in a single chain. For example Transcription Factor TF3A was the first zinc-finger protein discovered to contain 9 C2H2 zinc-finger motifs (tandem repeats). Each motif consists of 2 antiparallel beta-strands followed by by an alpha-helix. A single zinc ion is tetrahedrally coordinated by conserved histidine and cysteine residues, stabilising the motif.  

  

Zinc-finger DNA binding protein family

Page 50: Introduction to  Bioinformatics

     

  

  

Binding: Fingers bind to 3 base-pair subsites and specific contacts are mediated by amino acids in positions -1, 2, 3 and 6 relative to the start of the alpha-helix.

Contacts mainly involve one strand of the DNA.

Where proteins contain multiple fingers, each finger binds to adjacent subsites within a larger DNA recognition site thus allowing a relatively simple motif to specifically bind to a wide range of DNA sequences.

This means that the number and the type of zinc fingers dictates the specificity of binding to DNA

Characteristics of the family:

     

Zinc-finger DNA binding protein family

Page 51: Introduction to  Bioinformatics

Leucine zipper(yeast)

PDB: 1YSA

..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...