Probability and Statistical Significance of Sequence Analysis
-
Upload
tzahidmalik -
Category
Documents
-
view
226 -
download
0
Transcript of Probability and Statistical Significance of Sequence Analysis
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 1/26
MOLECULAR BIOLOGY (Applied)
(Z-8102)Probability and Statistical Analysis of
Sequence Alignment
By
Muhammad Tariq Zahid
38-GCU-Z-10
Submitted to
Prof. Dr. Muhammad Sharif Mughal
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 2/26
The term "sequence analysis" in biology implies subjecting
a DNA or peptide sequence to sequence alignment, sequence
databases, repeated sequence searches, or other
bioinformatics methods on a computer. Sequence analysisin molecular biology and bioinformatics is an automated, computer-
based examination of characteristic fragments, e.g. of a DNA strand.
Sequence Alignment Analysis
± It is the process of lining up two or more sequences to achievemaximum level of identity for assessing the degree of similarity
and possible homology.
± It is a string matching procedure in which a sequence of
interest(Qeuery) is compared with sequences in databank.
Sequence Analysis
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 3/26
The main objective of sequence analysis is to analyze data to
make reliable prediction on structure, function and evolution of the
given sequence. two sequences are written across a page in two rows.
Identical characters are placed on the same column while non-
identical characters can be placed opposite a gap in other
sequence.
For sequences highly divergent multiple sequence alignment andprofile-matching searches are tested for meaningful results.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 4/26
A set of n amino acids can form 20n different polypeptides. This means
that structural prediction of even a small protein of 100 amino acid isastronomical.
So statistical methods are used to search for structural similarities from
the probabilities calculated from observed frequencies of amino acids in
the family class.
High level sequence similarity strong indication of homology
Sequence alignment is essential for
± Phylogenetic relationships
± Secondary structure of proteins
± Protein family classification and tertiary structure prediction.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 5/26
Problems/solution in sequence
similarity alignment
Complications occur due to insertions or gaps in alignment.
Gap score is deduced from alignment score to prevent too manygaps.
Alignment process can be measured in terms of number and length
of gap introduced and mismatches remaining. Mutation matrices, dotplots, global and local alignment algorithms
are available to adress the sequence alignment problems.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 6/26
Scoring matrix AIWQH : ::
AL- QH
AIWQH
: ::
A- L QH
To quantify the similarity achieved by an alignment, scoring matrices are used: they contain a value for each possible
substitution, and the alignment score is the sum of the matrix's
entries for each aligned amino acid pair.
PAM (Percent Accepted Mutations) are a common family of scoring
matrix. "accepted" means that the mutation has been adopted by thesequence in question. PAM 250 and PAM 10.
PAM matrix value: if the alignment score is greater than zero, the
sequences are considered to be related, if the score is negative, it is
assumed that they are not related.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 7/26
Similarity Matrix
A similarity matrix is a matrix of scores which express the similaritybetween two data points.
Similarity matrices are used for database searching. They are
simplest with value 1 for same and 0 for not same letters.
Scoring penalties are introduced to minimize the number of gaps.
Compilation of similarity scores I pair-wise alignment into a matrix iscalled scoring matrix, which help in
± Evaluating match-mismatch between any two residues.
± A score for insertion and deletion.
± Optimization of total score.
± Evaluating the significance of total score.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 8/26
Distance Matrix
Distance matrices are used for phylogenetic trees
Distance score is usually calculated by summing up mismatches
in an alignment divided by the total number of matches and
mismatches.
D= mismatches/(matches+mismatches)
Similarity (S) and distance matrix (D) are inter-convertable(S=1-D).
Less similar sequences have higher distance score.
The number of mutational changes give a quantitative measure
of distance between two gene sequences which help in
constructing phylogenetic tree.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 9/26
The degree of match between two letters can be represented in
a matrix. The score is
Score=pij
/qi
q j
(pij probability that residue I is substituted by residue j;
qi and qj are background probabilities for residue I and j)
Nucleotide bases are purines and pyrimidines
± A mutation that conserve ring number is called transition while
mutation which in ring number is called transversions.
± Use of transition/transversion matrix reduce the noise of distantly
related sequences.
Distance among amino acid sequcences is more difficult to calculate as
they need one, two or three mutations or sometimes silent mutations.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 10/26
Pair-wise Sequence Alignment
It is fundamental process in sequence comparison analysis
Relatively straightforward computational problem
It is used to find the best-matching piecewise (local) or global
alignments of two query sequences. Pairwise alignments can only
be used between two sequences at a time, often used for methodsthat do not require extreme precision.
Two align sequences is not a proof that a relationship exist beteen
them but statistical values are used to indicate level of confidence.
The three primary methods of producing pairwise alignments are
dot-matrix methods, dynamic programming, and word methods.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 11/26
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 12/26
Dynamic Programming Method
The best solution for pairwise sequence alignment seems to be an
approach called dynamic programming.
The technique of dynamic programming can be applied to produce
global alignments via the Needleman-Wunsch algorithm, and local
alignments via the Smith-Waterman algorithm. These methods allowartificial gaps in the sequence.
Key part of alignment methods is scoring for insertions and
deletions.
± A positive score for matching residues
± Negative score for gap in one sequences for matching of other regions. ± Negative score (penalty) each time a gap is extended.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 13/26
Global Alignment
Alignment of two nucleic acids or protein sequences over entire
length. It compares all characteristics of one sequence to all the
characteristics of other.
most useful when the sequences in the query set are similar and of roughly equal size.
All possible pairs are represented by two dimensional array
Statistical significance is determined by scoring system» Match=1 mismatch=0 gap=penalty
Limitations ± Not effective for divergent sequences
± Not valid for all biological sequences
± Short and highly similar sequences may be missed in global alignment.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 14/26
Local Alignment
It is the alignment of some portion of nucleic acid or protein
sequences.
More useful for dissimilar sequences that are suspected to contain
regions of similarity or similar sequence motifs within their larger sequence context.
The Smith-Waterman algorithm is a general local alignment method.
sufficiently similar sequences, there is no difference between local
and global alignments.
Hybrid methods, known as semiglobal or "glocal" methods,
attempt to find the best possible alignment.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 15/26
k-Tuple Method
It is an alternative to full alignment of two sequences to search
common pattern.
Word methods identify a series of short, nonoverlapping
subsequences ("words") in the query sequence that are then
matched to candidate database sequences. The relative positions of the word in the two sequences being
compared.
The BLAST and F ASTA algorithms use the word or k-tuple method.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 16/26
Multiple Sequence alignment
It is an extension of pairwise alignment to incorporate more than two
sequences at a time with gaps so that common structural positions
and ancestral residues are aligned in the same column.
Alignment of large number of sequences by pair-wise dynamicprogramming is almost impossible.
Used in identifying conserved sequence regions across a group of
sequences hypothesized to be evolutionarily related.
Goal of multiple sequence alignment process is to generate a
concise and information rich table of sequence data to obtain
relatedness of sequences to a gene family.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 17/26
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 18/26
Basic Local Alignment Search Tool
(BLAST)
BLAST is a set of similarity search programs designed to explore all
of the available sequence databases.
BLAST is from NCBI. It consists of a suite of algorithms which
provide a fast, accurate and sensitive database searching.
The general procedure is ± Each word of query sequence is optimally filtered to remove low-
complexity regions and locate all similar words (3 amino acids or 11
nucleotides) in test sequence.
± The alignment on both sides of similar words is tried to expand.
±H
igh scoring segment pairs are generated and a set of H
SPs arechosen for that database.
± Several non-overlapping HSPs may be combined to creat a longer more
significant match.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 19/26
Suit of Blast Programs
BLAST
± Ungapped blast. This program may miss the similarity if two sequencesdon¶t have a highly conserved region.
Gapped BLAST
± Dynamic programming is used to extend a central pair of alignmentresidues in both directions.
PSI-BLAST (Position specific interactive BLAST)
± Incorporate both pairwise and multiple sequence alignment methods.Uses for weak sequence similarities.
BLASTN
± Compares nucleotide query against all nucleotide sequences indatabases (DNADNA)
BLASTP
± Protein Protein
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 20/26
BLASTX
± Nucleotides are translated into all six reading frames and then compared with all
protein sequences in the database. Suitable for finding ESTs and novel proteins. TBLASTN
± Compares a protein query sequence against nucleotide sequence database
translated into all six reading frames
TBLASTX
± Compares the six-frame translation of nucleotide query sequence against six
frame translation of nucleotide sequence database
BEAUTY (BLAST Enhanced Alignment utility)
± Predict the function of protein being tested. Informations on sequence family
membership, location of conserved domain etc.
BLAST-2
± New release of BLAST that allows insertions or Deletions. Biologically more
significant.
PhyloBLAST
± Compares the query protein sequence to a SWISSPORT and TrEMBL protein
database and then phylogenetic analysis.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 21/26
F ASTA Algorithm
For rapid alignment of pairs of DNA or Proteins
Based on k-tuples (Words)
± Match identical words (k-tuples) from each list and creat diagonals by
joining adjacent matches
± Find sum of identical words ± Rescale using PAM matrix and retain top scoring matrix
± Join segments using gaps and eliminate other segments
Related algorithms
± F ASTX and F ASTY
translate a DNA query sequence in all three forward reading frames andcompares with protein Database
± TF ASTX and TF ASTY
Compare query protein sequence to a DNA sequence Database by
translating each DNA in to six reading frame.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 22/26
PILEUP algorithm
It estimates the best alignment for a group of sequences using pair-
wise approach.
It uses global alignment procedure.
± First, similarity score is calculated between all sequences to be aligned.
± Most similar pairs of sequences are aligned and averages are
calculated to aligned pairs.
± Multiple alignment if achieved by a series of progressive, pair-wise
alignment between sequences.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 23/26
CLUSTAL Algorithm
Based on local alignment which is advantageous for highly divergent
sequences with some regions of homolgy.
Based on the premise that similar sequences are likely to be
evolutionary related.
Similar sequence are aligned first and more distantly sequences areadded later.
Avantages
± Gaps and its length are distinct with different weights
± Different weights for different mismatches
± Can use F ASTA or slower and more accurate method. ± Can add individual sequences to an existing alignment.
± Secondary structure features (e.g. hydrophobicity) are also incorporated
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 24/26
Strategies for Sequence Similarity
search
Decide whether to search nucleic acid or protein database
The use of protein or nucleic acid sequence query depends upon
the biological information desired
If the sequence is protein or protein coding gene then search mustbe performed at protein level as proteins allow to detect far more
distant similarity.
Initial search should be performed with a heuristic algorithm program
e.g. BLAST and F ASTA
If query sequence is unknown, translate it into amino acidssequences then cmopare.
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 25/26
PRACTICAL WORK
Nucleotide BLAST
BLASTX
PBLAST
8/9/2019 Probability and Statistical Significance of Sequence Analysis
http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 26/26
THE END
Thanks For Your Patience