Probability and Statistical Significance of Sequence Analysis

8/9/2019 Probability and Statistical Significance of Sequence Analysis

http://slidepdf.com/reader/full/probability-and-statistical-significance-of-sequence-analysis 1/26

MOLECULAR BIOLOGY (Applied)

(Z-8102)Probability and Statistical Analysis of

Sequence Alignment

By

Muhammad Tariq Zahid

38-GCU-Z-10

Submitted to

Prof. Dr. Muhammad Sharif Mughal



The term "sequence analysis" in biology implies subjecting

a DNA or peptide sequence to sequence alignment, sequence

databases, repeated sequence searches, or other

bioinformatics methods on a computer. Sequence analysisin molecular biology and bioinformatics is an automated, computer-

based examination of characteristic fragments, e.g. of a DNA strand.

Sequence Alignment Analysis

± It is the process of lining up two or more sequences to achievemaximum level of identity for assessing the degree of similarity

and possible homology.

± It is a string matching procedure in which a sequence of

interest(Qeuery) is compared with sequences in databank.

Sequence Analysis



The main objective of sequence analysis is to analyze data to

make reliable prediction on structure, function and evolution of the

given sequence. two sequences are written across a page in two rows.

Identical characters are placed on the same column while non-

identical characters can be placed opposite a gap in other

sequence.

For sequences highly divergent multiple sequence alignment andprofile-matching searches are tested for meaningful results.



A set of n amino acids can form 20n different polypeptides. This means

that structural prediction of even a small protein of 100 amino acid isastronomical.

So statistical methods are used to search for structural similarities from

the probabilities calculated from observed frequencies of amino acids in

the family class.

High level sequence similarity strong indication of homology

Sequence alignment is essential for

± Phylogenetic relationships

± Secondary structure of proteins

± Protein family classification and tertiary structure prediction.



Problems/solution in sequence

similarity alignment

Complications occur due to insertions or gaps in alignment.

Gap score is deduced from alignment score to prevent too manygaps.

Alignment process can be measured in terms of number and length

of gap introduced and mismatches remaining. Mutation matrices, dotplots, global and local alignment algorithms

are available to adress the sequence alignment problems.



Scoring matrix AIWQH : ::

AL- QH

AIWQH

: ::

A- L QH

To quantify the similarity achieved by an alignment, scoring matrices are used: they contain a value for each possible

substitution, and the alignment score is the sum of the matrix's

entries for each aligned amino acid pair.

PAM (Percent Accepted Mutations) are a common family of scoring

matrix. "accepted" means that the mutation has been adopted by thesequence in question. PAM 250 and PAM 10.

PAM matrix value: if the alignment score is greater than zero, the

sequences are considered to be related, if the score is negative, it is

assumed that they are not related.



Similarity Matrix

A similarity matrix is a matrix of scores which express the similaritybetween two data points.

Similarity matrices are used for database searching. They are

simplest with value 1 for same and 0 for not same letters.

Scoring penalties are introduced to minimize the number of gaps.

Compilation of similarity scores I pair-wise alignment into a matrix iscalled scoring matrix, which help in

± Evaluating match-mismatch between any two residues.

± A score for insertion and deletion.

± Optimization of total score.

± Evaluating the significance of total score.



Distance Matrix

Distance matrices are used for phylogenetic trees

Distance score is usually calculated by summing up mismatches

in an alignment divided by the total number of matches and

mismatches.

D= mismatches/(matches+mismatches)

Similarity (S) and distance matrix (D) are inter-convertable(S=1-D).

Less similar sequences have higher distance score.

The number of mutational changes give a quantitative measure

of distance between two gene sequences which help in

constructing phylogenetic tree.



The degree of match between two letters can be represented in

a matrix. The score is

Score=pij

/qi

q j

(pij probability that residue I is substituted by residue j;

qi and qj are background probabilities for residue I and j)

Nucleotide bases are purines and pyrimidines

± A mutation that conserve ring number is called transition while

mutation which in ring number is called transversions.

± Use of transition/transversion matrix reduce the noise of distantly

related sequences.

Distance among amino acid sequcences is more difficult to calculate as

they need one, two or three mutations or sometimes silent mutations.



Pair-wise Sequence Alignment

It is fundamental process in sequence comparison analysis

Relatively straightforward computational problem

It is used to find the best-matching piecewise (local) or global

alignments of two query sequences. Pairwise alignments can only

be used between two sequences at a time, often used for methodsthat do not require extreme precision.

Two align sequences is not a proof that a relationship exist beteen

them but statistical values are used to indicate level of confidence.

The three primary methods of producing pairwise alignments are

dot-matrix methods, dynamic programming, and word methods.



Dynamic Programming Method

The best solution for pairwise sequence alignment seems to be an

approach called dynamic programming.

The technique of dynamic programming can be applied to produce

global alignments via the Needleman-Wunsch algorithm, and local

alignments via the Smith-Waterman algorithm. These methods allowartificial gaps in the sequence.

Key part of alignment methods is scoring for insertions and

deletions.

± A positive score for matching residues

± Negative score for gap in one sequences for matching of other regions. ± Negative score (penalty) each time a gap is extended.



Global Alignment

Alignment of two nucleic acids or protein sequences over entire

length. It compares all characteristics of one sequence to all the

characteristics of other.

most useful when the sequences in the query set are similar and of roughly equal size.

All possible pairs are represented by two dimensional array

Statistical significance is determined by scoring system» Match=1 mismatch=0 gap=penalty

Limitations ± Not effective for divergent sequences

± Not valid for all biological sequences

± Short and highly similar sequences may be missed in global alignment.



Local Alignment

It is the alignment of some portion of nucleic acid or protein

sequences.

More useful for dissimilar sequences that are suspected to contain

regions of similarity or similar sequence motifs within their larger sequence context.

The Smith-Waterman algorithm is a general local alignment method.

sufficiently similar sequences, there is no difference between local

and global alignments.

Hybrid methods, known as semiglobal or "glocal" methods,

attempt to find the best possible alignment.



k-Tuple Method

It is an alternative to full alignment of two sequences to search

common pattern.

Word methods identify a series of short, nonoverlapping

subsequences ("words") in the query sequence that are then

matched to candidate database sequences. The relative positions of the word in the two sequences being

compared.

The BLAST and F ASTA algorithms use the word or k-tuple method.



Multiple Sequence alignment

It is an extension of pairwise alignment to incorporate more than two

sequences at a time with gaps so that common structural positions

and ancestral residues are aligned in the same column.

Alignment of large number of sequences by pair-wise dynamicprogramming is almost impossible.

Used in identifying conserved sequence regions across a group of

sequences hypothesized to be evolutionarily related.

Goal of multiple sequence alignment process is to generate a

concise and information rich table of sequence data to obtain

relatedness of sequences to a gene family.



Basic Local Alignment Search Tool

(BLAST)

BLAST is a set of similarity search programs designed to explore all

of the available sequence databases.

BLAST is from NCBI. It consists of a suite of algorithms which

provide a fast, accurate and sensitive database searching.

The general procedure is ± Each word of query sequence is optimally filtered to remove low-

complexity regions and locate all similar words (3 amino acids or 11

nucleotides) in test sequence.

± The alignment on both sides of similar words is tried to expand.

±H

igh scoring segment pairs are generated and a set of H

SPs arechosen for that database.

± Several non-overlapping HSPs may be combined to creat a longer more

significant match.



Suit of Blast Programs

BLAST

± Ungapped blast. This program may miss the similarity if two sequencesdon¶t have a highly conserved region.

Gapped BLAST

± Dynamic programming is used to extend a central pair of alignmentresidues in both directions.

PSI-BLAST (Position specific interactive BLAST)

± Incorporate both pairwise and multiple sequence alignment methods.Uses for weak sequence similarities.

BLASTN

± Compares nucleotide query against all nucleotide sequences indatabases (DNADNA)

BLASTP

± Protein Protein



BLASTX

± Nucleotides are translated into all six reading frames and then compared with all

protein sequences in the database. Suitable for finding ESTs and novel proteins. TBLASTN

± Compares a protein query sequence against nucleotide sequence database

translated into all six reading frames

TBLASTX

± Compares the six-frame translation of nucleotide query sequence against six

frame translation of nucleotide sequence database

BEAUTY (BLAST Enhanced Alignment utility)

± Predict the function of protein being tested. Informations on sequence family

membership, location of conserved domain etc.

BLAST-2

± New release of BLAST that allows insertions or Deletions. Biologically more

significant.

PhyloBLAST

± Compares the query protein sequence to a SWISSPORT and TrEMBL protein

database and then phylogenetic analysis.



F ASTA Algorithm

For rapid alignment of pairs of DNA or Proteins

Based on k-tuples (Words)

± Match identical words (k-tuples) from each list and creat diagonals by

joining adjacent matches

± Find sum of identical words ± Rescale using PAM matrix and retain top scoring matrix

± Join segments using gaps and eliminate other segments

Related algorithms

± F ASTX and F ASTY

translate a DNA query sequence in all three forward reading frames andcompares with protein Database

± TF ASTX and TF ASTY

Compare query protein sequence to a DNA sequence Database by

translating each DNA in to six reading frame.



PILEUP algorithm

It estimates the best alignment for a group of sequences using pair-

wise approach.

It uses global alignment procedure.

± First, similarity score is calculated between all sequences to be aligned.

± Most similar pairs of sequences are aligned and averages are

calculated to aligned pairs.

± Multiple alignment if achieved by a series of progressive, pair-wise

alignment between sequences.



CLUSTAL Algorithm

Based on local alignment which is advantageous for highly divergent

sequences with some regions of homolgy.

Based on the premise that similar sequences are likely to be

evolutionary related.

Similar sequence are aligned first and more distantly sequences areadded later.

Avantages

± Gaps and its length are distinct with different weights

± Different weights for different mismatches

± Can use F ASTA or slower and more accurate method. ± Can add individual sequences to an existing alignment.

± Secondary structure features (e.g. hydrophobicity) are also incorporated



Strategies for Sequence Similarity

search

Decide whether to search nucleic acid or protein database

The use of protein or nucleic acid sequence query depends upon

the biological information desired

If the sequence is protein or protein coding gene then search mustbe performed at protein level as proteins allow to detect far more

distant similarity.

Initial search should be performed with a heuristic algorithm program

e.g. BLAST and F ASTA

If query sequence is unknown, translate it into amino acidssequences then cmopare.



PRACTICAL WORK

Nucleotide BLAST

BLASTX

PBLAST



THE END

Thanks For Your Patience

Probability and Statistical Significance of Sequence Analysis

Documents

Transcript of Probability and Statistical Significance of Sequence Analysis