Post on 22-Dec-2015
Sequence alignment
Gabor T. Marth
Department of Biology, Boston Collegemarth@bc.edu
BI420 – Introduction to Bioinformatics
Biologically significant alignment
hbb_human
http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi
1. Find two truly related sequences (subunits of human hemoglobin) in GenBank:
hba_human
2. Save sequences on the Desktop and rename: hba_human.fasta & hbb_human.fasta
Biologically significant alignment
http://artedi.ebc.uu.se/programs/pairwise.html
4. Upload our two proteins:
3. Visit a web-based pair-wise alignment program:
Biologically significant alignment
5. Create a pair-wise alignment between the two protein sequences:
Biologically plausible alignment
Leg hemoglobin
Retrieve another sequence, leghemoglobin:
Create a pair-wise alignment with human hemoglobin A:
Biologically plausible alignment
http://en.wikipedia.org/wiki/Leghemoglobin
Spurious alignment
Retrieve the sequence of a human BRCA1 gene variant, clearly not related to hemoglobin:
Examples from: Biological sequence analysis. Durbin, Eddy, Krogh, Mitchison
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein&cmd=search&term=NP_009225.1
Make the pair-wise alignment:
Alignment types
Examples from: BLAST. Korf, Yandell, Bedell
How do we align the words: CRANE and FRAME?
CRANE || |FRAME
3 matches, 2 mismatches
How do we align words that are different in length?
COELACANTH || |||P-ELICAN--
COELACANTH || |||-PELICAN--
5 matches, 2 mismatches, 3 gaps
In this case, if we assign +1 points for matches, and -1 for mismatches or gaps, we get 5 x 1 + 1 x (-1) + 3 x (-1) = 0. This is the alignment score.
Finding the “best” alignment
COELACANTH || |||P-ELICAN--
COELACANTH | |||PE-LICAN--
COELACANTH || P-EL-ICAN-
COELACANTH PELICAN--
S=-2 S=-6 S=-10
S=0
Global vs. local alignment
Example from: Higgs and Attwood
Aligning words: SHAKE and SPEARE
1. Global alignment: aligning the two sequences along their entire length (even if it means adding many “gaps”):
SH-AKE| | |SPEARE
SHAKE---| |SP--EARE
-OR-
1. Local alignment: aligning only a nicely matching section between the two sequences (possibly leaving the ends un-aligned): SHAKE
SPEARE
SHAKE | |SPEARE
Global alignment – Needleman-Wunsch
Example from: Higgs and Attwood
Pair-wise amino-acid scores S(ai,bi) (PAM250 scoring scheme) plus gap score g.
+ gap score g = -6
Global alignment – Needleman-Wunsch
Example from: Higgs and Attwood
Recursion scheme to calculate scores from already known scores:
H(i-1,j-1) + S(ai,bi) diagonalH(i,j) = best of: H(i-1,j) – g vertical
H(I,j-1) – g horizontal{
Global alignment – Needleman-Wunsch
Example from: Higgs and Attwood
Initialization (filling the top row and left column from gap scores):
Align the two sequences: AAGATTCAC and CCGCTCAA
Global alignment – Needleman-Wunsch
Example from: Higgs and Attwood
Initialization (filling the top row and left column from gap scores):
Global alignment – Needleman-Wunsch
Example from: Higgs and Attwood
Filling the rest of the cells (i,j):
Global alignment – Needleman-Wunsch
Example from: Higgs and Attwood
Tracing back to read out the alignment:
S-HAKESPEARE
Best global alignment:
Local alignment – Smith-Waterman
Example from: Higgs and Attwood
Recursion scheme changes:1. if the best score for a cell is negative, we replace it by 0 (start over)2. gaps at the boundary are ignored they get 0 score
H(i-1,j-1) + S(ai,bi) diagonalH(i,j) = best of: H(i-1,j) – g vertical
H(I,j-1) – g horizontal0 start over
{
Local alignment – Smith-Waterman
Example from: Higgs and Attwood
Initialization
Align the two sequences: AAGATTCAC and CCGCTCAA
Local alignment – Smith-Waterman
Example from: Higgs and Attwood
Trace-back:
SHAKESPEARE
Best local alignment:
Visualizing pair-wise alignments
Visit a web server running a dot-plotter:
http://bioweb.pasteur.fr/seqanal/interfaces/dotmatcher.html
Upload hba_human and hbb_human, and create dot-plot:
Scoring schemes
Match-mismatch-gap penalties: e.g. Match = 1 Mismatch = -5 Gap = -10
Scoring matrices
Multiple alignments
Fetch HXK (hexokinase) sequences from NCBI; save as hxk.fasta on the Desktop
Multiple alignments
Visit a web-hosted clustalW site (e.g.: http://artedi.ebc.uu.se/programs/clustalw.html) and upload the HXK sequences
The BLAST algorithms
Program Database Query Typical Uses
BLASTN Nucleotide Nucleotide Mapping oligonucleotides, amplimers, ESTs, and repeats to a genome. Identifying related transcripts.
BLASTP Protein Protein Identifying common regions between proteins. Collecting related proteins for phylogenetic analysis.
BLASTX Protein Nucleotide Finding protein-coding genes in genomic DNA.
TBLASTN Nucleotide Protein Identifying transcripts similar to a known protein (finding proteins not yet in GenBank). Mapping a protein to genomic DNA.
TBLASTX Nucleotide Nucleotide Cross-species gene prediction. Searching for genes missed by traditional methods.
The BLAST algorithm
Sequence alignment takes place in a 2-dimensional space where diagonal lines represent regions of similarity. Gaps in an alignment appear as broken diagonals. The search space is sometimes considered as 2 sequences and somtimes as query x database.
Sequence 1
alignments gapped alignment
Search space
• Global alignment vs. local alignment
– BLAST is local
• Maximum scoring pair (MSP) vs. High-scoring pair (HSP)
– BLAST finds HSPs (usually the MSP too)
• Gapped vs. ungapped
– BLAST can do both
The BLAST algorithm
Sequence 1
word hits
RGD 17
KGD 14
QGD 13
RGE 13
EGD 12
HGD 12
NGD 12
RGN 12
AGD 11
MGD 11
RAD 11
RGQ 11
RGS 11
RND 11
RSD 11
SGD 11
TGD 11
BLOSUM62 neighborhood
of RGD
T=12
• Speed gained by minimizing search space
• Alignments require word hits
• Neighborhood words
• W and T modulate speed and sensitivity
2-hit seeding
word clustersisolated words
• Alignments tend to have multiple word hits.
• Isolated word hits are frequently false leads.
• Most alignments have large ungapped regions.
• Requiring 2 word hits on the same diagonal (of 40 aa for example), greatly increases speed at a slight cost in sensitivity.
Extension of the seed alignments
extension
alignment
• Alignments are extended from seeds in each direction.
• Extension is terminated when the maximum score drops below X.
The quick brown fox jumps over the lazy dog.The quiet brown cat purrs when she sees him.
X = 5
length of extension
trim to max
Text examplematch +1mismatch -1no gaps
BLAST statistics
>gi|23098447|ref|NP_691913.1| (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = 253
Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1
Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++ISbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49
How significant is this similarity?
Scoring the alignment
Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++ISbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49
4
4-1
S (score)
The Karlin-Altschul equation
A minor constant
Expected number of alignments
Length of query
Length of database
Search space
Raw score
Scaling factor
Normalized score
The “Expect” or “E-value”
The “P-value” EeP 1
The sum-statistics
Sum statistics increases the significance (decreases the E-value) for groups of consistent alignments.