Sequence alignment Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 –...

Sequence alignment

Gabor T. Marth

Department of Biology, Boston Collegemarth@bc.edu

BI420 – Introduction to Bioinformatics

Biologically significant alignment

hbb_human

http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

1. Find two truly related sequences (subunits of human hemoglobin) in GenBank:

hba_human

2. Save sequences on the Desktop and rename: hba_human.fasta & hbb_human.fasta

http://artedi.ebc.uu.se/programs/pairwise.html

4. Upload our two proteins:

3. Visit a web-based pair-wise alignment program:

5. Create a pair-wise alignment between the two protein sequences:

Biologically plausible alignment

Leg hemoglobin

Retrieve another sequence, leghemoglobin:

Create a pair-wise alignment with human hemoglobin A:

Biologically plausible alignment

http://en.wikipedia.org/wiki/Leghemoglobin

Spurious alignment

Retrieve the sequence of a human BRCA1 gene variant, clearly not related to hemoglobin:

Examples from: Biological sequence analysis. Durbin, Eddy, Krogh, Mitchison

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein&cmd=search&term=NP_009225.1

Make the pair-wise alignment:

Alignment types

Examples from: BLAST. Korf, Yandell, Bedell

How do we align the words: CRANE and FRAME?

CRANE || |FRAME

3 matches, 2 mismatches

How do we align words that are different in length?

COELACANTH || |||P-ELICAN--

COELACANTH || |||-PELICAN--

5 matches, 2 mismatches, 3 gaps

In this case, if we assign +1 points for matches, and -1 for mismatches or gaps, we get 5 x 1 + 1 x (-1) + 3 x (-1) = 0. This is the alignment score.

Finding the “best” alignment

COELACANTH || |||P-ELICAN--

COELACANTH | |||PE-LICAN--

COELACANTH || P-EL-ICAN-

COELACANTH PELICAN--

S=-2 S=-6 S=-10

Global vs. local alignment

Example from: Higgs and Attwood

Aligning words: SHAKE and SPEARE

1. Global alignment: aligning the two sequences along their entire length (even if it means adding many “gaps”):

SH-AKE| | |SPEARE

SHAKE---| |SP--EARE

1. Local alignment: aligning only a nicely matching section between the two sequences (possibly leaving the ends un-aligned): SHAKE

SPEARE

SHAKE | |SPEARE

Global alignment – Needleman-Wunsch

Pair-wise amino-acid scores S(ai,bi) (PAM250 scoring scheme) plus gap score g.

+ gap score g = -6

Recursion scheme to calculate scores from already known scores:

H(i-1,j-1) + S(ai,bi) diagonalH(i,j) = best of: H(i-1,j) – g vertical

H(I,j-1) – g horizontal{

Initialization (filling the top row and left column from gap scores):

Align the two sequences: AAGATTCAC and CCGCTCAA

Initialization (filling the top row and left column from gap scores):

Filling cell (1,1):

Filling the rest of the cells (i,j):

Tracing back to read out the alignment:

S-HAKESPEARE

Best global alignment:

Local alignment – Smith-Waterman

Recursion scheme changes:1. if the best score for a cell is negative, we replace it by 0 (start over)2. gaps at the boundary are ignored they get 0 score

H(i-1,j-1) + S(ai,bi) diagonalH(i,j) = best of: H(i-1,j) – g vertical

H(I,j-1) – g horizontal0 start over

Initialization

Align the two sequences: AAGATTCAC and CCGCTCAA

Filling the cells:

Trace-back:

SHAKESPEARE

Best local alignment:

Visualizing pair-wise alignments

Visit a web server running a dot-plotter:

http://bioweb.pasteur.fr/seqanal/interfaces/dotmatcher.html

Upload hba_human and hbb_human, and create dot-plot:

Scoring schemes

Match-mismatch-gap penalties: e.g. Match = 1 Mismatch = -5 Gap = -10

Scoring matrices

Multiple alignments

Fetch HXK (hexokinase) sequences from NCBI; save as hxk.fasta on the Desktop

Multiple alignments

Visit a web-hosted clustalW site (e.g.: http://artedi.ebc.uu.se/programs/clustalw.html) and upload the HXK sequences

Multiple alignments

The multiple alignment of 24 hexokinese protein sequences from various species

Anchored multiple alignment

Similarity searching vs. alignment

Alignment

Similarity search

database

The BLAST algorithms

Program Database Query Typical Uses

BLASTN Nucleotide Nucleotide Mapping oligonucleotides, amplimers, ESTs, and repeats to a genome. Identifying related transcripts.

BLASTP Protein Protein Identifying common regions between proteins. Collecting related proteins for phylogenetic analysis.

BLASTX Protein Nucleotide Finding protein-coding genes in genomic DNA.

TBLASTN Nucleotide Protein Identifying transcripts similar to a known protein (finding proteins not yet in GenBank). Mapping a protein to genomic DNA.

TBLASTX Nucleotide Nucleotide Cross-species gene prediction. Searching for genes missed by traditional methods.

BLAST report

gi|7428631

http://www.ncbi.nlm.nih.gov/BLAST/

BLAST report

The BLAST algorithm

Sequence alignment takes place in a 2-dimensional space where diagonal lines represent regions of similarity. Gaps in an alignment appear as broken diagonals. The search space is sometimes considered as 2 sequences and somtimes as query x database.

Sequence 1

alignments gapped alignment

Search space

• Global alignment vs. local alignment

– BLAST is local

• Maximum scoring pair (MSP) vs. High-scoring pair (HSP)

– BLAST finds HSPs (usually the MSP too)

• Gapped vs. ungapped

– BLAST can do both

The BLAST algorithm

Sequence 1

word hits

RGD 17

KGD 14

QGD 13

RGE 13

EGD 12

HGD 12

NGD 12

RGN 12

AGD 11

MGD 11

RAD 11

RGQ 11

RGS 11

RND 11

RSD 11

SGD 11

TGD 11

BLOSUM62 neighborhood

of RGD

• Speed gained by minimizing search space

• Alignments require word hits

• Neighborhood words

• W and T modulate speed and sensitivity

Word length

2-hit seeding

word clustersisolated words

• Alignments tend to have multiple word hits.

• Isolated word hits are frequently false leads.

• Most alignments have large ungapped regions.

• Requiring 2 word hits on the same diagonal (of 40 aa for example), greatly increases speed at a slight cost in sensitivity.

Extension of the seed alignments

extension

alignment

• Alignments are extended from seeds in each direction.

• Extension is terminated when the maximum score drops below X.

The quick brown fox jumps over the lazy dog.The quiet brown cat purrs when she sees him.

length of extension

trim to max

Text examplematch +1mismatch -1no gaps

BLAST statistics

>gi|23098447|ref|NP_691913.1| (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = 253

Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1

Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++ISbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49

How significant is this similarity?

Scoring the alignment

Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++ISbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49

S (score)

The Karlin-Altschul equation

A minor constant

Expected number of alignments

Length of query

Length of database

Search space

Raw score

Scaling factor

Normalized score

The “Expect” or “E-value”

The “P-value” EeP 1

The sum-statistics

Sum statistics increases the significance (decreases the E-value) for groups of consistent alignments.

The sum-statistics

The sum score is not reported by BLAST!

Sequence alignment Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 –...

Documents

Transcript of Sequence alignment Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 –...

Marth In Brawl

Marth Aand Mary

Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

the presidential scholars program - bc.edu

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College marth@bc.edu.

Magisterium & Moral Teaching Presented by Rev. James T. Bretzke, S.J., S.T.D. Bretzke@bc.edu.

By Rev. James T. Bretzke, S.J., S.T.D. Professor of Moral Theology Boston College School of Theology & Ministry Email: bretzke@bc.edu and Web-page:bretzke@bc.edu.

Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June.

Gabor Marth, Goncalo Abecasis, PIs

Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

resource guide - Home - Boston College...-2800 I rafael.luna@bc.edu Michael Martin, Associate Dean for Seniors Stokes 140S | 2-2800 | michael.martin.2@bc.edu oston ollege Law SchoolMolly

Annual Report - bc.edu

DANA SAJDI - bc.edu

Making Money and Making a Self - bc.edu

mARTh - CEMC€¦ · Hello! I am Brian Smith I teach grade 4 at Lexington Public School, and I love mashing one thing into another. You can find me at @smithwithclass #mARTh

PULSE Student Workbook - bc.edu

Diagnostic Assessments in Algebra and Geometry · Diagnostic Assessments in Algebra and Geometry Jessica Masters (jessica.masters@bc.edu) Michael Russell (russelmh@bc.edu) Technology

A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.