Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill,...
-
date post
18-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill,...
![Page 1: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/1.jpg)
Sequence Analysis
Hemant KelkarCenter for Bioinformatics
University of North CarolinaChapel Hill, NC 27599
![Page 2: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/2.jpg)
Scope of Series
Talk I
• Overview and BLAST
Talk II
• Protein analysis/Sequence Alignment
Talk III
• Evolution
• Genomics and challenges
![Page 3: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/3.jpg)
Bioinformatics
• Mathematical, Statistical and computational methods that are used for solving biological problems
• Glue that holds the “omics” data together
![Page 4: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/4.jpg)
Help …
• Is “my sequence” in the databases?• Is it similar to any sequence in the DB?• Does it have any know motifs/domains
that can help in identification?• Is there a structural homolog?• Are there any polymorphisms?• Genetic Map location?
Bioinformatics TOOLS!
![Page 5: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/5.jpg)
Bioinformatics Tools
• Genetic Code
• Protein Structure
• Protein Evolution
Similarity search e.g. BLAST, FASTA
http://restools.sdsc.edu/biotools/biotools9.html
e.g. CLUSTALW, T-COFFEE, Phylip
![Page 6: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/6.jpg)
Primary Sequence Databases
• GenBank (http://www.ncbi.nlm.nih.gov/Genbank/index.html
) • PIR (http://pir.georgetown.edu/) • Swiss-Prot (http://us.expasy.org/sprot/)
Sequence information as is generated in the laboratory
![Page 7: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/7.jpg)
Derived Sequence Databases
• PFAM (http://www.sanger.ac.uk/Software/Pfam/) : Protein families based on HMM models
• InterPRO (http://www.ebi.ac.uk/interpro/) : Protein families and domains based on functional sites
• TransFac (http://www.gene-regulation.com/) transcription factor db
• Cytochrome P450 database (http://drnelson.utmem.edu/CytochromeP450.html)
Databases based on functional or phylogenetic analysis
![Page 8: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/8.jpg)
Derived Sequence Databases
• Flybase (http://www.flybase.org/) : Fly Genome
• Wormbase (http://www.wormbase.org/) : C. elegans
• Genome Browser (http://genome.ucsc.edu/) :
Human and Mouse • MGI (http://www.informatics.jax.org/) : Mouse
• Microbial Genome Resource : (http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl)
Databases based on taxonomy
![Page 9: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/9.jpg)
Sequence Alignments
• Provide a measure of relation between the nucleotide or protein sequence
• This allows us to decipher:
Structural relationships
Functional relationships
Evolutionary relationships
![Page 10: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/10.jpg)
Sequence Similarity Searches
• Information conserved evolutionarily
• DNA sequences NOT coding for proteins/rRNAs diverge rapidly• When possible use protein sequences for similarity searches
• Non-homologous protein identification is much less reliable• What is measured and what is inferred?
![Page 11: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/11.jpg)
Similarity
• Is always based on an observable
• Usually expressed as % identity
• Quantifies the divergence of two sequences
• substitutions/insertions/deletions
• Residues crucial for structure and/or function
![Page 12: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/12.jpg)
Homology
• Homology always implies that the molecules share a common ancestor
• Absolute answer
• Molecules ARE or ARE NOT homologous
• No degrees
![Page 13: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/13.jpg)
How to Find Similar Sequences
• Global Sequence Alignments
• Sequence comparison along entire length
• Homolog of similar length• Local Sequence Alignments
• Similar regions in two sequences
• Regions outside the local alignment excluded
• Sequences of different length/similarity
![Page 14: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/14.jpg)
Dotplot
![Page 15: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/15.jpg)
Scoring Matrices
• Empirical weighting schemes
• Considers important biology
• Side chain chemistry/structure/function
• Functional/Structural Conservation
• Ile/Val – small and hydrophobic
• Ser/Thr – both polar
• Size/Charge/Hydrophibicity
![Page 16: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/16.jpg)
Nucleotide Matrix
A C G TA 5 -4 -4 -4C -4 5 -4 -4G -4 -4 5 -4T -4 -4 -4 5
![Page 17: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/17.jpg)
PAM Scoring Matrices
• Margaret Dayhoff (1978)
• Point accepted mutations (PAM)
• Patterns of substitutions in highly related proteins (>85% identical), based on multiple sequence alignments
• New side chains must function similarly
• 1 PAM 1 AA change per 100 AA
• 1 PAM ~ 1 % Divergence
![Page 18: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/18.jpg)
BLOSUM Matrices
• Henikoff and Henikoff (1992)
• Blocks Substitution Matrices
• Differences in conserved ungapped regions
• Directly calculated no extrapolations
• Sensitive to structural/functional subs
• Generally perform better for local similarity searches
![Page 19: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/19.jpg)
Scoring Matrix – BLOSUM62
![Page 20: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/20.jpg)
BLOSUM n
• Calculated from sequences sharing no more than n% identity
• Sequences with more than n% identity are clustered and weighted to 1• Reducing the value of “n” yields more divergent/distantly-related sequences • BLOSUM62 used as default by many of the online search sites
![Page 21: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/21.jpg)
Matrices and more
PAM Matrices (Altschul, 1991)
PAM 40 Short alignments >70%
PAM120 >50%
PAM250 Longer weaker local areas >30%
BLOSUM Matrices (Henikoff, 1993)
BLOSUM 90 Short alignments >60%
BLOSUM 80 >50%
BLOSUM 62Commonly used >35%
BLOSUM 30 Longer, weaker local alignments
![Page 22: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/22.jpg)
Gaps
• Compensate for insertion and deletions• Improvement alignments
• Must be kept to a reasonably small number • 1 per 20 residues is logical
• Need a different scoring scheme
![Page 23: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/23.jpg)
Gap Penalties
• Penalty for gap introduction
• Penalty for Gap extension
where G = gap-opening penalty 511
L = Gap-extension penalty 21
n = Length of gap
Deductions for Gap = G + Ln
NucProt
![Page 24: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/24.jpg)
BLAST
• Basic Local Alignment Search Tool
• Seeks high-scoring segment pair (HSP)
• Sequences that can be aligned w/o gaps
• have a maximal aggregate score
• score be above score threshold S• Many HSP reported for ungapped blast
![Page 25: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/25.jpg)
BLAST Algorithms
Program Query TargetBLASTN Nucloetide NucleotideBLASTP Protein ProteinBLASTX Nucleotide Protein
(6-Frame)
TBLASTN Protein Nucleotide (6FR)TBLASTX Nucloetide(6FR) Nucloetide(6FR)
![Page 26: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/26.jpg)
Neighborhood Words
Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE
STL13
SAL8
SNL8
SVL8
SBL7
SCL7
SDL7
Etc.
= 4 + 5 + 4
Neighborhood Score Threshold
(T = 8)
Query Word (W = 3)
![Page 27: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/27.jpg)
High-Scoring Segment Pairs
STL13
SAL8
SNL8
SVL8
SBL7
SCL7
SDL7
Etc.Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE
++ G + ++G G+GKS+LLSA L L+ ++G + Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS
![Page 28: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/28.jpg)
Extension
Significance Decay
• Mismatches
• Gap penalties
Extension
Cumulative Score
X
S
T
Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE ++ G + ++G G+GKS+LLSA L L+ ++G +
Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS
![Page 29: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/29.jpg)
Karlin Altschul Equation
E = kmNe-λs
m Number of letters in query
N Number of letters in db
mN Size of search space
λs Normalized score
k minor constant
![Page 30: Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d235503460f949f9eda/html5/thumbnails/30.jpg)
http://www.ncbi.nlm.nih.gov