Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
1
Experimental origins of sequence data
F
Each color is one lane of an electrophoresis gel.
The Sanger dideoxynucleotide method
New technology: Pyrosequencing• http://www.youtube.com/watch?
v=nFfgWGFe0aA&NR=1• ..or search youtube for “pyrosequencing”
• Whole genome sequencing in < 1 day!!
3
AAAGAGATTCTGCTAGCGGTCGG
AGAGATGCTGCAGCGAGTCGGCC
Plant.
Bug.
Aligning two sequences tells us how they are related.
An alignment is a one-to-one association, or a set of one-to-one associations. Aligned sequences are assumed to be homologous (having a common ancestor). Furthermore, aligned positions within the sequences are assumed to have a common ancestor position.
Positions that align in sequence usually align in space
have a common ancestor
superimpose in spaceTGCTA TGCAA
TGCTA
a Venn diagram
Simple alignment• Simple similarity score:
Identity match = 1 point
mismatch = 0 points
gap = -1 points
• Optimal alignment = The highest-scoring alignment given the similarity score.
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
Building an alignment starts with a scoring matrix. In its simplest form, a dot plot.
Everything aligned to everything.
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
An alignment is a path through the scoring matrix, always proceeding to the right and
down. (no non-sequential alignments allowed.)
AAAGAGATTCTGCTAGCGGTCGG
AGAGATGCTGCAGCGAGTCGGCC
Unbroken diagonals represent “blocks” of sequence without indels.
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
blocks
indels
insertion, A
deletion of T
mutation, T->G
The path records, and scores, all mutational events, incl. insertions, deletions, mutations.
BLOSUM62: protein substitution matrix
PAM250
Protein versus DNA alignments
• Protein alphabet = 20, DNA alphabet = 4.– Protein alignment is more informative– Less chance of homoplasy with proteins.– Homology detectable at greater edit distance– Protein alignment more informative
• Better Gold Standard alignments are available for proteins. – Better statistics from G.S. alignments.
• On the other hand, DNA alignments are more sensitive to short evolutionary distances. 13
Are protein alignment better?
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
14
Database searching
Why do a database search?Mol. Bio: Determination of gene function. Primer design.
Pathology, epidemiology, ecology: Determination of species, strain, lineage, phylogeny.
Biophysics: Prediction of RNA or protein structure, effect of mutation.
one sequenceGenBank, PIR,
Swissprot,GenEMBL, DDBJ
lots of sequences
Searching millions of sequences
Given a protein or DNA sequence, we want to find all of the sequences in GenBank (over 17 million sequences!!) that have a good alignment score.
Each alignment score should be the optimal score (or a close approximation).
How do we do it?
Fast Database SearchingBLAST S. Altschul et al.
First make a set of lookup tables for all 3-letter (protein) or 11-letter (DNA) matches.
Make another lookup table: the locations of all 3-letter words in the database.
Start with a match, extend to the left and right until the score no longer increases.
Very fast. Selective, but not as sensitive as slower search methods (SSEARCH). Reliable statistics. Heuristic, not optimal.
BLAST, precalculations
PGQ
...
PGQ PGR PGS ... PGT PGV PGW PGY PAQ PCQ PDQ PEQ PFQ ... ...
All 8000 possible 3-tuples
50 high-scoring
3-tuples
Each 3-tuple is scored against all 8000 possible 3-tuples using BLOSUM. The top scoring 50 are kept as that 3tuple’s “neighborhood words”
BLASTquery sequence
identity matches
seeds HSPs
a 3-tuple
For every 3-residue window, we get the set of 50 nearest neighbors. Use each word to get identity matches (seeds). Then extend the seed alignments as long as the score increases.
neighborhood words for 3-tuple
target sequence
BLAST
HSPs alignment
The best extended seeds are called HSPs (high scoring pairs). The top scoring HSP is picked first, then the second (as long as it falls "northwest" or "southeast" of the first.), and so on.
Other forms of BLAST
21
BLAST query databaseblastn nucleotide nucleotideblastp protein proteintblastn protein translated DNAblastx translated DNA proteintblastx translated DNA translated DNA
psi-blast protein, profile proteinphi-blast pattern protein
transitive blast* any any*not really a blast. Just a way of using blast.
Psi-BLAST: Blast with profiles
Psi-BLAST searches the database iteratively.(Cycle 1) Normal BLAST (with gaps)
(Cycle 2) (a) Construct a profile from the results of Cycle 1.
(b) Search the database using the profile.
(Cycle 3) (a) Construct a profile from the results of Cycle 2.
(b) Search the database using the profile.
And So On... (user sets the number of cycles)
Psi-BLAST is much more sensitive than BLAST.
Also more vulnerable to low-complexity.
PHI-BLAST --Patterned Hit Initiated BLAST
23
DNA or Protein search?•Advantages of searching DNA databases
Larger database. Does not assume a reading frame. Can find similarity in non-coding regions (introns, promotor regions). Can find frameshift mutations. Can find pseudogenes.
•Disadvantages
Slower. Not as sensitive. Ignores selective pressure at the protein level.
•Advantages of searching protein sequences
Faster. More sensitive. More biologically relevant.
•Disadvantages
Not applicable to non-coding DNA (promotors, introns, etc)
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
25
How significant is that?
Please give me a number for...
...how likely the data would not have been the result of chance,...
...as opposed to...
...a specific inference. Thanks.
Dayhoff's randomization experiment
Aligned scrambled Protein A versus scrambled Protein B
100 times (re-scrambling each time).
NOTE: scrambling does not change the AA composition!
Results: A Normal Distributionsignificance of a score is measuredas the probability of getting this score in a random alignment
score
freq
Lippman's randomization experimentAligned Protein A to 100 natural sequences, not scrambled.
Results: A wider normal distribution (Std dev = ~3 times larger)WHY? Because natural sequences are different than random.
Even unrelated sequences have similar local patterns, and uneven amino acid composition.
Lippman got a similar result if he randomized the sequences by words instead of letters.
Was the significance over-estimated using Dayhoff's method?
score
freq
P(S > x)E(M) gives us the expected length of the longest number of matches in a row. But, what we really want is the answer to this question:
How good is the score x? (i.e. how significant)
So, we need to model the whole distribution of chance scores, then ask how likely is it that my score or greater comes from that model.
score
freq
A normal distribution
Suppose you had a Gaussian distribution “dart-board”. You throw 1000 darts randomly. Score your darts according the number on the X-axis where it lands. What is the probability distribution of scores? Answer:The same Gaussian distribution! (duh)
Extreme values from a normal distribution
What if we throw 10 darts at a time and keep only the highest-scoring dart (extreme value)? What is the distribution of the extreme values?
The Extreme Value Distribution
Normal distributions (Dayhoff, Lippman) overestimate significance when the scores are extreme values. EVD is the correct null model.
Fitting the EVD to random alignments
log(P(S≥x)) = log(Kmn) - λx
• Generate a large number of known false alignment scores S, (all alignments with the same two lengths m and n), • Plot log(P(S≥x)) versus x , fit to a line!
Estimated P (integral of the EVD): P(S≥x) ≈ Kmne-λx
Taking the log,
x
x x
xx
x
x
x
x x x xx
xxx x xxx
x
logP
(S≥x
)
The slope is −λ, the intercept is log(Kmn). Now we can calculate P for any score x.
where K=constant, m=size of database, n=length of sequence, λ=constant
Pop-quiz
You did a BLAST search using a sequence that has absolutely no homologs in the database. Absolutely none.
The BLAST search gave you false “hits” with the top e-values ranging from 0 to 20. You look at them and you notice a pattern in the e-values.
How many of your hits have e-value ≤ 10.?
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
35
Evolutionary time
A
B
C
D
11
1
6
3
5
genetic change
A
B
C
D
time
A
B
C
D
no meaning
Cladogram Phylogram Ultrametric tree
(D:5,(A:1,(C:1,B:6):1):3)parenthesis (notation can have both labels and distances.
A multiple sequence alignment is made using many pairwise sequence alignments
Multiple Sequence Alignment
Multiple sequence alignment
1. align all pairs2. pairwise align two most similar first3. align next most similar4. repeat until all sequences are aligned
38
A G H I . W W P FA G H I I F W P Y
AWPY
S(P,[W,F]) =(1/2)(S(P,W) + S(P,F))
Construct a distance-based tree
97 8177
82 59 3280 55 3190 65 40
61 4233
ABCDEF
A B C D E F ABCDEF
Draw tree heredistances
CLUSTALW
• Start with unrooted tree, using Neighbor joining.
• choose root to get guide tree• progressive alignment
– matches are scored using sequence weights– gaps are position dependent
• GOP lower for polar residues• GOP zero where there is already a gap
40
JD Thompson, DG Higgins, TJ Gibson - Nucleic acids research, 1994
CLUSTALW Position specific gap penalty
41
Parsimony: Finding the tree with the minimum number of mutations
Given a tree and a set of taxa, one-letter each (1) choose optional characters for each ancestor. (2) Select the root character that minimizes the number of mutations by selecting each and propagating it through the tree.
T
C
T
C
C
T/C
T/C
T/C
T/C/C C
C
C
C
T
C
T
C
C
minimum 2 mutations minimum 1 mutation
Columns in a MSA have a common evolutionary history
By aligning the sequences, we assert that the aligned residues in each column had a common ancestor.
Orthologs/paralogsOrthologs: homologs originating from a speciation event
Paralogs: homologs originating from a gene duplication event.
clam
duck
crab
fish
clam
A
duck
A
crab
Bfis
h A
duck
B
fish
B
Sequence treecl
am A
crab
B
duck
Adu
ck B
fish
Afis
h B
duplication
speciation
speciationgene loss
Species tree reconciled trees
How do I know it’s a paralog?
• If it’s a paralog, then at some point in evolutionary history, a species existed with two identical genes in it.
• One may have been lost since then. (Descendants are still paralogs!)
• Paralogs can be from different species.
• Paralogous genes have more than the expected sequence divergence.
• Because they are more likely to have different functions
• Because they diverged earlier than the speciation event.
• Without species information or functional information, it’s impossible to tell
Life is not strictly a tree -- horizontal gene transfer
46
BF Smets, T Barkay (2005) “Horizontal gene transfer: perspectives at a crossroads of scientific disciplines” Nature Reviews Microbiology.
Discrete Steps Needed for Stability of Gene TransferStably incorporating horizontally transferred genes into a recipient genome involves five distinct steps (Fig. 1). 1. First, a particular segment of DNA or RNA is prepared for transfer from the donor strain through one of several processes, including excision and circularization of conjugative transposons, initiation of conjugal plasmid transfer by synthesis of a mating pair-formation protein complex, or packaging of nucleic acids into phage virions. 2. Next, the segment is transferred either by conjugation, which requires contact between the donor and recipient cells, or by transformation and transduction without direct contact. 3. During the third step, genetic material enters the recipient cell, where cell exclusion may abort the transfer. 4. Otherwise, during the fourth step, the incoming gene is integrated into the recipient genome by legitimate or sitespecific recombination or by plasmid circularization and complementary strand
synthesis. Barriers to transfer during this step come from restriction modification systems, failure to integrate and replicate within the new host genome, and incompatibility with resident plasmids. 5. In the final step, transferred genes are replicated as part of the recipient genome and transmitted to daughter cells in stable fashion over successive generations. Researchers from different disciplines tend to focus on specific stages within this five-step sequence. Thus, evolutionary biologists who examine microbial genomes for evidence of past transfers tend to look at HGTs from the perspective of step five. Molecular biologists are more likely to examine the details of the transfer events, while microbial ecologists look more broadly when they describe the magnitude and diversity of the mobile gene pool, sometimes called the mobilome.
“Boot strap analysis”
• A method to validate a phylogenetic tree, branchpoint by branchpoint.
• Requires a means to generate independent trees. (For example trees generated from different regions of the mitochondrial genome.)
• Choose the representative tree as the ‘parent’. Calculate the following:
For each branchpoint in the parent tree, For each tree, ask Is there a branchpoint having the same subclade contents (i.e. same taxa, any order)Bootstrap value = number of trees having the branchpoint / total trees.
Comparing branchpoints
A B C D E E DBCA AB C D E E DCAB
B A C D EE DABBB A A EE CCC DD
= P((A,B),C) = 5/8For each branchpoint in the parent tree, For each tree, ask Is there a branchpoint having the same subclade contents (i.e. same taxa, any order)Bootstrap value = number of trees having the branchpoint / total trees.
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
49
Ontology
• Ontologies relate facts to knowledge• facts
– may be known/unknown/little known– not attached to knowers– unchanging
• Knowledge– attached to knower– may disappear
50
Gene Ontology
- Gene annotation system- Controlled vocabulary that can be
applied to all organisms- Used to describe gene products
What is the Gene Ontology?
A (part of the) solution:
- A controlled vocabulary that can be applied to all organisms
- Used to describe gene products - proteins and RNA - in any organism
GO: Three ontologies
Where does it act?
What processes is it involved in?
What does it do? Molecular Function
Cellular Component
Biological Process
gene product
Cellular Component
• where a gene product acts
Mitochondrial membraneCellular Component
Biological Process
GluconeogenesisBiological Process
Molecular Function
• A single reaction or activity, not a gene product
• A gene product may have several functions• Sets of functions make up a biological
process
Molecular Function
hexose kinase
Filter queries by organism, data source or evidence
Search for GO terms or by Gene symbol/name
Querying the GO
• Access gene product functional information
• Find how much of a proteome is involved in a process/ function/ component in the cell
• Map GO terms and incorporate manual annotations into own databases
• Provide a link between biological knowledge and …
• gene expression profiles
• proteomics data
What can scientists do with GO?
attacked
time
control
Puparial adhesionMolting cyclehemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabloismImmune response
Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
…analysis of high-throughput data according to GOMicroArray data analysis
AnalysisofFunc.onalAnnota.on–DownregulatedGenes
Figuremodifiedfromh;p://en.wikipedia.org/wiki/Image:Microarray‐schema.jpg
Notreatment Treatedcells
Selec%ngmicroarraysubsetsbasedonGOrevealsdrugtarget
courtesyofShabanaShabeer,AlbertEinsteingSchoolofMedicine
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Correlation• Protein structure
64
65
Trioxx non-users
Trioxx users
RLS
no RLS
4 0
12 16
€
ui − u( ) ri − r( )∑ui − u( )2 ri − r( )2∑∑
Biology Grad Core course: Discussion Topic Merck Smith-Kline was the author on a study of Trioxx, an anti-inflammatory drug used to treat arthritis, for which it was know to be effective. The study followed over 500 long-time Trioxx users and an equal number of control subjects who had never used the drug. Dr. Smith-Kline was looking for correlations between the use of Trioxx and the incidence of any disease other than arthritis, in any demographic group. He noted in the study that Tunisian Americans, in the age range from 45-55, male or female, and who had been a vegetarian for more than 6 months at any time in their lives, had a "strong negative correlation" between the use of Trioxx and the incidence of restless leg syndrome (RLS), and began touting Trioxx as an effective anti-RLS drug.
The numbers were as follows:
Total Tunisian American vegetarians age 45-55 : 32Total Tunisian American vegetarians age 45-55 Trioxx users : 16Total Tunisian American vegetarians age 45-55 Trioxx non-users : 16Total Tunisian American vegetarians age 45-55 who have RLS : 4Total Tunisian American vegetarians age 45-55 who do not have RLS : 28
Dr. Smith-Kline correctly calculated the correlation between Trioxx and RLS as follows:Corr =, where ui = 1 if subject i is a user, and 0 otherwise, ri is 1 if the subject has RLS and 0 otherwise.
The sums were carried out over all 32 subjects in the subset, and the resulting correlation was -0.378. This confidence level was cited as 99%, since the p-value for this correlation was 0.01, The sample size of 32 and the uneven distribution of subjects with RLS were taken into account.
The data itself was collected correctly and the calculations were correct, both for the correlation and its confidence. Yet Merck Smith-Kline did something dishonest in this study. What was it and what specific question would you ask him to reveal his dishonesty?
Correlation
x2
(∑ - <x>) y( - <y> )i ii
∑ x( - <x>) y( - <y> )i ii∑
i
2r =
Pearson’s correlations coefficient. Or
Pearson's product moment correlation
Correlation using metric data
Correlation cancels baseline and scale.Correlation is insensitive to non-linear relationships.
Non-linearity is not picked up by correlation
All of these examples have r=0.816Re-sampling will fix some of these.
Correlation Confidence by Resampling
• Start with paired data (x,y), calculate r. Let’s say r=0.511
• Randomly associate x and y values.
• Calculate rran
• Repeat 10000 times.
• Significance is p = number of times rran > 0.511, divided by 10000. (If r is negative, count rran < r)
Resampling exampleThe y-values have been randomly swapped, 10000 times. 10% of the time, r is >=
0.816, therefore p=0.10
0.816
0.816 0.101 -0.321
-0.2670.020
-0.199 -0.331
-0.204
Resampling example
0.816
-0.130
Scrambling randomizes sampling in the gray shaded area. If there is one data point set apart from the others, correlation can be high by chance.
Correlation using Booelan data
0 00011000101
x y0 10011001101
<x>=4/12=0.33
<y>=6/12=0.50
True is assigned 1, False 0.
x \ y 1 0
1 4 0
0 2 6
x
2
(∑ - <x> ) y( - <y> )i ii
∑ x( - <x> ) y( - <y> )i ii∑
i
2r =
r=0.71p=0.025
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Correlation• Protein structure
73
Photosystem I: 1JB0
Classes of membrane proteins
•Single transmembrane helix
•several transmembrane helices
•beta-barrel or channel
•Anchored by one (not-transmembrane) helix or a covalently attached fatty acid
Photosystem I: Guided tour
Download and display 1JB0.pdb (one jay bee zero)
restrict not protein and not hohcolor cpkDisplay -> ball and stick
select magnesiumlabel %rset fontsize 12set fontstroke 2color labels yellow
Find the pseudo 2-fold axis
How many Mg are there?
What are the residue numbers of the “special pair” of chlorophylls?
Photosystem I : Guided tour
(select the special pair using select XXX or YYY)spacefillselect hetero and not hohlabels offcolor temperature
How are the B-factorsdistributed?
Was NCS 2-fold symmetry enforced during refinement?
Guess what: 2-fold symmetry was not
enforced during evolution!
Which side is more ordered? Chain A or chain B?
Photosystem I : Guided tour
Find the name of the lipid that does not havea phosphate group.
Characterize the environmentof the lipid. Could it have a role in the light harvest process?
Unix shortcut: use grepgrep ^”HETNAM” 1JB0.pdb
select [LMG]restrict selectedcenter selectedselect within (11., [LMG]) and proteinDisplay -> ball_and_stickcolor cpkselect within (11., [LMG]) and ligand
Photosystem I : Guided tourselect within (11., [LMG]) and ligandspacefill 1.5color green select within (11., [LMG]) and *.MGspacefill 1.5color white select within (11., [LMG]) and [PQN]color red select within (11., [LMG]) and solventspacefill 1.0color cyan What is PQN?
How close is it to the nearestmagnesium
Photosystem I : Guided tourrestrict ligandwireframecolor cpkDisplay-> ball and stickselect [CL1] or [CL2]wireframe color greenselect [PQN]color magentaspacefill 1.0select [BCR]color orangespacefill 1.5select *.MGspacefill 1.0color white
Trace the path of the electronsfrom the special pair to the twoquinones.
Light harvesting complex
Are any of the pigmentsconnected to the special pair?
Photosystem I : Guided tourrestrict [PQN]spacefill color cpkselect within (11.,[PQN]) and proteinwireframe 0.5color cpkselect within (11.,[PQN]) and ligand and not [PQN]color greenwireframe 0.5select within (11.,[PQN]) and solvent spacefill 0.6color cyan
Which quinone is more loosely-bound
Environment of the quinones
How does the electron getfrom one quinone to theother? What protein sidechainforms a bridge?
Top Related