Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence...

Bioinformatics

• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure

1

Experimental origins of sequence data

F

Each color is one lane of an electrophoresis gel.

The Sanger dideoxynucleotide method

New technology: Pyrosequencing• http://www.youtube.com/watch?

v=nFfgWGFe0aA&NR=1• ..or search youtube for “pyrosequencing”

• Whole genome sequencing in < 1 day!!

3

AAAGAGATTCTGCTAGCGGTCGG

AGAGATGCTGCAGCGAGTCGGCC

Plant.

Bug.

Aligning two sequences tells us how they are related.

An alignment is a one-to-one association, or a set of one-to-one associations. Aligned sequences are assumed to be homologous (having a common ancestor). Furthermore, aligned positions within the sequences are assumed to have a common ancestor position.

Positions that align in sequence usually align in space

have a common ancestor

superimpose in spaceTGCTA TGCAA

TGCTA

a Venn diagram

Simple alignment• Simple similarity score:

Identity match = 1 point

mismatch = 0 points

gap = -1 points

• Optimal alignment = The highest-scoring alignment given the similarity score.

AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC

Building an alignment starts with a scoring matrix. In its simplest form, a dot plot.

Everything aligned to everything.


An alignment is a path through the scoring matrix, always proceeding to the right and

down. (no non-sequential alignments allowed.)

AAAGAGATTCTGCTAGCGGTCGG

AGAGATGCTGCAGCGAGTCGGCC

Unbroken diagonals represent “blocks” of sequence without indels.


blocks

indels

insertion, A

deletion of T

mutation, T->G

The path records, and scores, all mutational events, incl. insertions, deletions, mutations.

BLOSUM62: protein substitution matrix

PAM250

Protein versus DNA alignments

• Protein alphabet = 20, DNA alphabet = 4.– Protein alignment is more informative– Less chance of homoplasy with proteins.– Homology detectable at greater edit distance– Protein alignment more informative

• Better Gold Standard alignments are available for proteins. – Better statistics from G.S. alignments.

• On the other hand, DNA alignments are more sensitive to short evolutionary distances. 13

Are protein alignment better?

Bioinformatics


14

Database searching

Why do a database search?Mol. Bio: Determination of gene function. Primer design.

Pathology, epidemiology, ecology: Determination of species, strain, lineage, phylogeny.

Biophysics: Prediction of RNA or protein structure, effect of mutation.

one sequenceGenBank, PIR,

Swissprot,GenEMBL, DDBJ

lots of sequences

Searching millions of sequences

Given a protein or DNA sequence, we want to find all of the sequences in GenBank (over 17 million sequences!!) that have a good alignment score.

Each alignment score should be the optimal score (or a close approximation).

How do we do it?

Fast Database SearchingBLAST S. Altschul et al.

First make a set of lookup tables for all 3-letter (protein) or 11-letter (DNA) matches.

Make another lookup table: the locations of all 3-letter words in the database.

Start with a match, extend to the left and right until the score no longer increases.

Very fast. Selective, but not as sensitive as slower search methods (SSEARCH). Reliable statistics. Heuristic, not optimal.

BLAST, precalculations

PGQ

...

PGQ PGR PGS ... PGT PGV PGW PGY PAQ PCQ PDQ PEQ PFQ ... ...

All 8000 possible 3-tuples

50 high-scoring

3-tuples

Each 3-tuple is scored against all 8000 possible 3-tuples using BLOSUM. The top scoring 50 are kept as that 3tuple’s “neighborhood words”

BLASTquery sequence

identity matches

seeds HSPs

a 3-tuple

For every 3-residue window, we get the set of 50 nearest neighbors. Use each word to get identity matches (seeds). Then extend the seed alignments as long as the score increases.

neighborhood words for 3-tuple

target sequence

BLAST

HSPs alignment

The best extended seeds are called HSPs (high scoring pairs). The top scoring HSP is picked first, then the second (as long as it falls "northwest" or "southeast" of the first.), and so on.

Other forms of BLAST

21

BLAST query databaseblastn nucleotide nucleotideblastp protein proteintblastn protein translated DNAblastx translated DNA proteintblastx translated DNA translated DNA

psi-blast protein, profile proteinphi-blast pattern protein

transitive blast* any any*not really a blast. Just a way of using blast.

Psi-BLAST: Blast with profiles

Psi-BLAST searches the database iteratively.(Cycle 1) Normal BLAST (with gaps)

(Cycle 2) (a) Construct a profile from the results of Cycle 1.

(b) Search the database using the profile.

(Cycle 3) (a) Construct a profile from the results of Cycle 2.

(b) Search the database using the profile.

And So On... (user sets the number of cycles)

Psi-BLAST is much more sensitive than BLAST.

Also more vulnerable to low-complexity.

PHI-BLAST --Patterned Hit Initiated BLAST

23

DNA or Protein search?•Advantages of searching DNA databases

Larger database. Does not assume a reading frame. Can find similarity in non-coding regions (introns, promotor regions). Can find frameshift mutations. Can find pseudogenes.

•Disadvantages

Slower. Not as sensitive. Ignores selective pressure at the protein level.

•Advantages of searching protein sequences

Faster. More sensitive. More biologically relevant.

•Disadvantages

Not applicable to non-coding DNA (promotors, introns, etc)

Bioinformatics


25

How significant is that?

Please give me a number for...

...how likely the data would not have been the result of chance,...

...as opposed to...

...a specific inference. Thanks.

Dayhoff's randomization experiment

Aligned scrambled Protein A versus scrambled Protein B

100 times (re-scrambling each time).

NOTE: scrambling does not change the AA composition!

Results: A Normal Distributionsignificance of a score is measuredas the probability of getting this score in a random alignment

score

freq

Lippman's randomization experimentAligned Protein A to 100 natural sequences, not scrambled.

Results: A wider normal distribution (Std dev = ~3 times larger)WHY? Because natural sequences are different than random.

Even unrelated sequences have similar local patterns, and uneven amino acid composition.

Lippman got a similar result if he randomized the sequences by words instead of letters.

Was the significance over-estimated using Dayhoff's method?

score

freq

P(S > x)E(M) gives us the expected length of the longest number of matches in a row. But, what we really want is the answer to this question:

How good is the score x? (i.e. how significant)

So, we need to model the whole distribution of chance scores, then ask how likely is it that my score or greater comes from that model.

score

freq

A normal distribution

Suppose you had a Gaussian distribution “dart-board”. You throw 1000 darts randomly. Score your darts according the number on the X-axis where it lands. What is the probability distribution of scores? Answer:The same Gaussian distribution! (duh)

Extreme values from a normal distribution

What if we throw 10 darts at a time and keep only the highest-scoring dart (extreme value)? What is the distribution of the extreme values?

The Extreme Value Distribution

Normal distributions (Dayhoff, Lippman) overestimate significance when the scores are extreme values. EVD is the correct null model.

Fitting the EVD to random alignments

log(P(S≥x)) = log(Kmn) - λx

• Generate a large number of known false alignment scores S, (all alignments with the same two lengths m and n), • Plot log(P(S≥x)) versus x , fit to a line!

Estimated P (integral of the EVD): P(S≥x) ≈ Kmne-λx

Taking the log,

x

x x

xx

x

x

x

x x x xx

xxx x xxx

x

logP

(S≥x

)

The slope is −λ, the intercept is log(Kmn). Now we can calculate P for any score x.

where K=constant, m=size of database, n=length of sequence, λ=constant

Pop-quiz

You did a BLAST search using a sequence that has absolutely no homologs in the database. Absolutely none.

The BLAST search gave you false “hits” with the top e-values ranging from 0 to 20. You look at them and you notice a pattern in the e-values.

How many of your hits have e-value ≤ 10.?

Bioinformatics


35

Evolutionary time

A

B

C

D

11

1

6

3

5

genetic change

A

B

C

D

time

A

B

C

D

no meaning

Cladogram Phylogram Ultrametric tree

(D:5,(A:1,(C:1,B:6):1):3)parenthesis (notation can have both labels and distances.

A multiple sequence alignment is made using many pairwise sequence alignments

Multiple Sequence Alignment

Multiple sequence alignment

1. align all pairs2. pairwise align two most similar first3. align next most similar4. repeat until all sequences are aligned

38

A G H I . W W P FA G H I I F W P Y

AWPY

S(P,[W,F]) =(1/2)(S(P,W) + S(P,F))

Construct a distance-based tree

97 8177

82 59 3280 55 3190 65 40

61 4233

ABCDEF

A B C D E F ABCDEF

Draw tree heredistances

CLUSTALW

• Start with unrooted tree, using Neighbor joining.

• choose root to get guide tree• progressive alignment

– matches are scored using sequence weights– gaps are position dependent

• GOP lower for polar residues• GOP zero where there is already a gap

40

JD Thompson, DG Higgins, TJ Gibson - Nucleic acids research, 1994

CLUSTALW Position specific gap penalty

41

Parsimony: Finding the tree with the minimum number of mutations

Given a tree and a set of taxa, one-letter each (1) choose optional characters for each ancestor. (2) Select the root character that minimizes the number of mutations by selecting each and propagating it through the tree.

T

C

T

C

C

T/C

T/C

T/C

T/C/C C

C

C

C

T

C

T

C

C

minimum 2 mutations minimum 1 mutation

Columns in a MSA have a common evolutionary history

By aligning the sequences, we assert that the aligned residues in each column had a common ancestor.

Orthologs/paralogsOrthologs: homologs originating from a speciation event

Paralogs: homologs originating from a gene duplication event.

clam

duck

crab

fish

clam

A

duck

A

crab

Bfis

h A

duck

B

fish

B

Sequence treecl

am A

crab

B

duck

Adu

ck B

fish

Afis

h B

duplication

speciation

speciationgene loss

Species tree reconciled trees

How do I know it’s a paralog?

• If it’s a paralog, then at some point in evolutionary history, a species existed with two identical genes in it.

• One may have been lost since then. (Descendants are still paralogs!)

• Paralogs can be from different species.

• Paralogous genes have more than the expected sequence divergence.

• Because they are more likely to have different functions

• Because they diverged earlier than the speciation event.

• Without species information or functional information, it’s impossible to tell

Life is not strictly a tree -- horizontal gene transfer

46

BF Smets, T Barkay (2005) “Horizontal gene transfer: perspectives at a crossroads of scientific disciplines” Nature Reviews Microbiology.

Discrete Steps Needed for Stability of Gene TransferStably incorporating horizontally transferred genes into a recipient genome involves five distinct steps (Fig. 1). 1. First, a particular segment of DNA or RNA is prepared for transfer from the donor strain through one of several processes, including excision and circularization of conjugative transposons, initiation of conjugal plasmid transfer by synthesis of a mating pair-formation protein complex, or packaging of nucleic acids into phage virions. 2. Next, the segment is transferred either by conjugation, which requires contact between the donor and recipient cells, or by transformation and transduction without direct contact. 3. During the third step, genetic material enters the recipient cell, where cell exclusion may abort the transfer. 4. Otherwise, during the fourth step, the incoming gene is integrated into the recipient genome by legitimate or sitespecific recombination or by plasmid circularization and complementary strand

synthesis. Barriers to transfer during this step come from restriction modification systems, failure to integrate and replicate within the new host genome, and incompatibility with resident plasmids. 5. In the final step, transferred genes are replicated as part of the recipient genome and transmitted to daughter cells in stable fashion over successive generations. Researchers from different disciplines tend to focus on specific stages within this five-step sequence. Thus, evolutionary biologists who examine microbial genomes for evidence of past transfers tend to look at HGTs from the perspective of step five. Molecular biologists are more likely to examine the details of the transfer events, while microbial ecologists look more broadly when they describe the magnitude and diversity of the mobile gene pool, sometimes called the mobilome.

“Boot strap analysis”

• A method to validate a phylogenetic tree, branchpoint by branchpoint.

• Requires a means to generate independent trees. (For example trees generated from different regions of the mitochondrial genome.)

• Choose the representative tree as the ‘parent’. Calculate the following:

For each branchpoint in the parent tree, For each tree, ask Is there a branchpoint having the same subclade contents (i.e. same taxa, any order)Bootstrap value = number of trees having the branchpoint / total trees.

Comparing branchpoints

A B C D E E DBCA AB C D E E DCAB

B A C D EE DABBB A A EE CCC DD

= P((A,B),C) = 5/8For each branchpoint in the parent tree, For each tree, ask Is there a branchpoint having the same subclade contents (i.e. same taxa, any order)Bootstrap value = number of trees having the branchpoint / total trees.

Bioinformatics


49

Ontology

• Ontologies relate facts to knowledge• facts

– may be known/unknown/little known– not attached to knowers– unchanging

• Knowledge– attached to knower– may disappear

50

Gene Ontology

- Gene annotation system- Controlled vocabulary that can be

applied to all organisms- Used to describe gene products

What is the Gene Ontology?

A (part of the) solution:

- A controlled vocabulary that can be applied to all organisms

- Used to describe gene products - proteins and RNA - in any organism

GO: Three ontologies

Where does it act?

What processes is it involved in?

What does it do? Molecular Function

Cellular Component

Biological Process

gene product

Cellular Component

• where a gene product acts

Mitochondrial membraneCellular Component

Biological Process

GluconeogenesisBiological Process

Molecular Function

• A single reaction or activity, not a gene product

• A gene product may have several functions• Sets of functions make up a biological

process

Molecular Function

hexose kinase

Filter queries by organism, data source or evidence

Search for GO terms or by Gene symbol/name

Querying the GO

• Access gene product functional information

• Find how much of a proteome is involved in a process/ function/ component in the cell

• Map GO terms and incorporate manual annotations into own databases

• Provide a link between biological knowledge and …

• gene expression profiles

• proteomics data

What can scientists do with GO?

attacked

time

control

Puparial adhesionMolting cyclehemocyanin

Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes

Immune responseToll regulated genes

Amino acid catabolismLipid metobolism

Peptidase activityProtein catabloismImmune response

Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.

…analysis of high-throughput data according to GOMicroArray data analysis

AnalysisofFunc.onalAnnota.on–DownregulatedGenes

Figuremodifiedfromh;p://en.wikipedia.org/wiki/Image:Microarray‐schema.jpg

Notreatment Treatedcells

Selec%ngmicroarraysubsetsbasedonGOrevealsdrugtarget

courtesyofShabanaShabeer,AlbertEinsteingSchoolofMedicine

Bioinformatics

• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Correlation• Protein structure

64

65

Trioxx non-users

Trioxx users

RLS

no RLS

4 0

12 16

€

ui − u( ) ri − r( )∑ui − u( )2 ri − r( )2∑∑

Biology Grad Core course: Discussion Topic Merck Smith-Kline was the author on a study of Trioxx, an anti-inflammatory drug used to treat arthritis, for which it was know to be effective. The study followed over 500 long-time Trioxx users and an equal number of control subjects who had never used the drug. Dr. Smith-Kline was looking for correlations between the use of Trioxx and the incidence of any disease other than arthritis, in any demographic group. He noted in the study that Tunisian Americans, in the age range from 45-55, male or female, and who had been a vegetarian for more than 6 months at any time in their lives, had a "strong negative correlation" between the use of Trioxx and the incidence of restless leg syndrome (RLS), and began touting Trioxx as an effective anti-RLS drug.

The numbers were as follows:

Total Tunisian American vegetarians age 45-55 : 32Total Tunisian American vegetarians age 45-55 Trioxx users : 16Total Tunisian American vegetarians age 45-55 Trioxx non-users : 16Total Tunisian American vegetarians age 45-55 who have RLS : 4Total Tunisian American vegetarians age 45-55 who do not have RLS : 28

Dr. Smith-Kline correctly calculated the correlation between Trioxx and RLS as follows:Corr =, where ui = 1 if subject i is a user, and 0 otherwise, ri is 1 if the subject has RLS and 0 otherwise.

The sums were carried out over all 32 subjects in the subset, and the resulting correlation was -0.378. This confidence level was cited as 99%, since the p-value for this correlation was 0.01, The sample size of 32 and the uneven distribution of subjects with RLS were taken into account.

The data itself was collected correctly and the calculations were correct, both for the correlation and its confidence. Yet Merck Smith-Kline did something dishonest in this study. What was it and what specific question would you ask him to reveal his dishonesty?

Correlation

x2

(∑ - <x>) y( - <y> )i ii

∑ x( - <x>) y( - <y> )i ii∑

i

2r =

Pearson’s correlations coefficient. Or

Pearson's product moment correlation

Correlation using metric data

Correlation cancels baseline and scale.Correlation is insensitive to non-linear relationships.

Non-linearity is not picked up by correlation

All of these examples have r=0.816Re-sampling will fix some of these.

Correlation Confidence by Resampling

• Start with paired data (x,y), calculate r. Let’s say r=0.511

• Randomly associate x and y values.

• Calculate rran

• Repeat 10000 times.

• Significance is p = number of times rran > 0.511, divided by 10000. (If r is negative, count rran < r)

Resampling exampleThe y-values have been randomly swapped, 10000 times. 10% of the time, r is >=

0.816, therefore p=0.10

0.816

0.816 0.101 -0.321

-0.2670.020

-0.199 -0.331

-0.204

Resampling example

0.816

-0.130

Scrambling randomizes sampling in the gray shaded area. If there is one data point set apart from the others, correlation can be high by chance.

Correlation using Booelan data

0 00011000101

x y0 10011001101

<x>=4/12=0.33

<y>=6/12=0.50

True is assigned 1, False 0.

x \ y 1 0

1 4 0

0 2 6

x

2

(∑ - <x> ) y( - <y> )i ii

∑ x( - <x> ) y( - <y> )i ii∑

i

2r =

r=0.71p=0.025

Bioinformatics

• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Correlation• Protein structure

73

Photosystem I: 1JB0

Classes of membrane proteins

•Single transmembrane helix

•several transmembrane helices

•beta-barrel or channel

•Anchored by one (not-transmembrane) helix or a covalently attached fatty acid

Photosystem I: Guided tour

Download and display 1JB0.pdb (one jay bee zero)

restrict not protein and not hohcolor cpkDisplay -> ball and stick

select magnesiumlabel %rset fontsize 12set fontstroke 2color labels yellow

Find the pseudo 2-fold axis

How many Mg are there?

What are the residue numbers of the “special pair” of chlorophylls?

Photosystem I : Guided tour

(select the special pair using select XXX or YYY)spacefillselect hetero and not hohlabels offcolor temperature

How are the B-factorsdistributed?

Was NCS 2-fold symmetry enforced during refinement?

Guess what: 2-fold symmetry was not

enforced during evolution!

Which side is more ordered? Chain A or chain B?

Photosystem I : Guided tour

Find the name of the lipid that does not havea phosphate group.

Characterize the environmentof the lipid. Could it have a role in the light harvest process?

Unix shortcut: use grepgrep ^”HETNAM” 1JB0.pdb

select [LMG]restrict selectedcenter selectedselect within (11., [LMG]) and proteinDisplay -> ball_and_stickcolor cpkselect within (11., [LMG]) and ligand

Photosystem I : Guided tourselect within (11., [LMG]) and ligandspacefill 1.5color green select within (11., [LMG]) and *.MGspacefill 1.5color white select within (11., [LMG]) and [PQN]color red select within (11., [LMG]) and solventspacefill 1.0color cyan What is PQN?

How close is it to the nearestmagnesium

Photosystem I : Guided tourrestrict ligandwireframecolor cpkDisplay-> ball and stickselect [CL1] or [CL2]wireframe color greenselect [PQN]color magentaspacefill 1.0select [BCR]color orangespacefill 1.5select *.MGspacefill 1.0color white

Trace the path of the electronsfrom the special pair to the twoquinones.

Light harvesting complex

Are any of the pigmentsconnected to the special pair?

Photosystem I : Guided tourrestrict [PQN]spacefill color cpkselect within (11.,[PQN]) and proteinwireframe 0.5color cpkselect within (11.,[PQN]) and ligand and not [PQN]color greenwireframe 0.5select within (11.,[PQN]) and solvent spacefill 0.6color cyan

Which quinone is more loosely-bound

Environment of the quinones

How does the electron getfrom one quinone to theother? What protein sidechainforms a bridge?

Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence...

Documents

Transcript of Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence...