BIOL3014 Review Advanced Bioinformatics. Protein Structure.

BIOL3014 Review

Advanced Bioinformatics

Protein Structure

Proteins are linear polymers that fold up by themselves…mostly.

The amino acidsThey can be grouped by properties in many ways according to the chemical and physical properties (e.g. size) of the side chain.

Here is one grouping based on chemical properties:

•Basic: proton acceptors•Acidic: proton donors•Uncharged polar: have polar groups like CONH2 or CH2OH

•Nonpolar: tend to be hydrophobic•Weird: proline links to the N in the main chain•Strong: Cysteine can make “disulphide bridges”

Protein Secondary Structure

Alpha Helix

• 3.6 amino acid (residues) per turn

• O(i) hydrogen bonds to N(i+4)

From book…correct?

Wikipedia

Beta Sheet

A. Three strands shown

B. Anti-parallel sheet

C. Parallel sheet

Sheets are usually curved and can even form barrels.

Beta Turns: getting around tight corners

• Steric hindrance determines whether a tight turn is possible

• R3’s side chain is usually Hydrogen (R3 is glycine)

X-ray crystallography

• Needs crystallized proteins

• Hard to get crystals• Very tough for

hydrophobic (e.g. transmembrane) proteins

• Better accuracy than NMR

• Expensive: $100,000/protein

NMR spectroscopy

• Protons resonate at a frequency that depends on their chemical environment.

• This can be used to predict structure.

• Does not require crystallization; protein may be in solution.

• Lower resolution than X-ray crystallography

Obtaining secondary structure from sequence

Predict what?

• There are many types of secondary structure.• Which do we want to predict?

– Alpha helix– Beta strand– Beta turn– Random coil– Pi-helices– 310-helices– Type I turns– …

Start with some proteins of known structure

• Get some good X-ray or NMR models of proteins.

• Since we know their tertiary structures, certainly we can assign each residue in each protein a secondary state.

• Or can we?

DSSP to the rescue!

• In 1983 Kabsch and Sander introduced DSSP (Dictionary of Protein Secondary Structure) …not a typo..

• It automated the assignment of secondary structure from tertiary structure to make it less arbitrary.

Rules

• Chou-Fassman: created tables of breaking/forming propensity and the relative frequency of each residue type in helices and strands.

• Self information (what the identity of a residue tells you about its likely secondary structure state) is not the only thing we can extract from the known structures.– Maybe certain residues have a strong influence (or are strongly

correlated) with what the secondary state is several residues away. So, look at “long-distance” relationships:

• Directional information: information about the conformation at position i carried by the residue at position j, where i≠j, and is independent of the type of residue at position j.

• Pair information: like directional information, but takes account of the type of residue at position j.

Don’t forget about evolution!

• Sequence evolves faster than structure.• So, imagine a position in an alpha helix (or

other conformation) that recently mutated.– If we could find the orthologous residue in the

same protein in other species, those residues would give us a much better picture.

– So, we should look at the distribution of residues at that position, not just the residue in a particular protein.

PSI-BLAST is often used to get residue distributions

• The simplest way to get an estimate of the distribution of residues at each position in the protein we are trying to predict is to use PSI-BLAST.– PSI-BLAST will output a “profile” containing an estimate of the

residue distribution at each position in the query protein.

– Each column of the profile is a multinomial probability vector.

• The PSI-BLAST profile can be used in place of the protein in prediction rules.

• PSI-BLAST also outputs a multiple alignment, and it, too, can be used in prediction rules.– You could predict the secondary structure for each protein in the

alignment, and choose the “majority” or “average” prediction.

Why use HMMs for transmembrane topology?

• Transmembrane proteins have a simple, repetitive topology.

• The topology can be subdivided into a small set of regions.– Helices– Inside– Outside– Tails/Caps (at ends of

helices)

• The helices tend to have lengths in a limited range.

HMM design: Modeling sequences of varying lengths

• Self-loops can model sequences of length 1 to infinity: L = [1,…,infinity]

• Each time through the self-loop generates one more letter.

• This 1-state model generates sequences of length L with probability:

Pr(L) = pL-1(1-p).• So, you control the length of

the sequences (sort of…).

p

1-p

Grouping states

• To avoid over-fitting, we want to reduce the number of parameters.– Each emitting state has nineteen free parameters (one for

each amino acid - 1).

• If a group of states are modeling regions with very similar amino acid preferences, why not require that they all use the same parameters?– If you tie n states together, you “save” 19n parameters, so

the model is less prone to over-fitting when you train it.

Generalization

• We want to know how well a model will generalize to data it has never “seen”.

• If we test (measure accuracy) on the same data we trained on:– We overestimate the generalization

accuracy– We will tend to over-fit the training data (by

adjusting the model design to fit it)

Sample questions

1. Obtaining protein secondary structure– a. Define the protein secondary structure task.– b. List five types of secondary structure element.– c. Describe what is meant by the ideas of “self

information”, “directional information” and “pair information” when predicting secondary structure using a sliding-window method.

– d. What is a PSI-BLAST profile and why are they used in secondary structure prediction?

– e. What kinds of proteins are HMMs particularly suited to modeling?

Proteome and Gene Expression Analysis

The Goals

• Functional Genomics:– To know when, where and how much

genes are expressed.– To know when, where, what kind and how

much of each protein is present.

• Systems Biology:– To understand the transcriptional and

translational regulation of RNA and proteins in the cell.

Measuring Gene Expression

• What we want to do is measure the number of copies of each RNA transcript in a cell at a given point in time.– Extract the RNA from the cell.– Measure each type of transcript quantitatively.

• How do you measure it?– Sequence it in a quantitative way– But sequencing is (used to be) very expensive

• So, use technology and tricks…

Low-throughput Sequencing

• qPCR (also called rtPCR) allows you to accurately measure a given transcript.– But you have to decide which transcript

you want to measure and make primers for it.

– So it is very expensive and low-throughput.

• So the “array technologies” were born…

Gene Arrays

• Put a bunch of different, short single-stranded DNA sequences at predefined positions on a substrate.

• Let the unknown mixture of tagged DNA or RNA molecules hybridize to the DNAs.

• Measure the amount of hybridized material.

Measuring Protein Expression

• In order to measure all the types of protein in a cell we must– Extract the proteins– Purify the proteins– Identify the individual proteins

• How do we accomplish purification and identification of proteins.

The Technologies:Protein Expression

• Low-throughput– 2D Gel Electrophoresis + Mass Spectrometry– Liquid chromatograph + Mass Spectrometry

• Protein microarrays– Limited in application at this point– Can be used for things other than protein

expression like protein-protein interactions

Separating the Proteins:2D Gel Electrophoresis

• First step: pI/pH– Proteins are introduced to a gel

with an imobilized pH gradient.– A charge is applied.– Proteins migrate until the pH

causes them to lose their charge (isoelectric point) and then stop.

• Second step: mass– First gel transferred to second

gel– SDS (detergent) breaks

structure and charges the proteins proportional to their mass.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Steps of Mass Spectrometry

• Digest:– Sample (spot) is digested with a

proteolytic enzyme

• Spectrum:– Peaks correspond to the mass-

charge ratio of protein fragments

– These provide a fingerprint

• Identify:– Compare fingerprint to

theoretical fingerprints– Post-translational

modifications screw things up.

Goals

• We’ve measured the expression of genes or proteins using the technologies discussed previously.

• What can we do with that information?– Identify significant differences in

expression– Identify similar patterns of expression

(clustering)

Analysis steps

1. Data normalization

2. Statistical Analysis

3. Cluster Analysis

Data Normalization

• Why normalize?– Removes systematic errors– Makes the data easier to analyze

statistically

Sources of Error

• Measurements always contain errors.– Systematic (oops)– Random (noise!)

• Subtracting the background level can remove some systematic error– Using the ratio in two-channel experiments does this– Subtracting the overall average intensity can be used with

one-channel data.

• Taking averages over replicates of the experiment reduces the random error.

Statistical Analysis

• Determining what differences in expression are statistically significant

• Controlling false positives

When are two measurements significantly different?

• We want to say that an expression ratio is significant if it is big enough (>1) or small enough (<1).

• A two-fold ratio (for example) is only significant if the variances of the underlying measurements are sufficiently small.

• The significance is related to the area of the overlap of the underlying distributions.

QuickTime™ and a decompressor


The Z-test

• If the data is approximately normal, convert it to a Z-score.– X can be the log expression ratio; is then 0 is the sample standard deviation; n is the number of repeats

• The Z-score is distributed N(0,1) (standard normal).• The significance level is the area in the tail(s) of the standard

normal distribution.

QuickTime™ and a decompressor


QuickTime™ and a decompressorare needed to see this picture.

The t-test

• The t-test makes fewer assumptions about the data than the Z-test

• It can be applied to compare two average measurements which can have– Different variances– Different numbers of observations

Cluster Analysis

• Similar expression patterns– Groups of genes/proteins with similar

expression profiles

• Similar expression sub-patterns– Groups of genes/proteins with similar

expression profiles in a subset of conditions

Distance Measures Between Pairs of Points

• In order to cluster the points (genes or conditions), we need some concept of which points are “close” to each other.

• So we need a measure of distance (or, conversely,) similarity between two rows (or columns) in our n by m matrix.

• We can then compute all the pair-wise distances between rows (or columns).

Standard Distance Measures

• Euclidean Distance

• Pearson Correlation Coefficient

• Mahalanobis Distance

Euclidean Distance

• Standard, everyday distance – Treats all dimensions equally– If some genes vary more than others (have higher

variance), they influence the distance more.

Mahalanobis Distance

• The “normalized” Euclidean distance• Scales each dimension by the variance in that dimension.

– This is useful if the genes tend to vary much more in one sample than in others since it reduces the affect of that sample on the distances.

Pearson Correlation Coefficient

• Distances are small when two genes have similar patterns of change even if the size of the changes are different.

• This is accomplished by scaling by the sample variance of the gene’s expression levels under different conditions.

Types of Linkage

• A. Single Linkage• B. Complete

Linkage• C. Centroid Method

QuickTime™ and aTIFF (Uncompressed) decompressor


Sample Questions

1. Gene expression analysis – a. What kind of molecules do expression microarrays measure?

– b. Expression microarray data is known to be “noisy”. Describe as many ways as you can of reducing this problem.

– c. What experimental technique is commonly used to validate the results of expression microarrays?

– d. The “Z-test” or “t-test” is usually applied to expression microarray data. Why is this done and what do these tests tell us?

– e. Principle components analysis is often applied to microarray data as well. What is its purpose and what can it tell us?

– f. Name two types of distance measures that can be used with microarray data for clustering expression profiles.

Phylogeny

Overview

• Evolution and sequence variation

• Phylogenetic trees– The meaning of distance– Evolutionary sequence models

• Constructing trees

Rooted and Unrooted Trees

• “Leaves” are extant species• Internal nodes are ancestral species• Adding a root gives time a direction

But how can two species be at different “evolutionary distances” from their ancestor?

?

Distance Time

• The rate of evolution, r, can vary over time.

• The distance is equal to the rate times the time:

d=rt

What is the evolutionary distance between two DNA sequences?

• Align the two DNA sequences.

• Count the number of places where they differ (ignoring gaps)

p = D/L– D is the number of differences and– L is the total number of aligned positions

Is p the evolutionary distance?

• NO!

• p is just the observed number of differences.– What is value will p tend towards as

evolutionary distance increases???

Relationship between p-distance and d-distance

• So the branch lengths of the tree are “d=rt”.

• We must propose an evolutionary model to compute “d” from the observed p-distance.

Jukes-Cantor Evolutionary Model

• Assumes all base frequencies are ¼

• Has one parameter, α, the substitution rate (per unit time).

• Distance formula: d = ¾ ln(1- 4⁄3 p)

Transitions and Transversions

Kimura Two-Parameter Model

• Models transversions and transitions separately because the former are very uncommon in reality.– Transitions: A<->G, C<->T– Two parameters: transition rate α, transversion rate

β.

• Distance formula:

d = ½ ln(1-2P-Q) - ¼ ln(1-2Q) where P and Q are fraction of transitions and

transversions, respectively.

Constructing Phylogenetic Trees

First, construct a multiple alignment

• A good multiple alignment is key.• The p-distances between pairs of

sequences can then be computed.• This allows the d-distances between

pairs of sequences to be computed.• Some tree-building methods use the

multiple alignment directly– Parsimony Methods

Tree-building methods

• UPGMA (1958)– Builds rooted, ultrametric trees– Assumes constant rate of evolution in all branches

• Neighbor-joining (1987)– Builds unrooted, additive trees– Assumes the best tree has the shortest total

branch length.– Principal of minimum evolution, as with maximum

parsimony trees.

Ultrametric Trees

• Simplest type of rooted, additive tree.

• Assumes that the rate of evolution is constant over time.– With sequences,

called the “molecular clock”.

– Horizontal lines have no meaning.

Example Questions

1. Phylogeny– a. What does adding a root to a tree tell us?– b. In an additive tree, what is the meaning of

branch length?– c. What is the difference between p-distance

and d-distance?– d. What is the difference between the Jukes-

Cantor evolutionary model and the Kimura two-parameter model?

– e. What does the neighbor-joining tree building method minimize?

– f. What kind of trees does it build?

BIOL3014 Review Advanced Bioinformatics. Protein Structure.

Documents

Transcript of BIOL3014 Review Advanced Bioinformatics. Protein Structure.