CISC 841 Bio Informatics Non-additivity in protein–DNA binding R. A. O’Flanagan, G. Paillard, R....

CISC 841 Bio Informatics Non-additivity in proteinDNA binding R. A. OFlanagan, G. Paillard, R. Lavery and A. M. Sengupta 13 April 2006 Presentation : Manoj Pillay ( A Review ) Slide 2 Research Orientation : GENOMIC ANNOTATION Concentration : Identifying Protein Binding Sites Introduction Brushing up the prerequisite Molecular Biology basics DNA-Binding Protein any protein that binds to double or single stranded DNA Binding Site A region on a protein, DNA or RNA at which chemical bonds are formed Slide 3 Protein Classification Enzymes Structural Proteins Receptors, Kinases etc Transcription Factors A transcription factor is a protein that binds DNA at a specific promoter or enhances region or state; where it regulates transcription They regulate the production of all proteins Classes of Transcription Factors General Upstream Inducible So, why are we interested in just transcription factors? Slide 4 Why does this paper exist? With current technology, it is impossible to exactly predict where transcription factors are going to bind to - Approaches to the problem in the past Using PWM Using PSSM Using HMM A significant limitation All these methods assume that the overall binding affinity of a given protein is made up of additive contributions from interactions at each nucleotide position within the binding site !!! Experimental Results Challenge the assumption - Mnt Repressor - EGR1 Zn-finger Observation : Correlation exists between neighbouring nucleotide positions Slide 5 So then, Do we really need to formulate a new approach? Why dont we modify one of the existing approaches instead of spending a lot of money and energy on a new research line? Why dont we just adhere to the existing conventions? Why dont we take care of those correlations by replacing mononucleotide PWM representations with those based on dinucleotides or longer sequence elements? Cant we extend the HMM approach by adding hidden layers to HMM formulations? Alternatively, can we pioneer an SVM based approach to generalize the problem and then resolve it? Why do we have to make thought process complicated for ppl? Why?????????????? Slide 6 The Answer to all the WHYs LACK OF EXPERIMENTAL DATA - for most transcription factors, only a few binding sites are experimentally characterized. Big Deal!!! So then, why dont we get all those so-called experimental characterizations and results and then modify our existing approach. Do we really have to innovate? Yes! Because, technology for studying a complete binding site comprising of 10-20 nucleotide positions is still under development - DNA Microarrays - Genome SELEX - Micro-array based chromatic immuno-precipitation assays - SELEX SAGE Slide 7 OUR APPROACH - Theoretical - uses ADAPT a methodology for analyzing protein DNA recognition mechanisms - takes into account 10 to 20 nucleotide positions sequences - Only sequences corresponding to most stable complexes are studies ( the ones which fall within 5Kcal/mol of the best sequence ) to generate a weigh matrix ADAPT allows to calculate : E int Protein-DNA interaction energy E def DNA deformation energy - energy necessary to deform a free DNA segment to the structure it adopts when bound to the protein BINDING ENERGY : E tot = E int + E def E int direct recognition E def indirect recognition Direct and Indirect components of protein DNA recognition Slide 8 E def is important to us because - in some cases such as the TBP, binding introduces severe deformation. - apart from E int, it is the other component that influences interactions between neighboring nucleotide pairs. - Correlation can occur if and only if there is degeneracy in sequence preference i.e. existence of E def We therefore understand that E def is a necessary but insufficient condition for correlation to exist. Investigation of this Research Correlation effects on the binding specificity of some prototypical protein-DNA complexes. How effectively can correlation be incorporated into binding site prediction? Slide 9 METHODOLOGY IN DETAIL CALCULATING protein-DNA binding energies tests were carried out on 3 proteins TBP, BamH1 & GCN4 An optimal sequence which exhibits the best binding characteristics is generated from the set of all sequences. Let us call the binding energy of this sequence as E opt E opt Kcal/mol E opt + 5 Kcal/mol E opt + 10 Kcal/mol True Binding SitesTrue Non-Binding SitesDiscarded Sequences Optimal Sequence Protein Binding Site Non-Binding Site TBP8816515 BamH13681582 GCN44764091 Slide 10 Binding Site Length L tot = N log M/log 4 N = Total Length of DNA fragment M = Number of Sequences with energies < cutoff Derivation N base pairs 4 N base sequences If B base pairs remain after N-M sequences are selected using cutoff criterion 4 B =M B = log M/log 4 L tot =N-B=N-log M/log 4The effective length of the protein binding site ANALYZING CORRELATION -Analysis is limited to neighboring nucleotide positions. - Correlation = P i,i+1 (P i + P i+1 ) which can be calculate in reality as a change in entropy MonoNucleotide Entropy Dinucleotide Entropy S i (0,2) and S i+1 (0,4) Slide 11 We introduce sequence lengths and These lengths yield a quantitative measure of the correlation. It may be noted that EXTRACTING WEIGHT MATRIX PARAMETERS w ia = C 0 is chosen in such a way that the best binding site scores 0 and therefore poorer sites have positive scores Note : There exists an assumption that Binding probability is proportional to the exponential of W m Assumption may not hold for our distribution as we sample training set sequences using a cutoff criterion Therefore, considering sharp cutoffs, we use an SVM for our mononucleotides Slide 12 Binding energy of protein to a sequence is then given by, i = free energy contribution from ith base i is incorporated to minimize variances ofover the background dist. of sequences subject to constraint which means that sequences with Binding sites Generalizing WM and SVMs for dinucleotides we find the following equations where are chosen to minimize the variances Slide 13 EVIDENCE FOR NON-ADDITIVITY The optimal sequence for TBP is as AGTATAATTAAA C0 is now calculated as ( -+) Slide 14 A diagonal implying a perfect correlation between Wm scores and binding energies would have wiped out all our hypotheses which state that dinucleotide(or higher) dependencies are considerable. The variation of 3A and 3B confirms that non-additivity arises in the process of binding and is dominated by interaction between adjacent nucleotides. 3C further reinforces this fact. Slide 15 ANALYZING NON-ADDITIVITY WITHIN BINDING SITE - analysis by calculating binding site lengths - analysis by calculating entropies Slide 16 PREDICTING BINDING SITES TAKING NON-ADDITIVITY INTO ACCOUNT -200 sequences among 880(binding sites) are used as inputs(TBP) -Resulting weight matrices and energy matrices are used to assign information scores and predicting binding energies to every candidate site. So, then we have an SVM with the following characteristics : True Positives True Negatives False Positives False Negatives When algorithm identifies a true binding site as such When algorithm identifies a true non-binding site as such When algorithm declares a true non-binding site to be a binding site When algorithm declares a true binding site to be a non-binding site Slide 17 Probability of misclassification is much higher with a mononucleotide base consideration i.e. overlooking the existence of non-additivity. Recalldiscussed on slide 12. The threshold parameter u can be adjusted to yield even better results Positive prediction value is given by TP/(TP+FP) unlike other SVMs we generally see, in which prediction value is often given by FP/(FP+TN). This is because FPs tend to be very small for most reasonable values in our calculations. Slide 18 MonoNucleotide ModelDiNucleotide Model Training sets ranging from 2 t0 200 TBP binding sites Notice that Di-nucleotide Model outperforms the mononucleotide model as the size of training set increases. WHY? Slide 19 When size of training set increases Fraction of misclassified sequences increase. FP+FN Size of training set And Unless nearest neighbor interactions are taken into account, the increases are sharp! TBP GCN4 - We therefore, are in a position to say that, a minimum number of binding sites are necessary before it becomes advantageous to introduce correlations. Slide 20 - The point to note from this graph is that there is significant improvement from results in TABLE 2 where only 3 or 4 sequences where considered. Slide 21 EXPERIMENTAL SIGNATURE OF CORRELATION FROM EXPERIMENT - Preliminary analysis of dimeric protein CAP is conducted as our proteins cannot be subject to experiments without high throughput technological aid. - 76 binding sites are confirmed using our cutoff criterion in CAP Slide 22 CONCLUSION DNA deformation within a protein leads to significant non-additivity Effects are more or less limited to neighboring nucleotide interactions Non additivity may be relevant only to a limited number of dinucleotide steps in target site. SVM and WM approaches may be used to conduct the experiment although SVM approach is found to have been slightly outsmarting WM approach Improvement in prediction power depends upon size of the training set. Non- additivity should be taken into account for only those steps where it is really needed or else overfitting of dinucleotide model is imminent which means nothing but poor predictive power All findings are based on the worthiness of ADAPT.. ADAPT has never failed in the past as experimental results generally do not vary much from its simulated results but of course.. we do not overlook the possibility that one day there might be a protein which. Slide 23 Thank you everyone for participation and patient attention to my presentation of the review on Non-additivity in protein-DNA binding.

CISC 841 Bio Informatics Non-additivity in protein–DNA binding R. A. O’Flanagan, G. Paillard, R....

Documents

Transcript of CISC 841 Bio Informatics Non-additivity in protein–DNA binding R. A. O’Flanagan, G. Paillard, R....