Post on 23-Mar-2020
TRANSCRIPTION FACTOR BINDING POSITIONS PREDICTION WITH CNN
Bowen Hu
TRANSCRIPTION FACTORS (TF)
➤ Protein that binds to a specific DNA sequence.
➤ Key in regulating (turning on or off) gene expression.
➤ Understanding TF binding locations will help us understand which part of gene is expressed in specific cell lines(skin tissue, brain tissue…).
➤ Together with genotype and expression analysis, we can take a step forward on understanding biological processes and disease states.
CHIP-SEQ DATA
➤ ChIP-sequencing is a method used to analyze protein interactions with DNA.
➤ Finding all possible combination of TFs and cell lines is expensive and time consuming.
➤ Method for precisely predicting whether a TF will bind to some sequence is necessary.
DATA PREPARATION
➤ Chip-seq data are downloaded from ENCODE, https://www.encodeproject.org, experiment ENCSR101FJT.
➤ TF in the experiment is ZNF143-human with sample size 21679.
➤ Data cleaning:
Remove missing values
Generate negative samples.
Truncate sequence with fixed length 60 so that it can be expressed as image.
DATA TRANSFORMATION
➤ One-hot encoding.
➤ DNA sequence “ACTA” will be expressed as:
➤ The first DNA sequence input AAAGAATCCAGCTTAAATCGAis shown next page to illustrate CNN model.
Convolution Neural network
WHY CNN?
➤ Traditional methods for predicting TF binding position are based on position weight matrices (PWMs) or motifs.
➤ People use likelihood ratio test or score test to make decision.
➤ These methods are not using Chip-seq information directly, but summary statistics.
➤ There are some other biological features influencing binding behavior as well.
➤ I would expect higher accuracy if apply a model built on Chip-seq data directly (CNN).
CNN TRAINING
➤ Hyper-parameter selection.
Sequence length: 60;
Feature matrix (motif) length: 10;
Number of features: 600;
Window size of max pooling layer: 60;
Fully connection layer size: 50;
➤ Data seperation:
70% for training, 20% for validation, 10% for testing.
CNN VISUALIZATIONConvolution layer
Max pooling
RESULTS
➤ The accuracy rate of predicted value is 60% with CNN.
➤ Comparing to 90% accuracy rate of a current prevailing method gkm-SVM, it is not a desirable result.
DISCUSSION
➤ Possible reasons for worse accuracy rate of CNN:
Misleading negative samples.
CNN failed to capture the non-linear feature due to limit of layers.
Hyper-parameter can be improved.
Failed to involve tissue information.
➤ Improvement & future work
Generate new negative samples.
Add more layers of CNN.