TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE...

11
TRANSCRIPTION FACTOR BINDING POSITIONS PREDICTION WITH CNN Bowen Hu

Transcript of TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE...

Page 1: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

TRANSCRIPTION FACTOR BINDING POSITIONS PREDICTION WITH CNN

Bowen Hu

Page 2: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

TRANSCRIPTION FACTORS (TF)

➤ Protein that binds to a specific DNA sequence.

➤ Key in regulating (turning on or off) gene expression.

➤ Understanding TF binding locations will help us understand which part of gene is expressed in specific cell lines(skin tissue, brain tissue…).

➤ Together with genotype and expression analysis, we can take a step forward on understanding biological processes and disease states.

Page 3: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

CHIP-SEQ DATA

➤ ChIP-sequencing is a method used to analyze protein interactions with DNA.

➤ Finding all possible combination of TFs and cell lines is expensive and time consuming.

➤ Method for precisely predicting whether a TF will bind to some sequence is necessary.

Page 4: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

DATA PREPARATION

➤ Chip-seq data are downloaded from ENCODE, https://www.encodeproject.org, experiment ENCSR101FJT.

➤ TF in the experiment is ZNF143-human with sample size 21679.

➤ Data cleaning:

Remove missing values

Generate negative samples.

Truncate sequence with fixed length 60 so that it can be expressed as image.

Page 5: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

DATA TRANSFORMATION

➤ One-hot encoding.

➤ DNA sequence “ACTA” will be expressed as:

➤ The first DNA sequence input AAAGAATCCAGCTTAAATCGAis shown next page to illustrate CNN model.

Page 6: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

Convolution Neural network

Page 7: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

WHY CNN?

➤ Traditional methods for predicting TF binding position are based on position weight matrices (PWMs) or motifs.

➤ People use likelihood ratio test or score test to make decision.

➤ These methods are not using Chip-seq information directly, but summary statistics.

➤ There are some other biological features influencing binding behavior as well.

➤ I would expect higher accuracy if apply a model built on Chip-seq data directly (CNN).

Page 8: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

CNN TRAINING

➤ Hyper-parameter selection.

Sequence length: 60;

Feature matrix (motif) length: 10;

Number of features: 600;

Window size of max pooling layer: 60;

Fully connection layer size: 50;

➤ Data seperation:

70% for training, 20% for validation, 10% for testing.

Page 9: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

CNN VISUALIZATIONConvolution layer

Max pooling

Page 10: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

RESULTS

➤ The accuracy rate of predicted value is 60% with CNN.

➤ Comparing to 90% accuracy rate of a current prevailing method gkm-SVM, it is not a desirable result.

Page 11: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein

DISCUSSION

➤ Possible reasons for worse accuracy rate of CNN:

Misleading negative samples.

CNN failed to capture the non-linear feature due to limit of layers.

Hyper-parameter can be improved.

Failed to involve tissue information.

➤ Improvement & future work

Generate new negative samples.

Add more layers of CNN.