Machine Learning in Bioinformatics 2008. 2. 21. Rhee, Je-Keun.
-
Upload
egbert-lawson -
Category
Documents
-
view
223 -
download
0
Transcript of Machine Learning in Bioinformatics 2008. 2. 21. Rhee, Je-Keun.
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
2
Biological Background: Central DogmaBiological Background: Central Dogma
http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/central_dogma.html
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
3
DNA, RNA, ProteinDNA, RNA, Protein
Eg. DNA SequenceAGGATTTAGAACAAAATCCGAAAAGGAGTGACATAACATTACAACATTAGGAATAAAGTAGATAAAACATTGATCAAAGGAAATTTAGTTATAGTTGAAAATTTTTATTATAAAAAGGGAACGAAGGGAGATTTTTTCAAGGGCATTTTGGTCCACCCTCTTGAGTTTTCCAGTTGTTGTAGCAGGAGCAAACTTGTTTGTTCCCATAGTAACCCGGAGGCACACAGAGACACTTCCTGCAGCATTTGTTGCAGAACGTAATGCAAGCCTTGTGGTACTGTGTCTTTTTACACCTCCTATCACATTCCGATGGGCATTGGGTACGTTTCCAGGCTTCCTGGGTCCATAACCGTTTCTGGCTCCAACTTCAACATTAGATCCACCTTGGAGGCCATAACCATGGGTTTGAAAGCATGAAGAGGGGCAATGAAGGGCCAAGAGGNAGATAGNCCCATATGGCCTANNCATTTCCAGGTTTGGGGNATTGGTATCCAAAGACCAACAACCCCCCAAACCCCCCAAACAGGTTTAGCCCCTTGGGG
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
4
Main topics in BioinformaticsMain topics in Bioinformatics
Biological data analysis Sequence Analysis (DNA, RNA, Protein, ...) Structural Bioinformatics (Protein structure, RNA structure, ...) Gene Expression Systems Biology Text mining Etc.
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
5
Machine learning methods for biological data Machine learning methods for biological data analysisanalysis
Feature Selection
Classification Decision Trees, Artificial Neural Networks, Bayesian Networks,
Support Vector Machines, k-NN, etc.
Clustering k-means, Hierarchical Clustering, PCA, SOM, etc.
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
8
Protein structure predictionProtein structure prediction
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
9
Protein structure predictionProtein structure prediction
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
10
Protein structure predictionProtein structure prediction
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
11
Gene expression analysisGene expression analysis
microarray
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
12
Data preparationData preparation
Sample 1 Sample 2 Sample i
Sample k Sample n
<Microarray image samples>
Sample 1
Gene 2
Image analysis
<Numerical data for data mining>
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
13
An Example: Hierarchical ClusteringAn Example: Hierarchical Clustering
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
14
Example: Bayesian network classifierExample: Bayesian network classifier
Zyxin
Leukemia
MB-1
C-mybLTC4S
<Network structure>
Zyxin Zyxin
Leukemia ALL or AML
LTC4S Leukotriene C4 synthase (LTC4S) gene
C-myb C-myb gene extracted from Human (c-myb) gene, complete primary cds, and five complete alternatively spliced cds
MB-1 MB-1 gene
n
i iiXPP1
)|()( PaX
A Bayesian network classifier for acute leukemias [Hwang et al. 2001]
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
16
microRNA prediction using a probabilistic leamicroRNA prediction using a probabilistic learning model rning model
Probabilistic co-learning model
최초의 기계학습 방법을 이용한 miRNA 예측 : 범용의 miRNA 예측 알고리즘
Human,mouse,fly 등 8 종의 miRNA 예측 및 annotated genome browser 제공
최초의 기계학습 방법을 이용한 miRNA 예측 : 범용의 miRNA 예측 알고리즘
Human,mouse,fly 등 8 종의 miRNA 예측 및 annotated genome browser 제공
TT TT TTT TF
MMMMMMMMMMMMMMMMM I IMNMMMD DDD DDDDDDMMMMMMNI
T T T T T T T T T T T T TF TF F FFF FF FF FF FF FF FF F
M- N-
I- D-
M- N-
I- D-
M- N-
I- D-
M- N-
I- D-
M- N-
I- D-
M- N-
I- DS
T
F
T
F
T
F
T
F
T
F
T
F
G
U
| -
C
G
C|
U
U
G
U|
U
-
True states
Emission symbols
<Structural states>M : match stateN : mismatch stateD : deletion stateI : insertion state
<Hidden states>T : true state(mature miRNA)F : false state
(a)
(b)
4
D
N M
S E
I
1
2 3
5
6, 7, 8
9
(c)
TMD TDM TMN TMI
EM(GU) ED(- C) EM(GC) EN(UU) EM(GU) EI(U- )Emission probabilities
T0MTransition probabilities
State sequence π
M+ N+
I+ D+
M+ N+
I+ D+
M+ N+
I+ D+
M+ N+
I+ D+
M+ N+
I+ D+
M+ N+
I+ D+S
1 2 3 94 5,6,7,8
False states
Human microRNA prediction through a probabilistic co-learning model of sequence and structure, Jin-Wu Nam, Ki-Roo Shin, Jinju Han, Yoontae Lee, V. Narry Kim, Byoung-Tak Zhang, Nucleic Acids Research, 33(11):3570-3581, 2005.
ProMiR II: A web server for the probabilistic prediction of clustered, nonclustered, conserved and nonconserved microRNAs, Jin-Wu Nam, Jin-Han Kim, Sung-Kyu Kim, Byoung-Tak Zhang, Nucleic Acids Research, 34:W455-W458, 2006.
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
17
Prediction for cardiovascular diseasePrediction for cardiovascular disease
Using Aptamer chip data Disease prediction by Decision Tree, Artificial Neural Networks, Baye
sian Networks, and Support Vector Machines
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
18
microRNA target predictionmicroRNA target prediction
Feature set for target prediction
검증된 데이터의 학습 : 실험적으로 검증된 데이터의 기계학습 적용 다대다 miRNA:Target network 예측
검증된 데이터의 학습 : 실험적으로 검증된 데이터의 기계학습 적용 다대다 miRNA:Target network 예측
Target prediction by
support vector machine
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
19
Multiobjective Optimization-Based OligonuclMultiobjective Optimization-Based Oligonucleotide Optimizationeotide Optimization
ParentParentttParentParenttt OffspringOffspringttOffspringOffspringtt
Non dominated sorting
ParentParentt+1t+1 OffspringOffspringt+1t+1OffspringOffspringt+1t+1
Genetic operation
+
Resulting ProbesMulti-Objective Evolutionary Algorithm
사용자의 여러 요구 조건을 다중 목적함수로 사용 응용에 따라 목적함수의 변경 및 새로운 목적함수의 추가 가능 바이오메드랩의 HPV 판별칩 용 probe design 에 적용하여 유용성 입증
사용자의 여러 요구 조건을 다중 목적함수로 사용 응용에 따라 목적함수의 변경 및 새로운 목적함수의 추가 가능 바이오메드랩의 HPV 판별칩 용 probe design 에 적용하여 유용성 입증
Multiobjective evolutionary optimization of DNA sequences for reliable DNA computing, S.-Y. Shin, I.-H. Lee, D. Kim, and B.-T. Zhang, IEEE Trans. Evo. Comp., 2005.
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
20
Phylogenetic Tree Construction based on kernPhylogenetic Tree Construction based on kernel methodsel methods Construction of phylogenetic trees by kernel-based comparative analysis
of metabolic networks, S. J. Oh, J.-G, Joung, J.-H. Chang and B.-T. Zhang, BMC Bioinformatics, 7:284, 2006.
Metabolic Pathway 기반의 Phylogenetic 분석 Graph Kernel 의한 종간의 유사도 비교하는 새로운 방법론 제안 Biological Pathway 분석을 통한 관심 후보 Target 들의 발굴에 기여
Metabolic Pathway 기반의 Phylogenetic 분석 Graph Kernel 의한 종간의 유사도 비교하는 새로운 방법론 제안 Biological Pathway 분석을 통한 관심 후보 Target 들의 발굴에 기여
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
21
Tree-Based Biochemical Network Tree-Based Biochemical Network IdentificationIdentification
Genetic Programming
OffspringOffspring
PopulationPopulation
SelectionSelection
Genetic operatorsGenetic operators
Initialization
Survival of fitness
Biochemical network 의 효율적 표현을 위한 S-tree representation 제안 시계열 자료로부터 S-tree 를 학습할 수 있는 유전 프로그래밍 기법 개발 새로운 네트워크 구조의 규명에 있어서 시스템적 관점의 분석 가능
Biochemical network 의 효율적 표현을 위한 S-tree representation 제안 시계열 자료로부터 S-tree 를 학습할 수 있는 유전 프로그래밍 기법 개발 새로운 네트워크 구조의 규명에 있어서 시스템적 관점의 분석 가능
Identification of biochemical networks by S-tree based genetic programming, D.-Y. Cho, K.-H., Cho, and B.-T. Zhang, Bioinformatics, 22(13):1631-1640, 2006
Time-series profiles
S-Tree representation
Biochemical Network Modeling
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
22
Co-clustering by probabilistic evolutionary Co-clustering by probabilistic evolutionary learning for module detectionlearning for module detection
Population-based Probabilistic Learning
다양한 Genome-Wide 데이터를 통한 Coclustering 분석기법 제안 Coclustering Evolutionary Algorithm 에 의한 mRNA-miRNA Module 탐색 mRNA-miRNA 의 Functional 상관관계 규명에 기여
다양한 Genome-Wide 데이터를 통한 Coclustering 분석기법 제안 Coclustering Evolutionary Algorithm 에 의한 mRNA-miRNA Module 탐색 mRNA-miRNA 의 Functional 상관관계 규명에 기여
miR
NA
1
mRNA1
miR
NA
2m
iRN
A3
miR
NA
4m
iRN
A5
miR
NA
6m
iRN
A7
miR
NA
8
mRNA2mRNA3mRNA4mRNA5mRNA6mRNA7mRNA8mRNA9mRNA10mRNA11mRNA12mRNA13
Arr
ay1
Arr
ay2
Arr
ay3
Arr
ay4
Arr
ay5
Arr
ay6
Array1Array2Array3Array4Array5Array6
miRNA module
miRNA expression
mRNA expression
target scores of miRNAs
Heterogeneous DatasetsCoherent Transcriptional Modules
Discovery of microRNA-mRNA modules via population- based probabilistic learning, J.-G. Joung, K.-B. Hwang, J.-W. Nam, S.-J. Kim and B.-T. Zhang, Bioinformatics, 23(9):1141-1147, 2007.
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
23
Co-clustering by probabilistic latent variable Co-clustering by probabilistic latent variable model with Heterogeneous Datasetsmodel with Heterogeneous Datasets
Position Weighted Matrixes (PWMs)
…
…
Stem Cell Subpopulations
…
Transcription Factors
Gene Expression Datasets
이종 데이터를 통한 Coclustering 분석기법 제안 Coclustering Latent Variable Model 에 의한 Regulatory Module 탐색 Systematic Regulatory Mechanism 규명에 기여
이종 데이터를 통한 Coclustering 분석기법 제안 Coclustering Latent Variable Model 에 의한 Regulatory Module 탐색 Systematic Regulatory Mechanism 규명에 기여
Identification of regulatory modules by co-clustering latent variable models: stem cell differentiation, J.-G, Joung, D. Shin, R.-H. Seong and B.-T. Zhang, Bioinformatics, 22(16): 2005-2011, 2006.
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
24
Modeling for temporal gene expression Modeling for temporal gene expression profilesprofiles
SMD
Gene Expression Profiles
Self-organizing latent lattice models for temporal gene expression profiling, B.-T. Zhang, J. Yang, and S. W. Chi, Machine Learning, 52(1/2): 67-89, 2003.
Gene Pair Selection
Update PatternPrototype
Update Latent Grid &Build Interaction Site Map
Visualize Gene Pairs in the
Interaction Site Map
Clustering
Extract InteractiveGene Pairs
복수의 관련유전자의 발현양상의 시각화 가능 은닉노드로부터 특정 동적발현패턴에 해당하는 expression 값들의 생성 기능적으로 연관된 유전자 쌍의 탐색
복수의 관련유전자의 발현양상의 시각화 가능 은닉노드로부터 특정 동적발현패턴에 해당하는 expression 값들의 생성 기능적으로 연관된 유전자 쌍의 탐색
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
25
Hierarchical Bayesian Networks for Large-Hierarchical Bayesian Networks for Large-scale Network Constructionscale Network Construction
• Hierarchical probabilistic graphical models for large-scale data analysis, Hwang, Kyu-Baek, Ph.D. Thesis, School of Computer Science and Engineering, Seoul National University, August 2005.
• Learning hierarchical Bayesian networks for large-scale data analysis, K.-B. Hwang, B.-H. Kim, and B.-T. Zhang, Lecture Notes in Computer Science, 4232:670-679, 2006.
대규모 데이터의 정보를 압축하여 임의 규모의 요약 - 베이지안망 작성 가능 대규모 베이지안망의 단계적 가시화 자연계에 존재하는 복잡계 네트워크의 특성을 반영 : 베이지안망의 특성 (
요인간 관계의 확률통계적 표현 ) + 군집화의 특성
대규모 데이터의 정보를 압축하여 임의 규모의 요약 - 베이지안망 작성 가능 대규모 베이지안망의 단계적 가시화 자연계에 존재하는 복잡계 네트워크의 특성을 반영 : 베이지안망의 특성 (
요인간 관계의 확률통계적 표현 ) + 군집화의 특성
© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/
26
Text MiningText Mining
Biomedical literature 에서 gene/protein interaction 에 대한 recognition / extraction / inference / visualization 을 수행할 수 있는 통합 integrated text mining platformintegrated text mining platform 개발
효율적인 Text mining 을 위한 tree kernel 기반 interaction sentence classifierinteraction sentence classifier 개발 알려진 상호작용 데이터에 data mining 을 적용하여 새로운 상호작용을 예측하는 protein protein
interaction prediction modelinteraction prediction model 의 개발
Biomedical literature 에서 gene/protein interaction 에 대한 recognition / extraction / inference / visualization 을 수행할 수 있는 통합 integrated text mining platformintegrated text mining platform 개발
효율적인 Text mining 을 위한 tree kernel 기반 interaction sentence classifierinteraction sentence classifier 개발 알려진 상호작용 데이터에 data mining 을 적용하여 새로운 상호작용을 예측하는 protein protein
interaction prediction modelinteraction prediction model 의 개발
A tree kernel-based method for protein-protein interaction mining from biomedical literature, Jae-Hong Eom, Sun KimA tree kernel-based method for protein-protein interaction mining from biomedical literature, Jae-Hong Eom, Sun Kim, Seong-Hwan Kim, Byoung-Tak Zhang, , Seong-Hwan Kim, Byoung-Tak Zhang, Lecture Notes in BioinformaticsLecture Notes in Bioinformatics, 3886:42-52, 2006., 3886:42-52, 2006.
PubMiner: machine learning-based text mining system for biomedical information mining, J.-H. Eom and B.-T. Zhang, PubMiner: machine learning-based text mining system for biomedical information mining, J.-H. Eom and B.-T. Zhang, Genomics & InformaticsGenomics & Informatics, 2(2):99-106, 2004., 2(2):99-106, 2004.
Prediction of implicit protein-protein interaction by optimal associative feature mining, J.-H. Eom, J.-H. Chang and B.-Prediction of implicit protein-protein interaction by optimal associative feature mining, J.-H. Eom, J.-H. Chang and B.-T. Zhang, T. Zhang, Lecture Notes in Computer ScienceLecture Notes in Computer Science, 3177:85-91, 2004, 3177:85-91, 2004