RNA-sequence data normalization through ... - BioData Mining
Representations for large-scale (Big) Sequence Data Mining
-
Upload
vijay-raghavan -
Category
Technology
-
view
185 -
download
1
description
Transcript of Representations for large-scale (Big) Sequence Data Mining
Laboratory for InterNet Computing
Representations for Large-scale (Big) Sequence Data Mining: A Tale of Two
Vector Space Models
Vijay Raghavan, Ryan Benton,
Tom Johnsten, Ying Xie
October 13, 2013
Laboratory for InterNet Computing
Outline
• Introduction
• Related Work
• Generalized Multi-Layered Vector Spaces
– Lossless Decomposition Model of Sequence
Data (Granular computing Paradigm)
– Applications of LD to Bioinformatics
– Multi-Layered Vector Spaces (Turkish Music)
– Applications of MLVS to Bioinformatics
• Conclusions and Future Work
Laboratory for InterNet Computing
Introduction
Laboratory for InterNet Computing
Motivation
• Traditional Turkish Music
• Can Akkoç
– Institute of Applied Mathematics; The Middle
East Technical University, Ankara, Turkey
– Mathematics and Statistics Department,
University of South Alabama
Laboratory for InterNet Computing
Improv Melody Line Turkish Music
RDCSDCSDRSRDSDSCNCNCSNANACDSNCSCSDSCSCRDSCSCNANSDHAGHANANENCNMGAGANANCSDNCNCSCSDSDRDSANCSNSDCRDNCSDSCSD
- - - - - - - - - - - - - - - - - - - - - -
R_Rast, D_Dügâh, S_Segâh, C_Çargah, N_Neva, H_Hüseyni, A_Acem, E_Eviç, G_Gerdaniye,
M_Muhayyer
Laboratory for InterNet Computing
Motivation
• Traditional Turkish Music
– How to automatically categorize into families?
–How can it be represented?
Laboratory for InterNet Computing
Importance
• To analyze and classify arbitrary sequences
– Using structural similarities and differences
– Mathematical problem of importance
• Examples
– Music sequences
– Bio-sequences
– Web & computer Network access logs
– Click streams
Laboratory for InterNet Computing
Related Work
• Vector Space Model in Information Retrieval
• Convolution Kernels in NLP
Laboratory for InterNet Computing
Generalized Multi-Layered Vector Spaces (GMLVS)
Laboratory for InterNet Computing
GMLVS Properties
• Discover patterns
– Defined in terms of the alphabet
– Collection of sequences
– Local and global
• Reconstruct a sequence from model representation
• Facilitate data mining tasks
– Descriptive
– Predictive
Laboratory for InterNet Computing
GMLVS Model Formulation
• A sequence S of finite length |S|
– Defined over a finite alphabet β
– Viewed as a collection of generated
subsequences
– Length t where t = 1,..., |S|-1.
Laboratory for InterNet Computing
GMLVS Model Formulation
• A feature (f)
– Pair of subsequences f = (i, j)
• i and j ∈
– Specified step value m
• m stands for the step size of a given feature.
• 0 ≤ m ≤ k
*tβ
Laboratory for InterNet Computing
Interpretation of m
• m=1, – f represents a consecutive subsequence
• m > 1 – then f is a subsequence with a gap
– Gap is filled by an arbitrary sequence of (m – 1) symbols
– i and j are called, respectively, leading and trailingsubsequences
• if m = 0– the leading subsequence is element(s) from alphabet
– trailing element is a null symbol, whose length is zero
– m is not meaningful, since (m-1) is negative.
Laboratory for InterNet Computing
Example
• Sequence
– ILLNQNLVRSIKDSFVVTLNSNLVLSF
• Feature 1
– i = ILL
– j = NL
– m = 3
Laboratory for InterNet Computing
Example
• Sequence
– ILLNQNLVRSIKDSFVVTLNSNLVLSF
• Feature 3
– i = LN
– j = NL
– m =2
Laboratory for InterNet Computing
Validity
• Feature f=(i, j) is valid,
– |i | + |j| + m < |S|
• Result: number of possible features is a
function of the number of possible
subsequences and the range of step sizes.
Laboratory for InterNet Computing
Families and Clusters
• By allowing m to vary, generates a
multitude of m-step pairs (families)
• A multi-layered k-clustering Ck
– Composed of Pm|(i,j) where m= 1,2,...,k.
– Size of a cluster Ck = * k ,
• t = |i | + | j |.
t|*| β
Laboratory for InterNet Computing
Lossless Decomposition of Sequence Data
Laboratory for InterNet Computing
Lossless Decomposition of Sequence Data
• Many research works in granular computing
– Focus on information granulation
– Forms granules from structured data.
• We focus on the top-down process
– Generates granules via feature (information)
extraction.
Laboratory for InterNet Computing
Principle of Lossless Decomposition
• Require that the original sequence data can
be rebuilt from the generated granules.
• For example, granules decomposed from a
protein sequence may be defined as
– Individual amino acids, plus
– Positions of the corresponding amino acid in
the sequence.
Laboratory for InterNet Computing
More Formally
• Set of feature vectors G from a set of extracted
features of the form f = (i, NULL),
– where i ∈ * in
– m = 0 such that
– G = {<fp>}
– where fp is the starting position of the pth instance of
feature f in S.
• S can be reconstructed directly from G.
tβ
Laboratory for InterNet Computing
Lossless Decomposition of a Protein Sequence into Granules
• The primary structure of a protein
– a linear sequence of 20 amino acids
– Each acid represented by a letter such as L or K.
• Example sequence segment:
ILLNQNLVRSIKDSFVVTLISSEVLSF
Laboratory for InterNet Computing
• Sequence of “ILLNQNLVRSI“ decomposed into the following granules:
<I>:{0, 10}, <L>:{1, 2, 6}, <N>:{3, 5}, <Q>:{4}
<V>:{7}, <R>:{8}, <S>:{9}, <A>:{-1},<D>:{-1}
• -1 represents the corresponding amino acid does not appear in the protein sequence.
Laboratory for InterNet Computing
More formally
• Given a protein sequence S
• Assuming A denotes a set of 20 amino acids,
• We conduct lossless decomposition of S into a set of granules such that info(S) = info(G).
Laboratory for InterNet Computing
• Each granule can be decomposed into
granules of finer resolution.
• Given the sequence of “ILLNQNLVRSI“
• Granules <L>:{1, 2, 6} can be decomposed
<LL>:{1}, <LN>:{2}, <LV>:{6}, <LA>:f{-1},
<LD>:{-1}, …
• Multi-Layer
Laboratory for InterNet Computing
Granules: Pairs of Amino Acids
• Given a protein sequence S
• Assuming A denoting a set of 20 amino
acids,
• we conduct lossless decomposition of S into a set of granules such that info(S) = info(G)
Laboratory for InterNet Computing
• If decomposing a protein sequence based on
positions of all possible combinations
– 3 consecutive amino acids,
• A set of 8000 granules for each protein sequence
– 4 consecutive amino acids,
• A set of 160000 granules for each protein sequence
– n consecutive amino acids,
• A set of 20n granules.
Laboratory for InterNet Computing
Applications to Bioinformatics: Lossless Decomposition
• Protein Sequence Alignment
Laboratory for InterNet Computing
Protein Sequence Alignment -Background
• A fundamental task in protein sequence
analysis.
– Evaluating similarity relationship between two
protein sequences
• Similarity in sequences may indicate
homology in protein structures or functions.
Laboratory for InterNet Computing
• The Needleman-Wunsch algorithm
– Identify global optimal pairwise sequence
alignment
– Uses dynamic programming process.
– A maximum similarity score can be derived
from the optimal pairwise alignment.
Laboratory for InterNet Computing
• Smith-Waterman algorithm
– Variation of Needleman-Wunsch algorithm
– Finds the highest scoring local alignment
between two sequences.
Laboratory for InterNet Computing
• Both methods
– Guaranteed to obtain certain optimal alignments
between two sequences,
– Time complexity of both algorithms is O(MN)
• If a sequence
– Is used as query in a sequence database
– And search uses either Needleman-Wunsch or
Smith-Waterman
– Search time complexity is O(MNK).
Laboratory for InterNet Computing
Protein sequence alignment based on granular representation
• Given two protein sequences S1 and S2
• Apply Lossless Decomposition
– Convert S1 into G1 (set of granules)
– Convert S2 into G2 (another set of granules)
Laboratory for InterNet Computing
Protein sequence alignment based on granular representation
• Need; Compute pairwise similarity/
distance between protein sequences,
• Solution: A method that distributes
– The process of pairwise sequence alignment
into
– Process using the individual granules generated
by lossless decomposition
Laboratory for InterNet Computing
• Example: Decompositions are positions of all possible combinations of 2 consecutive amino acids.
Laboratory for InterNet Computing
Laboratory for InterNet Computing
Laboratory for InterNet Computing
Laboratory for InterNet Computing
• Advantages of granular position sequence over original protein sequence
– Much shorter
– Alignment computation much more efficient
Laboratory for InterNet Computing
• Distance between the sequence S1 and the
sequence S2
– Aggregation of the distances between granules
– Hence, calculation of the distance
• Between two sequences with 400 granule pairs
• Can be distributed to 400 parallel calculations
• The actual number present in a given collection is
likely much fewer
Laboratory for InterNet Computing
Preliminary Experimental Studies
• We studied the performance of the
proposed approach in protein sequence
classfication on 53 SCOP protein families.
Laboratory for InterNet Computing
• Used 1-nearest neighbor (1NN) approach – Predict if a test sequence belongs to given family
• More specifically, – For Each test sequence,
– Compare its similarity to each training sequence
– Assign class label of the most similar training sequence
• The accuracy rate for each family is reported.
Laboratory for InterNet Computing
Laboratory for InterNet Computing
Discussion
• Scale up
• Distributed framework
• Protein sequence kernels (similarity)
– Pairwise distance calculation
Laboratory for InterNet Computing
Multi-Layered Vector Spaces
Laboratory for InterNet Computing
Multi-Layered Vector Spaces
• Sequence
– Disassembled into ordered pairs
– Separated by (m-1) spaces
• Features
– f = (i, j), where i and j ∈ *
– Location of leading elements can serve as
components of feature vector.
1β
Laboratory for InterNet Computing
Multi-Layered Vector Spaces
• Sequence S
– Multi-layered structure
– Set of m-step ordered pairs (of features) (i, j)
• Denoted by P , where 1 ≤ m ≤ k.
• Total number of feature vectors that can be
generated from an alphabet β is |β|2.
),|( jim
Laboratory for InterNet Computing
Example
m=1 G A
m=2 G * A
m=3 G * * A
m=4 G * * * A
m=5 G * * * * A
m= G * * * * * A
Laboratory for InterNet Computing
Example: m-Step Ordered Pairs
• Given S:
1 2 3 4 5 6 7 8 9 10
g c t g g g c t c a
11 12 13 14 15 16 17 18 19 20
g c t a a t g a g c
Laboratory for InterNet Computing
Example: m-Step Ordered Pairs
• Given S:
• 1-step ordered pairs (g,c)
– [1,2], [6,7], [11,12], [19,20]
1 2 3 4 5 6 7 8 9 10
g c t g g g c t c a
11 12 13 14 15 16 17 18 19 20
g c t a a t g a g c
Laboratory for InterNet Computing
Example: m-Step Ordered Pairs
• Given S:
• 2-step ordered pair (g,t)
– [1,3], [6,8], [11,13]
1 2 3 4 5 6 7 8 9 10
g c t g g g c t c a
11 12 13 14 15 16 17 18 19 20
g c t a a t g a g c
Laboratory for InterNet Computing
Feature Vectors
• Boolean vectors
– Record locations of anchor positions as Boolean
values (1,0)
• Suitable for equal length sequences.
Laboratory for InterNet Computing
Example: Feature Vectors
• Given S:
• 1-step ordered pairs (g,c)
– [1,2], [6,7], [11,12], [19,20]
– <1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0>
1 2 3 4 5 6 7 8 9 10
g c t g g g c t c a
11 12 13 14 15 16 17 18 19 20
g c t a a t g a g c
Laboratory for InterNet Computing
Example: Feature Vectors
• Given S:
• 2-step ordered pair (g,t)
– [1,3], [6,8], [11,13]
– <1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0>
1 2 3 4 5 6 7 8 9 10
g c t g g g c t c a
11 12 13 14 15 16 17 18 19 20
g c t a a t g a g c
Laboratory for InterNet Computing
Combinations
• Fix pair and step size
– One Vector
• Fix step size, all pairs
– Multiple Vectors
• Number vectors: A2
• A is size of alphabet
– “Multi-Layered Vector Spaces”
Laboratory for InterNet Computing
Combinations
• Fix pair, multiple step sizes
– Multiple Vectors
• Number vectors: κ• k is number of step sizes
• All pairs, multiple step sizes
– Multiple Vectors
• Number vectors: A2 * κκκκ• A is size of alphabet
Laboratory for InterNet Computing
Laboratory for InterNet Computing
Concern
• Initially, have a ‘compact’ representation
• Moved to very large, non-compact, sparse
representation
– Bad for most machine learning/data mining
algorithms
Laboratory for InterNet Computing
Feature Vectors
• Partition sequence into n equal segments
– Count number of anchor positions falling into
each segment.
• n can be adjusted to meet expectations on resolution
• Suitable for unequal length sequences
Laboratory for InterNet Computing
Example: Feature Vectors
• (g,c) (m=1) (n = 4)– [1,2], [6,7], [11,12], [19,20]
– <1,0,0,0,0|1,0,0,0,0|1,0,0,0,0|0,0,0,1,0>
– <1,1,1,1>
• (g,t) (m=2) (n = 4)
– [1,3], [6,8], [11,13]
– <1,0,0,0,0|1,0,0,0,0|1,0,0,0,0|0,0,0,0,0>
– <1,1,1,0>
Laboratory for InterNet Computing
Applications to Bioinformatics: MLVS
• Signal Prediction
• Cancer Mutation
• Eukaryota versus Euglenozoa
Laboratory for InterNet Computing
Eukaryota versus Euglenozoa
Laboratory for InterNet Computing
Eukaryota versus Euglenozoa
• Eukaryota
– Domain
• Euglenozoa
– Phylum
– Belongs
• Kingdom: Excavata
• Domain: Eukaryota
Laboratory for InterNet Computing
Data
• Number of sequences
– 43 Eukaryota
– 44 Euglenozoa
• Each specimen
– Sequence of DNA
• Alphabet: A, T, C, G
• Alphabet size = 4
Laboratory for InterNet Computing
Preprocessing
• Create vectors
– Step size: 1, 2, 3, 10
– All possible combination of pairs
– 100 Bins
• Result
– 64 files
• One per pair/step size.
Laboratory for InterNet Computing
Classification Techniques
• Considered two methods:
– C4.5
• A popular decision tree method
– Ensemble
• Take K trees
• Each tree votes.
• The ‘class’ with most votes win
• Considered ensembles of 3, 5, 7, 9, 11 13, 15.
Laboratory for InterNet Computing
Experiments
• Hold-Out Method
– Divide data, randomly, into
• Training set – Used to create C4.5 classifier
– Tree could also be part of ensemble
• Test set – Used to evaluate C4.5 classifier
– Also used to evaluate ensemble
– Repeat 5 times and average results
• Measure
– Accuracy
Laboratory for InterNet Computing
Vm|(i,j) m=1 m=2 m=3 m=10(a,a) 75 82 75 69(a,c) 77 63 69 64(a,g) 75 89 83 75(a,t) 77 78 82 71(c,a) 69 68 71 67(c,c) 76 75 82 87(c,g) 64 78 74 67(c,t) 76 68 77 67(g,a) 70 78 74 72(g,c) 75 67 82 82(g,g) 70 66 84 85(g,t) 76 64 69 76(t,a) 66 68 72 67(t,c) 87 75 74 70(t,g) 72 70 67 67(t,t) 76 61 75 72
Average 74 72 76 72
Decision tree accuracy values for selected feature vectors
Laboratory for InterNet Computing
(k,m)-mismatch method accuracy values (Baseline)
K m = 0 (%) m=1 (%)
4 90 89
5 93 88
6 93 90
7 91 93
8 91 93
9 90 91
10 86 90
Laboratory for InterNet Computing
m # Classifiers : Accuracy (%)
1 3:90; 5:93; 7:93; 9:94; 11:92; 13:92; 15:96
2 3:90; 5:87; 7:87; 9:90; 11:87; 13:83; 15:79
3 3:87; 5:92; 7:91; 9:92; 11:93; 13:94; 15:94
10 3:92; 5:90; 7:92; 9:92; 11:90; 13:89; 15:87
Ensemble decision tree accuracy values
Laboratory for InterNet Computing
Results
• Individual Tree
– Average accuracy of 72% – 76%
– Note: Making decision using only a single pair
• a, a
• t, g
• Etc.
• Ensemble
– Best achieve 96%
• Step size 1
• 15 trees
Laboratory for InterNet Computing
Conclusions and Future Work
• Effective in Classification
• Potential for several orders of magnitude
faster
• Other mining tasks (Anaomaly Detection)
• Other Application Domains (Click streams,
Network Intrusion Detection, etc.)
Laboratory for InterNet Computing
Questions
Thank-you