Representations for large-scale (Big) Sequence Data Mining

73
Laboratory for InterNet Computing Representations for Large-scale (Big) Sequence Data Mining: A Tale of Two Vector Space Models Vijay Raghavan, Ryan Benton, Tom Johnsten, Ying Xie October 13, 2013

description

Analyzing and classifying sequence data based on structural similarities and differences is a mathematical problem of escalating relevance. Indeed, a primary challenge in designing machine learning algorithms to analyzing se-quence data is the extraction and representation of significant features. This paper introduces a generalized sequence feature extraction model, referred to as the Generalized Multi-Layered Vector Spaces (GMLVS) model. Unlike most models that represent sequence data based on subsequences frequency, the GMLVS model represents a given sequence as a collection of features, where each individual feature captures the spatial relationships between two subse-quences and can be mapped into a feature vector. The utility of this approach is demonstrated via two special cases of the GMLVS model, namely, Lossless Decomposition (LD) and the Multi-Layered Vector Spaces (MLVS).

Transcript of Representations for large-scale (Big) Sequence Data Mining

Page 1: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Representations for Large-scale (Big) Sequence Data Mining: A Tale of Two

Vector Space Models

Vijay Raghavan, Ryan Benton,

Tom Johnsten, Ying Xie

October 13, 2013

Page 2: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Outline

• Introduction

• Related Work

• Generalized Multi-Layered Vector Spaces

– Lossless Decomposition Model of Sequence

Data (Granular computing Paradigm)

– Applications of LD to Bioinformatics

– Multi-Layered Vector Spaces (Turkish Music)

– Applications of MLVS to Bioinformatics

• Conclusions and Future Work

Page 3: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Introduction

Page 4: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Motivation

• Traditional Turkish Music

• Can Akkoç

– Institute of Applied Mathematics; The Middle

East Technical University, Ankara, Turkey

– Mathematics and Statistics Department,

University of South Alabama

Page 5: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Improv Melody Line Turkish Music

RDCSDCSDRSRDSDSCNCNCSNANACDSNCSCSDSCSCRDSCSCNANSDHAGHANANENCNMGAGANANCSDNCNCSCSDSDRDSANCSNSDCRDNCSDSCSD

- - - - - - - - - - - - - - - - - - - - - -

R_Rast, D_Dügâh, S_Segâh, C_Çargah, N_Neva, H_Hüseyni, A_Acem, E_Eviç, G_Gerdaniye,

M_Muhayyer

Page 6: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Motivation

• Traditional Turkish Music

– How to automatically categorize into families?

–How can it be represented?

Page 7: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Importance

• To analyze and classify arbitrary sequences

– Using structural similarities and differences

– Mathematical problem of importance

• Examples

– Music sequences

– Bio-sequences

– Web & computer Network access logs

– Click streams

Page 8: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Related Work

• Vector Space Model in Information Retrieval

• Convolution Kernels in NLP

Page 9: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Generalized Multi-Layered Vector Spaces (GMLVS)

Page 10: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

GMLVS Properties

• Discover patterns

– Defined in terms of the alphabet

– Collection of sequences

– Local and global

• Reconstruct a sequence from model representation

• Facilitate data mining tasks

– Descriptive

– Predictive

Page 11: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

GMLVS Model Formulation

• A sequence S of finite length |S|

– Defined over a finite alphabet β

– Viewed as a collection of generated

subsequences

– Length t where t = 1,..., |S|-1.

Page 12: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

GMLVS Model Formulation

• A feature (f)

– Pair of subsequences f = (i, j)

• i and j ∈

– Specified step value m

• m stands for the step size of a given feature.

• 0 ≤ m ≤ k

*tβ

Page 13: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Interpretation of m

• m=1, – f represents a consecutive subsequence

• m > 1 – then f is a subsequence with a gap

– Gap is filled by an arbitrary sequence of (m – 1) symbols

– i and j are called, respectively, leading and trailingsubsequences

• if m = 0– the leading subsequence is element(s) from alphabet

– trailing element is a null symbol, whose length is zero

– m is not meaningful, since (m-1) is negative.

Page 14: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Example

• Sequence

– ILLNQNLVRSIKDSFVVTLNSNLVLSF

• Feature 1

– i = ILL

– j = NL

– m = 3

Page 15: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Example

• Sequence

– ILLNQNLVRSIKDSFVVTLNSNLVLSF

• Feature 3

– i = LN

– j = NL

– m =2

Page 16: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Validity

• Feature f=(i, j) is valid,

– |i | + |j| + m < |S|

• Result: number of possible features is a

function of the number of possible

subsequences and the range of step sizes.

Page 17: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Families and Clusters

• By allowing m to vary, generates a

multitude of m-step pairs (families)

• A multi-layered k-clustering Ck

– Composed of Pm|(i,j) where m= 1,2,...,k.

– Size of a cluster Ck = * k ,

• t = |i | + | j |.

t|*| β

Page 18: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Lossless Decomposition of Sequence Data

Page 19: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Lossless Decomposition of Sequence Data

• Many research works in granular computing

– Focus on information granulation

– Forms granules from structured data.

• We focus on the top-down process

– Generates granules via feature (information)

extraction.

Page 20: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Principle of Lossless Decomposition

• Require that the original sequence data can

be rebuilt from the generated granules.

• For example, granules decomposed from a

protein sequence may be defined as

– Individual amino acids, plus

– Positions of the corresponding amino acid in

the sequence.

Page 21: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

More Formally

• Set of feature vectors G from a set of extracted

features of the form f = (i, NULL),

– where i ∈ * in

– m = 0 such that

– G = {<fp>}

– where fp is the starting position of the pth instance of

feature f in S.

• S can be reconstructed directly from G.

Page 22: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Lossless Decomposition of a Protein Sequence into Granules

• The primary structure of a protein

– a linear sequence of 20 amino acids

– Each acid represented by a letter such as L or K.

• Example sequence segment:

ILLNQNLVRSIKDSFVVTLISSEVLSF

Page 23: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• Sequence of “ILLNQNLVRSI“ decomposed into the following granules:

<I>:{0, 10}, <L>:{1, 2, 6}, <N>:{3, 5}, <Q>:{4}

<V>:{7}, <R>:{8}, <S>:{9}, <A>:{-1},<D>:{-1}

• -1 represents the corresponding amino acid does not appear in the protein sequence.

Page 24: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

More formally

• Given a protein sequence S

• Assuming A denotes a set of 20 amino acids,

• We conduct lossless decomposition of S into a set of granules such that info(S) = info(G).

Page 25: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• Each granule can be decomposed into

granules of finer resolution.

• Given the sequence of “ILLNQNLVRSI“

• Granules <L>:{1, 2, 6} can be decomposed

<LL>:{1}, <LN>:{2}, <LV>:{6}, <LA>:f{-1},

<LD>:{-1}, …

• Multi-Layer

Page 26: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Granules: Pairs of Amino Acids

• Given a protein sequence S

• Assuming A denoting a set of 20 amino

acids,

• we conduct lossless decomposition of S into a set of granules such that info(S) = info(G)

Page 27: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• If decomposing a protein sequence based on

positions of all possible combinations

– 3 consecutive amino acids,

• A set of 8000 granules for each protein sequence

– 4 consecutive amino acids,

• A set of 160000 granules for each protein sequence

– n consecutive amino acids,

• A set of 20n granules.

Page 28: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Applications to Bioinformatics: Lossless Decomposition

• Protein Sequence Alignment

Page 29: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Protein Sequence Alignment -Background

• A fundamental task in protein sequence

analysis.

– Evaluating similarity relationship between two

protein sequences

• Similarity in sequences may indicate

homology in protein structures or functions.

Page 30: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• The Needleman-Wunsch algorithm

– Identify global optimal pairwise sequence

alignment

– Uses dynamic programming process.

– A maximum similarity score can be derived

from the optimal pairwise alignment.

Page 31: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• Smith-Waterman algorithm

– Variation of Needleman-Wunsch algorithm

– Finds the highest scoring local alignment

between two sequences.

Page 32: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• Both methods

– Guaranteed to obtain certain optimal alignments

between two sequences,

– Time complexity of both algorithms is O(MN)

• If a sequence

– Is used as query in a sequence database

– And search uses either Needleman-Wunsch or

Smith-Waterman

– Search time complexity is O(MNK).

Page 33: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Protein sequence alignment based on granular representation

• Given two protein sequences S1 and S2

• Apply Lossless Decomposition

– Convert S1 into G1 (set of granules)

– Convert S2 into G2 (another set of granules)

Page 34: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Protein sequence alignment based on granular representation

• Need; Compute pairwise similarity/

distance between protein sequences,

• Solution: A method that distributes

– The process of pairwise sequence alignment

into

– Process using the individual granules generated

by lossless decomposition

Page 35: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• Example: Decompositions are positions of all possible combinations of 2 consecutive amino acids.

Page 36: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Page 37: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Page 38: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Page 39: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• Advantages of granular position sequence over original protein sequence

– Much shorter

– Alignment computation much more efficient

Page 40: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• Distance between the sequence S1 and the

sequence S2

– Aggregation of the distances between granules

– Hence, calculation of the distance

• Between two sequences with 400 granule pairs

• Can be distributed to 400 parallel calculations

• The actual number present in a given collection is

likely much fewer

Page 41: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Preliminary Experimental Studies

• We studied the performance of the

proposed approach in protein sequence

classfication on 53 SCOP protein families.

Page 42: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

• Used 1-nearest neighbor (1NN) approach – Predict if a test sequence belongs to given family

• More specifically, – For Each test sequence,

– Compare its similarity to each training sequence

– Assign class label of the most similar training sequence

• The accuracy rate for each family is reported.

Page 43: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Page 44: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Discussion

• Scale up

• Distributed framework

• Protein sequence kernels (similarity)

– Pairwise distance calculation

Page 45: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Multi-Layered Vector Spaces

Page 46: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Multi-Layered Vector Spaces

• Sequence

– Disassembled into ordered pairs

– Separated by (m-1) spaces

• Features

– f = (i, j), where i and j ∈ *

– Location of leading elements can serve as

components of feature vector.

Page 47: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Multi-Layered Vector Spaces

• Sequence S

– Multi-layered structure

– Set of m-step ordered pairs (of features) (i, j)

• Denoted by P , where 1 ≤ m ≤ k.

• Total number of feature vectors that can be

generated from an alphabet β is |β|2.

),|( jim

Page 48: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Example

m=1 G A

m=2 G * A

m=3 G * * A

m=4 G * * * A

m=5 G * * * * A

m= G * * * * * A

Page 49: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Example: m-Step Ordered Pairs

• Given S:

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c

Page 50: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Example: m-Step Ordered Pairs

• Given S:

• 1-step ordered pairs (g,c)

– [1,2], [6,7], [11,12], [19,20]

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c

Page 51: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Example: m-Step Ordered Pairs

• Given S:

• 2-step ordered pair (g,t)

– [1,3], [6,8], [11,13]

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c

Page 52: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Feature Vectors

• Boolean vectors

– Record locations of anchor positions as Boolean

values (1,0)

• Suitable for equal length sequences.

Page 53: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Example: Feature Vectors

• Given S:

• 1-step ordered pairs (g,c)

– [1,2], [6,7], [11,12], [19,20]

– <1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0>

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c

Page 54: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Example: Feature Vectors

• Given S:

• 2-step ordered pair (g,t)

– [1,3], [6,8], [11,13]

– <1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0>

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c

Page 55: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Combinations

• Fix pair and step size

– One Vector

• Fix step size, all pairs

– Multiple Vectors

• Number vectors: A2

• A is size of alphabet

– “Multi-Layered Vector Spaces”

Page 56: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Combinations

• Fix pair, multiple step sizes

– Multiple Vectors

• Number vectors: κ• k is number of step sizes

• All pairs, multiple step sizes

– Multiple Vectors

• Number vectors: A2 * κκκκ• A is size of alphabet

Page 57: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Page 58: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Concern

• Initially, have a ‘compact’ representation

• Moved to very large, non-compact, sparse

representation

– Bad for most machine learning/data mining

algorithms

Page 59: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Feature Vectors

• Partition sequence into n equal segments

– Count number of anchor positions falling into

each segment.

• n can be adjusted to meet expectations on resolution

• Suitable for unequal length sequences

Page 60: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Example: Feature Vectors

• (g,c) (m=1) (n = 4)– [1,2], [6,7], [11,12], [19,20]

– <1,0,0,0,0|1,0,0,0,0|1,0,0,0,0|0,0,0,1,0>

– <1,1,1,1>

• (g,t) (m=2) (n = 4)

– [1,3], [6,8], [11,13]

– <1,0,0,0,0|1,0,0,0,0|1,0,0,0,0|0,0,0,0,0>

– <1,1,1,0>

Page 61: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Applications to Bioinformatics: MLVS

• Signal Prediction

• Cancer Mutation

• Eukaryota versus Euglenozoa

Page 62: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Eukaryota versus Euglenozoa

Page 63: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Eukaryota versus Euglenozoa

• Eukaryota

– Domain

• Euglenozoa

– Phylum

– Belongs

• Kingdom: Excavata

• Domain: Eukaryota

Page 64: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Data

• Number of sequences

– 43 Eukaryota

– 44 Euglenozoa

• Each specimen

– Sequence of DNA

• Alphabet: A, T, C, G

• Alphabet size = 4

Page 65: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Preprocessing

• Create vectors

– Step size: 1, 2, 3, 10

– All possible combination of pairs

– 100 Bins

• Result

– 64 files

• One per pair/step size.

Page 66: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Classification Techniques

• Considered two methods:

– C4.5

• A popular decision tree method

– Ensemble

• Take K trees

• Each tree votes.

• The ‘class’ with most votes win

• Considered ensembles of 3, 5, 7, 9, 11 13, 15.

Page 67: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Experiments

• Hold-Out Method

– Divide data, randomly, into

• Training set – Used to create C4.5 classifier

– Tree could also be part of ensemble

• Test set – Used to evaluate C4.5 classifier

– Also used to evaluate ensemble

– Repeat 5 times and average results

• Measure

– Accuracy

Page 68: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Vm|(i,j) m=1 m=2 m=3 m=10(a,a) 75 82 75 69(a,c) 77 63 69 64(a,g) 75 89 83 75(a,t) 77 78 82 71(c,a) 69 68 71 67(c,c) 76 75 82 87(c,g) 64 78 74 67(c,t) 76 68 77 67(g,a) 70 78 74 72(g,c) 75 67 82 82(g,g) 70 66 84 85(g,t) 76 64 69 76(t,a) 66 68 72 67(t,c) 87 75 74 70(t,g) 72 70 67 67(t,t) 76 61 75 72

Average 74 72 76 72

Decision tree accuracy values for selected feature vectors

Page 69: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

(k,m)-mismatch method accuracy values (Baseline)

K m = 0 (%) m=1 (%)

4 90 89

5 93 88

6 93 90

7 91 93

8 91 93

9 90 91

10 86 90

Page 70: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

m # Classifiers : Accuracy (%)

1 3:90; 5:93; 7:93; 9:94; 11:92; 13:92; 15:96

2 3:90; 5:87; 7:87; 9:90; 11:87; 13:83; 15:79

3 3:87; 5:92; 7:91; 9:92; 11:93; 13:94; 15:94

10 3:92; 5:90; 7:92; 9:92; 11:90; 13:89; 15:87

Ensemble decision tree accuracy values

Page 71: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Results

• Individual Tree

– Average accuracy of 72% – 76%

– Note: Making decision using only a single pair

• a, a

• t, g

• Etc.

• Ensemble

– Best achieve 96%

• Step size 1

• 15 trees

Page 72: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Conclusions and Future Work

• Effective in Classification

• Potential for several orders of magnitude

faster

• Other mining tasks (Anaomaly Detection)

• Other Application Domains (Click streams,

Network Intrusion Detection, etc.)

Page 73: Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Questions

Thank-you