Representations for large-scale (Big) Sequence Data Mining

Laboratory for InterNet Computing

Representations for Large-scale (Big) Sequence Data Mining: A Tale of Two

Vector Space Models

Vijay Raghavan, Ryan Benton,

Tom Johnsten, Ying Xie

October 13, 2013


Outline

• Introduction

• Related Work

• Generalized Multi-Layered Vector Spaces

– Lossless Decomposition Model of Sequence

Data (Granular computing Paradigm)

– Applications of LD to Bioinformatics

– Multi-Layered Vector Spaces (Turkish Music)

– Applications of MLVS to Bioinformatics

• Conclusions and Future Work


Introduction


Motivation

• Traditional Turkish Music

• Can Akkoç

– Institute of Applied Mathematics; The Middle

East Technical University, Ankara, Turkey

– Mathematics and Statistics Department,

University of South Alabama


Improv Melody Line Turkish Music

RDCSDCSDRSRDSDSCNCNCSNANACDSNCSCSDSCSCRDSCSCNANSDHAGHANANENCNMGAGANANCSDNCNCSCSDSDRDSANCSNSDCRDNCSDSCSD

- - - - - - - - - - - - - - - - - - - - - -

R_Rast, D_Dügâh, S_Segâh, C_Çargah, N_Neva, H_Hüseyni, A_Acem, E_Eviç, G_Gerdaniye,

M_Muhayyer


Motivation

• Traditional Turkish Music

– How to automatically categorize into families?

–How can it be represented?


Importance

• To analyze and classify arbitrary sequences

– Using structural similarities and differences

– Mathematical problem of importance

• Examples

– Music sequences

– Bio-sequences

– Web & computer Network access logs

– Click streams


Related Work

• Vector Space Model in Information Retrieval

• Convolution Kernels in NLP


Generalized Multi-Layered Vector Spaces (GMLVS)


GMLVS Properties

• Discover patterns

– Defined in terms of the alphabet

– Collection of sequences

– Local and global

• Reconstruct a sequence from model representation

• Facilitate data mining tasks

– Descriptive

– Predictive


GMLVS Model Formulation

• A sequence S of finite length |S|

– Defined over a finite alphabet β

– Viewed as a collection of generated

subsequences

– Length t where t = 1,..., |S|-1.


GMLVS Model Formulation

• A feature (f)

– Pair of subsequences f = (i, j)

• i and j ∈

– Specified step value m

• m stands for the step size of a given feature.

• 0 ≤ m ≤ k

*tβ


Interpretation of m

• m=1, – f represents a consecutive subsequence

• m > 1 – then f is a subsequence with a gap

– Gap is filled by an arbitrary sequence of (m – 1) symbols

– i and j are called, respectively, leading and trailingsubsequences

• if m = 0– the leading subsequence is element(s) from alphabet

– trailing element is a null symbol, whose length is zero

– m is not meaningful, since (m-1) is negative.


Example

• Sequence

– ILLNQNLVRSIKDSFVVTLNSNLVLSF

• Feature 1

– i = ILL

– j = NL

– m = 3


Example

• Sequence

– ILLNQNLVRSIKDSFVVTLNSNLVLSF

• Feature 3

– i = LN

– j = NL

– m =2


Validity

• Feature f=(i, j) is valid,

– |i | + |j| + m < |S|

• Result: number of possible features is a

function of the number of possible

subsequences and the range of step sizes.


Families and Clusters

• By allowing m to vary, generates a

multitude of m-step pairs (families)

• A multi-layered k-clustering Ck

– Composed of Pm|(i,j) where m= 1,2,...,k.

– Size of a cluster Ck = * k ,

• t = |i | + | j |.

t|*| β


Lossless Decomposition of Sequence Data


Lossless Decomposition of Sequence Data

• Many research works in granular computing

– Focus on information granulation

– Forms granules from structured data.

• We focus on the top-down process

– Generates granules via feature (information)

extraction.


Principle of Lossless Decomposition

• Require that the original sequence data can

be rebuilt from the generated granules.

• For example, granules decomposed from a

protein sequence may be defined as

– Individual amino acids, plus

– Positions of the corresponding amino acid in

the sequence.


More Formally

• Set of feature vectors G from a set of extracted

features of the form f = (i, NULL),

– where i ∈ * in

– m = 0 such that

– G = {<fp>}

– where fp is the starting position of the pth instance of

feature f in S.

• S can be reconstructed directly from G.

tβ


Lossless Decomposition of a Protein Sequence into Granules

• The primary structure of a protein

– a linear sequence of 20 amino acids

– Each acid represented by a letter such as L or K.

• Example sequence segment:

ILLNQNLVRSIKDSFVVTLISSEVLSF


• Sequence of “ILLNQNLVRSI“ decomposed into the following granules:

<I>:{0, 10}, <L>:{1, 2, 6}, <N>:{3, 5}, <Q>:{4}

<V>:{7}, <R>:{8}, <S>:{9}, <A>:{-1},<D>:{-1}

• -1 represents the corresponding amino acid does not appear in the protein sequence.


More formally

• Given a protein sequence S

• Assuming A denotes a set of 20 amino acids,

• We conduct lossless decomposition of S into a set of granules such that info(S) = info(G).


• Each granule can be decomposed into

granules of finer resolution.

• Given the sequence of “ILLNQNLVRSI“

• Granules <L>:{1, 2, 6} can be decomposed

<LL>:{1}, <LN>:{2}, <LV>:{6}, <LA>:f{-1},

<LD>:{-1}, …

• Multi-Layer


Granules: Pairs of Amino Acids

• Given a protein sequence S

• Assuming A denoting a set of 20 amino

acids,

• we conduct lossless decomposition of S into a set of granules such that info(S) = info(G)


• If decomposing a protein sequence based on

positions of all possible combinations

– 3 consecutive amino acids,

• A set of 8000 granules for each protein sequence

– 4 consecutive amino acids,

• A set of 160000 granules for each protein sequence

– n consecutive amino acids,

• A set of 20n granules.


Applications to Bioinformatics: Lossless Decomposition

• Protein Sequence Alignment


Protein Sequence Alignment -Background

• A fundamental task in protein sequence

analysis.

– Evaluating similarity relationship between two

protein sequences

• Similarity in sequences may indicate

homology in protein structures or functions.


• The Needleman-Wunsch algorithm

– Identify global optimal pairwise sequence

alignment

– Uses dynamic programming process.

– A maximum similarity score can be derived

from the optimal pairwise alignment.


• Smith-Waterman algorithm

– Variation of Needleman-Wunsch algorithm

– Finds the highest scoring local alignment

between two sequences.


• Both methods

– Guaranteed to obtain certain optimal alignments

between two sequences,

– Time complexity of both algorithms is O(MN)

• If a sequence

– Is used as query in a sequence database

– And search uses either Needleman-Wunsch or

Smith-Waterman

– Search time complexity is O(MNK).


Protein sequence alignment based on granular representation

• Given two protein sequences S1 and S2

• Apply Lossless Decomposition

– Convert S1 into G1 (set of granules)

– Convert S2 into G2 (another set of granules)


Protein sequence alignment based on granular representation

• Need; Compute pairwise similarity/

distance between protein sequences,

• Solution: A method that distributes

– The process of pairwise sequence alignment

into

– Process using the individual granules generated

by lossless decomposition


• Example: Decompositions are positions of all possible combinations of 2 consecutive amino acids.


• Advantages of granular position sequence over original protein sequence

– Much shorter

– Alignment computation much more efficient


• Distance between the sequence S1 and the

sequence S2

– Aggregation of the distances between granules

– Hence, calculation of the distance

• Between two sequences with 400 granule pairs

• Can be distributed to 400 parallel calculations

• The actual number present in a given collection is

likely much fewer


Preliminary Experimental Studies

• We studied the performance of the

proposed approach in protein sequence

classfication on 53 SCOP protein families.


• Used 1-nearest neighbor (1NN) approach – Predict if a test sequence belongs to given family

• More specifically, – For Each test sequence,

– Compare its similarity to each training sequence

– Assign class label of the most similar training sequence

• The accuracy rate for each family is reported.


Discussion

• Scale up

• Distributed framework

• Protein sequence kernels (similarity)

– Pairwise distance calculation


Multi-Layered Vector Spaces



• Sequence

– Disassembled into ordered pairs

– Separated by (m-1) spaces

• Features

– f = (i, j), where i and j ∈ *

– Location of leading elements can serve as

components of feature vector.

1β



• Sequence S

– Multi-layered structure

– Set of m-step ordered pairs (of features) (i, j)

• Denoted by P , where 1 ≤ m ≤ k.

• Total number of feature vectors that can be

generated from an alphabet β is |β|2.

),|( jim


Example

m=1 G A

m=2 G * A

m=3 G * * A

m=4 G * * * A

m=5 G * * * * A

m= G * * * * * A


Example: m-Step Ordered Pairs

• Given S:

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c



• Given S:

• 1-step ordered pairs (g,c)

– [1,2], [6,7], [11,12], [19,20]

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c



• Given S:

• 2-step ordered pair (g,t)

– [1,3], [6,8], [11,13]

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c


Feature Vectors

• Boolean vectors

– Record locations of anchor positions as Boolean

values (1,0)

• Suitable for equal length sequences.


Example: Feature Vectors

• Given S:

• 1-step ordered pairs (g,c)

– [1,2], [6,7], [11,12], [19,20]

– <1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0>

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c



• Given S:

• 2-step ordered pair (g,t)

– [1,3], [6,8], [11,13]

– <1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0>

1 2 3 4 5 6 7 8 9 10

g c t g g g c t c a

11 12 13 14 15 16 17 18 19 20

g c t a a t g a g c


Combinations

• Fix pair and step size

– One Vector

• Fix step size, all pairs

– Multiple Vectors

• Number vectors: A2

• A is size of alphabet

– “Multi-Layered Vector Spaces”


Combinations

• Fix pair, multiple step sizes


• Number vectors: κ• k is number of step sizes

• All pairs, multiple step sizes


• Number vectors: A2 * κκκκ• A is size of alphabet


Concern

• Initially, have a ‘compact’ representation

• Moved to very large, non-compact, sparse

representation

– Bad for most machine learning/data mining

algorithms


Feature Vectors

• Partition sequence into n equal segments

– Count number of anchor positions falling into

each segment.

• n can be adjusted to meet expectations on resolution

• Suitable for unequal length sequences



• (g,c) (m=1) (n = 4)– [1,2], [6,7], [11,12], [19,20]

– <1,0,0,0,0|1,0,0,0,0|1,0,0,0,0|0,0,0,1,0>

– <1,1,1,1>

• (g,t) (m=2) (n = 4)

– [1,3], [6,8], [11,13]

– <1,0,0,0,0|1,0,0,0,0|1,0,0,0,0|0,0,0,0,0>

– <1,1,1,0>


Applications to Bioinformatics: MLVS

• Signal Prediction

• Cancer Mutation

• Eukaryota versus Euglenozoa


Eukaryota versus Euglenozoa


Eukaryota versus Euglenozoa

• Eukaryota

– Domain

• Euglenozoa

– Phylum

– Belongs

• Kingdom: Excavata

• Domain: Eukaryota


Data

• Number of sequences

– 43 Eukaryota

– 44 Euglenozoa

• Each specimen

– Sequence of DNA

• Alphabet: A, T, C, G

• Alphabet size = 4


Preprocessing

• Create vectors

– Step size: 1, 2, 3, 10

– All possible combination of pairs

– 100 Bins

• Result

– 64 files

• One per pair/step size.


Classification Techniques

• Considered two methods:

– C4.5

• A popular decision tree method

– Ensemble

• Take K trees

• Each tree votes.

• The ‘class’ with most votes win

• Considered ensembles of 3, 5, 7, 9, 11 13, 15.


Experiments

• Hold-Out Method

– Divide data, randomly, into

• Training set – Used to create C4.5 classifier

– Tree could also be part of ensemble

• Test set – Used to evaluate C4.5 classifier

– Also used to evaluate ensemble

– Repeat 5 times and average results

• Measure

– Accuracy


Vm|(i,j) m=1 m=2 m=3 m=10(a,a) 75 82 75 69(a,c) 77 63 69 64(a,g) 75 89 83 75(a,t) 77 78 82 71(c,a) 69 68 71 67(c,c) 76 75 82 87(c,g) 64 78 74 67(c,t) 76 68 77 67(g,a) 70 78 74 72(g,c) 75 67 82 82(g,g) 70 66 84 85(g,t) 76 64 69 76(t,a) 66 68 72 67(t,c) 87 75 74 70(t,g) 72 70 67 67(t,t) 76 61 75 72

Average 74 72 76 72

Decision tree accuracy values for selected feature vectors


(k,m)-mismatch method accuracy values (Baseline)

K m = 0 (%) m=1 (%)

4 90 89

5 93 88

6 93 90

7 91 93

8 91 93

9 90 91

10 86 90


m # Classifiers : Accuracy (%)

1 3:90; 5:93; 7:93; 9:94; 11:92; 13:92; 15:96

2 3:90; 5:87; 7:87; 9:90; 11:87; 13:83; 15:79

3 3:87; 5:92; 7:91; 9:92; 11:93; 13:94; 15:94

10 3:92; 5:90; 7:92; 9:92; 11:90; 13:89; 15:87

Ensemble decision tree accuracy values


Results

• Individual Tree

– Average accuracy of 72% – 76%

– Note: Making decision using only a single pair

• a, a

• t, g

• Etc.

• Ensemble

– Best achieve 96%

• Step size 1

• 15 trees


Conclusions and Future Work

• Effective in Classification

• Potential for several orders of magnitude

faster

• Other mining tasks (Anaomaly Detection)

• Other Application Domains (Click streams,

Network Intrusion Detection, etc.)


Questions

Thank-you

Representations for large-scale (Big) Sequence Data Mining

Technology

Transcript of Representations for large-scale (Big) Sequence Data Mining