PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS...

20
PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH NEW DELHI-110012 A HYBRID APPROACH FOR IDENTIFYING 5’ SPLICING JUNCTION WITH HIGHER ACCURACY

description

RATIONALE AND GENESIS Probabilistic WMM WAM MM1 MEM SAE Machine Learning MM1-SVM WD-SVM LIK-SVM MM1-SVM DS-SVM 6 th World Congress on Biotechnology

Transcript of PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS...

Page 1: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

PRABINA KUMAR MEHERSCIENTISTDIVISION OF STATISTICAL GENETICSINDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTEINDIAN COUNCIL OF AGRICULTURAL RESEARCHNEW DELHI-110012

A HYBRID APPROACH FOR IDENTIFYING 5’ SPLICING JUNCTION WITH HIGHER ACCURACY

Page 2: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Transcription

DNA

Pre mRNA

mRNA

Protein

Splicing

Translation

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

THE CENTRAL DOGMA

Every GT in the gene is a possible donor site and it need to predicted as either true or false splice site

Page 3: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

RATIONALE AND GENESIS

Probabilistic

WMM

WAM

MM1

MEM

SAE

Machine Learning

MM1-SVM

WD-SVM

LIK-SVM

MM1-SVM

DS-SVM

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 4: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Zhang et al. (Experts systems with Applications, 2006)

RATIONALE AND GENESIS…

Windownsize-100bp

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 5: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Encoded test set

Training data set of TSS and FSS

Scoring matrix of FSS

Scoring matrix of TSSDifference

Training sites

Test sites

Encoded training setDifference

matrix

44 , 43 , ............, 2 , 1 0 , 0 , 1, 2 , ..............., 43, 44

1

2

3

...AT...TA TC...AC..

...TT...GC GG...TC..

...AC...TC AT...GC..... ... ... ... ..

...GT

G

...AC CC...

TGTGTGT

AG..GTN

sss

s

POS.. (-44, -43) (-43,-42) … (42,43) (43,44)(AA) … … … …(AT) … … … …(AG) …(AC) …(……) …(CG) …(CC) … … … …

Huang et al. (Biochemie, 2006)

( , )( 44, 43)

A As

RATIONALE AND GENESIS…

Windownsize-88bp

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 6: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Less accurate with sub-optimal window length

4

Most of the approaches are species specific

3

Threshold is easy in MLA2

Difficult to determine threshold in probabilistic approaches1

RATIONALE AND GENESIS…

6th W

orld

Con

gres

s on

Bio

tech

nolo

gy

Page 7: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

DATA for Validation

Human

Bovine

Fish

Worm

TSS TSS

2796 90923

10000 10000

10000 10000

1000 19000

HS3D

UCSC Genome Browser

Kamath et al. 2014

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 8: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

DATA for Comparison

NN269

Training Testing

TSS#1116

FSS#4140

TSS#208

FSS#782

Each sequence is of 15nt long with conserved GT at 8th and 9th positions respectively

6th W

orld

Con

gres

s on

Bio

tech

nolo

gy

Page 9: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Sequence Encoding

1 21

log ( ) ; { , , , }L

P T ti

i

f p A C G T

2 2 21 1

log ( ) log ( )L L

P TF t fi i

i i

f p p

13

( )100

Lt t

iP T i

t t

p Mf

M N

1 14

( ) ( )100

L Lt t f f

i iP TF i i

t t f f

p M p Mf

M N M N

WMM

Shapiro and Senapathy

where M is the sum of highest frequency at position 1 to L and N is the sum of lowest frequency at position 1 to L obtained from frequency matrix of nucleotides

POSITIONAL FEATURE

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 10: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

CONTD…

DEPENDENCY FEATURE

5 21 1( )

log ( ) ; , { , , , }L L

D T ti j

i j i

f p A C G T

6 2 21 1( ) 1 1( )

log ( ) log ( )L L L L

D TF t fi j i j

i j i i j i

f p p

71 1( )

2 ( 1) 2 ( )L L

D T ti j

i j i

f L L p

81 1( ) 1 1( )

2 ( ) 2 ( )L L L L

D TF f ti j i j

i j i i j i

f p p

SAE

WAM

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 11: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

CONTD…

1 29 1 2 1 2

( )( ) ; , { , , , }1

C I nf A C G T

L

1 2 310 1 2 3 1 2 3

( )( ) ; , , { , , , }

2C I n

f A C G TL

1 2 3 411 1 2 3 4 1 2 3 4

( )( ) ; , , , { , , , }

3C I n

f A C G TL

COMPOSITIONAL FEATURE

Dimers

Triplets

Tetramers

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 12: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Feature Selection

344 4 4 16+64+256

Total Positional Dependency Compositional

( ) j j

j j

x xF j

s s

4 4 14+15+12 6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 13: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Feature Selection…

Feature Type #Features Features

Positional 4

Dependency 4

Compositional

41

1 2 3 4, , ,P T P TF P T P TFf f f f

5 6 7 8, , ,D T D TF D T D TFf f f f

10 10 10 10 10 10 10

10 10 10 10 10 10 10

( ), ( ), ( ), ( ), ( ), ( ), ( )

( ), ( ), ( ), ( ), ( ), ( ), ( )

C I C I C I C I C I C I C I

C I C I C I C I C I C I C I

f AA f AC f AG f CA f CC f CT f GA

f GC f GG f GT f TA f TC f TG f TT

11 11 11 11 11

11 11 11 11 11

11 11 11 11 11

( ), ( ), ( ), ( ), ( ),

( ), ( ), ( ), ( ), ( ),

( ), ( ), ( ), ( ), ( )

C I C I C I C I C I

C I C I C I C I C I

C I C I C I C I C I

f AAG f AGG f AGT f CAG f GAG

f GGG f GGT f GTA f GTC f GTG

f TAA f TGA f TGC f TGG f TGT

12 12 12 12

12 12 12 12

12 12 12 12

( ), ( ), ( ), ( ),

( ), ( ), ( ), ( ),

( ), ( ), ( ), ( ),

C I C I C I C I

C I C I C I C I

C I C I C I C I

f AAGG f AGGT f CAGG f GAGG

f GGGT f GGTA f GGTG f GTAA

f GTGA f GTGG f TAAG f TGAG

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 14: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Cross validation

1 2 3 4 5

TSS

1 2 3 4 5

FSS

1 2 3 4

1 2 3 4

5 5

Training

Test

Classifiers Prediction

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 15: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Parameter Optimization

6th W

orld

Con

gres

s on

Bio

tech

nolo

gy

Page 16: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Performance measure

6th W

orld

Con

gres

s on

Bio

tech

nolo

gy

Page 17: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Performance measure…

MeasureBalanced Imbalanced

Human Bovine Fish Worm Human Bovine Fish Worm

AUC-ROC 96.05 96.94 96.95 96.24 97.21 97.45 97.41 98.06

AUC-PR 97.64 97.89 97.91 97.90 93.24 93.34 93.38 92.29

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 18: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Comparative Analysis

NN269 #TSS #FSSTraining 1116 208Testing 4140 782

Approaches AUC-ROC AUC-PR ReferencesMM1-SVM 97.62 89.58 Baten et al., 2006LIK-SVM 98.04 92.65

Sonnennburg et al., 2007WD-SVM 98.50 92.86

WDS-SVM 98.13 92.47EFFECT 98.20 92.81 Kamath et al., 2014Proposed 96.53 93.54

6th W

orld

Con

gres

s on

Bio

tech

nolo

gy

Page 19: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

Prediction Server

http://cabgrid.res.in:8080/hsplice

HSplice

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

Page 20: PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.

ACKNOWLEDGEMENT

DIRECTORINDIAN AGRICULTURAL STATISTICS RESEARCH INSTITUTE

NEW DELHI

6th

Wor

ld C

ongr

ess o

n B

iote

chno

logy

6th World Congress on Biotechnology 6

th World C

ongress on Biotechnology

6th World Congress on Biotechnology