Integration II

Integration II

Prediction

Kernel-based data integration

• SVMs and the kernel “trick”• Multiple-kernel learning• Applications– Protein function prediction– Clinical prognosis

SVMs

These are expression measurements from two genes for two populations(cancer types)

The goal is to define a cancer type classifier...

[Noble, Nat. Biotechnology, 2006]

SVMs



One type of classifier is a “hyper-plane”that separates measurements fromtwo cancer types


SVMs




E.g.: a one-dimensional hyper-plane


SVMs




E.g.: a two-dimensional hyper-plane


SVMs

Suppose that measurements are separable:there exists a hyperplane thatseparates two types

Then there are an infinite number ofseparating hyperplanes

Which to use?


SVMs

Suppose that measurements are separable:there exists a hyperplane thatseparates two types

Then there are an infinite number ofseparating hyperplanes

Which to use?

The maximum-margin hyperplane

Equivalently: minimizer of


SVMs

Which hyper-plane to use?

In reality: minimizer of trade-off between1. classification error, and2. margin size

loss penalty

SVMs

This is the primal problem

This is the dual problem

SVMs

What is K?

The kernel matrix:each entry is sample inner productone interpretation: sample similaritymeasurements completely described by K

SVMs

Implication:Non-linearity is obtained byappropriately defining kernelmatrix K

E.g. quadratic kernel:

SVMs

Another implication:No need for measurement vectorsall that is required is similarity between samples

E.g. string kernels

Protein Structure PredictionProtein structure

Protein sequence

Sequence similarity

Protein Structure Prediction

Kernel-based data fusion

Core idea: use different kernels for different genomic data sources a linear combination of kernel matrices is a kernel (under certain conditions)


Kernel to use in prediction:


In general, the task is to estimateSVM function along withcoefficients of the kernelmatrix combination

This is a type of well-studiedoptimization problem(semi-definite program)


Same idea applied to cancer classification from expression and proteomic data


• Prostate cancer dataset– 55 samples– Expression from microarray– Copy number variants

• Outcomes predicted:– Grade, stage, metastasis, recurrence

Integration II

Documents

Transcript of Integration II