Text classification using Text kernels

Text Classification Using String Kernels

Presented byDibyendu Nath & Divya Sambasivan

CS 290D : Spring 2014

Huma Lodhi, Craig Saunders, et alDepartment of Computer Science, Royal Holloway, University of London

Intro: Text Classification• Task of assigning a document to one

or more categories.

• Done manually (library science) or algorithmically (information science, data mining, machine learning).

• Learning systems (neural networks or decision trees) work on feature vectors, transformed from the input space.

• Text documents cannot readily be described by explicit feature vectors. lingua-systems.eu

Problem Definition• Input : A corpus of documents.

• Output : A kernel representing the documents. • This kernel can then be used to classify, cluster etc.

using existing algorithms which work on kernels, eg: SVM, perceptron.

• Methodology : Find a mapping and a kernel function so that we can apply any of the standard kernel methods of classification, clustering etc. to the corpus of documents.

Overview

• Motivation

• Kernel Methods

• Algorithms - with increasingly better

efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

Overview

• Motivation

• Kernel Methods

efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

Motivation

• Text documents cannot readily be described by explicit feature vectors.

• Feature Extraction - Requires extensive domain knowledge- Possible loss of important information.

• Kernel Methods – an alternative to explicit feature extraction

Overview

• Motivation

• Kernel Methods

efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

The Kernel Trick• Map data into feature space via mapping ϕ. • The mapping may be assessed via a kernel

function.• Construct a linear function in feature space

slide from Huma Lodhi

Kernel Function

slide from Huma Lodhi

Kernel Function – Measure of Similarity, returns the inner product between mapped data points

K(xi, xj) = < Φ(xi), Φ(xj)>

Example –

Kernels for Sequences• Word Kernels [WK] - Bag of Words- Sequence of characters followed by

punctuation or space

• N-Grams Kernel [NGK]• Sequence of n consecutive substrings• Example : “quick brown”

3-gram - qui, uic, ick, ck_, _br, bro, row, own

• String Subsequence Kernel [SSK]• All (non-contiguous) substrings of n-symbols

Word Kernels• Documents are mapped to very high

dimensional space where dimensionality of the feature space is equal to the number of unique words in the corpus.

• Each entry of the vector represents the occurrence or non-occurrence of the word.

• Kernel - inner product between mapped sequences give a sum over all common (weighted) words

fish tank sea

Doc 1 2 0 1

Doc 2 1 1 0

String Subsequence KernelsBasic IdeaNon-contiguous substrings :

substring “c-a-r”

card – length of sequence = 3

custard – length of sequence = 6

The more subsequences (of length n) two strings have in common, the more similar they are considered

Decay FactorSubstrings are weighted according to the degree of contiguity in a string by a decay factor λ ∊ (0,1)

Example

c-a c-t a-t c-r a-r

car cat

Documents we want to compare

λ2 λ2λ30 0

λ2 λ2λ3 0 0

K(car, car) = 2λ4 + λ6

K(cat, cat) = 2λ4 + λ6

K(car, cat) = K(car, cat) = λ4

Overview

• Motivation

• Kernel Methods

efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

Algorithm Definitions

• AlphabetLet Σ be the finite alphabet

• StringA string is a finite sequence of characters from alphabet with length |s|

• SubsequenceA vector of indices ij, sorted in ascending order, in a string ‘s’ such that they form the letters of a sequence

Eg: ‘lancasters’ = [4,5,9] Length of subsequence = in – i1 +1 = 9 - 4 + 1 = 6

Algorithm Definitions• Feature Spaces

• Feature MappingThe feature mapping φ for a string s is given by

defining the u coordinate φu(s) for each u ∈ Σn

These features measure the number of occurrences of subsequences in the string s weighting them according to their lengths.

String Kernel• The inner product between two mapped strings

is a sum over all the common weighted subsequence

λ2 λ2λ30 0

λ2 λ2λ3 0 0

K(car, cat) = λ4

Intermediate Kernel

c-a c-t a-t c-r a-r

λ2 λ2λ30 0

λ2 λ2λ3 0 0

Count the length from the beginning of the sequence through the end of the strings s and t.

Recursive Computation

Null sub-string

Target string is shorter than search sub-string

c-a c-t a-t c-r a-r

λ2λ30 0

λ2λ3 0 0

c-a c-t a-t c-r a-r

cat λ2λ3 0 0λ3

λ4 λλ40 0

K’(car,cat) = λ6

K’(cart,cat) = λ7

λ3λ4

+λ7+λ5

λ2 λ2λ30 0

λ2 λ2λ3 0 0

K(car,cat) = λ4

c-a c-t a-t c-r a-r

cat λ2

λ3 λ2

K(cart,cat) = λ4

+λ7 +λ5

Recursive ComputationNull sub-string

Target string is shorter than search sub-string

O(n |s||t|2) O(n |s||t|)Dynamic

ProgrammingRecursion

Efficiency

O(|Σ|n)

O(n |s||t|)

O(n |s||t|2)

All subsequences of length n.

Kernel Normalization

Setting Algorithm Parameters

Overview

• Motivation

• Kernel Methods

efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

Kernel Approximation

Suppose, we have some training points (xi, y

i)∈ X × Y , and

some kernel function K(x,z) corresponding to a feature

space mapping φ : X → F such that K(x, z) = ⟨φ(x), φ(z)⟩.

Consider a set S of vectors S = {si

∈ X }.

If the cardinality of S is equal to the dimensionality of the

space F and the vectors φ(si) are orthogonal

(i.e. K(si,s

j) = Cδ

ij)*, then the following is true:

Kernel ApproximationIf instead of forming a complete orthonormal basis, the cardinality of S Q ⊆ S is less than the dimensionality of X or the vectors si are not fully orthogonal, then we can construct an approximation to the kernel K:

If the set S Q is carefully constructed, then the production of a Gram matrix which is closely aligned to the true Gram matrix can be achieved with a fraction of the computational cost.

Problem : Choose the set S Q to ensure that the vectors φ(si) are orthogonal.

Selecting Feature SubsetHeuristic for obtaining the set S Q is as follows:

1.We choose a substring size n.

2.We enumerate all possible contiguous strings of length n.

3.We choose the x strings of length n which occur most frequently in the dataset and this forms our set S Q.

By definition, all such strings of length n are orthogonal (i.e. K(si,sj) = Cδij for some constant C) when used in conjunction with the string kernel of degree n.

Kernel Approximation Results

Overview

• Motivation

• Kernel Methods

efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

EvaluationDataset : Reuters-21578, ModeApte Split

Categoried Selected:Precision = relevant documents categorized relevant / total documents categorized relevant

Recall = relevant documents categorized relevant/total relevant documents

F1 = 2*Precision*Recall/Precision+Recall

Evaluation

Evaluation Effectiveness of Sequence Length

[k = 7] [k = 5]

[k = 6] [k = 5]

[k = 5]

[k = 5][k = 5]

[k = 5]

EvaluationEffectiveness of Decay Factor

λ = 0.3

λ = 0.03

λ = 0.05

λ = 0.03

Overview

• Motivation

• Kernel Methods

efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

Follow Up• String Kernel using sequences of words rather than

characters, less computationally demanding, no fixed decay factor, combination of string kernels

Cancedda, Nicola, et al. "Word sequence kernels." The Journal of Machine Learning Research 3 (2003): 1059-1082.

• Extracting semantic relations between entities in natural language text, based on a generalization of subsequence kernels.

Bunescu, Razvan, and Raymond J. Mooney. "Subsequence kernels for relation extraction." NIPS. 2005.

Follow Up

•Homology – Computational biology method to identify the ancestry of proteins.

Model should be able to tolerate upto m-mismatches. The kernels used in this method measure sequence similarity based on shared occurrences of k-length subsequences, counted with up to m-mismatches.

Overview

• Motivation

• Kernel Methods

efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

ConclusionKey Idea: Using non-contiguous string subsequences to compute similarity between documents with a decay factor which discounts similarity according to the degree of contiguity

•Highly computationally intensive method – authors reduced the time complexity from O(|Σ|n) to O(n|s||t|) by a dynamic programming approach

•Still less intensive method – Kernel Approximation by Feature Subset Selection.

•Empirical estimation of k and λ, from experimental results

•Showed promising results only for small datasets

•Seems to mimic stemming for small datasets

Any Q?Thank You :)

Text classification using Text kernels

Data & Analytics

Transcript of Text classification using Text kernels

Text Classification using String Kernelspapers.nips.cc/paper/1869-text-classification-using-string-kernels.pdf · Text Classification using String Kernels HUlna Lodhi John Shawe-Taylor

Text Classification

Fisher kernels for image representation & generative classification models Jakob Verbeek December 11, 2009.

Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

Text classification methods

Text Mining 4/5: Text Classification

Mismatch String Kernels for SVM Protein Classification · Mismatch String Kernels for SVM Protein Classification AthinaSpiliopoulou MorfoulaFragopoulou IoannisKonstas. Outline Definitions

Text Classification/Categorization

Text Classification Applications

Text Classification using String Kernels - Machine Learning @ Wash

Mismatch string kernels for discriminative protein classification

Text Classification and Naïve Bayes The Task of Text Classification.

Local Features and Kernels for Classification of Texture ... · Local Features and Kernels for Classication of Texture and Object Categories: A Comprehensive Study Jianguo Zhangy

Predictive Text Analytics and Text Classification Algorithms

Text Classification using String Kernels

Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

LDA-based topic modelling in text sentiment classification ...€¦ · sentation of text documents in text sentiment classification? (3) Which configuration of classification algorithms,

Lecture 14: Text Classification; Vector space classification

VECTOR QUANTIZATION KERNELS FOR THE CLASSIFICATION OF PROTEIN SEQUENCES … · 2017-11-28 · VECTOR QUANTIZATION KERNELS FOR THE CLASSIFICATION OF PROTEIN SEQUENCES AND STRUCTURES

Textmining4 Text Classification