1 Pattern Recognition using Support Vector Machine and Principal Component Analysis Ahmed Abbasi MIS...

1

Pattern Recognition using Support Vector Machine and

Principal Component Analysis

Ahmed Abbasi

MIS 510

3/21/2007

2

Outline• Background• Support Vector Machine

– Classification• Linear Kernel

– Applications: Text Categorization• Non Linear Kernels

– Applications: Document Categorization• Ensemble Methods

– Applications: Image Recognition– Regression and Feature Selection

• Principal Component Analysis– Standard PCA

• Applications: Style Categorization– Kernel PCA

• Applications: Image Categorization– PCA Ensembles

• Applications: Style Categorization

• SVM and PCA Resources

3

Background

• Statistical Pattern Recognition– Includes classic problems such as character

recognition and medical diagnosis.

– Machine learning algorithms have become popular for pattern recognition.

• Due to enhanced computational power over the past 30-40 years.

• Machines effective for structured and (in some cases) semi-structured problems.

– Popular recent data mining applications include credit scoring, text categorization, image recognition.

4

Background

• Data Mining Terminology– It’s important to firstly review some

common data mining terms.

– Data mining data is typically represented using a feature matrix.

• Features– Attributes used for analysis– Represented by columns in feature

matrix

• Instances– Entity with certain attribute values– Represented by rows in feature

matrix– An example instance is highlighted in

red (also called a feature vector).

• Class Labels– Indicate category for each instance.– This example has two classes (C1

and C2).– Only used for supervised learning.

F1 F2 F3 F4 F5

C1 41 1.2 2 1 3.6

C2 63 1.5 4 0 3.5

C1 109 0.4 6 1 2.4

C1 34 0.2 1 0 3.0

C1 33 0.9 6 1 5.3

C2 565 4.3 10 0 3.2

C1 21 4.3 1 0 1.2

C2 35 5.6 2 0 9.1

Attributes used to classify instances

Ea

ch in

sta

nce

ha

s a

cla

ss la

be

l

Features

Instances

The Feature Matrix

5

Background

• Loan Application Data Example– Machine learning

algorithms are often used by financial institutions for making loan decisions.

– Loan data is represented using a feature matrix.

– Features• Credit score, loan

amount, loan type, applicant’s income, etc.

– Instances• Each instance represents

a prior loan.

– Class Labels• Two classes: whether the

borrower honored the loan or defaulted.

Income (F1)

Loan

Amount

(F2)

Credit Score (F3)

Loan Type (F4)

Honored (C1)

34,000 10,000 685 Mortgage

Honored (C1)

63,050 49,000 700 Stafford

Defaulted (C2)

20,565 35,000 730 Stafford

Honored (C1)

50,021 10,000 664 Mortgage

Defaulted (C2)

100,350 129,000 705 Car Loan

Honored (C2)

800,000 300,000 800 Yacht Loan

Attributes used to classify loan decision

Prio

r lo

an

inst

an

ces

use

d t

o c

lass

ify f

utu

re lo

an

s

Features

Instances

The Loan Data Feature Matrix

6

Background

• Two broad categories of machine learning algorithms.

• Supervised learning algorithms– Also called discriminant methods– Require training data with class labels

• Some examples already discussed in previous lectures include Neural Networks and ID3/C4.5 Decision Tree algorithms.

• Unsupervised learning algorithms– Non-discriminant methods– Build models based on training data, without use of

class labels

7

Background

• In this lecture, we will discuss two popular machine learning algorithms.

• Support Vector Machine– Supervised learning method

• Principal Component Analysis– Unsupervised learning methods

8

Support Vector Machine: Background

• Grounded in Statistical Learning Theory, or VC (Vapnik-Chervonenkis) Theory.

• Technique introduced in the mid 1990’s.– Developed at AT&T bell labs.– Some interesting extensions

done at Microsoft Research.

– The idea is to select a set of functions (called the hyperplane) that can minimize the sum of the empirical risk and VC dimensions.

9


• The intuition behind SVM: VC Theory

level confidence thesignifying 10 range on thenumber a is

)( functions ofset for thedimension VC theis

),( functions selected ofset thedenotingparameter a is

here }1,1{ where instance of label class theis

instance trainingparticular a is

set data ngour trainiin instances ofnumber theis

: where

)4/log()1)/2(log(),(

2

1)(

1

xfh

xf

yiy

i

l

l

hlhxfy

lR

i

ii

l

iii

The empirical risk of the training data

An indicator of the function sets’ effectiveness.

The VC confidence for a set of functions

Proportional to the “capacity” of the function set

The best training model is one that minimizes these two:

Lowest risk and lowest VC dimensions should hopefully result in the most accurate and generalizable model.

10


• Linear Kernel– Uses a linear hyperplane to

separate the different class instances.

– The circled instances represent the support vectors.

• These are the instances that set the boundaries on the hyperplane.

• The distance between the hyperplane and support vectors represents the margin.

– The hyperplane which maximizes this margin is used.

– The greater the margin, the greater the likelihood that the SVM model will be generalizable.

11

Support Vector Machine: Classification

• Linear Kernels for Text Categorization– Linear SVM has been used for a plethora of

important text categorization problems:

• Topic Categorization– Classifying a set of documents by topic

• Sentiment Classification– Classifying online movie and/or product reviews as

“positive” or “negative”

• Style Classification– Categorizing text based on authorship (writing style)

Support Vector Machine: Classification• Topic Categorization

– Motivation: Digital Libraries!!!• Arranging documents by topic is a natural way to organize information in

online libraries.• Dumais et al. (1998) at Microsoft Research conducted an in depth topic

categorization study comparing linear SVM with other techniques on the Reuters corpus.

– Found that SVM outperformed other techniques on most topics as well as overall.

Findsim NBayes BayesNets Trees LinearSVM

Earn 92.9% 95.9% 95.8% 97.8% 98.0%

Acq 64.7% 87.8% 88.3% 89.7% 93.6%

Money-fx 46.7% 56.6% 58.8% 66.2% 74.5%

Grain 67.5% 78.8% 81.4% 85.0% 94.6%

Crude 70.1% 79.5% 79.6% 85.0% 88.9%

Trade 65.1% 63.9% 69.0% 72.5% 75.9%

Interest 63.4% 64.9% 71.3% 67.1% 77.7%

Ship 49.2% 85.4% 84.4% 74.2% 85.6%

Wheat 68.9% 69.7% 82.7% 92.5% 91.8%

Corn 48.2% 65.3% 76.4% 91.8% 90.3%

Avg. Top 10 64.4% 81.5% 85.0% 88.4% 92.0%

Avg. All Cat 61.7% 75.2% 80.0% N/A 87.0%

13


• Sentiment Categorization– Motivation: Market Research!!!

• Gathering consumer preference data is expensive• Yet its also essential when introducing new products or improving

existing ones.– Software for mining online review forums….$10,000– Information gathered…….priceless.

(www.epinions.com)

14


• Sentiment Classification Experiment– Objective to test effectiveness of features and

techniques for capturing opinions.– Test bed of 2000 digital camera product reviews

taken from www.epinions.com.• 1000 positive (4-5 star) and 1000 negative (1-2 star) reviews• 500 for each star level (i.e., 1,2,4,5)

– Two experimental settings were tested• Classifying 1 star versus 5 star (extreme polarity)• Classifying 1+2 star versus 4+5 star (milder polarity)

– Feature set encompassed a lexicon of 3000 positive or negatively oriented adjectives and word n-grams.

– Compared C4.5 decision tree against SVM.• Both run using 10-fold cross validation.

http://www.epinions.com/

15


• Sentiment Classification Experimental Results– SVM significantly outperformed C4.5 on both experimental

settings. – The improved performance of SVM was attributable to its ability

to better detect reviews containing sentiments with less polarity.– Many of the milder (2 and 4 star) reviews contained positive and

negative comments about different aspects of the product. • It was more difficult for the C4.5 technique to detect the overall

orientation of many of these reviews.

Techniques

Sentiments SVM C4.5

Extreme Polarity 93.00 91.05

Mild Polarity 89.40 85.20

16


• Style Categorization– Motivation: Online Anonymity Abuse!!!

• Ability to identify people based on writing style can allow the use of stylometric authentication.

• Important for many online text-based applications:– Email scams (email body text)– Online auction fraud (feedback comments)– Cybercrime (forum, instant messaging logs)– Computer hacking (program code)

17


Test Bed # Authors

25 50 100

Enron Email 87.2 86.6 69.7

eBay Comments 95.6 93.8 90.4

Java Forum 94.0 86.6 41.1

CyberWatch Chat 40.0 33.3 19.8

Style Categorization

Experimental Results: Stylometric Identification using SVM

Classification Accuracy (%)

Linear SVM kernel was fairly effective for identifying up to 50 authors

However, performance fell as number of authors increased (e.g., 100 authors).

Thus, the use of a single SVM may not be appropriate as the number of author classes increases.

Another problem is that the use of supervised techniques may not be suitable for online settings.

18


• More Complex Problems: Fraudulent Escrow Website Categorization

– Motivation: Online Escrow Fraud nets billions of dollars in revenue annually!!!

– Given the growing amount of fraudulent sellers/traders online, people are told to use escrow services for security.

– So naturally, fake escrow websites have started to pop up.

• Online fraud databases such as the Artists-Against-419 document an average of 30-40 new sites every day!!!

• Especially prevalent for online sales of larger goods, such as vehicles.

19

Support Vector Machines: Classification

• Fraudulent Escrow Website Categorization– Which of the following escrow websites are fake?

***All Of Them***

20

Support Vector Machine: ClassificationSame Page Design (HTML and URLs)Same Image and BannerSame Text and Icon

21

Support Vector Machine: Classification• More Complex Problems: Fraudulent Escrow Website Categorization

– Websites contain many pages.• Each page contains HTML, body text, images, URL and anchor text, and in/out links.• Each of these forms of content are important for detecting fake escrow websites.• Not necessarily more complex in terms of classification difficulty, but more representational

complexity.

22


• Fraudulent Escrow Website Categorization

– Using individual feature categories with a single linear SVM is no problem in this case.

– However, if we wish to use all features, the one-to-many relationship between pages and images is problematic.

– Also, what about site structure features?

• E.g., in/out links, page level, etc.

Body Text

Features

HTML Features

URL Features

Image Features

Real Page (C1)

1,2,1,4 3,4,5,2 9,2,3 Image1:

1,3,5,5

Image2:

8,3,4,1

Image3:

9,4,2,4

Fake Page (C2)

63,50,4,5 49,10,5,2 3,2,4 Image1:

43,43,6,4

Image2:

92,54,6,3

Attributes used to classify web pages

Prio

r in

sta

nce

s u

sed

to

cla

ssify

fu

ture

pa

ge

s

Features

Instances

The Web Page Feature Matrix

23


• Fraudulent Escrow Website Categorization– A website contains many pages, and a page can

contain many images, along with HTML, body text, URLs and anchor text, and site structure.

– Important fake escrow classification characteristics:• Requires use of rich feature set (text, html, images, urls, etc.)

– Some feature patterns/trends across fake sites– Some content duplication across fake sites

• Web site structure may be important

– A single linear SVM cannot handle such information….

– Two solutions:• Ensemble Classifiers• Non-linear Kernel

24



• Ensemble Classifiers– Also referred to as

voting based techniques.

– Use multiple SVMs to distribute complex features.

– This is called a feature based ensemble.

– Each SVM classifier is an “expert” on one feature category.

Body Text

Features

HTML Features

URL Features

Image Features

Real Page (C1)

1,2,1,4 3,4,5,2 9,2,3 Image1:

1,3,5,5

Image2:

8,3,4,1

Image3:

9,4,2,4

Fake Page (C2)

63,50,4,5 49,10,5,2 3,2,4 Image1:

43,43,6,4

Image2:

92,54,6,3

Attributes used to classify web pages

Prio

r in

sta

nce

s u

sed

to

cla

ssify

fu

ture

pa

ge

s

Features

Instances


BodyTextSVM

HTMLSVM

URLSVM

ImageSVM

25



• Nonlinear kernel– We can define our

own kernel function.

– Using this function, we can compute the similarity score between every page.

– This matrix can then be input into a linear SVM.

– Notice that the features are now the similarity scores for the pages.

Body Text Features

HTML Features

URL Features

Image Features

Real Page (C1)

1,2,1,4 3,4,5,2 9,2,3 Image1:

1,3,5,5

Image2:

8,3,4,1

Image3:

9,4,2,4

Fake Page (C2)

63,50,4,5 49,10,5,2 3,2,4 Image1:

43,43,6,4

Image2:

92,54,6,3

Real Page (C1)

2,3,5,5 4,7,8,2 9,3,1 Image1:

4,5,5,3


Kernel Function

Similarity P1 Similarity P2 Similarity P3

Real Page (C1) 1.000 0.134 0.531

Fake Page (C2) 0.134 1.000 0.157

Real Page (C1) 0.531 0.157 1.000

26

Support Vector Machine: Classification• Fraudulent Escrow Website Categorization

– An example kernel called “Escrow Kernel”– This kernel is customized to handle fraudulent escrow pages.– It considers the page structure, average page-site similarity, and max

page-site similarity.• The Escrow Kernel is defined as follows:

ectors;category v feature sk' and a page are ,..., and ,...

a; pagefor linksin/out ofnumber and level page theare out and ,in,lv

b; sitein pages

set; trainingin the sites web b

:For

n

1*

outout

out-out*

inin

in-in*lv-lvmin argb)(a,Sim

n

1*

outout

out-out*

inin

in-in*lv-lv

m

1b)(a,Sim

: Where

)}ba,(Sim),ba,(Sim),...,ba,(Sim),ba,({Sim

: vectorfeature with thea pageeach Represent

212,1

aaa

ka

ka

ka

kaka

b sitein pagesmax

1 1ka

ka

ka

kakaave

maxave1max1ave

nn

n

iiii

k

m

k

n

iii

pp

kkkaaa

mk

p

ka

ka

27


• Fraudulent Escrow Website Categorization– Experimental Design– 50 bootstrap instances

• Randomly select 50 real escrow sites and 50 fake web sites in each instance.

– Use all the web pages from the selected 100 sites as the instances.

• Each instance, use 10-fold CV for page categorization.– 90% pages used for training, 10% for testing in each fold.

• Compare different feature categories discussed as well as use of all features with ensemble and kernel approach.

28

Support Vector Machine

• Fraudulent Escrow Website Categorization– Experimental Results (Page level)– The linear kernel outperformed the escrow kernel on the text and

html features.– The escrow kernel outperformed linear SVM on all other feature

sets.• Both ensemble and all feature kernels outperformed the use of

individual feature categories.

Kernel/Features Body Text

HTML URL Image All

Linear SVM 96.92 97.08 93.99 72.26 97.69*

Escrow Kernel 95.98 95.98 95.93 78.18 98.85

*Linear Ensemble with 4 SVM Classifiers

Average classification accuracy (%) across 50 bootstrap runs

29


• Style Categorization Revisited• Ensemble Classifiers

– Can also be used across instances.– Use multiple SVMs to distribute complex classes.– This is called an instance or class based ensemble.– Each SVM classifier is an “expert” on one class.– Could be useful for style categorization scalability problem.

Lexical

(F1)

Syntax

(F2)

Topic

(F3)

Structure (F4)

ID 1

(C1)

1.25 3.41 3.90 2.12

ID 2

(C2)

2.31 5.42 4.35 1.65

ID 3

(C3)

2.23 4.31 8.42 5.03

F1 F2 F3 F4

ID 1 (C1) 1.25 3.41 3.90 2.12

Other (C2) 2.31 5.42 4.35 1.65

Other (C2) 2.23 4.31 8.42 5.03

Features

Inst

ance

s

Identity Feature Matrix

ID 1SVM

ID 3SVM

F1 F2 F3 F4

Other (C2) 1.25 3.41 3.90 2.12

Other (C2) 2.31 5.42 4.35 1.65

ID 3 (C1) 2.23 4.31 8.42 5.03


Test Bed Features/Techniques # Authors

25 50 100

Enron Email Ensemble 88.0 88.2 76.7

Single SVM 87.2 86.6 69.7

EBay Comments Ensemble 96.0 94.0 90.9

Single SVM 95.6 93.8 90.4

Java Forum Ensemble 92.4 85.2 53.5

Single SVM 94.0 86.6 41.1

CyberWatch Chat Ensemble 46.0 36.6 22.6

Single SVM 40.0 33.3 19.8

Experimental Results: Stylometric Identification using SVM and Ensemble


The use of the class-based ensemble outperformed the single SVM on three of four data sets.

The exception being the Java Programming Forum.

Generally the performance gap widened as the number of classes increased.

31


• Kernel Function Examples– In both the examples

on the right no linearly separable hyperplane is possible.

– The top one uses the following second order monomials as features:

– The bottom one shows how a 3rd degree polynomial kernel can be used.

2221

21 ,2, xxxx

32


• Popular Non-linear Kernel Functions– Polynomial Kernels– Gaussian Radial Basis Function (RBF) Kernels– Sigmoidal Kernels– Tree Kernels– Graph Kernels

– Always be careful when designing a kernel• A poorly designed kernel can often reduce performance• The kernel should be designed such that the similarity scores or

structure created by the transformation places related instances in a manner separable from unrelated instances.

• Garbage in – garbage out• Live by the kernel….die by the kernel...• ***Insert preferred idiom here***

33

Support Vector Machine: Feature Selection

• Most machine learning algorithms can also be used for feature selection.

• Trained classifiers assign each feature a weight.– This can be used as an indicator

of its effectiveness or importance.– For example, decision tree

models (DTMs) have been used a lot.

• Similarly, SVM is also highly effective.– Iteratively decrease the feature

space by only selecting features over a threshold weight or the n best features.

SVMFeature

Set

SVM Weights

Selected Features

34

Support Vector Machine: Feature Selection

• Sentiment Categorization– 2,000 movie review test bed

• Performed 10 fold CV and 50 instances with a 1900-100 review split.

– Used SVM to test sentiment polarity classification performance (positive vs. negative)

– Compared no feature selection baseline with feature selection using information gain (IG), genetic algorithm (GA), and SVM weights (SVMW).

• SVMW performed well, significantly outperforming the baseline and with the best overall accuracy, using the minimum set of features.

Techniques 10-Fold CV Bootstrap Std. Dev. # Features

Base 87.95% 88.05% 4.133 26,870

IG 92.50% 92.08% 2.523 2,316

GA 92.55% 92.29% 2.893 2,017

SVMW 92.86% 92.34% 2.080 2,000

35

Support Vector Machine: Regression

• SVM regression is designed to handle continuous data predictions.

• Useful for problems where the classes lie along a continuum instead of discrete classes.– Stock Prediction

• Predicting the impact a news story will have on a company’s stock price.

– Sentiment Categorization• Differentiating 1,2,3,4, and 5 star movie and product reviews.• Often the difference between a 1 and 2 star review is very

subtle.• Being able to make more precise predictions can be useful

here.

36

Principal Component Analysis: Background

• PCA is a popular dimensionality reduction technique– Been around since the early 1900’s– Still used a lot for text and image processing– Idea is to project data into lower dimension feature

space.• Where variables are transformed into a smaller set of

principal components that account for the important variance in the feature matrix.

– Used a lot for:• Data preprocessing/filtering• Feature selection/reduction• Classification and clustering• Visualization

37

Principal Component Analysis: Background

1)

2)

:system following thesolvingby ),...,,(r eigenvectoExtract

:1 eigenvalueeach For

321

m

mmmm aaaa

:0 of polynomial sticcharacteri wherepoints findingby },...,,{ seigenvalue ofset Extract

matrix feature of matrix covariance Derive

21

n

X

0)det()( Ip

0)( mm aI}...,,,{ rseigenvecto ofset ain Resulting 21 naaan

iTkik xa

:dimension each for scorescomponent principal extracting

by instanceeach for tion representa ldimensiona Compute

nk

in

ik

F1 F2 F3 F4 F5 F6

C1 41 1.2 2 1 3.6 1.5

C2 63 1.5 3 0 3.5 2.4

C1 109 0.4 6 1 2.4 3.2

Features

Inst

ance

s

The Feature Matrix

P1 P2 P3

C1 2.6 9.2 1.2

C2 3.2 5.6 2.4

C1 4.4 5.1 3.1

Principal Components

Instan

ces

The Projected Matrix

PCA

will load heavily on P1

38

Principal Component Analysis: Classification

• Use of principal component analysis for authorship and genre analysis of texts using 50 function words and 2D plots.

Some structure based on education level of author.

Some clustering based on genre. Fiction are different from description and argument.

No authorship structure or clustering using top 3 components.

Due to lack of feature richness.

39

Principal Component Analysis: ClassificationAnonymous Message Scores

5 messages

1 messageAuthor B

Author A

Author PCA Scores (using richer features)

40

Principal Component Analysis: Classification

• Kernel Functions– Kernel functions can

be used with PCA in a manner similar to SVM.

– This example shows how a polynomial kernel can be used.

– Polynomial PCA has been used a lot for image recognition.

Kernel Function

41

Principal Component Analysis: Applications

Writeprint Illustration

42

Principal Component Analysis: ApplicationsVarious Writeprint Views

Temporal View

Standard View Density View

Multidimensional View

43


1 1 10 12 3 0

YX

1 1 10 13 0

YX

1 1 10 3 0

YX

1 13 3 0

YX

1 10 3 0

YX

All Features

Punctuation

Letter Freq.

Word Length

Content Spec.

All Features Letter Freq.

Punctuation Word Length

Content Spec.

Writeprint Category Prints

Writeprints are made using all features, while individual categories can also be used for identification or analysis purposes (category prints).

44


Content Specific

Word Length Character Bigrams

Punctuation

Category Print Views

This author has a fairly consistent set of discussion topics, based on the tighter pattern (less variation of content specific features).

45



Feature x y

~ 0 0

@ 0.022814 -0.01491

# 0 0

$ -0.01253 -0.17084

% 0 0

^ -0.01227 -0.01744

& -0.01753 -0.0777

* -0.03017 -0.05931

- -0.12656 0.991784

_ 0.998869 0.047184

= -0.05113 -0.07576

+ 0.142534 0.021726

> -0.1077 0.392182

< -0.10618 0.213193

[ 0 0

] 0 0

{ 0 0

} 0 0

/ -0.05075 -0.09065

\ 0 0

| -0.05965 0.428848

Special Char. Eigenvectors

Author A

Author B

Author C

Author D

Special Char. Writeprints

Interpreting

Writeprints

47

Principal Component Analysis: ApplicationsAnonymous MessagesAuthor Writeprints

Author B

Author A 10 messages

10 messages


Test Bed Features/Techniques # Authors

25 50 100

Enron Email Writeprint 92.0 90.4 83.1

Ensemble 88.0 88.2 76.7

SVM/EF 87.2 86.6 69.7

Baseline 64.8 54.4 39.7

EBay Comments Writeprint 96.0 95.2 91.3

Ensemble 96.0 94.0 90.9

SVM/EF 95.6 93.8 90.4

Baseline 90.6 86.4 83.9

Java Forum Writeprint 88.8 66.4 52.7

Ensemble 92.4 85.2 53.5

SVM/EF 94.0 86.6 41.1

Baseline 84.8 60.2 23.4

CyberWatch Chat Writeprint 50.4 42.6 31.7

Ensemble 46.0 36.6 22.6

SVM/EF 40.0 33.3 19.8

Baseline 37.6 30.8 17.5

Experimental Results: Stylometric Identification Task


Writeprint outperformed SVM and Ensemble SVM

49


• Temporal Writeprint views of the two authors across all features (lexical, syntactic, structural, content-specific, n-grams, etc.).

• Each circle denotes a text window that is colored according to the point in time at which it occurred.

• The bright green points represent text windows from emails written after the scandal had broken out while the red points represent text windows from before.

• Author B has greater overall feature variation, attributable to a distinct difference in the spatial location of points prior to the scandal as opposed to afterwards.

• In contrast, Author A has no such difference, with his newer (green) text points placed directly on top of his older (redder) ones.

• Consequently, Author B has had a profound change with respect to the text in his emails while there doesn’t appear to be any major changes for Author A.

Author B

Author A

The Enron Case

50


51


52

Principal Component Analysis: ApplicationsOther PCA based visualization techniques

Themescapes

Galaxies Text Blobs

ThemeRiver

53

PCA and SVM Resources

• You can “google” these terms…

• SVM– Weka (University of Waikato, New Zealand)– SVM Light (Cornell University)– LibSVM (National Taiwan University)

• PCA– Weka (University of Waikato, New Zealand)– Matlab (Mathworks)

1 Pattern Recognition using Support Vector Machine and Principal Component Analysis Ahmed Abbasi MIS...

Documents

Transcript of 1 Pattern Recognition using Support Vector Machine and Principal Component Analysis Ahmed Abbasi MIS...