report.doc

Project Report on

Building Descriptor Based SVM for Document Categorization

Student: Naveen Kumar Ratkal

Advisor: Dr. Zubair

ABSTRACT................................................................................................................................................3

INTRODUCTION....................................................................................................................................3

SUPPORT VECTOR MACHINE.....................................................................................................4

INTRODUCTION..........................................................................................................................................4SVM MODEL.............................................................................................................................................4

SVM IMPLEMENTATION................................................................................................................8

TRANSFORMATION....................................................................................................................................8SCALING......................................................................................................................................................8RBF KERNEL.............................................................................................................................................8

APPROACH...............................................................................................................................................9

ARCHITECTURE.................................................................................................................................10

SVM BASED ARCHITECTURE................................................................................................................10TRAINING..................................................................................................................................................10ASSIGNMENT............................................................................................................................................11

EXPERIMENTS AND RESULTS..................................................................................................11

CONCLUSION........................................................................................................................................13

Reference....................................................................................................................................................14

Abstract

Automated document categorization is extensively used and studied for document categorization, various approaches have been implemented. SVM is one of the machine learning technique which is promising for text categorization. Most of the text categorization systems are not widespread. One of the reasons is that very often it is not clear how to adapt a machine learning approach to a collection of an organization with its unique needs. DTIC currently uses a dictionary based approach to assign DTIC thesaurus descriptors to an acquired document based on its title and abstract. The thesaurus currently consists of around 14K descriptors. The main objective of this project is to build a framework for automatically downloading the documents from the DTIC website, OCRing (Optical Character Recognition) the downloaded files and finally train these files using the positive documents and negative documents (which are taken from the others class positive set).

IntroductionAutomated document categorization has been extensively studied and a good survey article discusses evolution of various techniques for document categorization with particular focus on machine learning approaches. One of the machine learning techniques, Support Vector Machines (SVMs), is promising for text categorization (Dumais 1998, Joachims 1998). Dumais et al. evaluated SVMs for the Reuters-21578 collection. They found SVMs to be most accurate for text categorization and quick to train.

The automatic text categorization area has matured and a number of experimental prototypes are available. However, most of these experimental prototypes, for the purpose of evaluating different techniques, have restricted to a standard collection such as Reuters. As pointed out in, the commercial text categorization systems are not widespread. One of the reasons is that very often it is not clear how to adapt a machine learning approach to a collection of an organization with its unique needs.

Here we implement a framework that we have created that allows one to evaluate various approaches of machine learning to the categorization problem for a specific collection. Specifically, we valuate the applicability of SVMs for Defense Technical Information Center (DTIC) needs. The DTIC currently uses a dictionary based approach to assign DTIC thesaurus descriptors to an acquired document based on its title and abstract. The descriptor assignment is validated by humans and one or more fields/groups (subject categorization) are assigned to the target document. The thesaurus currently consists of around 14K descriptors. For subject categorization, DTIC uses 25 main categories (fields) and 251 subcategories (groups). Typically a document is assigned two or three fields/groups. Further, a document is also assigned five or six descriptors from the DTIC thesaurus.

The dictionary based approach currently used by DTIC relies on mapping of document terms to DTIC thesaurus descriptions and is not efficient as it is requires continuous update of the mapping table. Additionally, it suffers from the quality of descriptors that are assigned using this approach. There is a need to improve this process to make it efficient in terms of time and the quality of descriptors that are assigned to documents.

Support Vector Machine

Introduction

SVM (Support Vector Machine) was introduced by V. Vapnik in late 70s. It is widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, and text categorization [11]. The goal of text categorization is the classification of documents into a fixed number of predefined categories. Each document can be in multiple, exactly one, or no category at all [8]. Text categorization with SVM eliminates the need for manual, difficult, time-consuming classifying by learning classifiers from examples and performing the category assignments automatically. This is a supervised learning problem

SVMs are a set of related supervised learning methods used for classification and regression. They belong to a family of generalized linear classifiers. A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin. Hence it is also known as the maximum margin classifier. [8]

The main idea of Support Vector Machine is to find an optimal hyperplane to separate two classes with the largest margin from pre-classified data [12]. The use of the maximum-margin hyperplane is motivated by Vapnik Chervonenkis theory, which provides a probabilistic test error bond that is minimized when the margin is maximized [1].

SVM Model

SVM models can be divided into four distinct groups based on the error function:(1) C-SVM classification(2) nu-SVM classification(3) epsilon-SVM regression(4) nu-SVM regressionCompared to regular C-SVM, the formulation of nu-SVM is more complicated, so up to now there have been no effective methods for solving large-scale nu-SVM [3]. We are using C-SVM classification in our project. For C-SVM, training involves the minimization of the error function:

Subject to the constraints:

Where C is the capacity constant or penalty parameter of the error function, w is the vector of coefficients, b is a constant and is parameter for handling non separable data (inputs). The index i labels the N training cases. Note that is the class labels and xi

is the independent variables. The kernel is used to transform data from the input (independent) to the feature space [13].

SVM kernels:

Kernels area class of algorithms whose task is to detect and exploit complex patterns in data (eg: by clustering, classifying, ranking, cleaning, etc. the data). Typical problems are: how to represent complex patterns; and how to exclude spurious (unstable) patterns (= over fitting). The first is a computational problem; the second a statistical problem. [13].

The class of kernel methods implicitly defines the class of possible patterns by introducing a notion of similarity between data, for example, similarity between documents by length, topic, language, etc. Kernel methods exploit information about the inner products between data items. Many standard algorithms can be rewritten so that they only require inner products between data (inputs) . When a kernel is given there is no need to specify what features of the data are being used. [13].

Kernel functions = inner products in some feature space (potentially very complex)

In SVM there are four common kernels:

linear polynomial

(RBF) radial basis function

sigmoid

In general RBF is a reasonable first choice because the kernel matrix using sigmoid may not be positive definite and in general it’s accuracy is not better than RBF[12], linear is a special case of RBF, and polynomial may have numerical difficulties if a high degree is used. In this implementation we use an RBF kernel whose kernel function is:

exp (γ||xy||²)

Kernel trick:

In machine learning, the kernel trick is a method for easily converting a linear classifier algorithm into a non-linear one, by mapping the original observations into a higher-dimensional non-linear space so that linear classification in the new space is equivalent to non-linear classification in the original space [1]. Let’s look at this with an example.

Here we are interested in classifying data as a part of a machine-learning process. These data points may not necessarily be points in but may be multidimensional points [1]. We are interested in whether we can separate them by a n-1 dimensional hyperplane. This is a typical form of linear classifier. Let us consider a two-class, linearly separable classification problem as below:

Figure 1: Linear Classifier [12].

But not all classification problems are as simple as this.

Figure 2: Non Linear Input space [12].

In case of complex problems, where the input space is not linear, the solution is a complex curve as below.

Figure 3: Kernel Trick [12].

Incase of such non linear input space, the data space is mapped into a linear feature space. This mapping of the feature space is performed by a class of algorithms called Kernels.

Good Decision Boundary:As discussed above theinput data points may not necessarily be points in but may be multidimensional (computer science notation) points. We are interested in whether we can separate them by a n-1 dimensional hyperplane [1]. This is a typical form of linear classifier. There are many linear classifiers that might satisfy this property. However, a given set of linear class data points may have more than one margin or boundary that separates them. A Perceptron algorithm can be used to find these multiple boundaries.

Figure 3: Good Decision Boundary

SVM solves this by finding the maximum separation (margin) between the two classes. and is known as the maximum-margin hyper plane.

Class 1

Class 2

Figure 4: maximum margin hyper plane

SVM Implementation

The SVM implemented in this project follows the following steps [7]: Transform data to the format of an SVM software Conduct simple scaling/normalization on the data Consider the RBF kernel Find the best parameter C and Gamma Use the best parameter C and Gamma to train the training set Test

TransformationThere are many existing SVM library we can use, such as rainbow [9], LIBSVM etc. LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) etc [5]. In this project we select LibSVM because of the following reasons: (1) cross validation for model selection (2) different SVM formulations (3) Java sources

The data format supported by LIBSVM consists of representing every documents as a line of feature id pairs:

[label] [feature Id]:[feature Value] [feature Id]:[feature Value]……The label is the class [0,1] or [-1,1] that the document belongs to .The input here is the text file containing the articles or metadata of the images. LIBSVM however takes the data files of the above described format. The input files are transformed into the LIBSVM format by processing each document as follows. Every document is scanned to pick up features (or key words that represent a document) and every terms frequency (TF) and the inverse document frequency (IDF) is calculated.

These are then used to compute the weight of every term in the training set. There are many tools such as lucene that calculate the TF and DF for a given document and create its SVM data model. In this project I have developed my own API to calculate, compute and create the svm data files (based on the required feature sets).

ScalingThe weighted frequencies are scaled to avoid attributes in greater numeric ranges dominate those in smaller ranges.

RBF KernelIn this project we use the RBF kernel as it is a simple model to start with. The number of hyper parameters influences the complexity of model selection and Polynomial kernel has more hyper parameters than the RBF kernel.

Approach

There are several ways one can apply SVM to the problem of assigning fields/groups and descriptors to new documents based on learning set. In one approach we would treat the problems of categorization and descriptor selection as independent problems each one solved by independent SVMs. A combined approach would either start with the categorization problem or the descriptor problem and then solve the other as a restricted domain problem. For instance, we could solve the categorization problem first and the for the resulting specific field/group solve the descriptor selection problem for the document at hand. Here, we discuss the .independent. approach. And discuss how to resolve inconsistencies that can result when the two problems are solved independently. An equally important task, which is not the focus of this paper is to study different ways to apply SVM approach and their tradeoffs and possible other machine learning techniques that can result in better quality descriptors and fields/group mapping. The overall approach consists of the following major steps.

Step 1. We use the existing DTIC collection for the training phase. For this, we first need to have a representation of a document. A document is represented by a vector of weighted terms. The weights are determined by the term frequency and inverse document frequency . a standard technique used in the IR area. Using this representation of a document, we train the SVM for 251 fields/groups and 14000 descriptors.

Step 2. We use trained SVM to identify the fields and groups for a document. We also assign a likelihood factor (varying from 0 to 1) to an assigned field/group based on the document distance from the hyperplane (please refer to SVM background). We sort the assigned fields/groups based on the likelihood factor and select first .k. fields groups based on a threshold.

Step 3. Similar to Step 2, we identify descriptors, sort them, and select .m. descriptors based on a threshold.

Step 4. Note that fields/groups identified in Step 2 and descriptors identified in Step 3 may be inconsistent. That is we may have a field/group assignment for a new document without its descriptor as identified by the DTIC thesaurus. One straightforward way to resolve this is to use intersection of descriptors identified by the fields/groups mapping and the descriptors identified by the SVM. The likelihood factor can then be used to select few fields/groups (around two or three) and five or six descriptors.

As we are using SVM to classify a large number of classes (251 + 14000), there can be performance issues. As part of this work we investigate ways to improve the performance without sacrificing the quality of fields/groups, and descriptors assignment. In this paper, we focus on Step 3. of the overall process with particular attention on the automated framework that allows collection to be analyzed, trained, processed, and results presented.

Architecture

SVM Based ArchitectureThe process to identify a descriptor for a document consists of two phases: training phase and assignment phase. In the training phase, we train as many SVMs as the number of thesaurus terms. Next in the assignment phase, we present a document to all the trained SVMs and select .m. descriptors based on a threshold. We now give details of the two processes.

Training

As DTIC thesaurus consists of 14000 terms, the training process to build 14000 SVMs needs to be automated. Figure 2 illustrates the training process. For each thesaurus term, we construct a URL and use this URL to search documents from DTIC collection in that category. For our testing purposes we have enabled this process for only five thesaurus terms. From the search results, we download a subset of the documents that are selected randomly. These documents, which are in PDF format, are converted to text using Omnipage, an OCR engine. We use the traditional IR document model, which is based onterm frequency (TF) and document frequency (DF). In this project, we use Apache Lucene package to determine TF and DF to come up with a document representation that can be used for training a SVM. These documents form the positive training set for the selected thesaurus term. The negative training set is created by randomly selecting documents from positive training sets for terms other than the selected term.

Assignment

Figure 3 illustrates the assignment phase. The input document is represented using TF and DF as in the training phase. This document is presented to all the trained SVMs, which in turn output an estimate in the range from 0 to 1 indicating how likely the selected term maps to the test document. Based on a threshold, we can then assign .m. thesaurus terms (descriptors) to the test document.

Experiments and Results

We use recall and precision metrics that are commonly used by the information extraction and data mining communities. The general definition of recall and precision is:

Recall = Correct Answers / Total Possible AnswersPrecision = Correct Answers / Answers Produced

To compute recall and precision typically one uses confusion matrix C [10], which is a K X K matrix for a K-class classifier. In our case, it is a 5 by 5 matrix and an element of the matrix indicates how many documents in class i have been classified as class j. For an ideal classifier all off diagonal entries will be zeroes, and if there are n documents in a class j.

Recall Precision Correct Rate

We created a testbed with various descriptors, below are the details of the testbed.

Testbed 1: In this case we took 5 descriptors (damage tolerance, fabrication, machine, military history and tactical analysis).

Training size: 50 positive documents and 50 negative documents.Test Documents size : 20 documents.Number of pages: First five and Last five.

D: Damage ToleranceF: Fabrication M: MachineH: Military History T: Tactical Analysis

D F M H T Recall PrecisionD 18 2 0 0 0 0.9 0.90F 2 14 1 0 3 0.7 0.77M 0 1 18 0 1 0.9 0.81H 0 1 1 16 2 0.8 0.66T 0 0 2 8 10 0.5 0.62

Correct Rate : 0.76

Testbed 2: we took 5 narrow descriptors (military budgets, military capabilities, military commanders, military government and military operations).Training size: 50 positive documents and 50 negative documents.Test Documents size : 40 documents.Number of pages: First five and Last five.

MB MC MCO MG MO Recall PrecisionMB 22 6 4 3 5 0.55 0.79MC 1 28 4 4 3 0.70 0.61MCO 1 2 24 10 3 0.60 0.52MG 2 6 5 25 2 0.63 0.53MO 2 4 9 5 20 0.50 0.61

Correct Rate : 0.59

We multiplied the constant values for the terms that are occurring in the first two pages of the document (we used the same Testbed 2).

MB MCO MC MG MO Recall PrecisionMB 15 0 4 0 1 0.75 0.65MCO 2 10 1 3 3 0.53 0.50MC 4 3 10 2 1 0.50 0.59MG 1 1 0 8 0 0.80 0.50MO 1 6 2 3 8 0.40 0.62

Correct Rate : 0.57

We added the constant values for the terms that are occurring in the first two pages of the document (we used the same Testbed 2).

MB MCO MC MG MO Recall PrecisionMB 16 0 2 1 1 0.8 0.73MCO 1 13 1 3 2 0.65 0.57MC 3 2 14 1 0 0.7 0.67MG 1 1 0 8 0 0.8 0.53MO 1 7 4 2 6 0.3 0.67

Correct Rate : 0.63

Conclusion In this paper, we proposed a framework to evaluate the effectiveness of SVMs for both subject categorization and descriptor selection problem for a DTIC collection. We have improved the existing DTIC process, which uses a dictionary to assign thesaurus descriptors to an acquired document based on its title and abstract. Our preliminary results are encouraging. We still need to do more testing to determine the right training sizes for various SVMs.

Reference

[1] Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47.

[2] Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. (1998). .Inductive learning algorithms and representations for text categorization., In Proceedings of CIKM-98, 7 th

ACM International Conference on Information and Knowledge Management (Washington, US, 1998), pp. 148.155.

[3] Joachims, T. (1998). .Text categorization with support vector machines: learning with many relevant features., In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, DE, 1998), pp. 137.142. [4] Reuters-21578 collection. URL: http://www.research.att.com/~lewis/reuters21578.htm

[5] V. N. Vapnik. The nature of Statistical Learning Theory. Springer, Berlin, 1995.

[6] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): 955-974, 1998.

[7] T. Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning, Springer, 1998.

[8] T. Joachims, Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer, 2002.

[9] J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351.

[10]. Kohavi R and Provost F. Glossary of Terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Vol. 30, No. 2/3, February/March 1998.

http://www.research.att.com/~lewis/reuters21578.htm

report.doc

Documents

Transcript of report.doc