Team G

16
Term Project of CS570 Artificial Intelligence Webpage Categorization An application of Text Categorization Team Golf

Transcript of Team G

Page 1: Team G

Term Project of CS570 Artificial Intelligence

Webpage CategorizationAn application of Text Categorization

Team Golf20043561 Ha-Yong, Jung20040000 Hoang, G. Vu.20043350 Jung-Ki, Yoo

2004 / 12/ 19

Page 2: Team G

1. IntroductionText Categorization (TC) is the task of automatically classifying documents based on their contents (i.e., based on text only, not on images, metadata, etc.) to a number of pre-defined categories. Each document can be classified in multiple, one, or no category at all.TC has a wide range of applications including document organization, document filtering, web page categorization, and word sense disambiguation.

In this project, we develop a web page categorization application capable of automatically classifying Korean scientific web pages into a number of ‘favorite’ categories. We choose Support Vector Machines (SVM) as our classification technique because SVM is powerful, theoretically rigorous, and is able to handle very high-dimensional feature spaces. Furthermore, some recent TC systems have showed that SVM outperforms most of other machine learning techniques such as neural networks, RBF networks, and decision trees. The text classifier using SVM uses document vector as a feature vector to represent text in training step and also in classifying step. And our target document webpage has much more unnecessary texts than contents texts we really need. So we have to change representation of text through many steps to use webpage as document vector of classifier.

We implemented our system as client-server structure. It has several advantages. First, training set and its category can be changed independently with application. And also we can change classifier using other Machine Learning algorithm or optimize it. All of these doesn’t affect to application at all. So performance and coverage of our system can be improved without changing clients. That is, any users using clients doesn’t need to change their clients program. Whole system flow is as follows:

Figure1: System Flows

This paper is organized as follows. In section 2, we talk about text representation. Section 3 introduces SVM and the use of SVM in our application. In section 4, the details of our application are described.

Page 3: Team G

2. Text RepresentationIn this section, we will talk about Text Representation which is needed for Text Categorization. Text Representation module consists of several steps and partial result of each step can be viewed on the web. So our system can be accessed not only client application but also general web browser.

Figure2: Web Interface of Text Representation Module

2.1. Contents Extraction from WebpageThe target documents of our application are WebPages. So we need to extract only contents from a webpage because there are much more text which is not related to the contents, for example, HTML tag, java script, and so on. As you know, webpage is made of HTML and HTML is not well organized language. So, it’s not so easy work to extract contents because there are various exceptions and various expressions to represent same thing in HTML.

There are two kinds of texts which should be eliminated in texts formatted by HTML. One is meta text including HTML tag, and the other is contents text which is not related to real contents. First, we eliminate HTML tag marked by “<” and “>” characters. Then, we eliminate java script and style sheet, but there are some ambiguity to decide entry point and end point of them because their entry point is expressed variously, furthermore they has various special characters include “<” and “>”. And we eliminate not only these meta texts but also contents texts which is not related to real contents. For example, the texts between anchor tags represents link to other webpage. The other webpage which is

Page 4: Team G

linked is meaningful of course, but the texts represent link itself is not meaningful at all in many cases. So we exclude the text represent link.

The things mentioned above are just a little part of whole problem. In this manner, there are too many various exceptions and it take long time to fix it because many exceptions are discovered in Parsing step not in this step. The “contents” part marked by blue box in Figure2 represents contents texts after this process.

2.2. Parsing ContentsThe contents be made by previous step can be used for making document vector to represent document, but it is not sufficient to represent feature of each documents because it contains all terms in contents. For instance, almost important feature of document is represented by noun terms and little verb terms but it includes not only noun terms and verb terms but also adjective, adverb, preposition, conjunction, and so on. Moreover, one term has same meaning is represented by various forms because it is natural language used by human. So we need to parse the contents and extract only noun terms. The reasons to extract only noun terms are verbs has only little meanings of documents relative to nouns and many meanings extracted from verbs can be covered by nouns. To do parsing, we used a Korean parser tool “HanNaNum.” The “Parsed Data” part marked by light blue box in Figure2 represents parsed results of contents texts after this process.

2.3. Document Vector

Each document is represented by a sparse vector of weights rjjjj wwwd ,...,, 21 where w is

the number of terms that occur at least once in the collection of documents and r is an index of terms and j is an index of documents. We call this sparse vector as Document Vector. This document vector is very sparse vector because index r should be counted by whole training document set and one document has only few terms of them of course. An index r should be global unique in whole training document set because each of them will be used by one feature dimension in training step of Machine Learning. So we made an index table which is implemented by hash table to assign an index r to each term. And, to get an index r, we used an index table which is made in previous step. An index table for category of each document is also made by same way.

In training step of text classification, we should get a document vector matrix which each row is document vector of parsed training documents. So we made the training document vector matrix includes each document vectors and its category. And we should get a document vector for an input document in application. The “Document Vector” part marked by dark brown box in Figure2 represents document vector of parsed texts after this process. But it shows just terms itself used in one document. An index which is assigned to each term is showed by next step.

2.4. Document Vector for SVMWe already made document vector in previous step, but SVM needs little difference format of document vector so it was needed to convert document vector to the other format. First constraint of document vector used by SVM is feature terms of document vector and its category should be an integer index described above. And it must be ordered. So we converted document vector format to fit for SVM. And we did one more

Page 5: Team G

thing in this step to get a more high precision. It is to use TFIDF as weight of each term. The TFIDF is computed as follows:

)(#log).,(#),( 0

jjrjrrj d

Cdtdttfidfw

If we use this weighting function, we can get more high precision in general because it can assign more high weight to more discriminative terms. If the term occurs in a few documents, it has more discriminative power. So TFIDF can assign more high weight to more discriminative terms by using this idea. The “Document Vector for SVM” part marked by purple box in Figure2 represents document vector for SVM after this process.

3. Text Categorization with Support Vector MachinesIn this section, we introduce SVM and TC with SVM. The specific use of SVM in our application is described afterward.

3.1. Support Vector MachinesSVM is a new machine learning technique based on Statistical Learning theory. SVM has been proved to be very effective in dealing with high-dimensional feature spaces - the most challenging problem of other machine learning techniques due to the so-called curse of dimensionality.

For the sake of simplicity, let's examine the basic idea of SVM in the linear-separable case (i.e., the training sample is separable by a hyperplane). Based on Structural Risk Minimization principle, SVM algorithm try to find a hyperplane such that the margin (i.e., the minimal distance of any training point to the hyperplane, see Figure3) is optimal. In order to find an optimal margin, quadratic optimization is used in which the basic computation is the dot product of two points in the input space.

Figure3: Linear classifier and margins.

For nonlinear-separable case, SVM use kernel-based methods to map the input space to a so-called feature space. The basic idea of kernel methods is finding a map Φ from the input space which is nonlinear separable to a linear-separable feature space (see Figure 4). However, the problem with the feature space is that it is usually of very large or even infinite dimensions and thus computing dot product in this space is intractable. Fortunately, kernel methods overcomes this problem by finding maps such that computing dot products in feature spaces becomes computing kernel functions in input spaces

k(x,y) = <Φ(x).Φ(y)>,

Page 6: Team G

where k(x,y) is a kernel function in the input space and <Φ(x).Φ(y)> is a dot product in the feature space. Therefore, the dot product in feature spaces can be computed even if the map Φ is unknown. Some most widely used kernel functions are Gaussian RBF, Polynomial, Sigmoidal, and B-Splines.

Figure 4: Example of mapping from input space to feature space.

3.2. Why SVM Work Well for Text Categorization?SVM has been known to be very efficient in TC due to a number of following reasons: High dimensional input spaces: Normally, in TC the dimensions of input spaces

are very large, and it is very challenging to other machine learning techniques. However, SVM does not depend on the number of features, and thus it has the potential to handle large feature spaces.

Few irrelevant features: In TC, there are very few irrelevant features. Therefore, the uses of feature selection in other machine learning approaches to reduce the number of irrelevant features will also decrease the accuracy of the classifiers. In contrast, SVM can handle very large feature space and thus feature selection is only optional.

Document vectors are sparse: In TC, each document vector has only few entries which are not zero (i.e., sparse). Moreover, there are many theoretical and empirical evidence that SVM is well suited for problems with dense concepts and sparse instances like TC.

Most TC problems are linearly separable: In practice, many TC problems are known to be linearly separable. Therefore, SVM is suitable for these tasks.

3.3. Using SVM in our applicationThe SVM classifier of our application is implemented based on the LIBSVM library. The library realizes C-support vector classification (C-SVC), ν-support vector classification (ν-SVC), ν-support vector regression (ν-SVR), and incorporates many efficient features such as caching, chunking, sequential minimal optimization and performs well in moderate-sized problems (about tens of thousands of training data points). In our application, we only use C-SVC with RBF kernel function .

To train our SVM classifier we use a collection of 7566 Korean scientific web documents that are already categorized in 16 broad categories by domain experts. After parsing and pre-processing, the total number of terms are 189815, meaning that each document is represented by a 189815-dimensional sparse vector – a fairly large dimension that is considered to be extremely difficult in other machine learning techniques such as neural networks or decision trees. Aside from the training data set, another testing data set

Page 7: Team G

consisting of 490 web documents that are also already classified in the same 16 categories is used to evaluate the classifier.

To decide the best parameters (i.e., C and γ) for the classifier, a grid search (i.e., search over various pairs (C, γ)) is needed to be performed. However, since a full grid search is very time-consuming we fix the γ parameter as 1/k (k is the number of training data) and only try various values of C. A recommended range of values of C is 2 0, 21, 22, … , 210

which is known good enough in practice. The classification accuracies obtained over training and testing data with various values of C are shown in Table 1 below.

Table 1: Classification accuracy over training and testing data for various values of C parameter.

C Accuracy over training data Accuracy over testing data1 59.91% 44.90%2 69.53% 53.06%4 78.09% 57.35%8 85.47% 61.22%16 90.92% 64.69%32 94.14% 68.57%64 96.52% 71.22%128 97.82% 71.02%256 98.23% 71.02%512 98.49% 72.45%1024 98.61% 72.45%

As we can see in Table 1, with C=1024 the classifier performs most accurately. Therefore, we choose C=1024 for our classifier.

4. Application Implementation This section is about how we implemented our application and its functions. We used simple web-browser type of program to show the function of our module.

4.1 Basic Structure Basically, our program is composed with client side (web-browser) and server side(processing module).

Figure 5: Basic concept of the program

Page 8: Team G

Client part is implemented as a simple web-browser. During web-surfing , if the user want to add the current page to his/her favorite list, he can easily add it just by pressing ‘Add Favorite’ button on the top of our program . After showing some HTML codes of current page, the program will send that code to the web server. And the web-server will process that information to appropriate form to SVM (by parsing and converting), and send it to client program. After receiving that data (preprocessed array of numbers appropriate to SVM module) client side will categorize it to one of 16 pre-defined categories using SVM module. And finally, client program will make .URL file of that page( which is able to be used directly for favorite list of Explorer. It is a kind of common format ) to categorized directory.

We especially focused on Science section because we cannot cover all the subjects of news on Internet. That will be categorized to 16 categories such as below list.

"Sub","Spa","Phy","Mea","Mat","Geo","Eng","Ene","Ecl","Ear","Che","Bio","Ast","Agr","Aer","Aco"Sub : Time travel Spa : Space Phy : Physics Mea: Measure Mat : MathematicsGeo : Geology Eng : Engineering Ene :Energy Ecl : Ecology Ear: EarthChe : chemistry Bio: Biology Ast :Astronology Agr : agricultureAer : Aero Aco : Acoustic

4.2 How to use this program?Following procedures are showing how to use this program.

Figure 6: Main window of Client Program

Client program has similar interface with Internet Explorer. If you want to add this page as a favorite list, and want to be categorized automatically, press ‘Add favorite’ button. (Figure 6)

Page 9: Team G

Figure 7: Source Code View After pressing the button, you can see this dialog box (Figure 7) showing the HTML

codes of current web-page. Because there are so many unnecessary parts like tag, we have to send this code to preprocessing server to preprocess it.

Moreover, you can input some specific name of this page to have at ‘바로가기’ box. Unless you write some specific words here, the page will be added to its category as ‘temporary.url’.

Figure 8: Result Dialog

After tens of seconds, you can see a dialog box(Figure 8) indicating the result of processing (the Category: You can see this page is categorized to Bio part from above directory) and the name of your favorite file(혈액형.url).

Page 10: Team G

You can use this result( the directory and categorized sub directories) just copying and pasting this to your favorite list. If you do not want to such efforts, press ‘Setup’ button and modify the base directory of favorite list (Figure 9) of this program like below.

Figure 9: Setup Dialog

Figure 10: After changing

If you modified base directory of our program to favorite directory of Explorer, you can directly use the result like above. (Figure 10)

Page 11: Team G

5. Conclusion and Further WorkWe tested the probability of using Learning algorithm to Web-browser categorization

using simple modules. Especially, SVM is used to classify HTML codes because SVM is known as good for character or document classification; we chose this classifier to divide temporary-obtained-HTML codes to add this to favorite lists of user.

The client program in our whole structure, have too many work to process such as waiting the result of preprocessing from server and running SVM module. These functions are originally planned to implement on the Web-sever. However, because of some problem (maybe because of some protocol usage difference or dealing data (floating point precision)) we couldn`t include all the processing part on our server. If this problem is solved, we think that we can make our client module have no time delay processing such parts. Figure11

If server can process all the heavy calculating and just send the result to our client part, there`ll be no time delay dealing with such processes. Moreover, this is good for managing , fixing and updating module just by replacing SVM model.

Figure 11: Ideal Structure of server-client relationship

Moreover, this structure can be easily expanded to other applications like spam-mail filtering, E-mail categorization, and Patent Classification.