Document Classification and Clustering

Document Clustering

By:Ankur Shrivastava

Ritesh ModiVinayak Bharti

Introduction

• Document clustering scheme aims to minimize within cluster distances and maximize intra cluster distances.

• Given a heterogeneous data-set, performing clustering based on relevant features.

• Represent the document clusters in different visual form as per requirements.

Block Diagram• Text data extraction from multimedia

documentsRaw corpus

(heterogeneous documents)

• Documents in plain text formatHomogeneous Data

• Removing stop words from documents and stemmingPreprocessing

• Relevant features of documentsFeature Extraction

• Clustered documentsDocument Clustering

Part 1: Conversion to Homogeneous form

The heterogeneous data is converted into a plain text file using the tool Apache Tika. Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity.

• Parsing: The Auto-detect Parser automatically figures out the kind of content like pdf file or html file and parses according to the appropriate parser

• Plain-text Conversion: Function returns the content of the document's body as a plain-text string.

Aggregation of these steps results in a plain text file

Part 2: Feature ExtractionList of features extracted from the text files:Apache UIMA(Unstructured Information Management Architecture) and Stanford NLP Library are used for extraction of these features.

• Unigrams, Bigrams, Trigrams: N- grams is a contiguous sequence of n words. N- grams of sizes 1,2,3 are extracted from the corpus.

• Punctuations: Number of punctuations in the text.

• Capitals: Words with all capital letters.

• #Sentences: Number of sentences in the text.

Preprocessing :Stop word removal and Stemming (Porter Stemmer) as per requirement.

Part 2: Feature Extraction(continued)• Parts-of-Speech(POS) Tagging: Identification of words as nouns, verbs,

adjectives, adverbs etc. The Stanford POS Tagger is used and a count of POS tags is maintained.

• Named Entities: Identification of named entities like Person, Location or Organization etc. The Stanford NER is used and a count of named entities is used.

• Positive and Negative words: Count of positive and negative words in the text.

• URLs: URLS in the text.Preprocessing :Stop word removal and Stemming (Porter Stemmer) as per requirement.

Part 3: Clustering

K-means clustering on the feature space using the tool Weka. • Clustering based on Euclidean distance between the means.

• The algorithm automatically normalizes numerical attributes when doing distance computations.

• Input documents are stored in folders titled with their cluster number.

Thank You

Document Classification and Clustering

Engineering

Transcript of Document Classification and Clustering