Document Classification and Clustering
-
Upload
ankur-shrivastava -
Category
Engineering
-
view
214 -
download
4
Transcript of Document Classification and Clustering
Document Clustering
By:Ankur Shrivastava
Ritesh ModiVinayak Bharti
Introduction
• Document clustering scheme aims to minimize within cluster distances and maximize intra cluster distances.
• Given a heterogeneous data-set, performing clustering based on relevant features.
• Represent the document clusters in different visual form as per requirements.
Block Diagram• Text data extraction from multimedia
documentsRaw corpus
(heterogeneous documents)
• Documents in plain text formatHomogeneous Data
• Removing stop words from documents and stemmingPreprocessing
• Relevant features of documentsFeature Extraction
• Clustered documentsDocument Clustering
Part 1: Conversion to Homogeneous form
The heterogeneous data is converted into a plain text file using the tool Apache Tika. Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity.
• Parsing: The Auto-detect Parser automatically figures out the kind of content like pdf file or html file and parses according to the appropriate parser
• Plain-text Conversion: Function returns the content of the document's body as a plain-text string.
Aggregation of these steps results in a plain text file
Part 2: Feature ExtractionList of features extracted from the text files:Apache UIMA(Unstructured Information Management Architecture) and Stanford NLP Library are used for extraction of these features.
• Unigrams, Bigrams, Trigrams: N- grams is a contiguous sequence of n words. N- grams of sizes 1,2,3 are extracted from the corpus.
• Punctuations: Number of punctuations in the text.
• Capitals: Words with all capital letters.
• #Sentences: Number of sentences in the text.
Preprocessing :Stop word removal and Stemming (Porter Stemmer) as per requirement.
Part 2: Feature Extraction(continued)• Parts-of-Speech(POS) Tagging: Identification of words as nouns, verbs,
adjectives, adverbs etc. The Stanford POS Tagger is used and a count of POS tags is maintained.
• Named Entities: Identification of named entities like Person, Location or Organization etc. The Stanford NER is used and a count of named entities is used.
• Positive and Negative words: Count of positive and negative words in the text.
• URLs: URLS in the text.Preprocessing :Stop word removal and Stemming (Porter Stemmer) as per requirement.
Part 3: Clustering
K-means clustering on the feature space using the tool Weka. • Clustering based on Euclidean distance between the means.
• The algorithm automatically normalizes numerical attributes when doing distance computations.
• Input documents are stored in folders titled with their cluster number.
Thank You