Text Mining with D2K/T2K

18
Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois [email protected] Office: (217) 244-9129 http://alg.ncsa.uiuc.edu Michael Welge, Director, [email protected] Loretta Auvil, Project Manager, lauvil @ncsa.uiuc.edu , (217) 265-8021 July 9, 2004 Text Mining with D2K/T2K

description

Text Mining with D2K/T2K. Outline. Text Mining Brief Intro Unsupervised Supervised Information Extraction … ALG Technology Pieces Demonstrations Discussion. What is text mining?. - PowerPoint PPT Presentation

Transcript of Text Mining with D2K/T2K

Page 1: Text Mining with D2K/T2K

Duane SearsmithAutomated Learning GroupNational Center for Supercomputing ApplicationsUniversity of [email protected]: (217) 244-9129http://alg.ncsa.uiuc.edu

Michael Welge, Director, [email protected] Auvil, Project Manager, [email protected], (217) 265-8021

July 9, 2004

Text Mining with D2K/T2K

Page 2: Text Mining with D2K/T2K

alg | Automated Learning Group

Outline

• Text Mining Brief Intro• Unsupervised• Supervised• Information

Extraction• …

• ALG Technology Pieces

• Demonstrations

• Discussion

Page 3: Text Mining with D2K/T2K

alg | Automated Learning Group

What is text mining?

• In simplified and practical terms it is the extraction of a relatively small amount of information of interest from a mass amount of text data.

But …

• You might not know what you’re looking for.• Discovering patterns in the haystack. (clustering, mining associations)

• How to recognize a needle.• Sifting through the haystack. (model building, supervised learning)

• Just the facts please.• Enumerating the make and model of every needle. (information

extraction)

Page 4: Text Mining with D2K/T2K

alg | Automated Learning Group

Common Tasks for Text Mining & Analysis

• Information retrieval

• Automatic grouping (clustering) of documents

• (Active) Classification

• Information extraction

• Topic detection and tracking

• Automatic summarization • “Understanding” text and question answering

• Machine Translation

Page 5: Text Mining with D2K/T2K

alg | Automated Learning Group

Text Preprocessing

• Preprocessing (Text -> Numeric Representation)• Tokenization• Sentence Splitting• Part-of-Speech Tagging• Term Normalization (Stemming)• Filtering (Stops)• Chunking• Term Extraction• Filtering (Again)• Term Weighting• Other Transformations

• Resource Taxing

Page 6: Text Mining with D2K/T2K

alg | Automated Learning Group

• Agglomerative (bottom up)• Quadratic time complexity• Sampling

•Random•Partition

• Hard vs. Soft

• Unsupervised method

• Basic notion to all of these approaches is some heuristic for measuring similarity between documents and document groups (term co-occurrence)

Strongly Similar Arcs

Kept

Weakly Similar Arcs

Broken

Clustering: Document Self-Organization

Page 7: Text Mining with D2K/T2K

alg | Automated Learning Group

How to Recognize a Needle• To classify your data you often need to build a model.

• To build a model you typically need examples from a “teacher” – metaphorically speaking.

• Finding good examples can be hard.

• T2K can also use active learning to help find good examples faster making model building easier.

Page 8: Text Mining with D2K/T2K

alg | Automated Learning Group

Pattern Mining

• Finding frequent item sets -> Rule Discovery

• Many methods: Apriori, Charm, FPGrowth, CLOSET

• Working with Jiawei Han and students -- Hwanjo Yu and Xiaolei Li

• Application: topic tree construction

Page 9: Text Mining with D2K/T2K

alg | Automated Learning Group

Just the Facts Please

• Finding a document that has the information you need is often not the end goal.

• To extract information you must first recognize it – you need to build a model, and that means you need to have examples.

• Levels of IE: What’s hard and what’s harder?

Page 10: Text Mining with D2K/T2K

alg | Automated Learning Group

D2K

Page 11: Text Mining with D2K/T2K

alg | Automated Learning Group

D2K Features

• Extension of existing API• Provides the capability to programmatically connect modules and set properties.• Allows D2K-driven applications to be developed.• Provides ability to pause and restart an itinerary.

• Enhanced Distributed Computing• Allows modules that are re-entrant to be executed remotely.• Uses Jini services to look up distributed resources.• Includes interface for specifying the runtime layout of a distributed itinerary.

• Processor Status Overlay • Shows utilization of distributed computing resources.

• Distributed Checkpointing• Resource Manager

• Provides a mechanism for treating selected data structures as if they were stored in global memory.

• Provides memory space that is accessible from multiple modules running locally as well as remotely.

• Batch Processing / Web Services

D2K Overview

Page 12: Text Mining with D2K/T2K

alg | Automated Learning Group

D2K/T2K/I2K - Data, Text, and Image Analysis

Information Visualization

Page 13: Text Mining with D2K/T2K

alg | Automated Learning Group

• The Engine (distributed, parallelized, persistent)• Core Modules (building blocks)• T2K is a specialized set of modules for text

analysis• I2K is a specialized set of modules for image

analysis• D2K Toolkit (rapid development environment)• ThemeWeaver is an independent application that

uses the D2K engine to run algorithms constructed from T2K modules. It is a demonstration platform

• Other D2K driven applications (StreamLined, EMO, …)

D2K Engine Core Modules T2K Applications

The Technology Pieces

I2K Toolkit

Page 14: Text Mining with D2K/T2K

alg | Automated Learning Group

T2K Core

• Tokenization• POS Tagging• Stemming• Chunking• Filters• Term

Weighting• Supervised /

Unsupervised Learning

• GATE Integration

• Pattern Mining• Text Streams• Summarization

T2K Core 1.0 (Beta)

Page 15: Text Mining with D2K/T2K

alg | Automated Learning Group

ThemeWeaver

Page 16: Text Mining with D2K/T2K

alg | Automated Learning Group

ThemeWeaver: Prototype Text Clustering Application• Hard clustering algorithms

• Modified Kmeans (3 sampling methods)

• Soft clustering• Suffix tree based algorithm• Can be used for longer documents

• Visualizations• “Single link” graph representation• Dendogram cluster tree• Clusters over time

• Drill down and backtrack UI

• D2K/T2K Driven

Page 17: Text Mining with D2K/T2K

alg | Automated Learning Group

The ALG TeamStaff

Loretta AuvilPeter BajcsyColleen BushellDora CaiDavid ClutterLisa GatzkeVered GorenChris NavarroGreg PapeTom RedmanDuane SearsmithAndrew ShirkAnca SuvaialaDavid TchengMichael Welge

StudentsTyler AlumbaughBradley BerkinJacob BiehlJohn CasselPeter GrovesOlubanji IyunSang-Chul LeeYoung-Jin LeeXiaolei LiBrian NavarroScott RamonSunayana SahaMartin UrbanBei YuHwanjo Yu

Page 18: Text Mining with D2K/T2K

alg | Automated Learning Group

* Demo / Discussion *