Text Mining with D2K/T2K
description
Transcript of Text Mining with D2K/T2K
Duane SearsmithAutomated Learning GroupNational Center for Supercomputing ApplicationsUniversity of [email protected]: (217) 244-9129http://alg.ncsa.uiuc.edu
Michael Welge, Director, [email protected] Auvil, Project Manager, [email protected], (217) 265-8021
July 9, 2004
Text Mining with D2K/T2K
alg | Automated Learning Group
Outline
• Text Mining Brief Intro• Unsupervised• Supervised• Information
Extraction• …
• ALG Technology Pieces
• Demonstrations
• Discussion
alg | Automated Learning Group
What is text mining?
• In simplified and practical terms it is the extraction of a relatively small amount of information of interest from a mass amount of text data.
But …
• You might not know what you’re looking for.• Discovering patterns in the haystack. (clustering, mining associations)
• How to recognize a needle.• Sifting through the haystack. (model building, supervised learning)
• Just the facts please.• Enumerating the make and model of every needle. (information
extraction)
alg | Automated Learning Group
Common Tasks for Text Mining & Analysis
• Information retrieval
• Automatic grouping (clustering) of documents
• (Active) Classification
• Information extraction
• Topic detection and tracking
• Automatic summarization • “Understanding” text and question answering
• Machine Translation
alg | Automated Learning Group
Text Preprocessing
• Preprocessing (Text -> Numeric Representation)• Tokenization• Sentence Splitting• Part-of-Speech Tagging• Term Normalization (Stemming)• Filtering (Stops)• Chunking• Term Extraction• Filtering (Again)• Term Weighting• Other Transformations
• Resource Taxing
alg | Automated Learning Group
• Agglomerative (bottom up)• Quadratic time complexity• Sampling
•Random•Partition
• Hard vs. Soft
• Unsupervised method
• Basic notion to all of these approaches is some heuristic for measuring similarity between documents and document groups (term co-occurrence)
Strongly Similar Arcs
Kept
Weakly Similar Arcs
Broken
Clustering: Document Self-Organization
alg | Automated Learning Group
How to Recognize a Needle• To classify your data you often need to build a model.
• To build a model you typically need examples from a “teacher” – metaphorically speaking.
• Finding good examples can be hard.
• T2K can also use active learning to help find good examples faster making model building easier.
alg | Automated Learning Group
Pattern Mining
• Finding frequent item sets -> Rule Discovery
• Many methods: Apriori, Charm, FPGrowth, CLOSET
• Working with Jiawei Han and students -- Hwanjo Yu and Xiaolei Li
• Application: topic tree construction
alg | Automated Learning Group
Just the Facts Please
• Finding a document that has the information you need is often not the end goal.
• To extract information you must first recognize it – you need to build a model, and that means you need to have examples.
• Levels of IE: What’s hard and what’s harder?
alg | Automated Learning Group
D2K
alg | Automated Learning Group
D2K Features
• Extension of existing API• Provides the capability to programmatically connect modules and set properties.• Allows D2K-driven applications to be developed.• Provides ability to pause and restart an itinerary.
• Enhanced Distributed Computing• Allows modules that are re-entrant to be executed remotely.• Uses Jini services to look up distributed resources.• Includes interface for specifying the runtime layout of a distributed itinerary.
• Processor Status Overlay • Shows utilization of distributed computing resources.
• Distributed Checkpointing• Resource Manager
• Provides a mechanism for treating selected data structures as if they were stored in global memory.
• Provides memory space that is accessible from multiple modules running locally as well as remotely.
• Batch Processing / Web Services
D2K Overview
alg | Automated Learning Group
D2K/T2K/I2K - Data, Text, and Image Analysis
Information Visualization
alg | Automated Learning Group
• The Engine (distributed, parallelized, persistent)• Core Modules (building blocks)• T2K is a specialized set of modules for text
analysis• I2K is a specialized set of modules for image
analysis• D2K Toolkit (rapid development environment)• ThemeWeaver is an independent application that
uses the D2K engine to run algorithms constructed from T2K modules. It is a demonstration platform
• Other D2K driven applications (StreamLined, EMO, …)
D2K Engine Core Modules T2K Applications
The Technology Pieces
I2K Toolkit
alg | Automated Learning Group
T2K Core
• Tokenization• POS Tagging• Stemming• Chunking• Filters• Term
Weighting• Supervised /
Unsupervised Learning
• GATE Integration
• Pattern Mining• Text Streams• Summarization
T2K Core 1.0 (Beta)
alg | Automated Learning Group
ThemeWeaver
alg | Automated Learning Group
ThemeWeaver: Prototype Text Clustering Application• Hard clustering algorithms
• Modified Kmeans (3 sampling methods)
• Soft clustering• Suffix tree based algorithm• Can be used for longer documents
• Visualizations• “Single link” graph representation• Dendogram cluster tree• Clusters over time
• Drill down and backtrack UI
• D2K/T2K Driven
alg | Automated Learning Group
The ALG TeamStaff
Loretta AuvilPeter BajcsyColleen BushellDora CaiDavid ClutterLisa GatzkeVered GorenChris NavarroGreg PapeTom RedmanDuane SearsmithAndrew ShirkAnca SuvaialaDavid TchengMichael Welge
StudentsTyler AlumbaughBradley BerkinJacob BiehlJohn CasselPeter GrovesOlubanji IyunSang-Chul LeeYoung-Jin LeeXiaolei LiBrian NavarroScott RamonSunayana SahaMartin UrbanBei YuHwanjo Yu
alg | Automated Learning Group
* Demo / Discussion *