Text Data Mining: Introduction
description
Transcript of Text Data Mining: Introduction
Text Data Mining: Text Data Mining: IntroductionIntroduction
Hao ChenHao Chen
School of Information SystemsSchool of Information Systems
University of California at BerkeleyUniversity of California at Berkeley
The KDD Process for The KDD Process for Extracting Useful Knowledge Extracting Useful Knowledge
from Volumes of Datafrom Volumes of Data
Large databases becomes ubiquitousLarge databases becomes ubiquitous grocery store’s checkout registrygrocery store’s checkout registry credit card authorizationcredit card authorization
Computer technology allow efficient and Computer technology allow efficient and inexpensive data storage and accessinexpensive data storage and access
But our ability to analyze and understand But our ability to analyze and understand large dataset lags far behind.large dataset lags far behind.
Manual Data Analysis Manual Data Analysis ImpracticalImpractical
Slow, expensive, and highly subjectiveSlow, expensive, and highly subjective Becomes impractical as data volumns Becomes impractical as data volumns
growgrow N: number of records (10N: number of records (1099)) D: number of fields (10D: number of fields (1022 -- 10 -- 1033))
Need computer technology to automate Need computer technology to automate the bookkeeping.the bookkeeping.
First KDD Workshop in 1989First KDD Workshop in 1989
Definitions of KDDDefinitions of KDD
Knowledge Discovery from DataKnowledge Discovery from DataThe nontrivial process of identifying valid, The nontrivial process of identifying valid, novel, potentially useful, and ultimately novel, potentially useful, and ultimately understandable patterns in data.understandable patterns in data.
KDD Process: SelectionKDD Process: Selection
Learning the application domainLearning the application domain Creating a target datasetCreating a target dataset
KDD Process: PreprocessingKDD Process: Preprocessing
Data cleaning & preprocessingData cleaning & preprocessing remove noiseremove noise handle missing data fieldshandle missing data fields time sequence informationtime sequence information
KDD Process: KDD Process: TransformationTransformation
Data reduction & projectionData reduction & projection features extractionfeatures extraction dimensionality reductiondimensionality reduction invariant representationinvariant representation
KDD Process: Data MiningKDD Process: Data Mining
Choosing function of data miningChoosing function of data mining Choosing data mining algorithmsChoosing data mining algorithms Data mining: searching for patterns of Data mining: searching for patterns of
interestinterest
KDD Process: KDD Process: Interpretation / EvaluationInterpretation / Evaluation
InterpretationInterpretation Using discovered knowledgeUsing discovered knowledge
What is Data Mining? What is Data Mining?
Fitting models to or determining patterns Fitting models to or determining patterns from very large datasets.from very large datasets.
A “regime” which enables people to A “regime” which enables people to interact effectively with massive data interact effectively with massive data stores.stores.
Deriving new information from data.Deriving new information from data. finding patternsfinding patterns across large datasets across large datasets discoveringdiscovering heretofore unknown information heretofore unknown information
What is Data Mining?What is Data Mining?
Potential point of confusion:Potential point of confusion: The The extracting ore from rockextracting ore from rock metaphor does metaphor does
not really apply to the practice of data miningnot really apply to the practice of data mining If it did, then standard If it did, then standard database queriesdatabase queries would would
fit under the rubric of data miningfit under the rubric of data mining Find all employee records in which employee earns
$300/month less than their managers
In practice, DM refers to:In practice, DM refers to: finding patterns across large datasets discovering heretofore unknown information
Another Definition of DMAnother Definition of DM
What SQL currently What SQL currently cannot cannot do.do. A standard query does not infer new informationA standard query does not infer new information
It retrieves a subset of what is already present and known. SQL originally intended for business apps
DM requires sophisticated aggregate queriesDM requires sophisticated aggregate queries
DM Touchstone ApplicationsDM Touchstone Applications
Finding patterns across data sets:Finding patterns across data sets: Reports on changes in retail salesReports on changes in retail sales
to improve sales
Patterns of sizes of TV audiencesPatterns of sizes of TV audiences for marketing
Patterns in NBA playPatterns in NBA play to alter, and so improve, performance
Deviations in standard phone calling behaviorDeviations in standard phone calling behavior to detect fraud for marketing
DM Touchstone ApplicationsDM Touchstone Applications
Separating signal from noise:Separating signal from noise: Classifying faint astronomical objects Classifying faint astronomical objects
Finding genes within DNA sequencesFinding genes within DNA sequences
Discovering novel tectonic activityDiscovering novel tectonic activity
Components of Data MiningComponents of Data Mining The modelThe model
function of the modelfunction of the model classification clustering
representational form of the modelrepresentational form of the model linear function of multiple variables Gaussian probability density function
The preference criterionThe preference criterion goodness of fitgoodness of fit avoiding overfittingavoiding overfitting
The search algorithmThe search algorithm
Model FunctionModel Function
ClassificationClassification RegressionRegression ClusteringClustering SummarizationSummarization Dependency modelingDependency modeling Link analysisLink analysis Sequence analysisSequence analysis
Model RepresentationModel Representation
Decision treeDecision tree Linear modelLinear model Nonlinear model (e.g. Neural Network)Nonlinear model (e.g. Neural Network) Example-based methodExample-based method
(e.g. Nearest Neighbor) (e.g. Nearest Neighbor) Probabilistic graphical dependency modelProbabilistic graphical dependency model
(e.g. Baysian Network)(e.g. Baysian Network) Relational attribute modelRelational attribute model
Search AlgorithmSearch Algorithm
Parameter search, given a modelParameter search, given a model Model search over model spaceModel search over model space
predictive predictive descriptivedescriptive
What’s New Here?What’s New Here?
Sounds like statistical modeling or machine Sounds like statistical modeling or machine learning.learning.
Main difference: scale and availabilityMain difference: scale and availability Datasets too large for classical analysisDatasets too large for classical analysis Increased opportunity for access Increased opportunity for access
end user is often not a statistician
New issues in samplingNew issues in sampling
Statistician’s ViewpointStatistician’s Viewpoint
What’s new about DM?What’s new about DM? Returns statisticians to their empirical rootsReturns statisticians to their empirical roots
exploration rather than modeling
Hypothesis testing may be irrelevantHypothesis testing may be irrelevant given the large data sizes everything is significant
Data was collected for some other purpose Data was collected for some other purpose than what it is being analyzed for nowthan what it is being analyzed for now
The Statistician’s Viewpoint The Statistician’s Viewpoint (David Hand 97)(David Hand 97)
conservativeconservative rigorousrigorous abstractabstract idealizedidealized
adventurousadventurous engineeringengineering practicalpractical real solutionsreal solutions
Statistics Machine Learningvs.
Research ChallengesResearch Challenges Massive datasets & high dimensionalityMassive datasets & high dimensionality User interaction & prior knowledgeUser interaction & prior knowledge Overfitting & assessing statistical significanceOverfitting & assessing statistical significance Missing dataMissing data Understandability of patternsUnderstandability of patterns Managing changing data and knowledgeManaging changing data and knowledge IntegrationIntegration Nonstandard, multimedia, object-oriented dataNonstandard, multimedia, object-oriented data
A Database Perspective on A Database Perspective on Knowledge DiscoveryKnowledge Discovery
Concept of data mining as a querying Concept of data mining as a querying processprocess
First steps toward efficient development First steps toward efficient development of knowledge discovery applicationsof knowledge discovery applications
New Research FrontierNew Research Frontier
Short termShort term::Efficient algorithms implementing Efficient algorithms implementing machine learning tools on the top of large machine learning tools on the top of large databasesdatabases
Long termLong term::building optimizing compilers for ad hoc building optimizing compilers for ad hoc queries and embedding queries in queries and embedding queries in application programming interfacesapplication programming interfaces
KDDMSKDDMS
KDD objectsKDD objects a rulea rule a classifiera classifier a clusteringa clustering
KDD queriesKDD queries a predicate returning a set of KDD or DB a predicate returning a set of KDD or DB
objectsobjects
Examples of KDD QueryExamples of KDD Query
Generate a classifierGenerate a classifier Generate the strongest ruleGenerate the strongest rule Generate all rules with consequent Generate all rules with consequent
attribute values computed by SQL queryattribute values computed by SQL query Find tuples that belong to the largest Find tuples that belong to the largest
clustercluster
Future DirectionsFuture Directions
KDD applications need development KDD applications need development supportsupport query KDD objectsquery KDD objects data mining operationsdata mining operations
nearest neighbors clustering
Development of querying tools is a big Development of querying tools is a big challengechallenge
Provide developers with build applications Provide developers with build applications using a KDD query languageusing a KDD query language
Text Data MiningText Data Mining
Peoples’ first thought:Peoples’ first thought: Make it easier to find things on the Web.Make it easier to find things on the Web. But this is information retrieval!But this is information retrieval!
The metaphor of extracting ore from rock:The metaphor of extracting ore from rock: Does Does make sense for extracting documents of make sense for extracting documents of
interest from a huge pile.interest from a huge pile. But does But does not not reflect notions of DM in practice:reflect notions of DM in practice:
finding patterns across large collections discovering heretofore unknown information
RealReal Text DM Text DM
What would finding a pattern across a What would finding a pattern across a large text collection large text collection reallyreally look like? look like?
From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)
Bill Gates + MS-DOS Bill Gates + MS-DOS in the Bible!in the Bible!
From: “The Internet Diary of the man who cracked the Bible Code”Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
RealReal Text DM Text DM
The point:The point: Discovering heretofore unknown information Discovering heretofore unknown information
is not what we usually do with text.is not what we usually do with text. (If it weren’t known, it could not have been (If it weren’t known, it could not have been
written by someone!)written by someone!)
However:However: There is a field whose goal is to learn about There is a field whose goal is to learn about
patterns in text for its own sake ...patterns in text for its own sake ...
ObservationObservation
Research that exploits patterns in text does so Research that exploits patterns in text does so mainly in the service of computational mainly in the service of computational
linguistics, rather than for learning about and linguistics, rather than for learning about and exploring text collections.exploring text collections.
TDM using Metadata TDM using Metadata (instead of Text)(instead of Text)
Data: Data: Reuter’s newswire (22,000 articles, late 1980s) Categories: commodities, time, countries, people,
and topic
Goals:Goals: distributions of categories across time (trends) distributions of categories between collections category co-occurrence (e.g., topic|country)
Interactive Interface:Interactive Interface: lists, pie charts, 2D line plots
Combining Text with Combining Text with MetadataMetadata
(images, hyperlinks)(images, hyperlinks)
ExamplesExamples Text + Links to find “authority pages” Text + Links to find “authority pages” (Kleinberg (Kleinberg
at Cornell, Page at Stanford)at Cornell, Page at Stanford)
Usage + Time + Links to study evolution of Usage + Time + Links to study evolution of web and information use web and information use (Pitkow et al. at PARC)(Pitkow et al. at PARC)
Images + Text to improve image searchImages + Text to improve image search
True Text Data Mining:True Text Data Mining:Don Swanson’s Medical WorkDon Swanson’s Medical Work
Given Given medical titles and abstractsmedical titles and abstracts a problem (incurable rare disease)a problem (incurable rare disease) some medical expertisesome medical expertise
find causal links among titlesfind causal links among titles symptomssymptoms drugsdrugs results results
Swanson Example (1991)Swanson Example (1991)
Problem: Migraine headaches (M)Problem: Migraine headaches (M) stress associated with Mstress associated with M stress leads to loss of magnesiumstress leads to loss of magnesium calcium channel blockers prevent some Mcalcium channel blockers prevent some M magnesium is a natural calcium channel blockermagnesium is a natural calcium channel blocker spreading cortical depression (SCD)implicated in Mspreading cortical depression (SCD)implicated in M high levels of magnesium inhibit SCDhigh levels of magnesium inhibit SCD M patients have high platelet aggregabilityM patients have high platelet aggregability magnesium can suppress platelet aggregabilitymagnesium can suppress platelet aggregability
All extracted from medical journal titlesAll extracted from medical journal titles
Swanson’s TDMSwanson’s TDM
Two of his hypotheses have received Two of his hypotheses have received some experimental verification.some experimental verification.
His techniqueHis technique Only partially automatedOnly partially automated Required medical expertiseRequired medical expertise
Few people are working on this.Few people are working on this.
ConclusionsConclusions
Currently, what might be construed as Text Data Currently, what might be construed as Text Data Mining is really Computational LinguisticsMining is really Computational Linguistics Text is tricky to process, but rich and abundant Text is tricky to process, but rich and abundant (now)(now) There are many CL tools availableThere are many CL tools available
Data Mining directly from textData Mining directly from text tells us about languagetells us about language produces meta-information that may be useful for produces meta-information that may be useful for
information accessinformation access
ConclusionsConclusions Information Access != Text Data MiningInformation Access != Text Data Mining
IA = finding needle in haystackIA = finding needle in haystack TDM = finding patterns or new informationTDM = finding patterns or new information
However, Information Access may potentially be However, Information Access may potentially be served by Text Data Mining techniques:served by Text Data Mining techniques: automated metadata assignmentautomated metadata assignment collection overviewscollection overviews
The synthesis of ideas from TDM and IAThe synthesis of ideas from TDM and IA: : Perhaps a new field of exploratory data analysis over Perhaps a new field of exploratory data analysis over
text!text!
Promising Research Promising Research DirectionsDirections
Text Data Mining Problems:Text Data Mining Problems: Patterns within sets of documents:Patterns within sets of documents:
What is the latest in this field? How is this field related to that field?
Chains of evidence embedded in text:Chains of evidence embedded in text: What drugs have been tested for this symptom? What effects did this funding have on that field?
Human use of information over timeHuman use of information over time How does information diffuse across the web?
Needed from SystemsNeeded from Systems
Support for linking Support for linking chainschains of associationsof associations Support for combined Support for combined structured structured andand
unstructured dataunstructured data Support for combining Support for combining disparate disparate
collectionscollections
Statistical Themes & Lessons Statistical Themes & Lessons
for Data Miningfor Data Mining Statistical themesStatistical themes Statistical lessonsStatistical lessons Cooperation between statistical and Cooperation between statistical and
computational communitiescomputational communities
Overview of Statistical Overview of Statistical ScienceScience
Probability distributionsProbability distributions Estimation, consistency, uncertainty, Estimation, consistency, uncertainty,
assumptions, robustness, and model assumptions, robustness, and model averagingaveraging
Hypothesis testingHypothesis testing Model scoringModel scoring Markov Chain Monte CarloMarkov Chain Monte Carlo Generalized model classesGeneralized model classes
Overview of Statistical Overview of Statistical SciencesSciences
Rational decision making and planningRational decision making and planning Inference to causesInference to causes PredictionPrediction
Important Themes of Important Themes of StatisticsStatistics
to Data Miningto Data Mining Clarity about goalsClarity about goals Use of model that are reliable means to Use of model that are reliable means to
the goal, understandable and plausible to the goal, understandable and plausible to usersusers
Sense of uncertainties of models and Sense of uncertainties of models and predictionspredictions
LessonsLessons
Data can lieData can lie Sometimes it’s not what’s in the data that Sometimes it’s not what’s in the data that
mattersmatters Perversity of the pervasive P-valuePerversity of the pervasive P-value Intervention and predictionIntervention and prediction