Text Mining the technology to convert text into knowledge Stan Matwin School of Information...
-
Upload
maximillian-mason -
Category
Documents
-
view
227 -
download
6
Transcript of Text Mining the technology to convert text into knowledge Stan Matwin School of Information...
Text Miningthe technology to convert text
into knowledge
Stan MatwinSchool of Information Technology and Engineering
University of Ottawa
Canada [email protected]
codata 2002 2
Plan
• What?
• Why?
• How?
• Who?
codata 2002 3
What?
• Text Mining (TM) = Data Mining from textual data
• Finding nuggets in otherwise uninteresting mountains of ore
• DM = finding interesting knowledge (relationships, facts) in large amounts of data
codata 2002 4
What cnt’d
• Working with large corpora• …and little knowledge• Discovering new knowledge • … e.g. in Grimm’s fairy tales• vs uncovering of existing knowledge• …e.g. find mySQL developers with 1yr
experience in a file of 5000 CVs• Has to treat data as NL
codata 2002 5
What? Cnt’d
• Uncovering aspect of TM
• TM = Information Extraction from Text
• Text -> Data Base mapping
• TM and XML
codata 2002 6
Examples
• Extracting information from CVs: skills, systems, technologies etc
• Personal news filtering agent
• Research in functional genomics about protein interaction
codata 2002 7
Why?• Moore’s law, and…
• Storage law
codata 2002 8
How?
A combination of – Machine learning– Linguistic analysis
• Stemming• Tagging• Parsing• Semantic analysis
codata 2002 9
Some TM-related tasks
• Text segmentation
• Topic identification and tracking
• Text summarization
• Language identification
• Author identification
codata 2002 10
Two case studies
• CADERIGE
• Spam detection (with AmikaNow)
codata 2002 11
Caderige
Knowledge extraction from Natural Language texts
Keyword search
searchtir
Subset of Potentially relevant abstracts
Filled-out forms
MedLine
Query in NL Contrôle Agent
: sigma E
Objet : expression du
gène dacB Justification : homologie
Contrôle Agent
: sigma E
Objet : expression du
gène dacB Justification : homologie
Contrôle Agent
: sigma E
Objet : expression du
gène dacB Justification : homologie
Data or knowledge base
Caderige
« Catégorisation Automatique de Documents pour l'Extraction de Réseaux d'Interactions Géniques »
codata 2002 12
Caderige
• Objective: to extract information of interest to geneticists from on-line bastract and/or paper databases (e.g. Medline)
• Ensure acceptable recall and precision
codata 2002 13
The araR gene is monocistronic, and the promoter region contains -10 and -35 regions (as determind by primer extension analysis) similar to those recognized by RNA polymerase containing the major vegetative cell sigma factor sigmaA. An insertion-deletion mutation in the araR gene leads to constitutive expression of the L-arabinose metabolic operaon. We demonstrate that the araR gene codes for a negative regulator of the ara operon and that the expression of araR is repressed by its own product.
The fragment (it.) can be selected by means of keywords
codata 2002 14
This question cannot be answered with keywords alone; semantic knowledge that repression is a type of regulation is req’d
It has been proposed that Pho-P plays a key role in the activation of tuA and in the repression of tagA and tagD.
"What are the proteins involved in the regulation of tagA?”
codata 2002 15
does not answer
After determination of the nucleotide sequence and deduction of the purR reading frame, the PurR product was found to be highly similar to the purR-encoded repressor from Bacillus subtilis.
"What are the proteins involved in the regulation of purR?",
In fact, parsing is needed to see that PurR and purR-encoded Repressor are objects of the verb to be similar
codata 2002 16
RNA isolated from a sigma B deletion mutant revealed that the transcription of gspA is sigmaB dependent.
Conceptual interpretation is needed to see that
is an answer to
"What are the proteins involved in the regulation of gspA
gspA is sigmaB dependent is interpreted asprotein sigmaB regulates gspA
codata 2002 17
CADERIGE Architecture
acquisition of linguistic resources by text mining
Extraction query
Forms
MedLine abstracts
Linguistic resources - Thesaurus
-extraction grammars - • • •
labeling
fragment selection
index extraction using
extr. gragrammars conceptual normalization normalization
Query-text matching
- fragment selectors
codata 2002 18
3 steps
1. Focusing: learned filters
2. Linguistic Analysis: lexicalsyntactic/semantic
• Syntax-semantics mapping
3. Extraction
codata 2002 19
Caderige: example
<Protein>
</gene_ expression><protein> </protein>
<interaction> </interaction>
<gene_ expression>
Semantic Class :Positive interaction
[NP( )1 1
[Verb(2
)2
[NP(3
)3
]
POSITIVE INTERACTIONSubject($2,$1)
Dobj($2,$3)<Gene
expression>] ]
codata 2002 20
Current stage
• 1 done
• XML for 3 designed
• Tools for 2 chosen
codata 2002 21
Email filters
• Spam elimination
• Automatic filing
• Compliance enforcement
• ….
codata 2002 22
Email…
• The trick: cast it as a text classification problem
• Build a training set
• train your favouritre classifier
• Deploy it
codata 2002 23
State of the art
• Current accuracy 80%
codata 2002 24
Difficulties
• multi-class problem where
• classes overlap
• and are hierarchical
• recall vs precision
codata 2002 25
TM: who – academically?
• David Lewis
• Yimin Yang – CMU
• Ray Mooney - UT Austin
• Nick Cercone - Waterloo
• Guy Lapalme – U. de Montréal
• TAMALE - University of Ottawa
codata 2002 26
Who – industrially?
• Clearforest
• AmikaNow
codata 2002 27
Conclusion
• Text mining – a necessity (so “!” instead of “?”)
• Still in its infancy
• Methods must exploit linguistic knowledge
codata 2002 28
Classification
• Prevalent practice:
examples are represented as vectors of values of attributes
• Theoretical wisdom,
confirmed empirically: the more examples, the better predictive accuracy
codata 2002 29
ML/DM at U of O
• Learning from imbalanced classes: applications in remote sensing
• a relational, rather than propositional representation: learning the maintainability concept
• Learning in the presence of background knowledge. Bayesian belief networks and how to get them. Appl to distributed DB
codata 2002 30
Why text classification?
• Automatic file saving
• Internet filters
• Recommenders
• Information extraction
• …
codata 2002 31
Bag of words
Text classification: standard approach
1. Remove stop words and markings2. remaining words are all attributes3. A document becomes a vector <word,
frequency>
4. Train a boolean classifier for each class
5. Evaluate the results on an unseen sample
codata 2002 32
Text classification: tools
• RIPPERA rule-based learner
Works well with large sets of binary features
• Naïve BayesEfficient (no search)
Simple to program
Gives “degree of belief”
codata 2002 33
“Prior art”
• Yang: best results using k-NN: 82.3% microaveraged accuracy
• Joachim’s results using Support Vector Machine + unlabelled data
• SVM insensitive to high dimensionality, sparseness of examples
codata 2002 34
SVM in Text classificationSVM
Transductive SVMMaximum separationMargin for test set
Training with 17 examples in 10 most frequent categories gives test performance of 60% on 3000+ test cases available during training
codata 2002 35
Combining classifiers
Comparable to best known results (Yang)
Reuters DigiTrad# representations b.e. representations b.e.1 NP .827 BWS .3603 BW, NP, NPS .845 BW, BWS, NP .404e
5 BW, NP, NPS, KP, KPS .849 BW, BWS, NP, KPS, KP .422e