Text Mining the technology to convert text into knowledge Stan Matwin School of Information...

Text Miningthe technology to convert text

into knowledge

Stan MatwinSchool of Information Technology and Engineering

University of Ottawa

Canada [email protected]

codata 2002 2

Plan

• What?

• Why?

• How?

• Who?

codata 2002 3

What?

• Text Mining (TM) = Data Mining from textual data

• Finding nuggets in otherwise uninteresting mountains of ore

• DM = finding interesting knowledge (relationships, facts) in large amounts of data

codata 2002 4

What cnt’d

• Working with large corpora• …and little knowledge• Discovering new knowledge • … e.g. in Grimm’s fairy tales• vs uncovering of existing knowledge• …e.g. find mySQL developers with 1yr

experience in a file of 5000 CVs• Has to treat data as NL

codata 2002 5

What? Cnt’d

• Uncovering aspect of TM

• TM = Information Extraction from Text

• Text -> Data Base mapping

• TM and XML

codata 2002 6

Examples

• Extracting information from CVs: skills, systems, technologies etc

• Personal news filtering agent

• Research in functional genomics about protein interaction

codata 2002 7

Why?• Moore’s law, and…

• Storage law

codata 2002 8

How?

A combination of – Machine learning– Linguistic analysis

• Stemming• Tagging• Parsing• Semantic analysis

codata 2002 9

Some TM-related tasks

• Text segmentation

• Topic identification and tracking

• Text summarization

• Language identification

• Author identification

codata 2002 10

Two case studies

• CADERIGE

• Spam detection (with AmikaNow)

codata 2002 11

Caderige

Knowledge extraction from Natural Language texts

Keyword search

searchtir

Subset of Potentially relevant abstracts

Filled-out forms

MedLine

Query in NL Contrôle Agent

: sigma E

Objet : expression du

gène dacB Justification : homologie

Contrôle Agent

: sigma E



Contrôle Agent

: sigma E



Data or knowledge base

Caderige

« Catégorisation Automatique de Documents pour l'Extraction de Réseaux d'Interactions Géniques »

codata 2002 12

Caderige

• Objective: to extract information of interest to geneticists from on-line bastract and/or paper databases (e.g. Medline)

• Ensure acceptable recall and precision

codata 2002 13

The araR gene is monocistronic, and the promoter region contains -10 and -35 regions (as determind by primer extension analysis) similar to those recognized by RNA polymerase containing the major vegetative cell sigma factor sigmaA. An insertion-deletion mutation in the araR gene leads to constitutive expression of the L-arabinose metabolic operaon. We demonstrate that the araR gene codes for a negative regulator of the ara operon and that the expression of araR is repressed by its own product.

The fragment (it.) can be selected by means of keywords

codata 2002 14

This question cannot be answered with keywords alone; semantic knowledge that repression is a type of regulation is req’d

It has been proposed that Pho-P plays a key role in the activation of tuA and in the repression of tagA and tagD.

"What are the proteins involved in the regulation of tagA?”

codata 2002 15

does not answer

After determination of the nucleotide sequence and deduction of the purR reading frame, the PurR product was found to be highly similar to the purR-encoded repressor from Bacillus subtilis.

"What are the proteins involved in the regulation of purR?",

In fact, parsing is needed to see that PurR and purR-encoded Repressor are objects of the verb to be similar

codata 2002 16

RNA isolated from a sigma B deletion mutant revealed that the transcription of gspA is sigmaB dependent.

Conceptual interpretation is needed to see that

is an answer to

"What are the proteins involved in the regulation of gspA

gspA is sigmaB dependent is interpreted asprotein sigmaB regulates gspA

codata 2002 17

CADERIGE Architecture

acquisition of linguistic resources by text mining

Extraction query

Forms

MedLine abstracts

Linguistic resources - Thesaurus

-extraction grammars - • • •

labeling

fragment selection

index extraction using

extr. gragrammars conceptual normalization normalization

Query-text matching

- fragment selectors

codata 2002 18

3 steps

1. Focusing: learned filters

2. Linguistic Analysis: lexicalsyntactic/semantic

• Syntax-semantics mapping

3. Extraction

codata 2002 19

Caderige: example

<Protein>

</gene_ expression><protein> </protein>

<interaction> </interaction>

<gene_ expression>

Semantic Class :Positive interaction

[NP( )1 1

[Verb(2

)2

[NP(3

)3

]

POSITIVE INTERACTIONSubject($2,$1)

Dobj($2,$3)<Gene

expression>] ]

codata 2002 20

Current stage

• 1 done

• XML for 3 designed

• Tools for 2 chosen

codata 2002 21

Email filters

• Spam elimination

• Automatic filing

• Compliance enforcement

• ….

codata 2002 22

Email…

• The trick: cast it as a text classification problem

• Build a training set

• train your favouritre classifier

• Deploy it

codata 2002 23

State of the art

• Current accuracy 80%

codata 2002 24

Difficulties

• multi-class problem where

• classes overlap

• and are hierarchical

• recall vs precision

codata 2002 25

TM: who – academically?

• David Lewis

• Yimin Yang – CMU

• Ray Mooney - UT Austin

• Nick Cercone - Waterloo

• Guy Lapalme – U. de Montréal

• TAMALE - University of Ottawa

codata 2002 26

Who – industrially?

• Google

• Clearforest

• AmikaNow

codata 2002 27

Conclusion

• Text mining – a necessity (so “!” instead of “?”)

• Still in its infancy

• Methods must exploit linguistic knowledge

codata 2002 28

Classification

• Prevalent practice:

examples are represented as vectors of values of attributes

• Theoretical wisdom,

confirmed empirically: the more examples, the better predictive accuracy

codata 2002 29

ML/DM at U of O

• Learning from imbalanced classes: applications in remote sensing

• a relational, rather than propositional representation: learning the maintainability concept

• Learning in the presence of background knowledge. Bayesian belief networks and how to get them. Appl to distributed DB

codata 2002 30

Why text classification?

• Automatic file saving

• Internet filters

• Recommenders

• Information extraction

• …

codata 2002 31

Bag of words

Text classification: standard approach

1. Remove stop words and markings2. remaining words are all attributes3. A document becomes a vector <word,

frequency>

4. Train a boolean classifier for each class

5. Evaluate the results on an unseen sample

codata 2002 32

Text classification: tools

• RIPPERA rule-based learner

Works well with large sets of binary features

• Naïve BayesEfficient (no search)

Simple to program

Gives “degree of belief”

codata 2002 33

“Prior art”

• Yang: best results using k-NN: 82.3% microaveraged accuracy

• Joachim’s results using Support Vector Machine + unlabelled data

• SVM insensitive to high dimensionality, sparseness of examples

codata 2002 34

SVM in Text classificationSVM

Transductive SVMMaximum separationMargin for test set

Training with 17 examples in 10 most frequent categories gives test performance of 60% on 3000+ test cases available during training

codata 2002 35

Combining classifiers

Comparable to best known results (Yang)

Reuters DigiTrad# representations b.e. representations b.e.1 NP .827 BWS .3603 BW, NP, NPS .845 BW, BWS, NP .404e

5 BW, NP, NPS, KP, KPS .849 BW, BWS, NP, KPS, KP .422e

Text Mining the technology to convert text into knowledge Stan Matwin School of Information...

Documents

Transcript of Text Mining the technology to convert text into knowledge Stan Matwin School of Information...