Recent Trends in Text Mining Girish Keswani [email protected].
-
Upload
oliver-lindsey -
Category
Documents
-
view
215 -
download
1
Transcript of Recent Trends in Text Mining Girish Keswani [email protected].
Recent Trends Recent Trends in in
Text MiningText Mining
Girish KeswaniGirish Keswani
[email protected]@micron.com
Text Mining?Text Mining?
What?What? Data Mining on Text DataData Mining on Text Data
Why?Why? Information RetrievalInformation Retrieval Confusion Set DisambiguationConfusion Set Disambiguation Topic DistillationTopic Distillation
How?How? Data MiningData Mining
OrganizationOrganization
Text Mining AlgorithmsText Mining Algorithms Jargon UsedJargon Used BackgroundBackground
Data Modeling,Data Modeling, Text Classification, andText Classification, and Text ClusteringText Clustering
ApplicationsApplications Experiments {NBC, NN and ssFCM}Experiments {NBC, NN and ssFCM} Further workFurther work ReferencesReferences
Text Mining AlgorithmsText Mining Algorithms
Classification AlgorithmsClassification Algorithms Naïve Bayes ClassifierNaïve Bayes Classifier Decision TreesDecision Trees Neural NetworksNeural Networks
Clustering AlgorithmsClustering Algorithms EM AlgorithmsEM Algorithms Fuzzy Fuzzy
JargonJargon
DM: Data MiningDM: Data Mining IR: Information RetrievalIR: Information Retrieval NBC: Naïve Bayes ClassifierNBC: Naïve Bayes Classifier EM: Expectation MaximizationEM: Expectation Maximization NN: Neural NetworksNN: Neural Networks ssFCM: Semi-Supervised Fuzzy C-ssFCM: Semi-Supervised Fuzzy C-
MeansMeans Labeled Data (Training Data)Labeled Data (Training Data) Unlabeled DataUnlabeled Data Test DataTest Data
Background: ModelingBackground: Modeling
Vector Space Model Vector Space Model
Background: ModelingBackground: Modeling
Generative Models of Data [13] : Generative Models of Data [13] : ProbabilisticProbabilistic
““to generate a document, a class is to generate a document, a class is first selected based on its prior first selected based on its prior probability and then a document is probability and then a document is generated using the parameters of generated using the parameters of the chosen class distribution”the chosen class distribution”
NBC and EM Algorithms are based NBC and EM Algorithms are based on this modelon this model
Importance of Unlabeled Importance of Unlabeled Data?Data?
AD
B
E F
C
G
Provides access to feature distribution Provides access to feature distribution in set F using joint probability in set F using joint probability
distributionsdistributions
Labeled Data
Unlabeled Data
Test Data
How to make use of How to make use of Unlabeled Data? Unlabeled Data?
How to make use of How to make use of Unlabeled Data? Unlabeled Data?
Experimental Results [1]Experimental Results [1]
Using NBC, EM and ssFCM
Experimental Results [2]Experimental Results [2]
Using NBC and EM
Extensions and Variants of Extensions and Variants of these approachesthese approaches
Authors in [6] propose a concept Authors in [6] propose a concept of Class Distribution Constraint of Class Distribution Constraint matrixmatrix Results on Confusion Set Results on Confusion Set
DisambiguationDisambiguation Automatic Title Generation [7]:Automatic Title Generation [7]:
Using EM AlgorithmUsing EM Algorithm Non-extractive approach Non-extractive approach
Relational Data [9]Relational Data [9]
A collection of data with A collection of data with relations between entities relations between entities explained is known as relational explained is known as relational datadata
Probabilistic Relational ModelsProbabilistic Relational Models
Commercial Use/ProductsCommercial Use/Products
IBM Text Analyzer [11]IBM Text Analyzer [11] Decision Tree BasedDecision Tree Based
SAS Text Miner[12]SAS Text Miner[12] Singular Value DecompositionSingular Value Decomposition
Filtering Junk EmailFiltering Junk Email Hotmail, Yahoo Hotmail, Yahoo
Advanced Search EnginesAdvanced Search Engines
Applications: Search EnginesApplications: Search Engines
Vivisimo Search Engine: Vivisimo Search Engine: (www.vivisimo.com)(www.vivisimo.com)
ExperimentsExperiments
NBCNBC Naïve Bayes ClassifierNaïve Bayes Classifier ProbabilisticProbabilistic
NNNN Neural NetworksNeural Networks
ssFCMssFCM Semi-Supervised Fuzzy ClusteringSemi-Supervised Fuzzy Clustering FuzzyFuzzy
Datasets Datasets (20 Newsgroups Data)(20 Newsgroups Data)
Sampling I:Sampling I:
Sampling II:Sampling II:
Dataset min2 min4 min6
# Features -- 9467 5685
Dataset Sampling Percentage Number of Features
Sample25 25% 13925
Sample30 30% 15067
Sample35 35% 16737
Sample40 40% 16871
Sample45 45% 17712
Sample50 50% 19135
Data
Vectors
Raw
Sampling I
Sampling II Vectors
Naïve Bayes ClassifierNaïve Bayes Classifier
SAMPLE % TRAINING % TEST ACCURACY %
Sample25
20 80 34.4637
63 36 48.4945
76 23 50.9322
82 17 47.7728
86 13 48.9971
20 80 31.5436
63 36 48.0729
76 23 47.8661
82 17 50.5568
86 13 50.4587
Sample30
33 66 39.1137
66 33 46.4233
77 22 48.5528
83 16 52.7383
86 13 51.2136
33 66 39.26
66 33 47.0192
77 22 48.8439
83 16 49.6907
86 13 51.6169
Naïve Bayes ClassifierNaïve Bayes Classifier
Acc
urac
y %
30
35
40
45
50
55
Sample25 Sample30
Sample
Sample25
Sample30.01 .05 .10 .25 .50 .75 .90 .95 .99
-3 -2 -1 0 1 2 3
Normal Quantile
NBCNBC
Acc
urac
y %
30
35
40
45
50
55
20 63 76 82 86
% TRAINING
Acc
urac
y %
40
45
50
55
33 66 77 83 86
% TRAINING
Acc
urac
y %
30
35
40
45
50
55
13 17 23 36 80
% TEST
Acc
urac
y %
40
45
50
55
13 16 22 33 66
% TEST
Sample25 Sample30
ssFCMssFCM
AC
CU
RA
CY
%
27.5
30
32.5
35
37.5
20 33 42 50 55 60
% LABELED
AC
CU
RA
CY
%
27.5
30
32.5
35
37.5
40 44 50 57 66 80
% UNLABELED
Effect of Labeled Data Effect of Unlabeled Data
ssFCMssFCM
AC
CU
RA
CY
%
27.5
30
32.5
35
37.5
sam
ple2
5
sam
ple3
0
sam
ple3
5
sam
ple4
0
sam
ple4
5
sam
ple5
0
Sample
Further WorkFurther Work Ensemble of Classifiers [16]Ensemble of Classifiers [16]
Further WorkFurther Work Knowledge Gathering from Knowledge Gathering from
ExpertsExperts E.g. 3 class Data:E.g. 3 class Data:
C1 C2 C3
Input Data {C1,C2,C3}
Test Data?
Classifier
ReferencesReferences
[1] “Text Classification using Semi-Supervised Fuzzy Clustering,” Girish Keswani and L.O.Hall, appeared in IEEE WCCI 2002 conference.
[2] “Using Unlabeled Data to Improve Text Classification,” Kamal Paul Nigam.[3] “Text Classification from Labeled and Unlabeled Documents using EM,” Kamal Paul
Nigam et al.[4] “The Value of Unlabeled Data for Classification Problems,” Tong Zhang.[5] “Learning from Partially Labeled Data,” Martin Szummer et al.[6] “Training a Naïve Bayes Classifier via the EM Algorithm with a Class Distribution
Constraint,” Yoshimasa Tsuruoka and Jun’ichi Tsujii.[7] “Automatic Title Generation using EM,” Paul E. Kennedy and Alexander G. Hauptmann.[8] “Unlabeled Data can degrade Classification Performance of Generative Classifiers,”
Fabio G. Cozman and Ira Cohen.[9] “Probabilistic Classification and Clustering in Relational Data,” Ben Taskar et al.[10] “Using Clustering to Boost Text Classification,” Y.C. Fang et al.[11] IBM Text Analyzer: “A decision-tree-based symbolic rule induction system for text
categorization,” D.E. Johnson et al.[12] “SAS Text Miner,” Reincke[13] “Pattern Recognition,” Duda and Hart 2000[14] “Machine Learning,” Tom Mitchell[15] “Data Mining,” Margaret Dunham[16] http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/