Text Mining, Association Rules and Decision Tree Learning

Decision Tree LearningSupervised Learning

Adrian CuyuganInformation Analytics

Multidisciplinary Subject

Statistics

Data Mining

Machine Learning

AI

Text Mining

Business Process Mining

Natural Language Processin

g

Database Managem

ent

Library Science

Mathematics

Computer Science

Machine Learning

Supervised vs Unsupervised Learning• Supervised learning assumes labeled data, i.e. there is

response variable that labels each record.• Unsupervised learning, on the other hand, does not expect

a response variable because the algorithm itself can learn from the distinct patterns within the data. Examples are clustering and pattern discovery.

Supervised Learning Techniques• Regression techniques assume a numerical response

variable. The most frequently used is linear regression by minimizing the sum or errors.

• Classification techniques assume a categorical response variable. The foundation of classification technique is decision tree algorithm.

EntropyIn other words, the algorithm splits the set of instances in subsets such that the variation within each subset becomes smaller.

Entropy is an information-theoritic measure for the uncertainly in a multi-set of elements.

If the multi-set contains many different elements and each element is unique, then variation is maximal and it takes many bits to encode the individual elements. Hence, the entropy is considered high.

If all elements, on the other hand, are the same, then actually no bits are needed to encode the individual elements, hence it is a low entropy.

DecisionY N

Entropy

DecisionY N

DecisionY N

Decision

High Entropy

High EntropyLow Entropy

Low Entropy

Low Entropy Low Entropy

Y N

Entropy

𝐸=−∑𝑖=1

𝑘 𝑐 𝑖

𝑛𝑙𝑜𝑔2

𝑐 𝑖

𝑛

Weighted Average Entropy

𝐸�̂�1=66∗1=1

𝐸�̂�= ∑𝑖 , 𝑗=1

𝑘 𝑐 𝑖𝑗

𝑛𝜃


𝐸�̂�2=26∗0+

46∗0.811=0.54

𝐸�̂�= ∑𝑖 , 𝑗=1

𝑘 𝑐 𝑖𝑗

𝑛𝜃


𝐸�̂�3=26∗0+

16∗0+

36∗0=0

𝐸�̂�= ∑𝑖 , 𝑗=1

𝑘 𝑐 𝑖𝑗

𝑛𝜃

Information Gain

DecisionY N

DecisionY N

Decision

Y N

𝐸𝜇1=1

𝐼𝐺=𝐸𝜇 (𝑇 )−𝐸𝜇(𝑇 ,𝑎)

𝐸𝜇2=0.54 𝐸𝜇3

=0

Stop!

Different Variations

Additional Settings

• Minimal size of the nodes

• Maximum depth of the tree

• Bootstrapping at nodes• Setting minimal

threshold of IG• Using Gini Index than

Information Gain• Post-pruning of the tree

Different Algorithms• ID3 (Iterative Dichotomiser 3)

First decision tree classifier.• CART (Classification and Regression Trees)

A binary classifier. The generic decision tree learning algorithm like in the example.

• C4.5 and C5.0Can handle numerical independent variables. The latter offers more computational speed and varies in splitting rule.

• CHAID (Chi-square Automatic Interaction Detector)Uses significance testing in splitting.

• Ensembles i.e. Random Forest, Ada Boost, Gradient BoostingUses bagging, bootstrapping and weighting. Very flexible and most recent innovations in decision tree learning.

Suggested Topics to Read1. Dividing datasets for model evaluation

a) Training and testing setsb) Cross-validation

2. Confusion matrix for binary classifiersa) True Positive and True Negativeb) False Positive and False Negative

3. Quality measures in evaluating classification modelsa) Error and Accuracyb) Precision and Recallc) F1 score (harmonic mean)d) ROC Charte) Area Under the Curve

4. Ensemble methods5. Bootstrapping and resampling statistics

Text Mining and AnalyticsUnsupervised Learning

Adrian CuyuganInformation Analytics

Text Mining Overview

Data Extraction

•File Types and Sources (Spreadsheet, Word Documents, HTML, JSON, API, etc.)

•Regular expressions•Data File Systems (RDBMS, Google File System, Hadoop, MapReduce)

Information

Retrieval

•Intro to Natural Language Analysis•Vector Space Model – Bag of Words•Term Frequency Matrix•Inverted Document Frequency Matrix•TF-IDF Matrix•Stop words and Stemming•Document Length Normalization (PL2, Okapi/BM25)

•Evaluation (Average Precision, Reciprocal Rank, F-meaure and nDCG)

•Query Likelihod, Statistical Language Probability Unigram Language Model

•Rocchio Feedback and KL Divergence•Recommender Systems

Pattern Analysis

•Pattern Discovery Concepts (Frequent, Closed and Max)

•Association Rules•Quantitave Measures (Support, Confidence and Lift)

•Other measures•Apriori Algorithm, ECLAT and FPGrowth Algorithms

•Multi-level and Multi-dimensional levels, Compressed and Colossal Patterns

•Sequential Patterns•Graph Patterns•Topic Modelling for Text Data

Clustering

•Partitioning, Hierarchical and Density based methods

•Spectral Clustering•Probabilistic Models and EM Algorithm

•Evaluating Clustering Models•Clustering streaming data•Graph Theory•Social Network Analysis

Analytics

•Text clustering, categorization and summarization

•Topic-based modelling•Sentiment analysis•Integration of free-form text and structured data

Visualization

•Basic charts and graphs •Animating and interactivity•Visualizing relationships (hierarchies, clusters and networks)•Visualizing text

Text Retrieval

Text Mining and Analytics

Natural Language Analysis

The quick brown fox jumped over the lazy dog.

Pragmatic Analysis

article adjective adjective noun verb preposition article adjective noun

Prepositional phraseNoun phrase

Subject Predicate

Lexical Analysis (part of speech tagging)

Syntactic Analysis(parsing)

Semantic Analysis fox (f1) dog (d1) jump (f1, d1)

How quick was the fox that it jumped over the dog.Could the dog escaped the quick fox if it wasn’t lazy?Why did the fox jump over the dog?

Vector Space Model

Document (d) The quick brown fox jumped over the lazy dog.

Query (q) How many times does “dog” occur in document?

Term frequency (tf)

Count of query in a document.Example: count(“dog”, d)

Document length |d| How long is the document?

Document frequency (df)

How often do we see “dog” in the entire collection?Example: df(“dog”) = p(“dog” | collection)

Simplest VSM Bag of Words𝑉𝑆𝑀 (𝑞 ,𝑑 )=𝑞 .𝑑

+ … +

¿∑𝑖=1

𝑛

𝑥1 𝑦1

quick fox over dog

… The quick brown …

The quick brown and over cunny fox…

… the fox is brown and quick…

The quick brown fox … fox … over…

The quick fox … the … the … over brown fox ... fox

How would you order of ranking of documents based on bit-vector term frequency?

1 = word is present0 = word is absent

Bit-Vector Term Frequency Matrix

quick fox over dog





The quick fox … the … the … over brown fox … fox

thequic

kbrow

nand is

cunny

fox over dog

0 1 0 0 0 0 1 1 1

0 1 0 0 0 0 0 0 0

0 1 0 0 0 0 1 1 0

0 1 0 0 0 0 1 0 0

0 1 0 0 0 0 1 1 0

0 1 0 0 0 0 1 1 0

𝑓 (𝑞 ,𝑑1 )=0∗0+1∗1+0∗0+0∗0+0∗0+0∗0+0∗0+0∗0+0∗0=1𝑓 (𝑞 ,𝑑3 )=0∗0+1∗1+0∗0+0∗0+0∗0+0∗0+1∗1+0∗0+0∗0=2𝑓 (𝑞 ,𝑑5 )=0∗0+1∗1+0∗0+0∗0+0∗0+0∗0+1∗1+1∗1+0∗0=3

the quick brown and is cunny fox over dog

1 = word is present0 = word is absent

Raw Term Frequency Matrix

quick fox over dog






the quickbrow

nand is

cunny

fox over dog

0 1 0 0 0 0 1 1 1

0 1 0 0 0 0 0 0 0 1

0 1 0 0 0 0 1 1 0 3

0 1 0 0 0 0 1 0 0 2

0 1 0 0 0 0 2 1 0 4

0 1 0 0 0 0 3 1 0 5

= sum of terms0 = word is absent

Limitation of Term Frequency

quick fox over dog






• fox deserves more credit to the matrix.• fox is perceived to have higher importance than compared

to over.

TF Weighting Matrix

quick fox over dog






the quickbrow

nand is

cunny

fox over dog

0 1 0 0 0 0 1 1 1

w 0 2.0 0 0 0 0 5.0 1.0 5.0

0 1 0 0 0 0 0 0 0 2.0

0 1 0 0 0 0 1 1 0 8.0

0 1 0 0 0 0 1 0 0 7.0

0 1 0 0 0 0 2 1 0 13.0

0 1 0 0 0 0 3 1 0 18.0

= sum of tf0 = word is absent

= weight of term0 = word is absent

Inverse Document Frequency w/ Smoothing

𝑇𝐹 𝐼𝐷𝐹=log [ (𝑀+1 )𝑘 ] total number of docs in collection

document frequency

𝑇𝐹

𝐼𝐷𝐹

𝑀

𝑘

Term Frequency-Inverse Document Frequencythe quick brown and is cunny fox over dog

5 5 5 5 5 5 5 5 5

5 5 5 1 1 1 4 3 0

0.08 0.08 0.08 0.78 0.78 0.78 0.18 0.30 0.00

quick fox over dog






the quickbrow

nand is

cunny

fox over dog

0.000 0.079 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.08

0.000 0.079 0.000 0.000 0.000 0.000 0.176 0.301 0.000 0.56

0.000 0.079 0.000 0.000 0.000 0.000 0.176 0.000 0.000 0.26

0.000 0.079 0.000 0.000 0.000 0.000 0.352 0.301 0.000 0.73

0.000 0.079 0.000 0.000 0.000 0.000 0.528 0.301 0.000 0.91

𝑇𝐹 𝐼𝐷𝐹=log [ (𝑀+1 )𝑘 ]

𝑇𝐹 𝐼𝐷𝐹=∑𝑖=1

𝑛

𝑥𝑛 𝑦𝑛 𝑙𝑜𝑔[ (𝑀+1 )𝑘 ]

Comparing Matrices

quick fox over dog






Bit-Vector

TF

Term Frequen

cy

TF Weightin

gTF-IDF

1 1 2.0 0.08

3 3 8.0 0.56

2 2 7.0 0.26

3 4 13.0 0.73

3 5 18.0 0.91

Stop Words

First person•I, me, myself•We, us, ourselves

Second person•You, yours, yourself, yourselves

Third person•He, him, his, himself•She, her, hers, herself•It, its, itself•They, them, themselves

Interrogatives and Demonstratives •What, which, who, whom•This, that, those, these

Be•Am, is, are, were•Be, been, being

Have•Have, has, had, having

Do•Do, does, did, doing

Auxiliary•Will, would, shall, should, can, could•May, might, must, ought

Pronoun + Verb•I’m, you’re, she’s, they’d, we’ll

Verb + Negation•Isn’t, aren’t, haven’t, doesn’t, didn’t

Auxiliary + Negation•Won’t, wouldn’t, can’t, cannot, mustn’t•Daren’t, oughtn’t

Miscellaneous•Let’s, there’s, how’s, what’s, here’s

Articles / Determiners•A, an, the

Conjunctions•For, an, nor, but, or, yet, so

Pro

nou

ns

Verb

sC

om

pou

nd

StemmingOriginal Such an analysis can reveal feature that are not easily

visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Lovins such an analys can reve featur ar not eas vis from th vari in the individu gen and can lead to a pictur of expres that is mor biolog transpar and access to interpres

Paice such an analys can rev feat that are not easy vis from the vary in the invdivid gen and can lead to a pict of express that is mor biolog transp and access to interpret

Porter such an analysi can reveal featur that ar not easili visibl from the variat in the individ gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Association Rules

Text Mining and Analytics

Pattern Discovery

What is Pattern Discovery?

• A pattern is a set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set.

• Patterns represent intrinsic and important properties of data sets.• Pattern discovery – uncovers patterns from massive data sets.

Why do Pattern Discovery?

• Foundation for many essential data mining tasks• Association, correlation, and causality analysis• Mining sequential, structural (e.g., sub-graph) patterns• Pattern analysis in spatiotemporal, multimedia, time-series, and stream data• Classification: Discriminative pattern-based analysis• Cluster analysis: Pattern-based subspace clustering

Pattern Discovery

Motivation Use

• Which products were often purchased together?• What are the subequent purchases after buying an iPhone?• What software scripts likely contain copy-and-paste expression?• What word sequences likely form phrases in the corpus?

Applications

• Market basket analysis, cross-marketing, sale campaign analysis, Web log analysis, biochemistry sequence analysis.

Frequent ItemsetsID Product Names

10

Outlook, SAP, Active Directory

20

Outlook, Desktop, Active Directory

30

Outlook, Active Directory, Sharepoint

40

SAP, Sharepoint, Voicemail

50

SAP, Desktop, Active Directory, Sharepoint, Voicemail

Itemset – a set of one or more itemsk-itemset –

Absolute Support – frequency of occurrences of an itemset .Relative Support – the fraction of transactions that contains .

An itemset is frequent if the support of is no less than the threshold. This is denoted as .

Let

Frequent 1-itemsets:Outlook: 3 (60%)SAP: 3 (60%)Active Directory: 4 (80%)Sharepoint: 3 (60%)

Frequent 2-itemsets:{Outlook, Active Directory}: 3 (60%)

Association RulesID Product Names

10 Outlook, SAP, Active Directory

20 Outlook, Desktop, Active Directory

30 Outlook, Active Directory, Sharepoint

40 SAP, Sharepoint, Voicemail

50SAP, Desktop, Active Directory, Sharepoint, Voicemail

OutlookActive

Directory

Outlook

ActiveDirectory

{Outlook} {Active Directory} = {Outlook, Active Directory}

Support (s) – The probability that a transaction contains

– denotes

Confidence (c) – The conditional probability that a transaction containing also contains .

Association Rule MiningID Product Names

10 Outlook, SAP, Active Directory

20 Outlook, Desktop, Active Directory

30 Outlook, Active Directory, Sharepoint

40 SAP, Sharepoint, Voicemail


Frequent itemsets – finding items that meet the threshold.

Association rule mining – finding all the rules that meet both support and confidence, .

1-itemsets

Outlook: 3 (60%)SAP: 3 (60%)Active Directory: 4 (80%)Sharepoint: 3 (60%)

2-itemsets

{Outlook, Active Directory}: 3 (60%)

Frequent itemsets Association rule

Outlook Active Directory: (60%, 100%)Active Directory Outlook: (60%, 75%)

Downward Closure of Frequent Patterns

Scenario: • A database contains two transactions with itemsets: • We get a frequent itemset: .• Also, its subsets are all frequent: , , … , …• This is equivalent to vignitillion .

Efficient mining: • If {Outlook, SAP, Active Directory} is frequent, so is {Outlook, Active

Directory}.• So, every transaction containing {Outlook, SAP, Active Directory} also

contains {Outlook, Active Directory}.• Any subset of a frequent itemset must be frequent.• So, if any subset of an itemset is infrequent, then there is no chance

for to be frequent.

Limitation of Support-Confidence Framework

Scenario:

• Active Directory Password Reset

Active Directory Active Directory Sum of Rows

Password Reset 400 350 750


Sum of Columns 600 400 1000

• Active Directory Password Reset

Lift

Active Directory, Password Reset

Active Directory Active Directory Sum of Rows




𝑙𝑖𝑓𝑡 ( X ,Y )=𝑐 ( 𝑋⇒𝑌 )𝑠 (𝑌 )

=𝑠 (𝑋∪𝑌 )𝑠 ( 𝑋 )∗𝑠(𝑌 )

is independent is positively correlatedis negatively correlated

Active Directory, Password Reset

Expected Value for Chi-SquareActive Directory Active Directory Sum of Rows




𝑥2=∑ (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑−𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 )2

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑

is independent correlated, either positive or negative, therefore it needs more tests

𝐸𝑖 . 𝑗=𝑇 𝑖𝑇 𝑗

𝑇𝑜𝑡𝑎𝑙𝑠 total of ith rowtotal of ith column

𝐸1.1=750∗6001000

=450

𝐸1.2=750∗4001000

=300

𝐸2.1=250∗6001000

=150

𝐸2.2=750∗4001000

=100

Chi-SquareActive Directory Active Directory Sum of Rows

Password Reset 400 (450) 350 (300) 750

Password Reset 200 (150) 50 (100) 250


𝑥2=(400−450 )2

450+

(350−300 )2

300+

(200−150 )2

150+

(50−100 )2

10 0=55.56

𝑥2=∑ (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑−𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 )2

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑

shows Active Directory and Password Reset are negatively correlated since the expected value is higher than the observed value.

Apriori Algorithm Pseudo Code

candidate itemset of k frequent itemset of k

= 1;

{frequent items}; // frequent 1-itemsetWhile () do { // when is non-zero candidates generated from ; // candidate generation Derive by counting candidates in with respect to database at ; ; }

return // return generated at each level

Apriori AlgorithmID Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Let

Itemset support

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset support

{A} 2

{B} 3

{C} 3

{E} 3

Itemset support

{A, B} 1

{A, C} 2

{A, E} 1

{B, C} 2

{B, E} 3

{C, E} 2

Itemset support

{A, C} 2

{B, C} 2

{B, E} 3

{C, E} 2

Itemset support

{B, C, E} 2

1st scan

𝐶1

1st scan

𝐹 1

2nd scan

𝐶2

2nd scan𝐹 2

3rd scan

𝐶3

candidate itemset of k frequent itemset of k

= 1;

{frequent items};While () do { candidates generated from ; Derive by counting candidates in with respect to database at }

return

Transactions Sparse Matrix

ID Product Names Outlook SAPActive

Directory

DesktopSharepo

intVoicema

il

10 Outlook, SAP, Active Directory 1 0 1 0 0 0

20 Outlook, Desktop, Active Directory 1 0 1 1 0 0

30 Outlook, Active Directory, Sharepoint 1 0 1 0 1 0

40 SAP, Sharepoint, Voicemail 0 1 0 0 1 1


0 1 1 1 1 1

Text Mining, Association Rules and Decision Tree Learning

Documents

Transcript of Text Mining, Association Rules and Decision Tree Learning