Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew...

26
Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795

Transcript of Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew...

Page 1: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Using Text Mining to Infer Semantic Attributes for Retail Data Mining

Authors: Rayid Ghani & Andrew E. Fano

Presenter: Vishal MahajanINFS795

Page 2: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Agenda Drawbacks in Current Data Mining Techniques. Purpose. Assumptions and Constraints. Methodology or Approach. Extraction of Feature Set. Labeling . Classification Techniques.

Naïve Bayes EM

Experimental Results. Recommender System.

Page 3: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Drawbacks in Current Data Mining Techniques Semantic Features not automatically

considered. Transactional Data analyzed without

analyzing the customer. Trending is partial. Retail Items treated as objects with no

associated semantics. Data Mining Techniques (association rules,

decision trees, neural networks) ignore the meaning of items and semantics associated with them.

Page 4: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Purpose of the Presentation Describe a system that extracts semantic

features. Populate the knowledge base with the

semantic features. Use of text mining in retailing to extract

semantic features from website of retailers.

How profiles of customers or group of customers can be build using Text Mining.

Page 5: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Assumptions & Constraints Focus on Apparel Retail segment only. Results focus on extracting those

semantic features that are deemed important by CRM or Retail experts.

Data extracted from retailers website. Models generated can be extended

beyond the Apparel Retail segment.

Page 6: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Approach Collect Information about products. Define set of features to be extracted. Label the data with values of the features. Train a classifier/extractor to use the

labeled training to extract features from unseen data.

Extract Semantic Features from new products by using trained classifier.

Populate a knowledge base with the products and corresponding feature.

Page 7: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Data Collection Methodology Use of web crawler to extract the

following from large retailers’ website: Names URLs Description Prices Categories of all Products Available

Use of wrappers. Extracted Information stored in a

database and a subset chosen.

Page 8: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Extraction of Feature Set Feature selection based on Expert Systems. Use of extensive domain knowledge. Feature selection based on Retail Apparel

section in mind. Feature Selected for the project

Age Group Functionality Price Formality Degree of Conservativeness Degree of Sportiness Degree of Trendiness Degree of Brand Appeal

Page 9: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Labeling Training Data Database created with data from

collected from retailer website. Subset of 600 products chosen and

labeled. Labeling guidelines provided

Page 10: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Details of Features extracted from each Product DescriptionAge Group Age Group Juniors, Teens, GenX,

Mature, All ages

For what ages is this item most appropriate?

Functionality Loungewear, Sportswear, Eveningwear, Business Casual, Business Formal

How will the item be used?

Pricepoint Discount, Average, Luxury Compared to other items of this kind is this item cheap or expensive?

Formality Informal, Somewhat Formal, Very Formal How formal is this item?

Conservation 1 (gray suits) to 5 (Loud, flashy clothes) Does this suggest the person is conservative or flashy?

Sportiness 1 to 5

Trendiness 1 (Timeless Classic) to 5 (Current favorite)

Is this item popular now but likely to go out of style? Or is it more timeless?

Brand Appeal 1 (Brand makes the product unappealing) to 5 (high brand appeal)

Is the brand known and makes it appealing

Page 11: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Verifying Training Data Disjoint Dataset as labeling done by

different individuals. Association rules (between features) used

to obtain consistency in labeled data. Apriori algorithm

Apriori Algorithm implemented with single and two feature antecedents and consequents.

Desired Consistency in Labeling achieved by applying associating rules

Page 12: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Apriori Algorithm Find the frequent itemsets: the sets of items

that have minimum support A subset of a frequent itemset must also be

a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and

{B} should be a frequent itemset Use the frequent itemsets to generate

association rules.

Page 13: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

The Apriori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D

itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Page 14: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Training from Labeled Data Learning problem treated as a text

classification problem. Only one text classifier for each

semantic feature. e.g Price of product will be classified as

either discount or average or luxury. Age group is classified as Juniors or Teens or

GenX or Mature or All Ages. Classification was performed using

Naïve Bayes classification.

Page 15: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Sample Association Rules

Informal <- Sportswear

24.5% 93.6%

Informal <- Loungewear

16.1% 82.3%

Informal <- Juniors

12.1% 89.4%

PricePoint =Ave <- BrandAppeal=2

8.8% 79.0%

BrandAppeal=5 <- Trendy=5

16.3% 91.2%

Sportswear <- Sporty=4

9.0% 85.7%

AgeGroup=Mature <- Trendy=1

9.4% 78.8%

Rule Support Confidence

Page 16: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Naïve Bayes Simple but effective text classification method. Class is selected according to class prior

probabilities. This Model assumes each word in a document

is generated independently of the other in the class.

||

1

||

1

||

1

)|Pr(),(||

)|Pr(),(1)|Pr( V

s

D

i

ijis

D

i

ijit

jt

dcdwNV

dcdwNcw

where N(wt,di) = count of times word wt occurs in document di

and Pr(cj,di) = {0,1)

Page 17: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Incorporating Unlabeled Data Initial sample was for 600 products only. Need to take care of unlabeled products

to make any meaningful predictions. Use of Supervised learning algorithms. These algorithms have proved to reduce

the classification error considerably. Use of Expectation-Maximization (EM)

Algorithm as the supervised technique.

Page 18: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Expectation-Maximization (EM) Method EM is an iterative statistical technique for

maximum likelihood estimation for incomplete data.

In the retail classification problem, unlabeled data is considered as incomplete data.

EM Locally maximizes the likelihood of the

parameter. Gives estimates for missing values.

Page 19: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Expectation-Maximization (EM) Method- cont EM method is a 2-step process. Initial Parameters are set using naïve Bayes from

just the labeled documents. Subsequent iteration of E- and M-Steps. E-Step

Calculates probabilistically weighed class label Pr(cj|dj), for every unlabeled document.

M-Step Estimates new classifier parameter using all

documents (Equation 1). E and M steps iterated unless classifier converges

Page 20: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Experimental Results

Baseline 29% 24% 68% 39% 49% 29% 36%

Naïve Bayes

66% 57% 76% 80% 70% 69% 82%

EM 78% 70% 82% 84% 78% 80% 91%

Algorithm

Age Group

Functionality

Formality

Conservation

Sportiness

Trendiness

BrandAppeal

Page 21: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Experimental ResultsBrand

Appeal=5(high)Conservative=5(hi

gh)Conservative=1(lo

w)Formality=Infor

malSomewhat Formal

LaurenRalphDKNYKennethColeimported

LaurenRalphBreastedSeasonlessTrouserJonesSportClassicblazer

RoseSpecialLeopardChemiseStrapsFlirtySpraySilkplatform

JeanTommyJeansDenimSweaterPocketNeckTeeHilfiger

JacketFullyButtonSkirtLinesYorkSeamCrepeleather

AgeGroup=Junior

Functionality=Loungewear

Functionality=Partywear

Sportiness=5(High)

Trendiness=1(low)

JrsDknyJeansTeeColligateLogoTommyPoloShortsneaker

ChemiseSilkKimonoCalvinKleinAugustLoungeHilfigerRobegown

RockDressSateenLengthSkirtShirtdressOpenPlatformPlaidflower

SneakerCampBaseRubberSoleWhiteMiraclesuiteAthleticNylonMesh

LaurenSeasonlessBreastedTrouserPocketCarefreeRalphBlazerbutton

Page 22: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Results on new data set The subset of data that was used

earlier was from a single retailer. Another sample of data was

collected from variety of retailers. The results are as follows.

Results are consistently better.

Algorithm Age

GroupFunctionalit

yFormalit

yConservatio

nSportine

ssTrendiness BrandAppeal

Naïve Bayes

83% 45% 61% 70% 81% 80% 87%

Page 23: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Recommender System Creation of customer profiles (real time)

is feasible by analyzing the text associated with products and by mapping it to pre-defined semantic features.

Identity of customer is not known and prior transaction history is unknown.

Semantic features are inferred by the “browsing” pattern of the customer.

Helps in suggesting new products to the customers.

Page 24: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Recommender System

Mathematically P(Aij|Product) Where Aij is the jth value of ith attribute

i=semantic attributes, j=possible values User profile is constructed as follows Pr(Ui,j|Past N Items) = 1/N

i,j

)|(Pr ,

1

kji

N

k

ItemsA

is calculated

Page 25: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Types of Recommender Systems Two Types of Recommender Systems. Collaborative Filtering.

Collect user feedback in terms of ratings. Exploit similarities and differences of customers to

recommend items. Issues

Sparsity Problem. New Items.

Content Filtering Compares the contents Issues

Narrow in scope Recommends similar products only

Page 26: Using Text Mining to Infer Semantic Attributes for Retail Data Mining Authors: Rayid Ghani & Andrew E. Fano Presenter: Vishal Mahajan INFS795.

Conclusions The systems learns from the use of

supervised and semi-supervised techniques. Major assumptions..Products accurately

convey the semantic attributes.?? Small sample of data used to Infer results.

Practical applications not verified. System bootstrapped from a small number

of labeled training examples. Interesting application which could be

evolved to generate trends for retail marketers.