Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University...

23
Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 [email protected]

Transcript of Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University...

Page 1: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Semi-automatic Product Attribute Extraction from Store

Website

Yan LiuCarnegie Mellon University

Sep 2, [email protected]

Page 2: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Example from Dick’s Sporting Goods

Product name

Description

Features

webpageFree text

Structured data

Page 3: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Applications Direct application

Product recommendation systems for customers Price estimates for auction Sales amount prediction

More general applications Document organization Email prioritization Question answering And many more text mining tasks

Page 4: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Relationship with Previous Work Information extraction

Extract from the documents salient facts about prespecified types of events, entities or relationships

Different from information retrieval Previous work

Finite state machines Sliding windows Sequential models, such as HMMs or CRFs Association and clustering

Major challenges Few training data Unclear attribute definition

Page 5: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Outline Introduction General framework Detailed algorithms Experiment results Conclusion and discussion

Page 6: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

General Framework

Attribute Identification(Semi-supervised learning)

Name-value Assignment(Statistical and grammatical association)

Feedback(Active learning)

9.68-lb total weight (4.4-kg)

9.68-lb total weight (4.4-kg)

9.68-lb total weight (4.4-kg)

weight: 9.68-lb, 4.4-kg

weight: CD-lbweight: CD-kg….

Example:Input: free text

Output: structured data

Page 7: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Attribute Identification

Initial label acquisition Template matching Knowledge database

Semi-supervised learning Yarowsky’s algorithm Co-training Co-EM Co-boosting Graph-based methods

Phrase identification Statistical associations between

adjacent words Heuristic grammatical rules

Page 8: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Attribute Identification (1)Initial Label Acquisition

Positive labels Template matching

Extracted templates from data with special format

Noisy data Knowledge database

Measure units: length, weight, volume and etc

Material Country Color

Negative labels Partial stop word list

Page 9: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Attribute Identification (1)Semi-supervised learning

Co-training [Blum & Mitchell, 1998; Collins & Singer, 1999]

Separation of two views

Contextual features Spelling features

Two kinds of features Stemmer words

(Porter Stemmer) POS tagging (Brill’s

tagger)

Algorithm Psedocode

Page 10: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Attribute Identification (1)Phrase Identification

Phrase identification Difference between chunking Label propagation Category dependent

Statistical association Information gain

Mutual information

Yule’s statistic

)()(),( YXHXHYXIG

display team colorsup to 12 inches

display team colorsup to 12 inches

)()(

),();(

YPXP

YXPYXMI

),(),(),(),(

),(),(),(),(),(

XYCYXCYXCYXC

XYCYXCYXCYXCYXYule

Page 11: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Name-value Assignment Combination of three information

sources Semantic association

Knowledge database Attribute name generation and pair

assignment Grammatical association

Parsing tree (Minipar) Attribute name/value generation

Statistical association scores Yule’s statistic (category

dependent) Pair assignment

Other association sources Wordnet

Page 12: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

User Feedback Clustering─based active learning

Novelty attribute identification Merge and splitting attributes Better use of labeled examples

Clustering algorithm Sparse data problem Multiple clustering algorithms

Cluster selection Within-cluster coherence Novelty based measurement

Page 13: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

User FeedbackClustering algorithm Latent semantic indexing (LSI) [Deerwester et al, 1990]

Singular value decomposition on term─document matrix Mapping the words into hidden semantic concepts Similarity measure: cosine similarity

Clustering algorithm using CLUTO K─means Bisected K─means Agglomerative algorithm

Single linkage Complete linkage Average linkage

Page 14: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

User FeedbackCluster Selection Novelty concepts

Major difference from previous task Supervised novelty detection is difficult

Tradeoff between novelty and relevancy Recently studied by the IR community [Carbonell and Goldstein,

1995; Zhang et al, 2003; Zhai et al, 2004]

Cluster selection criterion using maximal marginal relevance (MMR)

Similarity measure Cosine similarity KL-divergence

),(max)1(),(max jj

ii

MMR DCsimCCsimV

Page 15: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Outline Introduction General framework Detailed algorithms Experiment results Conclusion and discussion

Page 16: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Experiment Setup Dataset

Free text extracted from product descriptions on http://www.dickssportinggoods.com

Subsets from two categories Football (largest category)

52339 entries, 194273 words, 2926 predicted feature-value pairs Tennis (medium category)

3840 entries, 12533 words, 419 predicted feature-value pairs

Evaluation measures Direct evaluation

Precision on feature value pairs Indirect evaluation in other applications

Recommender systems

Page 17: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Experiment Results Initial label acquisition

Semi-supervised learning

Phrase identification

Semantic association

Grammatical association

Statistical association scores

Examples by steps

Page 18: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Experiment Results Human feedback

Sample files (link to file) Total labeling time of 5 mins Identified concepts

color, graphics, logo, design, fit, size, pocket, pad, set, adjustment, attachment, construction, strap

Examples by active learning

Page 19: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Experiment Results Precision on most frequent feature-value pairs

Most frequent 600 pairs Assignment of 5 labels

Fully correct, incorrect names, incorrect values, incorrect associations, nonsense:

Human labeling of approximately 6 hours Thanks to Katharin and Marko

Results

Page 20: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Conclusion Product attribute identification is a difficult task

Few training data Making use of labeled and unlabeled data by semi-supervised learning

Unclear attribute definition Novelty attribute identification by active learning

A framework of active learning combined with semi-supervised learning

Page 21: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Text Learning Techniques Text processing

Stemming (Porter stemmer) POS tagging (Brill’s parser) Text chunking and parsing (Minipar) Word semantics (Wordnet, dependency-based thesaurus) Latent semantic indexing (SVDPack)

Machine learning Semi-supervised learning (Co-training) Active learning (MMR) Classification (C4.5 decision tree, FOIL) Clustering (K-means, CLUTO) Information theory and statistical associations (Information gain, Yule’s

statistic)

Page 22: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Future Work Associations of product attributes across categories or websites

More effective active learning algorithms

Graphical models with application to information extraction

Page 23: Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu.

Questions?