Assertion
Creation and maintenance of large ontologies will require the capability to merge taxonomies
This problem is similar to the problem of merging e-commerce catalogs
R. Agrawai, R. Srikant: On Catalog Integration. WWW-10
Catalog Integration Problem
Integrate products from new catalog into master catalog.
a
ICs
LogicMem.DSP
fec db
ICs
Cat 2Cat 1
yx z
New CatalogMaster Catalog
Desired Solution
Automatically integrate products:little or no effort on part of user.domain independent.
Problem size:million productsthousands of nodes in the hierarchy
How do we do it
Build classification model (rules) using product descriptions in master catalog. Example: If the product description contains "DRAM", the
product is likely to be in the "Memory" category.
Use classification model to predict categories for products in the new catalog.
Logic
DSPx
5%
95%
National Semiconductor Files
Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver
device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator
...
...
...
National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:
Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line
Part: LM3940 1A Low Dropout RegulatorPangea Category:
Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator
...
...
Accuracy on Pangea Data
B2B Portal for electronic components:1200 categories, 40K training documents.500 categories with < 5 documents.
Accuracy:72% for top choice.99.7% for top 5 choices.
Enhanced Algorithm
Use affinity information in catalog to be integrated:Products in same category are similar.Bias the classifier to incorporate this information.
Accuracy boost depends on quality of current catalog:Use tuning set to determine amount of bias.
Empirical Results
71-22-6 79-21 100
Purity (No. of classes & their distribution)
0
5
10
15
20
% E
rro
rs Standard
Enhanced
Improvement in Accuracy (Pangea)
1 2 5 10 25 50 100 200
Weight
65
70
75
80
85
90
95
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Improvement in Accuracy (Reuters)
1 2 5 10 25 50 100 200
Weight
82
84
86
88
90
92
94
96
98
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Improvement in Accuracy (Google.Outdoors)
1 5 25 100 400 1000
Weight
50
60
70
80
90
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Tune Set Size (Pangea)
0 5 10 20 35 50
Tune Set Size
70
75
80
85
90
95A
ccu
racy
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Similar results for Reuters and Google.
Summary
The catalog integration technolgy can be directly used for creating and evolving large taxonomies
See WWW-2000 paper for experimental results on merging Yahoo and Google categorizations
Naive Bayes Classifier
Estimates the probability of a product belonging to Estimates the probability of a product belonging to a classa class
Pr(class | product) = Pr(class) * Pr(product | class) Pr(class | product) = Pr(class) * Pr(product | class) / Pr(product)/ Pr(product)
Pr(class) : # products in class / total productsPr(class) : # products in class / total productsPr(product) : same for all classes ( Pr(product) : same for all classes ( classesclasses Pr(class) * Pr(class) *
Pr(product | class) )Pr(product | class) )
How to compute Pr(product | class)?How to compute Pr(product | class)?
Naive Bayes Classifier (cont.)
Pr(Pr(productproduct | class) = Pr( | class) = Pr(productproduct description | class) description | class) * Pr(* Pr(productproduct attributes | class) attributes | class)
= = words in descriptionwords in description Pr(word | class) *Pr(word | class) * attributes attributes
Pr(APr(Aii = v= vkk | class)| class)assumption: words, attributes occur assumption: words, attributes occur independentlyindependently
Pr(word | class)Pr(word | class) (n+ (n+ ) / (t +) / (t + *|Vocabulary|)*|Vocabulary|)n : number of times word occurs in classn : number of times word occurs in classt : total number of words in classt : total number of words in class
Enhanced Classifier
S: node in new hierarchyS: node in new hierarchy
Pr(class | product, S) Pr(class | product, S) Pr(class | S) * Pr(product Pr(class | S) * Pr(product | class) / Pr(product | S)| class) / Pr(product | S)
Ignore Pr(product | S)Ignore Pr(product | S)
Pr(class CPr(class Cii | S) | S) (|C(|Cii| * Number of products in S | * Number of products in S predicted to be from Cpredicted to be from Cii))ww / / j (|Cj (|Cjj| * Number of | * Number of products in S predicted to be from Cproducts in S predicted to be from Cjj))ww
w determines the weightw determines the weight
Top Related