Merging Taxonomies. Assertion Creation and maintenance of large ontologies will require the...

21
Merging Taxonomies
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Merging Taxonomies. Assertion Creation and maintenance of large ontologies will require the...

Merging Taxonomies

Assertion

Creation and maintenance of large ontologies will require the capability to merge taxonomies

This problem is similar to the problem of merging e-commerce catalogs

R. Agrawai, R. Srikant: On Catalog Integration. WWW-10

Catalog Integration Problem

Integrate products from new catalog into master catalog.

a

ICs

LogicMem.DSP

fec db

ICs

Cat 2Cat 1

yx z

New CatalogMaster Catalog

The Problem (cont.)

After integration:

ICs

LogicMem.DSP

a fec db yx z

Desired Solution

Automatically integrate products:little or no effort on part of user.domain independent.

Problem size:million productsthousands of nodes in the hierarchy

How do we do it

Build classification model (rules) using product descriptions in master catalog. Example: If the product description contains "DRAM", the

product is likely to be in the "Memory" category.

Use classification model to predict categories for products in the new catalog.

Logic

DSPx

5%

95%

National Semiconductor Files

Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver

device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator

...

...

...

National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:

Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line

Part: LM3940 1A Low Dropout RegulatorPangea Category:

Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator

...

...

Accuracy on Pangea Data

B2B Portal for electronic components:1200 categories, 40K training documents.500 categories with < 5 documents.

Accuracy:72% for top choice.99.7% for top 5 choices.

Enhanced Algorithm

Use affinity information in catalog to be integrated:Products in same category are similar.Bias the classifier to incorporate this information.

Accuracy boost depends on quality of current catalog:Use tuning set to determine amount of bias.

Algorithm

Extension of the Naive-Bayes classification to incorporate affinity information

Empirical Results

71-22-6 79-21 100

Purity (No. of classes & their distribution)

0

5

10

15

20

% E

rro

rs Standard

Enhanced

Improvement in Accuracy (Pangea)

1 2 5 10 25 50 100 200

Weight

65

70

75

80

85

90

95

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Improvement in Accuracy (Reuters)

1 2 5 10 25 50 100 200

Weight

82

84

86

88

90

92

94

96

98

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Improvement in Accuracy (Google.Outdoors)

1 5 25 100 400 1000

Weight

50

60

70

80

90

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Tune Set Size (Pangea)

0 5 10 20 35 50

Tune Set Size

70

75

80

85

90

95A

ccu

racy

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Similar results for Reuters and Google.

Summary

The catalog integration technolgy can be directly used for creating and evolving large taxonomies

See WWW-2000 paper for experimental results on merging Yahoo and Google categorizations

Naive Bayes Classifier

Estimates the probability of a product belonging to Estimates the probability of a product belonging to a classa class

Pr(class | product) = Pr(class) * Pr(product | class) Pr(class | product) = Pr(class) * Pr(product | class) / Pr(product)/ Pr(product)

Pr(class) : # products in class / total productsPr(class) : # products in class / total productsPr(product) : same for all classes ( Pr(product) : same for all classes ( classesclasses Pr(class) * Pr(class) *

Pr(product | class) )Pr(product | class) )

How to compute Pr(product | class)?How to compute Pr(product | class)?

Naive Bayes Classifier (cont.)

Pr(Pr(productproduct | class) = Pr( | class) = Pr(productproduct description | class) description | class) * Pr(* Pr(productproduct attributes | class) attributes | class)

= = words in descriptionwords in description Pr(word | class) *Pr(word | class) * attributes attributes

Pr(APr(Aii = v= vkk | class)| class)assumption: words, attributes occur assumption: words, attributes occur independentlyindependently

Pr(word | class)Pr(word | class) (n+ (n+ ) / (t +) / (t + *|Vocabulary|)*|Vocabulary|)n : number of times word occurs in classn : number of times word occurs in classt : total number of words in classt : total number of words in class

Enhanced Classifier

S: node in new hierarchyS: node in new hierarchy

Pr(class | product, S) Pr(class | product, S) Pr(class | S) * Pr(product Pr(class | S) * Pr(product | class) / Pr(product | S)| class) / Pr(product | S)

Ignore Pr(product | S)Ignore Pr(product | S)

Pr(class CPr(class Cii | S) | S) (|C(|Cii| * Number of products in S | * Number of products in S predicted to be from Cpredicted to be from Cii))ww / / j (|Cj (|Cjj| * Number of | * Number of products in S predicted to be from Cproducts in S predicted to be from Cjj))ww

w determines the weightw determines the weight

Algorithm Outline

For each node S in the member hierarchy: For each product p in S:

i. Tentatively classify p using the standard model.

ii. Use the results of Step 1 to compute Pr(class | S).

iii. Re-classify each product in S using the enhanced model.