Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of...

49
Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of...

Page 1: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Castanet:Using WordNet to Build Facet

Hierarchies

Emilia Stoica and Marti HearstSchool of Information,

Berkeley

Page 2: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Focus: Search and Navigation of Large Collections

ImageCollections

E-GovernmentSites

Shopping SitesDigital Libraries

Page 3: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Study by Vividence in 2001 on 69 Sites 70% eCommerce 31% Service 21% Content 2% Community

Poorly organized search results Frustration and wasted time

Poor information architecture Confusion Dead ends "back and forthing" Forced to search

Problems with Site Search

Page 4: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

The Problem With Hierarchy

Most things can be classified in more than one way. Most organizational systems do not handle this well. Example: Animal Classification

otterpenguin

robinsalmon

wolfcobra

bat

SkinCovering

Locomotion

Diet

robinbat wolf

penguinotter, seal

salmon

robinbat

salmon

wolfcobra

otterpenguin

seal

robinpenguin

salmoncobra

batotterwolf

Page 5: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

The Problem With Hierarchy

start

fur scales feathers

swim fly run slither

fur scales feathers fur scales feathers

fish

rodents

insects

fish

rodents

insects

fish

rodents

insects

fish

rodents

insects

fish

rodents

insects

fish

rodents

insects

fish

rodents

insects

fish

rodents

insects

fish

rodents

insects

salmon bat robin wolf

Page 6: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

The Idea of Facets

Facets are a way of labeling data A kind of Metadata (data about data) Can be thought of as properties of items

Facets vs. Categories Items are placed INTO a category system Multiple facet labels are ASSIGNED TO items

Page 7: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

The Idea of Facets

Hot and Sweet Chicken: 1 pepper, 2 apricots, 1 pound chicken breast, 1 Tbsp gingerroot

Meat Chicken

Vegetables pepper

Fruit Apricot

Flavor gingerroot

Page 8: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Using Facets

Now there are multiple ways to get to each item

Preparation Method Fry Saute Boil Bake Broil Freeze

Desserts Cakes Cookies Dairy Ice Cream Sherbet Flan

Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple

Fruit > PineappleDessert > Cake

Preparation > Bake

Dessert > Dairy > SherbetFruit > Berries > Strawberries

Preparation > Freeze

Page 9: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Castanet

Semi-automatic algorithm for creating hierarchical faceted metadata

Carves out a structure from the hypernym (IS-A) relations within WordNet

Produces surprisingly good results for a wide range of subjects e.g., arts, medicine, recipes, math, news,

bibliographical records

Page 10: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

WordNet Challenges

A word may have more than one sense

- Fine granularity of word sense distinctions

e.g., newspaper (#1) - daily publication on

folded sheets

newspaper (#3) - physical object

- Ambiguity for the same sense

tuna#1 cactus

#2 fish food fish bony fish

Page 11: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

WordNet Challenges (cont.)

The hypernym path may be quite long (e.g., sense #3 of tuna has 14 nodes)

Sparse coverage of proper names and noun phrases (not addressed)

Page 12: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Algorithm Goals

Build a set of facet hierarchies Balance depth and breadth

Avoid “skinny” paths Don’t go too deep or too broad

Choose understandable labels Disambiguate words

Currently a word can take on only one sense

Page 13: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Our ApproachD

ocum

ents

Sel

ect

ter

ms

WordNet

Build core tree

Augmentcore tree

Remove

top level

categories

Compress

Tree

Divide into facets

Page 14: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

1. Select Terms

Select well-distributed terms from the collection

Eliminate stopwords Retain only those terms

with a distribution higher than a threshold

(default: top 10%)

Doc

ume

nts

WordNet

Sel

ect

term

s

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 15: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

2. Build Core Tree

Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a

count at each node on its path by # of docs with the term. frozen dessert

sundae

entity

substance,matter

nutriment

dessert

ice cream sundae

frozen dessert

entity

substance,matter

nutriment

dessert

sherbet,sorbet

sherbet

Build a “backbone” Create paths from

unambiguous terms only Bias the structure towards

appropriate senses of words

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 16: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

2. Build Core Tree (cont.)

Merge hypernym paths to build a tree

sundae

entity

substance,matter

nutriment

dessert

ice cream sundae

frozen dessert

entity

substance,matter

nutriment

dessert

sherbet,sorbet

sherbet

frozen dessert

sundae sherbet

substance,matter

nutriment

dessert

sherbet,sorbet

frozen dessert

entity

ice cream sundae

Page 17: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

3. Augment Core Tree

Attach to Core tree the terms with more than one sense

Favor the more common path over other alternatives

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 18: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Augment Core Tree (cont.)

Date (p1) Date (p2)

entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date date

Choose this path since it has more items assigned

Page 19: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Optional Step: Domains

To disambiguate, use Domains Wordnet has 212 Domains

medicine, mathematics, biology, chemistry, linguistics, soccer, etc.

A better collection has been developed by Magnini 2000 Assigns a domain to every noun synset

Automatically scan the collection to see which domains apply

The user selects which of the suggested domains to use or may add own

Paths for terms that match the selected domains are added to the core tree

Page 20: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Using Domains

dip glosses:

Sense 1: A depression in an otherwise level surface

Sense 2: The angle that a magnet needle makes with horizon

Sense 3: Tasty mixture into which bite-size foods are dipped

dip hypernyms

Sense 1 Sense 2 Sense 3

solid shape, form food

=> concave shape => space => ingredient, fixings

=> depression => angle => flavorer

Given domain “food”, choose sense 3

Page 21: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

4. Compress Tree

Rule 1: Eliminate a parent with fewer

than k children unless it is the root or its distribution is larger than 0.1*maxdist

ice cream sundae

dessert

sundae

frozen dessert

sherbet,sorbet

sherbet

parfait

dessert

frozen dessert

sundae parfait sherbet

abstraction

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 22: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

4. Compress Tree (cont.)

Rule 2: Eliminate a child whose

name appears within the parent’s name

sundae

dessert

frozen dessert

parfait sherbet

dessert

sundae parfait sherbet

abstraction

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 23: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

5. Divide into Facets

Divide into facets

Page 24: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

5. Divide into Facets(Remove top levels)

sugar syrup

entity

substance,matter

food,nutriment

ingredient,fixings

food stuff,food product

sweeteningherb

flavorer

parsley oregano sugar syrup

sweeteningherb

flavorer

parsley oregano

Rule 1: Manually eliminate the top t levels (t =4 for recipe collection).

Divide into facets

Rule 2: For each resulting tree, test if it has more than n children (n =2)

If yes, stop. If not, delete the root and test again.

Page 25: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Example: Recipes (3500 docs)

Page 26: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Castanet Output (shown in Flamenco)

Page 27: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Castanet Output

Page 28: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Castanet Output

Page 29: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Castanet Output

Page 30: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Castanet Output

Page 31: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.
Page 32: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Castanet Evaluation

This is a tool for information architects, so people of this type did the evaluation

We compared output on Recipes Biomedical journal titles

We compared to two state-of-the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99)

Page 33: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Subsumption Output

Page 34: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Subsumption Output

Page 35: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Subsumption Output

Page 36: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Subsumption Output

Page 37: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

LDA Output

Page 38: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

LDA Output

Page 39: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

LDA Output

Page 40: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Evaluation Method

Information architects assessed the category systems

For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels

Then comment on overall properties Meaningful? Systematic? Likely to use in your work?

Page 41: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Evaluation (cont.)

Sample questions for top level categories: - Would you add/remove/rename any category ?

- Did this category match your expectations ?

Sample questions for a specific category: - Would you add/move/remove any sub-categories ? - Would you promote any sub-category to top level ?

General questions: - Would you use Castanet ? - Would you use LDA ? - Would you use Subsumption ? - Would you use list of most frequent terms ?

Page 42: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Evaluation Results

Results on recipes collection for “Would you use this system in your work?” # “Yes in some cases” or “yes, definitely”:

Castanet: 29/34 LDA: 0/18 Subsumption: 6/16 Baseline: 25/34

Average response to questions about quality (4 = “strongly agree”)

Page 43: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Evaluation Results

Average responses for top-level categories 4= no changes, 1 = change many

Average responses for 2 subcategories

Page 44: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Needed Improvements

Take spelling variations and morphological variants into account

Use verbs and adjectives, not just nouns Normalize noun phrases Allow terms to have more than one sense Improve algorithm for assigning documents to

categories.

Page 45: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Opportunities for Tagging

New opportunity: Tagging, folksonomies (flickr, de.lici.ous) People are creating facets in a decentralized

manner They are assigning multiple facets to items This is done on a massive scale This leads naturally to meaningful associations

Page 46: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Conclusions

Castanet builds a set of faceted hierarchies by finding IS-A relations between terms using WordNet.

The method has been tested on various domains: medicine, recipes, math, news, arts, bibliographical

records

Usability study shows: Castanet is preferred to other state-of-the art solutions. Information architects want to use the tool in their work.

Page 47: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Learn More

Funding This work supported in part by NSF (IIS-9984741)

For more information: Stoica, E., Hearst, M., and Richardson, M., Automating

Creation of Hierarchical Faceted Metadata Structures, NAACL/HLT 2007

See http://flamenco.berkeley.edu

Page 48: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Motivation

Want to assign labels from multiple hierarchies

Page 49: Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Inflexible Force the user to start with a particular

category What if I don’t know the animal’s diet, but the

interface makes me start with that category? Wasteful

Have to repeat combinations of categories Makes for extra clicking and extra coding

Difficult to modify To add a new category type, must duplicate it

everywhere or change things everywhere

The Problem with Hierarchy