A REFERENCE-SET APPROACH TO INFORMATION EXTRACTION...

A REFERENCE-SET APPROACH TO INFORMATION EXTRACTION FROMUNSTRUCTURED, UNGRAMMATICAL DATA SOURCES

by

Matthew Michelson

A Dissertation Presented to theFACULTY OF THE GRADUATE SCHOOL

UNIVERSITY OF SOUTHERN CALIFORNIAIn Partial Fulfillment of theRequirements for the Degree

DOCTOR OF PHILOSOPHY(COMPUTER SCIENCE)

May 2009

Copyright 2009 Matthew Michelson

Dedication

To Sara, whose patience and support

never waiver.

ii

Acknowledgments

First and foremost, thanks to Sara, my wife. Her love, patience, and support made this

thesis possible.

Of course, many thanks to my family for their patience. I know it took almost 90%

of my life so far to finish school, but I promise this is my last degree! (Unless...)

I would especially like to thank my thesis adviser, Craig Knoblock, who taught me the

mechanics of research, the elements of style and grammar, and the patience required to

find the right solution. Craig was (and still is) always free to talk to me when I needed it,

especially when I felt like I was floundering in the muck of trial-and-error (usually more

error, less trial). Most importantly, he gave me enough freedom to try my own solutions,

but also helped reign my ideas back to practicality. I am grateful for all his help.

Many thanks to the rest of my thesis committee: Kevin Knight, Cyrus Shahabi and

Daniel O’Leary. My committee helped me focus my thesis during my qualification exam

with their comments and questions. Their patience and kind demeanor made the practical

aspects of my defense, such as scheduling, painless!

To my ISI-mates, both past (Marty, Snehal, Pipe, Mark, and Wesley) and present

(Yao-yi, Anon, and Amon), thanks for all the laughs, games, Web-based time wasters,

iii

and general harassment, without which the life of a graduate student would feel more like

“real work!”

This research is based upon work supported in part by the National Science Foun-

dation under award number IIS-0324955; in part by the Air Force Office of Scientific

Research under grant number FA9550-07-1-0416; and in part by the Defense Advanced

Research Projects Agency (DARPA) under Contract No. FA8750-07-D-0185/0004.

The United States Government is authorized to reproduce and distribute reports for

Governmental purposes not withstanding any copyright annotation thereon. The views

and conclusions contained herein are those of the authors and should not be interpreted as

necessarily representing the official policies or endorsements, either expressed or implied,

of any of the above organizations or any person connected with them.

iv

Table of Contents

Dedication ii

Acknowledgments iii

List Of Tables vii

List Of Figures x

Abstract xi

Chapter 1: Introduction 11.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . . . 11.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Chapter 2: An Automatic Approach to Extraction from Posts 112.1 Automatically Choosing the Reference Sets . . . . . . . . . . . . . . . . . 122.2 Matching Posts to the Reference Set . . . . . . . . . . . . . . . . . . . . . 152.3 Unsupervised Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 Results: Selecting reference set(s) . . . . . . . . . . . . . . . . . . . 212.4.2 Results: Matching posts to the reference set . . . . . . . . . . . . . 302.4.3 Results: Extraction using reference sets . . . . . . . . . . . . . . . 37

Chapter 3: Automatically Constructing Reference Sets for Extraction 413.1 Reference Set Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.3 Applicability of the ILA Method . . . . . . . . . . . . . . . . . . . . . . . 67

3.3.1 Reference-Set Properties . . . . . . . . . . . . . . . . . . . . . . . . 68

Chapter 4: A Machine Learning Approach to Reference-Set Based Extrac-tion 814.1 Aligning Posts to a Reference Set . . . . . . . . . . . . . . . . . . . . . . . 83

v

4.1.1 Generating Candidates by LearningBlocking Schemes for Record Linkage . . . . . . . . . . . . . . . . 84

4.1.2 The Matching Step . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.2 Extracting Data from Posts . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.3.1 Record Linkage Results . . . . . . . . . . . . . . . . . . . . . . . . 1104.3.1.1 Blocking Results . . . . . . . . . . . . . . . . . . . . . . . 1104.3.1.2 Matching Results . . . . . . . . . . . . . . . . . . . . . . 117

4.3.2 Extraction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Chapter 5: Related Work 130

Chapter 6: Conclusions 138

Bibliography 143

vi

List Of Tables

1.1 Three posts for Honda Civics from Craig’s List . . . . . . . . . . . . . . . 3

1.2 Methods for extraction/annotation under various assumptions . . . . . . . 8

2.1 Automatically choosing a reference set . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Vector-space approach to finding attributes in agreement . . . . . . . . . . . . 17

2.3 Cleaning an extracted attribute . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Reference Set Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Posts Set Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 Using Jensen-Shannon distance as similarity measure . . . . . . . . . . . . . . 24

2.7 Using TF/IDF as the similarity measure . . . . . . . . . . . . . . . . . . . . . 26

2.8 Using Jaccard as the similarity measure . . . . . . . . . . . . . . . . . . . . . 27

2.9 Using Jaro-Winkler TF/IDF as the similarity measure . . . . . . . . . . . . . . 28

2.10 A summary of each method choosing reference sets . . . . . . . . . . . . . . . 28

2.11 Annotation results using modified Dice similarity . . . . . . . . . . . . . . . . 32

2.12 Annotation results using modified Jaccard similarity . . . . . . . . . . . . . . . 33

2.13 Annotation results using modified TF/IDF similarity . . . . . . . . . . . . . . 35

2.14 Modified Dice using Jaro-Winkler versus Smith-Waterman . . . . . . . . . . . . 37

2.15 Extraction results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vii

3.1 Constructing a reference set . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Posts with a general trim token: LE . . . . . . . . . . . . . . . . . . . . . 48

3.3 Locking attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 ILA method for building reference sets . . . . . . . . . . . . . . . . . . . . 54

3.5 Reference set constructed by flattening the hierarchy in Figure 3.4 . . . . 57



3.8 Post Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.9 Field-level extraction results . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.10 Manually constructed reference-sets . . . . . . . . . . . . . . . . . . . . . . . 64

3.11 Comparing field-level results . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.12 Comparing the ILA method . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.13 Two domains where manual reference sets outperform . . . . . . . . . . . 68

3.14 Results on the BFT Posts and eBay Comics domains . . . . . . . . . . . . 69

3.15 Bigram types that describe domain characteristics . . . . . . . . . . . . . 71

3.16 Number of bootstrapped posts . . . . . . . . . . . . . . . . . . . . . . . . 73

3.17 K-L divergence between domains using average distributions . . . . . . . . 77

3.18 Percent of trials where domains match (under K-L thresh) . . . . . . . . . 78

3.19 Method to determine whether to automatically construct a reference set. . 79

3.20 Two domains for testing the Bootstrap-Compare method . . . . . . . . . 79

3.21 Extraction results using ILA on Digicams and Cora . . . . . . . . . . . . . 80

4.1 Modified Sequential Covering Algorithm . . . . . . . . . . . . . . . . . . . 87

4.2 Learning a conjunction of {method, attribute} pairs . . . . . . . . . . . . 89

viii

4.3 Learning a conjunction of {method, attribute} pairs using weights . . . . 92

4.4 Blocking results using the BSL algorithm (amount of data used for trainingshown in parentheses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.5 Some example blocking schemes learned for each of the domains. . . . . . 113

4.6 A comparison of BSL covering all training examples, and covering 95% ofthe training examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.7 Record linkage results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.8 Matching using only the concatenation . . . . . . . . . . . . . . . . . . . . 120

4.9 Using all string metrics versus using only the Jensen-Shannon distance . . 122

4.10 Record linkage results with and without binary rescoring . . . . . . . . . . 124

4.11 Field level extraction results: BFT domain . . . . . . . . . . . . . . . . . 126

4.12 Field level extraction results: eBay Comics domain. . . . . . . . . . . . . . 127

4.13 Field level extraction results: Craig’s Cars domain. . . . . . . . . . . . . . 128

4.14 Summary results for extraction showing the number of times each systemhad highest F-Measure for an attribute. . . . . . . . . . . . . . . . . . . . 128

ix

List Of Figures

1.1 Fitting posts into information integration . . . . . . . . . . . . . . . . . . 2

1.2 Framework for reference-set based extraction and annotation . . . . . . . 6

2.1 Unsupervised extraction with reference sets . . . . . . . . . . . . . . . . . . . 21

3.1 Hierarchies to reference sets . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Creating a reference set from posts . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Locking car attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4 Hierarchy constructed by ILA from car classified ads . . . . . . . . . . . . 55

3.5 Hierarchy constructed by ILA from laptop classified ads . . . . . . . . . . 56

3.6 Hierarchy constructed by ILA from ski auction listings . . . . . . . . . . . 56

3.7 Average distribution values for each domain using 10 random posts . . . . 74

4.1 The traditional record linkage problem . . . . . . . . . . . . . . . . . . . . 94

4.2 The problem of matching a post to the reference set . . . . . . . . . . . . 94

4.3 Two records with equal record level but different field level similarities . . 95

4.4 The full vector of similarity scores used for record linkage . . . . . . . . . 100

4.5 Our approach to matching posts to records from a reference set . . . . . . 103

4.6 Extraction process for attributes . . . . . . . . . . . . . . . . . . . . . . . 104

4.7 Algorithm to clean an extracted attribute . . . . . . . . . . . . . . . . . . 108

x

Abstract

This thesis investigates information extraction from unstructured, ungrammatical text on

the Web such as classified ads, auction listings, and forum postings. Since the data is un-

structured and ungrammatical, this information extraction precludes the use of rule-based

methods that rely on consistent structures within the text or natural language processing

techniques that rely on grammar. Instead, I describe extraction using a “reference set,”

which I define as a collection of known entities and their attributes. A reference set can be

constructed from structured sources, such as databases, or scraped from semi-structured

sources such as collections of Web pages. In some cases, as I shown in this thesis, a ref-

erence set can even be constructed automatically from the unstructured, ungrammatical

text itself. This thesis presents methods to exploit reference sets for extraction using

both automatic techniques and machine learning techniques. The automatic technique

provides a scalable and accurate approach to extraction from unstructured, ungrammat-

ical text. The machine learning approach provides even higher accuracy extractions and

deals with ambiguous extractions, although at the cost of requiring human effort to label

training data. The results demonstrate that reference-set based extraction outperforms

the current state-of-the-art systems that rely on structural or grammatical clues, which

is not appropriate for unstructured, ungrammatical text. Even the fully automatic case,

xi

which constructs its own reference set for automatic extraction, is competitive with the

current state-of-the-art techniques that require labeled data. Reference-set based extrac-

tion from unstructured, ungrammatical text allows for a whole category of sources to

be queried, allowing for their inclusion in data integration systems that were previously

limited to structured and semi-structured sources.

xii

Chapter 1

Introduction

This chapter introduces the issue of information extraction from unstructured, ungram-

matical text on the Web. In this chapter, I first motivate and define the problem explicitly.

Then, I briefly outline my approach to the problem and describe the general framework

for reference-set based information-extraction. I then describe my contributions to the

field of information extraction, and then present the outline of this thesis.

1.1 Motivation and Problem Statement

As more and more information comes online, the ability to process and understand this

information becomes more and more crucial. Data integration attacks this problem by

letting users query heterogeneous data sources within a unified query framework, com-

bining the results to ease understanding. However, while data integration can integrate

data from structured sources such as databases, semi-structured sources such as that

extracted from Web pages, and even Web Services [58], this leaves out a large class of

useful information: unstructured and ungrammatical data sources. I call such unstruc-

tured, ungrammatical data “posts.” Posts range in source from classified ads to auction

1

listings to forum postings to blog titles to paper references. The goal of this thesis

is to structure sources of posts, such that they can be queried and included

in data integration systems. Figure 1.1 shows an example integration system that

combines a database of car safety information, semi-structured Web pages of car reviews,

and posts of car classifieds. The box labeled “THESIS” is the mechanism by which to tie

the posts of classified ads into the integration system via structured queries, and this is

the motivation behind my research.

Figure 1.1: Fitting posts into information integration

Posts are full of useful information, as defined by the attributes that compose the

entity within the post. For example, consider the posts about cars from the online

classified service Craigslist,1 shown in Table 1.1. Each used car for sale is composed of

attributes that define this car; and if we could access the individual attributes we could

include such sources in data integration systems, such as that of Figure 1.1, and answer1www.craigslist.org

2

interesting queries such as “return the classified ads and car reviews for used cars that are

made by Honda and have a 5 star crash rating.” Such a query might requires combining

the structured database of safety ratings with the posts of the classified ads and the car

review websites. Clearly, there is a case to be made for the utility of structured queries

such as joins and aggregate selections of posts.

Craig’s List Post93 civic 5speed runs great obo (ri) $180093- 4dr Honda Civc LX Stick Shift $180094 DEL SOL Si Vtec (Glendale) $3000

Table 1.1: Three posts for Honda Civics from Craig’s List

However, currently accessing the data within posts does not go much beyond keyword

search. This is precisely because the ungrammatical, unstructured nature of posts makes

extraction difficult, so the attributes remain embedded within the posts. These data

sources are ungrammatical, since they do not conform to the proper rules of written

language (English in this case). Therefore, Natural Language Processing (NLP) based

information extraction techniques are not appropriate [57; 9]. Further, the posts are

unstructured since the structure can vary vastly between each listing. (That is, we can

not assume that the ordering of the attributes across the posts are preserved.) So, wrapper

based extraction techniques will not work either [49; 19] since the assumption that the

structure can be exploited does not hold. Lastly, even if one can extract the data from

within posts, you would need to assure that the extracted values map to the same value

for accurate querying. Table 1.1 shows three example posts about Honda Civics, but each

refers to the model differently. The first post mentions a “civic,” the next writes “Civc”

which is a spelling mistake, and the last calls the model “DEL SOL.” Yet, if we do a

3

query for all “civic” cars, all of these posts should be returned. Therefore, not only is it

important to extract the values from the posts, but we should clean the values in that all

values for a given attribute should be transformed to a single, uniform value. This thesis

presents an extraction methodology for posts that handles both the difficulties imposed

by the lack of grammar and structure, and also can assign a single, uniform value for an

attribute (called semantic annotation).

1.2 Proposed Approach

As stated, the method of extraction presented in this thesis involves exploiting structured,

relational data to aid the extraction, rather than relying on NLP or structure based meth-

ods. I call this structured data a “reference set.” A reference set consists of collections

of known entities with the associated, common attributes. Examples of reference sets in-

clude online (or offline) reference documents, such as the CIA World Fact Book;2 online

(or offline) databases, such as the Comics Price Guide;3 and with the Semantic Web one

can envision building reference sets from the numerous ontologies that already exist.

These are all cases of manually constructed reference sets since they require user

effort to construct the database, scrape the data off of Web pages, or create the ontology.

As this thesis shows, however, in many cases we can automatically generate a reference

set from the posts themselves, which can then be used for extraction. Nonetheless,

manual reference sets provide a stronger intuition into the strengths of reference-set based

information extraction.2http://www.cia.gov/cia/publications/factbook/3www.comicspriceguide.com

4

Using the car example, a manual reference set might be the Edmunds car buying

guide,4 which defines a schema for cars (the set of attributes such as make and trim)

as well as standard values for attributes themselves. Manually constructing reference

sets from Web sources such as the Edmunds car buying guide is done using wrapper

technologies such as Agent Builder5 to scrape data from the Web source using the schema

the source defines for the car.

Regardless of whether the reference set is constructed manually or automatically, the

high-level procedure for exploiting a reference set for extraction is the same. To use the

reference sets, the methods in this thesis first find the best match between each post and

the tuples of the reference set. By matching a post to a member of the reference set, the

algorithm can use the reference set’s schema and attribute values as semantic annotation

for the post (a single, uniform value for that attribute). Further, and importantly, these

reference set attributes provide clues for the information extraction step.

After the matching step, the methods perform information extraction to extract the

actual values in the post that match the schema elements defined by the match from the

reference set. This is the information extraction step. During the information extraction

step, the parts of the post are extracted that best match the attribute values from the

reference set member chosen during the matching step.

Figure 1.2 presents the basic methodology for reference-set based extraction, using

the second post of Table 1.1 as an example, and assuming it matches the reference

set tuple with the attributes {MAKE=HONDA, MODEL=CIVIC, TRIM=2 DR LX,

YEAR=1993}. As shown, first, the algorithm selects the best match from a reference set.4www.edmunds.com5A product of Fetch Technologies http://www.fetch.com/products.asp

5

Next, it exploits the match to aid information extraction. While this basic framework

stays consistent across the different methods described in this thesis, the details vary in

accordance with the different algorithms.

Figure 1.2: Framework for reference-set based extraction and annotation

This thesis presents three approaches to reference-set based extraction that are each

suitable depending on certain circumstances. The main algorithm of this thesis, presented

in Chapter 2 automatically matches a post to a reference set, and then uses the results

of this automatic matching to aid automatic extraction. This method assumes that there

exists a repository of manually created reference sets from which the system itself can

choose the appropriate one(s) to for matching/extraction. By using a repository, this

may lessen the burden on users who do not need to create new reference sets each time,

but can reuse previously created ones.

6

However, requiring a manually created reference set is not always even necessary. As

I show in Chapter 3, in many cases the algorithm I describe can construct the reference

set automatically, directly from the posts. It can then use the automatically constructed

reference set using the automatic matching and extraction techniques of Chapter 2.

There are two cases where a manually constructed reference set is preferred over

one that is automatically constructed. First, it’s possible that a user requires a certain

attribute to be added as a semantic annotation to a post in support of future, structured

queries (that is, after extraction). For example, suppose an auto-parts company has

an internal database of parts for cars. This company might want to join their parts

database with the classified ads about cars in a effort to advertise their mechanics service.

Therefore, they could use reference-set based extraction to identify the cars for sale in

the classified ads, which they can then join with their internal database. However, each

car in their database might have its own internal identification and this attribute might

be needed for their query. In this case, since none of the classified ads would already

contain this company-specific attribute, the company would have to manually create their

own reference set and include this attribute. So, in this case, the user would manually

construct a reference set and then use it for extraction.

Second, the textual characteristics of the posts might make it difficult to automatically

create a high quality reference set using the technique of Chapter 3. In fact, in Chapter 3

I present an analysis of the characteristics that define these domains where automatically

creating a reference set is appropriate. Further, that chapter describes a method that

can easily decide whether to use the automatic approach of Chapter 3, or suggest to the

user to manually construct a reference set.

7

The automatic extraction approach can be improved using machine learning tech-

niques at the cost of labeling data, as I describe in Chapter 4. This is especially use-

ful when very high quality extractions are required. Further, a user may require high-

accuracy extraction of attributes that are not easily represented in a reference set, such

as prices or dates. I call these “common attributes.” In particular, common attributes

are usually a very large set (such as prices or dates), but they contain characteristics that

can be exploited for extraction. For instance, regular expressions can identify prices.

However, when extracting both reference-set and common attributes there are ambi-

guities that must be accounted for to extract common attributes accurately. For instance,

while a regular expression for prices might pull out the token “626” from a classified ad for

cars, it is more likely that the “626” in the post refers to the car model, since the Mazda

company makes a car called the 626. By using a reference set to aid disambiguation, the

system handles extraction from posts with high accuracy. For instance, the system can

be trained that when a token matches both a regular expression for prices and the model

attribute from the reference set, it should be labeled as a car model. Chapter 4 describes

my machine learning approach to matching and extraction.

Table 1.2 summarizes and compares all three methods.

Table 1.2: Methods for extraction/annotation under various assumptionsSummary Situation

Chapter 2 Requires repository of manually Cannot use automatically constructed reference set(“ARX” approach) constructed reference sets; because require specific, unattainable attribute

Automatically selects appropriate or characteristics of text preventreference set(s); Automatic extraction automatic construction

Chapter 3 Automatically construct reference set; Data set conforms to characteristics(“ILA” algorithm) Use constructed reference set with for constructing reference set,

automatic method from Chapter 2 which can be testedChapter 4 Supervised machine learning; Require high-accuracy extraction(“Phoebus” approach) Requires manually constructed including common attributes that may be

reference set and labeled training data ambiguous

8

1.3 Thesis Statement

I demonstrate that we can exploit reference sets for information extraction on

unstructured, ungrammatical text, which overcomes the difficulties encoun-

tered by current extraction methods that rely on grammatical and structural

clues in the text for extraction; furthermore, I show that we can automatically

construct reference sets from the unstructured, ungrammatical text itself.

1.4 Contributions of the Research

The key contribution of the thesis is a method for information extraction that exploits

reference sets, rather than grammar or structure based techniques. The research includes

the following contributions:

• An automatic matching and extraction algorithm that exploits a given reference set

(Chapter 2)

• A method that selects the appropriate reference sets from a repository and uses

them for extraction and annotation without training data (Chapter 2)

• An automatic method for constructing reference sets from the posts themselves.

This algorithm also is able to suggest the number of posts required to automatically

construct the reference set sufficiently (Chapter 3)

9

• An automatic method whereby the machine can decide whether to construct its

own reference set, or require a person to do it manually (Chapter 3)

• A method that uses supervised machine learning to perform high-accuracy extrac-

tion on posts, even in the face of ambiguity. (Chapter 4)

1.5 Outline of the thesis

The remainder of this thesis is organized as follows. Chapter 2 presents my automatic

approach to matching posts to a reference set and then using those matches for automatic

extraction. This technique requires no more manual effort beyond constructing a reference

set (and even that can be avoided if the repository of reference sets is comprehensive

enough). Chapter 3 describes the technique for automatically constructing a reference set,

which can then be exploited using the methods of Chapter 2. Chapter 3 also describes an

algorithm for deciding whether or not the reference set can be automatically constructed.

Chapter 4 describes the approach to reference-set based extraction that uses machine

learning. Chapter 5 reviews the work related to the research in this thesis. Chapter 6

finishes with a conclusion of the contributions of this thesis and points out future courses

of research.

10

Chapter 2

An Automatic Approach to Extraction from Posts

Unsupervised Information Extraction (UIE) has recently witnessed significant progress [8;

28; 51]. However, current work on UIE relies on the redundancy of the extractions to

learn patterns in order to make further extractions. Such patterns rely on the assumption

that similar structures will occur again to make the learned extraction patterns useful.

Initially, the approaches can be seeded with manual rules [8], example extractions [51],

or sometimes nothing at all, relying solely on redundancy [28]. Regardless of how the

extraction process starts, extractions are validated via redundancy, and they are then

used to generalize patterns for extractions.

This approach to UIE differs from the one in this thesis in important ways. First, the

use of patterns makes assumptions about the structure of the data. However, I cannot

make any structural assumptions because my target data for extraction is defined by its

lack of structure and grammar. Second, since these systems are not clued into what can

be extracted, they rely on redundancy for their confidence in their extractions. In my

case, the confidence in the extracted values comes from their similarity to the reference-

set attributes exploited during extraction. Lastly, the goals differ. Previous UIE systems

11

seek to build knowledge bases through extraction. For instance, they aim to find all types

of cars on the Web. Our extraction creates relational data, which allows us to classify

and query posts. For this reason, we believe the previous work complements ours well

because we could use their techniques to automatically build our reference sets.

This chapter presents my algorithm for unsupervised information extraction of un-

structured, ungrammatical text. The algorithm has three distinct steps. As stated in

the introduction, reference-set based extraction relies on exploiting relational data from

reference sets, so the first step of the algorithm is to choose the applicable reference set

from a repository of reference sets. This is done so that users can easily reuse reference

sets they may have constructed in the past. Of course, a user can skip this step and just

provide his or her reference set directly to the algorithm instead, but by having the ma-

chine select the appropriate reference sets, this eases some of the burden and guesswork

from a human user. Once the reference sets are chosen, the second step in the algorithm is

for the system to match each post to members of the reference set, which allows it to use

these members’ attributes as clues to aid in the extraction. Finally, the system exploits

these reference-set attributes to perform the unsupervised extraction. It is important to

note that the approach of this chapter is totally automated beyond the construction of

the reference sets.

2.1 Automatically Choosing the Reference Sets

As stated above, the first step in the algorithm is to select the appropriate reference sets.

Given that the repository of reference sets grows over time, the system should choose the

12

reference sets to exploit for a given set of posts. The algorithm chooses the reference sets

based on the similarity between the set of posts and the reference sets in the repository.

Intuitively, the most appropriate reference set is the one with the most useful tokens in

common with the posts. For example, if we have a set of posts about cars, we expect a

reference set with car makes, such as Honda or Toyota, to be more similar to the posts

than a reference set of hotels.

To choose the reference sets, the system treats each reference set in the repository as

a single document and the set of posts as a single document. This way, the algorithm

calculates a similarity score between each reference set and the full set of posts. Next

these similarity scores are sorted in descending order, and the system traverses this list,

computing the percent difference between the current similarity score and the next. If

this percent difference is above a threshold, and the score of the current reference set is

greater than the average similarity score for all reference sets, the algorithm terminates.

Upon termination, the algorithm returns as matches the current reference set and all

reference sets that preceded it. If the algorithm traverses all of the reference sets with-

out terminating, then no reference sets are relevant to the posts. Table 2.1 shows the

algorithm.

I choose the percent difference as the splitting criterion between the relevant and

irrelevant reference sets because it is a relative measure. Comparing only the actual

similarity values might not capture how much better one reference set is as compared

to another. Further, I require that the score at the splitting point be higher than the

average score. This requirement captures cases where the scores are so small at the end of

the list that their percent differences can suddenly increase, even with a small difference

13

Table 2.1: Automatically choosing a reference set

Given posts P , threshold T , and reference set repository Rp← SingleDocument(P )For all reference sets ref ∈ Rri ← SingleDocument(ref)SIM(ri, p)← Similarity(ri, p)

For all SIM(ri, p) in descending orderIf PercentDiff (SIM(ri, p), SIM(ri+1, p)) > T ANDSIM(ri, p) > AV G(SIM(R, p))

Return SIM(rx, p), 1 > x > iReturn nothing (No matching reference sets)

in score. This difference does not mean that the algorithm found relevant reference sets,

rather it just means that the next reference set is that much worse than the current, bad

one.

By treating each reference set as a single document, the algorithm of Table 2.1 scales

linearly with the size of the repository. That is, each reference set in the repository is

scored against the posts only once. Furthermore, as the number of reference sets increases,

the percent difference still determines which reference sets are relevant. If an irrelevant

reference set is added to the repository, it will score low, so it will still be relatively that

much worse than the relevant one. If a new relevant set is added, the percent difference

between the new one and the one already chosen will be small, but both of them will still

be much better than the irrelevant sets in the repository. Thus, the percent difference

remains a good splitting point.

I do not require a specific similarity measure for this algorithm. Instead, the ex-

periments of Section 2.4 demonstrate that certain classes of similarity metrics perform

14

well. So, rather than picking just one as the best I try many different metrics and draw

conclusions about what types of similarity scores should be used and why.

2.2 Matching Posts to the Reference Set

As outlined in Chapter 1, reference-set based extraction hinges upon exploiting the best

matching member of a reference set for a given post. Therefore, after choosing the

relevant reference sets, the algorithm matches each post to the best matching members of

the reference set. When selecting multiple reference sets, the matching algorithm executes

iteratively, matching the set of posts once to each chosen reference set. However, if two

chosen reference sets have the same schema, it only selects the higher ranked one to

prevent redundant matching.

To match the reference set records to the posts, I employ a vector-space model. By

using a vector-space approach to matching, rather than machine learning, I can lessen the

burden required by tasks such as labeling training matches. Furthermore, a vector-space

model allows me to use information-retrieval techniques such as inverted indexes, which

are fast and scalable.

In my model, I treat each post as a “query” and each record of the reference set as

a “document,” and I use token-based similarity metrics to define their likeness. Again, I

do not tie this part of the algorithm to a specific similarity metric because I find a class

of token-based similarity metrics works well, which I justify in the experiments.

However, for the matching algorithm I do modify the similarity metrics I use. This

modification considers two tokens as matching if their Jaro-Winkler [62] similarity is

15

greater than a threshold. For example, consider the classic Dice similarity, defined over

a post p and a record of the reference set r, as:

Dice(p, r) =2 ∗ (p ∩ r)|p|+ |r|

If the threshold is 0.95, two tokens are put into the intersection of the modified Dice

similarity if those two tokens have a Jaro-Winkler similarity above 0.95. This Jaro-

Winkler modification captures tokens that might be misspelled or abbreviated, which is

common in posts. The underlying assumption of the Jaro-Winkler metric is that certain

attributes are more likely similar if they share a certain prefix. This works particularly

well for proper nouns, which many reference set attributes are, such as “Honda Accord”

cars. Using the modified similarity metric, the algorithm compares each post, pi, to each

member of the reference set and returns the reference set record(s) with the maximum

score, called rmaxi . In the experiments, I vary this threshold to test its effect on the

performance of matching the posts. I also justify using the Jaro-Winkler metric to modify

the Dice similarity, rather than another edit distance metric.

However, because more than one reference set record can have a maximum similarity

score with a post (rmaxi is a set), an ambiguity problem exists with the attributes provided

by the reference set records. For example, consider a post “Civic 2001 for sale, look!” If

we have the following 3 matching reference records: {HONDA, CIVIC, 4 Dr LX, 2001},

{HONDA, CIVIC, 2 Dr LX, 2001} and {HONDA, CIVIC, EX, 2001}, then we have an

ambiguity problem with the trim. We can confidently assign HONDA as the make, CIVIC

as the model, and 2001 as the year, because all of the matching records agree on these

16

attributes. We say that these attributes are “in agreement.” Yet, there is disagreement

on the trim because we cannot determine which value is best for this attribute. All the

reference records are equally acceptable from the vector-space perspective, but they differ

in value for this attribute. Therefore, the algorithm removes all attributes that do not

agree across all matching reference set records from the annotation (e.g., the trim in

our example). Once this process executes, the system has all of the attributes from the

reference set that it can use to aid extraction. The full automatic matching algorithm is

shown in Table 2.2.

Table 2.2: Vector-space approach to finding attributes in agreement

Given posts P and reference set RFor all pi ∈ Prmaxi ← MAX( SIM (pi,R))Remove attributes not in agreement from rmaxi

Selecting all the reference records with the maximum score, without pruning possible

false positives, introduces noisy matches. These false positives occur because posts with

small similarity scores still ‘match’ certain reference set records. For instance, the post

“We pick up your used car” matches the Renault Le Car, Lincoln Town Car, and Isuzu

Rodeo Pick-up, albeit with small similarity scores. However, since none of these attributes

are in agreement, this post gets no annotation. Therefore, by using only the attributes in

agreement, I essentially eliminate these false positive matches because no annotation will

be returned for this post. That is, because no annotation is returned for such posts, it is

as if there are no matching records for it. In this manner, by using only the attributes

“in agreement,” I separate the true matches from the false positives.

17

Once the system matches the posts to a reference set, users can query the posts

structurally like a database. This is a tremendous advantage over the traditional keyword

search approach to searching unstructured, ungrammatical text. For example, keyword

search cannot return records for which an attribute is missing, whereas the approach in

this thesis can. If a post were “2001 Accord for sale,” and the keyword search was Honda,

this record would not be returned. However, after matching posts to the reference set,

if we select all records where the matched-reference-set attribute is “Honda” this record

would be returned. Another benefit of matching the posts is that aggregate queries are

possible. This mechanism is not supported by keyword search.

Perhaps one of the most useful aspects of matching posts to the reference set is that one

can include the posts in an information-integration system (e.g., [38; 58]). In particular,

my approach is well suited for Local-as-View (LAV) information integration systems [37],

which allow for easy inclusion of new sources by defining their “source description” as a

view over known domain relations. For example, assume a user has a reference set of cars

from Edmunds car buying guide for the years 1990-2005 in his repository. If one includes

this source in an LAV integration system, the source description is:

Edmunds(make, model, year, trim) :-

Cars(make, model, year, trim) ∧

year ≥ 1995 ∧ year ≤ 2005

Now, assume a set of posts comes in and the system matches them to the records of

this Edmunds source. Before matching, the user had no idea how to include this source in

the integration system, but after matching the user can use the same source description

18

of the matching reference set, along with a variable to output the “post” attribute, to

define a source description of the unstructured, ungrammatical source. This is possible

because the algorithm appends records from the unstructured source with the attributes

in agreement from the reference set matches. So, the new source description becomes:

UnstructuredSource(post, make, model, year, trim) :-

Cars(make, model, year, trim) ∧

year ≥ 1995 ∧ year ≤ 2005

In this manner, one can collect and include new sources of unstructured, ungrammat-

ical text in LAV information integration systems without human intervention.

2.3 Unsupervised Extraction

Yet, the annotation that results from the matching step may not be enough. There are

many cases where a user wants to see the actual extracted values for a given attribute,

rather than the reference set attribute from the matching record. Therefore, once the

system retrieves the attributes in agreement from the reference set matches, it exploits

these attributes for information extraction. First, for each post, each token is compared

to each of the attributes in agreement from the matches. The token is labeled as the

attribute for which it has the maximum Jaro-Winkler similarity. I label a token as ‘junk’

(to be ignored) if it receives a score of zero against all reference-set attributes.

However, this initial labeling generates noise because the attributes are labeled in

isolation. To remedy this, the algorithm uses the modified Dice similarity, and generates

a baseline score between the extracted field and the reference-set field. Then, it goes

19

Table 2.3: Cleaning an extracted attribute

CLEAN-ATTRIBUTE (extracted attribute E, reference set attribute A)ExtCands ← {}baseline ← DICE(E,A)For all tokens ei ∈ E

score ← DICE(E/ei,A)IF score > baseline

ExtCands ← ExtCands ∪ eiIF ExtCands is empty

return EELSE

select ci ∈ ExtCands that yieds max scoreCLEAN-ATTRIBUTE (E/ci,A)

through the extracted attribute, removing one token at a time, and calculates a new

Dice similarity value. If this new score is higher than the baseline, the removed token

is a candidate for permanent removal. Once all tokens are processed in this way, the

candidate for removal that yielded the highest new score is removed. Then, the algorithm

updates the baseline to the new score, and repeats the process. When none of the tokens

yield improved scores when removed, this process terminates. This cleaning is shown in

Table 2.3. For this part of the algorithm, I use the modified Dice since our experiments in

the annotation section show this to be one of the best performing similarity metrics. Note,

my unsupervised extraction is O(‖p‖2) per post, where ‖p‖ is the number of tokens in

the post, because, at worst, all tokens are initially labeled, and then each one is removed,

one at a time. However, because most posts are relatively short, this is an acceptable

running time.

The whole multi-pass procedure for unsupervised extraction is given by Figure 2.1.

By exploiting reference sets, the algorithm performs unsupervised information extraction

without assumptions regarding repeated structure in the data.

20

Figure 2.1: Unsupervised extraction with reference sets

2.4 Experimental Results

This section presents results for my unsupervised approach to selecting a reference set,

finding the attributes in agreement, and exploiting these attributes for extraction.

2.4.1 Results: Selecting reference set(s)

The algorithm for choosing the reference sets is not tied to a particular similarity function.

Rather, I apply the algorithm using many different similarity metrics and draw conclusions

about which ones work best and why. This yields insight as to which types of metrics

users can plug in and have the algorithm perform accurately. 1

The experiments for selecting reference sets uses six reference sets and three sets posts

to test out the various situations where the algorithm should work correctly: returning a

single matching reference set, multiple matching reference sets, and no matching reference

sets.1Note that for the following metrics: Jensen-Shannon distance, Jaro-Winkler similarity, TF/IDF, and

Jaro-Winkler TF/IDF, I use the SecondString package’s implementation [17].

21

I use six reference sets, most of which have been used in past research. First, I use the

Hotels reference set from [44], which consists of 132 hotels each with a star rating, a hotel

name, and a local area. Another reference set from the same paper is the Comics reference

set, which has 918 Fantastic Four and Incredible Hulk comics from the Comic Books Price

guide, each with a title, issue number, and publisher. I also use two restaurant reference

sets, which were both used previously for record linkage [6]. One I call Fodors, which

contains 534 restaurant records, each with a name, address, city, and cuisine. The other

is Zagat, with 330 records.

I also have two reference sets of cars. The first, called Cars, contains 20,076 cars from

the Edmunds Car Buying Guide for 1990-2005. The attributes in this set are the make,

model, year, and trim. I supplement this reference set with cars from before 1990, taken

from the auto-accessories company, Super Lamb Auto. This supplemental list contains

6,930 cars from before 1990. I call this combined set of 27,006 records the Cars reference

set. The other reference set of cars also has the attributes make, model, year, and trim.

However, it is a subset of the cars covered by Cars. This data set comes from the Kelly

Blue Book car pricing service containing 2,777 records for Japanese and Korean cars from

1990-2003. I call this set KBBCars. This data set has also been used in the record linkage

community [46]. A summary of the reference sets is given in Table 2.4.

Table 2.4: Reference Set DescriptionsName Source Website Attributes RecordsFodors Fodors Travel Guide www.fodors.com name, address, city, cuisine 534Zagat Zagat Restaurant Guide www.zagat.com name, address, city, cuisine 330Comics Comics Price Guide www.comicspriceguide.com title, issue, publisher 918Hotels Bidding For Travel www.biddingfortravel.com star rating, name, local area 132Cars Edmunds and www.edmunds.com and make, model, trim, year 27,006

Super Lamb Auto www.superlambauto.comKBBCars Kelly Blue Book Car Prices www.kbb.com make, model, trim, year 2,777

22

I chose sets of posts to test different cases that exist for finding the appropriate

reference sets. One set of posts matches only a single reference set in the experimental

repository. It contains 1,125 posts from the forum Bidding For Travel. These posts, called

BFT, match only the Hotels reference set. This data set was used previously in [44].

Because my approach can also select multiple relevant reference sets, I use a set

of posts that matches both reference sets of cars. This set, which I call Craig’s Cars,

contains 2,568 car posts from Craigslist classifieds. Note that while there may be multiple,

appropriate reference sets, they also might have an internal ranking. In this case I expect

that both the Cars and KBBCars reference sets are selected, but Cars should be ranked

first (since KBBCars ⊂ Cars).

Lastly, I must examine whether the algorithm suggests that no relevant reference sets

exist in the repository. To test this feature, I collected 1,099 posts about boats from

Craigslist, called Craig’s Boats. Boats are similar enough to cars to make this task non-

trivial, because boats and cars are both made by Honda, for example, so that keyword

appears in both sets of posts. However, boats also differ from each of the reference sets

so that no reference set should be selected.

All three sets of posts are summarized in Table 2.5.

Table 2.5: Posts Set DescriptionsName Source Website Reference Set Match RecordsBFT Bidding For Travel www.biddingfortravel.com Hotels 1,125Craig’s Cars Craigslist Cars www.craigslist.org Cars, KBBCars 2,568Craig’s Boats Craigslist Boats www.craigslist.org 1,099

23

The first metric I test is the Jensen-Shannon distance (JSD) [39]. This information-

theoretic metric quantifies the difference in probability distributions between the tokens

in the reference set and those in the set of posts. Because JSD requires probability distri-

butions, I define the distributions as the likelihood of tokens occurring in each document.

Table 2.6 shows the results for choosing relevant reference sets using JSD. The ref-

erence set names in bold reflect those that are selected for the given set of posts. (This

means Craig’s Boats should have no bold names.) The scores in bold are the similarity

scores for the chosen reference sets. The percent difference in bold is the point at which

the algorithm breaks out and returns the appropriate reference sets. In particular, us-

ing JSD the algorithm successfully identifies the multiple cases of a single appropriate

reference set, multiple reference sets, or no reference set. Note that in all experiments I

maintain the percent-difference splitting-threshold at 0.6.

Table 2.6: Using Jensen-Shannon distance as similarity measureBFT Posts Craig’s Cars

Ref. Set Score % Diff. Ref. Set Score % DiffHotels 0.622 2.172 Cars 0.520 0.161Fodors 0.196 0.050 KBBCars 0.447 1.193Cars 0.187 0.248 Fodors 0.204 0.144KBBCars 0.150 0.101 Zagat 0.178 0.365Zagat 0.136 0.161 Hotels 0.131 0.153Comics 0.117 Comics 0.113Average 0.234 Average 0.266

Craig’s BoatsRef. Set Score % Diff.Cars 0.251 0.513Fodors 0.166 0.144KBBCars 0.145 0.088Comics 0.133 0.025Zagat 0.130 0.544Hotels 0.084Average 0.152

24

The next metric I test is cosine similarity using TF/IDF for the weights. These

results are shown in Table 2.7. Similarly to JSD, TF/IDF is able to identify all of the

cases correctly. However, in choosing both car reference sets for the Craig’s Cars posts,

TF/IDF incorrectly determines the internal ranking, placing KBBCars ahead of Cars.

This due to the IDF weights that are calculated for the reference sets. Although the

algorithm treats the set of posts and the reference set as a single document for comparison,

the IDF weights are based on individual records in the reference set. Therefore, since

Cars is a superset of KBBCars, certain tokens are weighted more in KBBCars than

in Cars, resulting in higher matching scores. These results also justify the need for a

double stopping criterion. It is not sufficient to only consider the percent difference as

an indicator of relative superiority amongst the reference sets. The scores must also be

compared to an average to ensure that the algorithm does not errantly choose a bad

reference set simply because it is relatively better than an even worse one. The last two

rows of the Craig’s Boats posts and the BFT Posts in Table 2.7 show this behavior.

I also tried a simpler bag-of-words metric that does not use any sort of token prob-

abilities or weights. Namely, I use the Jaccard similarity, which is defined as the tokens

in common divided by the union of the token sets. That is, given token set S and token

set T , Jaccard is:

Jaccard(S, T ) =S ∩ TS ∪ T

The results using the Jaccard similarity are shown in Table 2.8. One of the most

surprising aspects of these results is that the Jaccard similarity actually does pretty well.

It gets all but one of the cases correct. It is able to link the BFT Posts to the Hotels

25

Table 2.7: Using TF/IDF as the similarity measureBFT Posts Craig’s Cars

Ref. Set Score % Diff. Ref. Set Score % DiffHotels 0.500 1.743 KBBCars 0.122 0.239Fodors 0.182 0.318 Cars 0.099 1.129Comics 0.134 0.029 Zagat 0.046 0.045Zagat 0.134 0.330 Fodors 0.044 0.093Cars 0.100 1.893 Comics 0.041 0.442KBBCars 0.035 Hotels 0.028Average 0.182 Average 0.063

Craig’s BoatsRef. Set Score % Diff.Cars 0.200 0.189Comics 0.168 0.220Fodors 0.138 0.296Zagat 0.107 0.015KBBCars 0.105 0.866Hotels 0.056Average 0.129

set only and it is able to determine that no reference set is appropriate for the Craig’s

Boats posts. However, with the Craig’s Cars posts, it is only able to determine one of the

reference sets. It ranks the Fodors and Zagat restaurants ahead of KBBCars because the

restaurants have city names in common with some of the classified car listings. However,

it is unable to determine that city name tokens are not as important as car makes and

models. This is a problem if, for instance, the post is small and it contains a few more

city name tokens than car tokens.

Lastly, I investigate whether requiring that tokens match strictly, as in the above

metrics, is more useful than a soft matching technique that considers tokens matching

based on string similarities. To test this idea I use a modified version of TF/IDF where

tokens are considered a match when their Jaro-Winkler score is greater than 0.9. These

results are shown in Table 2.9. This metric is able to correctly retrieve the reference set

26

Table 2.8: Using Jaccard as the similarity measureBFT Posts Craig’s Cars

Ref. Set Score % Diff. Ref. Set Score % DiffHotels 0.272 1.489 Cars 0.207 1.339Comics 0.109 0.166 Fodors 0.088 0.126Fodors 0.094 0.004 Zagat 0.078 0.005Zagat 0.093 0.640 KBBCars 0.078 0.089Cars 0.057 0.520 Comics 0.072 2.536KBBCars 0.037 Hotels 0.020Average 0.110 Average 0.091

Craig’s BoatsRef. Set Score % Diff.Cars 0.129 0.366Comics 0.095 0.347Fodors 0.070 0.112Zagat 0.063 0.261KBBCars 0.050 1.917Hotels 0.017Average 0.071

for the BFT Posts, and for the Craig’s Boats posts this metric chooses no appropriate

reference set. However, for the Craig’s Boats case, this metric is close to returning many

incorrect reference sets since with the Zagat reference set the score is very close to the

average while the percent difference is huge. The largest failure is with the Craig’s Cars

posts. For this set of posts the ordering of the reference sets is correct because the Cars

and KBBCars have the highest scores, but no reference set is chosen because the percent

difference is never above the threshold with a similarity score above the average. The

percent differences are low because of the soft nature of the token matching. For example,

the Comics contains many matches because the issue number of the comics, such as #99,

#199, #299, etc. match an abbreviated car year in a post such as ’99. So, the similarity

scores between the cars and comics reference sets are close. From these sets of results

it becomes clear that it is better to have strict token matches, although they may be

27

less frequent. The strict nature of the matches ensures that the tokens particular to a

reference set are used for matches, which helps differentiate reference sets.

Table 2.9: Using Jaro-Winkler TF/IDF as the similarity measureBFT Posts Craig’s Cars

Ref. Set Score % Diff. Ref. Set Score % DiffHotels 0.593 1.232 Cars 0.699 0.174Fodors 0.266 0.173 KBBCars 0.595 0.174Zagat 0.227 0.110 Comics 0.505 0.050Cars 0.204 0.068 Fodors 0.481 0.107Comics 0.191 0.777 Zagat 0.435 0.948KBBCars 0.108 Hotels 0.223Average 0.265 Average 0.490

Craig’s BoatsRef. Set Score % Diff.Cars 0.469 0.096Comics 0.428 0.106Fodors 0.387 0.110KBBCars 0.349 0.082Zagat 0.322 1.212Hotels 0.146Average 0.350

The performance of each metric is shown in Table 2.10. This table shows each metric

I tested, and whether or not it correctly identified the reference set(s) for that set of

posts. In the case of Craig’s Boats, I consider it correct if no reference set is chosen. The

last column shows whether the method was able to also rank the chosen reference sets

correctly for the Craig’s Cars posts.

Table 2.10: A summary of each method choosing reference setsMethod BFT Craig’s Boats Craig’s Cars Craig’s Cars rank

JSD√ √ √ √

TF/IDF√ √ √

XJ-W TF/IDF

√ √X X

Jaccard√ √

X X

28

Based on these sets of results, I draw some conclusions about which metrics should be

used and why. Comparing the Jaccard similarity results to the success of both JSD and

TF/IDF, it is clear that it is necessary to include some notion of importance regarding

the matching tokens. The results argue that probability distributions of the tokens as

defined in JSD are a better metric, since TF/IDF can be overly sensitive, i.e., ignoring

tokens that may be important, even though they are frequent. This claim is also justified

by the fact that using JSD the algorithm does not run into the situation where it needs to

use the average stopping criterion, although it does with the TF/IDF metric. However,

since JSD and TF/IDF both do well, I can say that if a similarity metric can differentiate

important tokens from those that are not, then it can be used successfully in the algorithm

to choose reference sets. This is why I do not tie this algorithm to any particular metric,

since many could work.

Another interesting aspect of these results is the poor performance of TF/IDF using

the Jaro-Winkler modification. It seems that boosting the number of tokens that match

by using a less strict token matching method actually harms the ability to differentiate

between reference sets. This suggests that the tokens that define reference sets need to be

emphasized by matching them exactly. Lastly, across different domains and even across

different similarity metrics, the chosen threshold of 0.6 is appropriate for the cases where

the algorithm chooses the correct reference set. In all of the cases where the correct

reference set(s) is chosen, this threshold is exceeded.

29

2.4.2 Results: Matching posts to the reference set

Once the relevant reference sets are chosen, I use them to find the attributes in agreement

for the different sets of posts, since the algorithm will use these attributes during extrac-

tion. For the set of posts with reference sets (i.e., BFT and Craig’s Cars), I compare the

attributes in agreement found by the vector-space model to the attributes in agreement

for the true matches between the posts and the reference set. To evaluate the annotation,

I use the traditional measures of precision, recall, and F-measure (i.e., the harmonic mean

between precision and recall). A correct match occurs when the attributes match between

the true matches and the predictions of the vector-space model. As stated previously, I

try multiple modified token-similarity metrics to draw conclusions about what types of

metrics work and why.

Table 2.11 shows the results using the modified Dice similarity for the BFT and Craig’s

Cars posts, varying the Jaro-Winkler token-match threshold from 0.85 to 0.99. That is,

above this threshold two tokens will be considered a match for the modified Dice. In

both domains there is an improvement in F-measure as the threshold increases. This

is because more relevant tokens are being included when the threshold increases. If the

threshold is low then many irrelevant tokens are considered matches, so the algorithm

makes incorrect record level matches. For instance, as the threshold becomes low in the

Craig’s Cars domain, the results for the year attribute drop steeply because often the

difference in years is a single digit, which in turn yields errantly matched tokens for a

low threshold string similarity. Since errant matches are ignored (because they are not

“in agreement”) the scores are low. Note, however, that once the threshold becomes too

30

high, at 0.99, the results start to decrease. This happens because a very strict (high

value) edit-distance is too restrictive, so does not capture some of the possible matching

tokens that might be slightly misspelled.

The only attribute where the F-measure increases at 0.99 versus 0.95 is the year

attribute of the Craig’s Cars domain. In this case, the recall slightly increases at 0.99,

but the precisions are almost the same, yielding a higher F-measure for the matching

threshold of 0.99. This is due to more year values being “in agreement” with the higher

threshold since there will be less variation in terms of which reference set values can match

for this attribute, so those that do will likely be in agreement. For an example where

the threshold of 0.95 includes years that are not in agreement with high Jaro-Winkler

scores, consider a post with the token “19964” which might be a price or a year. If the

reference set record’s year attribute is “1994,” the Jaro-Winkler score between “19964”

and “1994” is 0.953. If the reference set record’s year is “1996” the Jaro-Winkler score is

0.96. In both cases, a threshold of 0.95 includes both years, so if this post matches two

reference set records with the same make, model and trim, but differing years of 1994 and

1996, then the year is discarded because it is not in agreement. We almost see the same

behavior with the trim attribute as well. This is because with both of these attributes, a

single difference in a character, say “LX” versus “DX” for a trim (or a digit for the year)

yields a completely different attribute, which can then become not “in agreement.”

Table 2.12 shows the results using the modified Jaccard similarity. As with the Dice

similarity, the modification is such that two tokens are put into the intersection of the

Jaccard similarity if their Jaro-Winkler score is above the threshold. The most striking

31

Table 2.11: Annotation results using modified Dice similarityBFT posts

Threshold Attribute Recall Precision F-Measure0.85 Hotel name 76.46 78.13 77.29

Star rating 80.74 77.86 79.27Local area 88.04 85.46 86.73

0.9 Hotel name 88.23 88.49 88.36Star rating 91.73 87.64 89.64Local area 93.09 89.36 91.19



Craig’s Cars postsThreshold Attribute Recall Precision F-Measure0.85 make 81.57 77.64 79.55

model 57.61 61.15 59.33trim 38.76 29.80 33.70year 2.38 8.14 3.68

0.9 make 88.41 82.82 85.52model 76.28 77.16 76.72trim 65.57 48.12 55.51year 69.45 82.88 75.58



result is that the scores match exactly to those using the Dice similarity. The Dice sim-

ilarity and Jaccard similarity can be used interchangeably. Further investigation reveals

that the actual similarity scores between the posts and their reference set matches are

different, which should be the case, but the resulting attributes that are “in agreement”

are the same using either metric. Therefore, they yield the same annotation from the

matches.

32

Table 2.12: Annotation results using modified Jaccard similarityBFT posts







model 57.61 61.15 59.33trim 38.76 29.80 33.70year 2.38 8.14 3.68




Table 2.13 shows results using the Jaro-Winkler TF/IDF similarity measure. Similarly

to the other metrics, for the BFT domain as the threshold increases so does the F-measure,

until the threshold peaks at 0.95 after which it decreases in accuracy. However, with the

Craig’s Cars domain the modified TF/IDF performs the best with a threshold of 0.99.

From these results, across all metrics, a threshold of 0.95 performs the best for the

BFT domain. In the Cars domain, the 0.95 threshold works best for the modified Dice

and Jaccard, and at this threshold both methods outperform the TF/IDF metric, even

33

when its threshold is at its best at 0.99. Therefore, the most meaningful comparisons

between the different metrics can be made at the threshold 0.95. In the BFT domain, the

modified TF/IDF outperforms the Dice and Jaccard metrics, until this threshold of 0.95.

At this threshold level, the Jaccard and Dice metrics outperform the TF/IDF metric on

the two harder attributes, the hotel name and the hotel area. In the Cars domain, the

TF/IDF metric is outperformed at every threshold level, except on the year attribute.

Interestingly, at the lowest threshold levels TF/IDF performs terribly because the tokens

that match in the computation have very low IDF scores since they match so many other

tokens in the corpus, resulting in very low TF/IDF scores. If the scores are low, then

many records will be returned and almost no attributes will ever be in agreement, yielding

very few correct annotations.

These three sets of results allows me to draw some conclusions about the utility of

different metrics for the vector-space matching task. The biggest difference between the

TF/IDF metric and the other two is that the TF/IDF metric uses term weights computed

from the set of tokens in the reference set. The key insight of IDF weights are their ability

to discern meaningful tokens from non-meaningful ones based on the inter-document

frequency. The assumption is that more meaningful tokens occur less frequently. However,

almost all tokens in a reference set are meaningful, and it is sometimes the case that very

meaningful tokens in a reference set occur very often. The most glaring instances of this

occur with the make and year attributes in the Cars reference set used for the Craig’s

Cars posts. Makes such as “Honda” occur quite frequently in the data set, and given

that for 20,076 car records the years only range from 1990 to 2005, the re-occurrence

of the same year tokens are very, very frequent. These attributes will be deemphasized

34

Table 2.13: Annotation results using modified TF/IDF similarityBFT posts







model 41.93 35.54 38.47trim 43.10 15.50 22.80year 35.58 29.18 32.06




significantly because of their frequency. If the matching token metric ignores the year,

this attribute will often not be in agreement since multiple records of the same car for

different years will be returned. Thus, TF/IDF has low scores for the year attribute.

TF/IDF also creates problems by overemphasizing unimportant tokens that occur rarely.

Consider the following post, “Near New Ford Expedition XLT 4WD with Brand New 22

Wheels!!! (Redwood City - Sale This Weekend !!!) $26850” which TF/IDF matches to

the reference set record

35

{VOLKSWAGEN, JETTA, 4 Dr City Sedan, 1995}. In this case, the very rare token

“City” causes an errant match because it is weighted so heavily. In the case of the

BFT posts, since the Hotel reference set has few commonly occurring tokens amongst a

small set of records, this phenomena is not as observable. Since weights, whether based

on TF/IDF or probabilities, rely on frequencies, such an issue will likely occur in most

matching methods that rely on the frequencies of tokens to determine their importance.

Therefore, I draw the following conclusions. In the matching step, an edit-distance

should be used to make soft matches between the tokens of the post and the reference set

records. If the Jaro-Winkler metric is used, the threshold should be set to 0.95, since that

yields the highest improvement using the best metrics. Lastly, and most importantly,

reference sets do not adhere to the assumptions made by weighting schemes, so only

metrics that do not use such schemes, such as the Dice and Jaccard similarities, should

be used, rather than TF/IDF.

Earlier I stated that the Jaro-Winkler metric emphasizes matching proper nouns,

rather than more common words, because it considers the prefix of words to be an im-

portant indicator of matching. This is in contrast to traditional edit distances that define

a transformation over a whole string, which are better for generic words where the prefix

might not indicate matching. Table 2.14 compares using Dice similarity modified with the

Jaro-Winkler metric to Dice modified with the Smith-Waterman distance [56]. (Smith-

Waterman is a classic edit distance originally developed to align DNA sequences.) As

with the Jaro-Winkler score, if two tokens have a Smith-Waterman distance above 0.95

they are considered a match in the modified Dice similarity. As the table shows, the

Jaro-Winkler Dice score outperforms the Smith-Waterman variant. Since many of the

36

reference set attributes are proper nouns, the Jaro-Winkler is better suited for matching,

which is especially apparent when using the Cars reference set.

Table 2.14: Modified Dice using Jaro-Winkler versus Smith-WatermanBFT posts

Method Attribute Recall Precision F-MeasureJaro-Winkler Hotel name 88.42 88.51 88.47


Smith-Waterman Hotel name 67.80 69.56 68.67Star rating 75.39 73.88 74.63Local area 84.34 81.72 83.01

Craig’s Cars postsMethod Attribute Recall Precision F-MeasureJaro-Winkler make 93.96 86.35 89.99

model 82.62 81.35 81.98trim 71.62 51.95 60.22year 78.86 91.01 84.50

Smith-Waterman make 24.12 27.18 25.56model 14.71 18.82 16.52trim 11.17 5.81 7.64year 29.17 52.30 37.45

2.4.3 Results: Extraction using reference sets

The main research exercise of the this thesis is information extraction from unstructured,

ungrammatical text. Therefore, Table 2.15 shows our “field-level” results for performing

information extraction exploiting the attributes in agreement. Field-level extractions

are counted as correct only if all tokens that compromise that field in the post are

correctly labeled. Although this is a much stricter rubric of correctness (versus token-

level correctness), it more accurately models how useful an extraction system would be.

The results from my system are presented as ARX 2

I compare ARX’s results to three systems that rely on supervised machine learn-

ing for extraction. I compare against two different Conditional Random Field (CRF)2Automatic Reference-set based eXtraction

37

[32] extractors built using MALLET [42], since CRFs are a comparison against both

state-of-the-art extraction and methods that rely on structure. The first method, called

“CRF-Orth” includes only orthographic features, such as whether a token is capitalized

or contains punctuation. The second CRF method, called “CRF-Win” contains ortho-

graphic features and sliding window features such that it considers both the two-tokens

preceding the current token, and two-tokens afterward for labeling. CRF-Win in partic-

ular will demonstrate how the unstructured nature of posts makes extraction difficult. I

trained these extractors using 10% of the labeled posts for training, since this is small

enough amount to make it a fair comparison to the automatic method, but is also large

enough such that the extractor can learn well enough.

I also compare ARX to Amilcare [14], which relies shallow NLP parsing for extraction.

Amilcare also provides a good comparison because it can be supplied with “gazetteers”

(lists of nouns) to aid extraction, so it was provided with the unique set of attribute

values (such as Hotel areas) for each attribute. Note, however, that Amilcare was trained

using 30% of the labeled posts for training so that it could learn a good model (however,

even this large amount of training did not seem to help as it only outperformed ARX on

two attributes). All supervised methods are run 10 times, and the results reported are

averages.

Although the results are not directly comparable since I compare systems that require

labeled data (supervised machine learning) to ARX, they still point to the fact that ARX

performs very well. Although the ARX system is completely automatic from selecting

its own reference set all the way through to extraction, it still performs the best on 5/7

attributes. The two attributes where it is not the best, (star rating of BFT and year of

38

Table 2.15: Extraction resultsCraig’s Cars Posts

Attribute Recall Prec. F-Mes.Make ARX 95.99 100.00 97.95

CRF-Orth 82.03 85.40 83.66CRF-Win 80.09 77.38 78.67Amilcare 97.58 91.76 94.57

Model ARX 83.02 95.01 88.61CRF-Orth 71.04 77.81 74.25CRF-Win 67.87 69.67 68.72Amilcare 78.44 84.31 81.24

Trim ARX 39.52 66.94 49.70CRF-Orth 47.94 47.90 47.88CRF-Win 38.77 39.35 38.75Amilcare 27.21 53.99 35.94

Year ARX 76.28 99.80 86.47CRF-Orth 84.99 91.33 88.04CRF-Win 83.03 86.12 84.52Amilcare 86.32 91.92 88.97

BFT PostsAttribute Recall Prec. F-Mes.Star Rating ARX 83.94 99.44 91.03

CRF-Orth 94.37 95.17 94.77CRF-Win 93.77 94.67 94.21Amilcare 95.58 97.35 96.46

Hotel Name ARX 70.09 77.16 73.46CRF-Orth 66.90 68.07 67.47CRF-Win 39.84 42.95 41.33Amilcare 58.96 67.44 62.91

Local Area ARX 62.23 85.36 71.98CRF-Orth 64.51 77.14 70.19CRF-Win 28.51 39.79 33.07Amilcare 64.78 71.59 68.01

Craig’s Cars), it is competitive (within 5% of the F-measure) of the best method, which in

these two cases is Amilcare, which was supplied with 30% of the labeled data for training.

Further, for these two attributes, all of the methods were competitive indicating that there

is some regularity (such as the fact that these are the two attributes with numbers in

them) that can be easily exploited making extraction less difficult for these two attributes

than the others.

In fact, lending credibility to this fact is that the CRF-Win method is actually com-

petitive for these two attributes, although it performs the worst on 6/7 of the attributes.

39

The fact that CRF-Win does so poorly indicates that the sliding window actually confused

the extraction method, implying that there is not a regular structure (i.e. neighborhood

window) that can be used to aid extraction. This is precisely the point of this thesis, that

structure and grammar based methods will not do well for extraction of data that does

not conform to structural assumptions, and therefore new methods are required, such as

reference-set based extraction. The results of Table 2.15 support this claim.

One interesting point about these results involves the comparison of the recall and

precision for the ARX method. In 5/7 attributes, the difference between the precision

and recall is at least 10% in favor of the precision, yet the ARX method does well.

This suggests that if the algorithm can increase the discovery of the extractions, it will

increase the recall, and get even better results. For example, one of the attributes where

this occurs is the Local Area of the BFT posts. In this attribute, one often sees acronyms

such as “AP” for airport or “DT” for downtown. Supervised systems can be trained to

identify such cases, but my approach would need some sort of acronym and synonym

discovery method to find those. However, it could find these, it could bridge this gap

between precision and recall.

40

Chapter 3

Automatically Constructing Reference Sets for Extraction

Chapter 2 described a method for automatically extracting data from unstructured, un-

grammatical text. The process was automatic except for the necessity in manually con-

structing the reference set (which can be kept in the repository). However, as this Chapter

will show, even this reference set creation process can be automated.

In fact, in many cases, manually constructing a reference set is not necessary, and

instead the machine can construct its own reference set which it can then use for extraction

and matching using the algorithms of Chapter 2. Further, in the cases where it might

be unclear whether to manually create a reference set or let the machine create one, an

algorithm of this Chapter can determine the answer on behalf of the user.

However, before I delve into the details for automatically constructing a reference set,

I will first describe the cases where a manually constructed reference set is necessary.

I call the first situation where a manual reference set is necessary the “SKU” problem,

since SKUs are usually internal identifiers used by companies to identify products. The

generalized version of the SKU problem is that it is possible that a user requires that a

certain attribute must be added as semantic annotation to a post in support of future,

41

structured queries (that is, after extraction). For example, suppose an auto-parts com-

pany has an internal database of parts for cars. This company might want to join their

parts database with the classified ads about cars in a effort to advertise their mechanics

service. Therefore, they could use reference-set based extraction to identify the cars for

sale in the classified ads, which they can then join with their internal database. Yet,

their internal database might have its own internal identification attribute (SKU), and

since this attribute would be needed to join the cars identified in the classifieds with their

database, it would need to be included as annotation for the post. In this case, since

none of the classified ads would already contain this attribute since it’s specific to the

company, it would not be possible to include it as an automatically constructed attribute,

and therefore the company would have to manually create their own reference set to in-

clude this attribute. So, in this case, the user would manually construct a reference set

and then use it for extraction.

There is a second scenario that requires a manual reference set. Specifically, in some

cases the textual characteristics of the posts are such that the automatic method of

constructing a reference set encounters difficulty. As stated above, not only does this

chapter analyze these textual characteristics that define the domains where automati-

cally constructing a reference set is appropriate, but further, I describe a method that

can determine whether to use the automatic approach of this chapter for reference set

construction, or suggest to the user to manually construct a reference set. This determi-

nation algorithm uses just a single, labeled example post.

42

3.1 Reference Set Construction

This chapter presents a method to automatically create a reference set with the intention

of using that reference set for extraction. The process starts solely with just a set of

posts, with no a priori knowledge or labeled data. In order to construct a reference set

from a set of posts, I borrow from the work on concept discovery. The concept discovery

task takes Web pages as input, and outputs a hierarchy of concepts, in tree form, such

that some of the concepts are subsumed (children) of others. This task generally breaks

into two distinct steps. First, the set of candidate terms (concepts) is selected from the

corpus of Web pages. Second, these candidate terms are turned into a hierarchy. The

techniques for the second step are varied. Examples include methods that range from

conditional probabilities over the terms [54] to Principal Component Analysis over a term

matrix [23] to smoothed Formal Concept Analysis [13]. Regardless, however, the general

structure to solve the problem is clear: first find the terms to construct the hierarchy,

then construct it.

I borrow from this process by noting that reference sets often form a type of hierarchy.

Each column of a reference set will typically subsume the columns that come after it.

This “subsumption” happens because of a specificity definition for certain attributes.

For example, consider the reference set shown in Figure 3.1. In this reference set, the

Civic and the Accord model are specific versions of a Honda make, and the Focus model

is a specific Ford make. Therefore, one can think of Honda as a node in the hierarchy

that subsumes the Civic and the Accord, while Ford subsumes Focus. Note, however, I

do not claim that a reference set truly conforms to a conceptual hierarchy. Rather it is

43

hierarchy-like in that certain attributes may relate to others in such a way that it can be

represented as a compact tree.

Figure 3.1: Hierarchies to reference sets

Consider the hierarchies for cars shown in Figure 3.1. Note that although these hier-

archies are independent, I can combine them to form a single reference set by traversing

each tree and setting each node as a column value. The columns of each hierarchy are

aligned in the reference set by their position in the tree. Therefore, my task becomes

constructing independent hierarchies for each entity represented by the posts, which I

can then combine into a single reference set. This whole process is shown in Figure 3.2.

Mirroring concept discovery, the first step in my algorithm is to consider which tokens

in my set of posts might represent actual attributes for the reference set. For this step,

the algorithm simply breaks up the set of posts into bigrams, keeping only those bigrams

that occur in more than one post. The condition that a bigram must occur in at least

two documents is similar to condition on candidate tokens in Sanderson and Croft [54].

44

As an example, the second post of Table 1.1 becomes the set of bigrams: {“93- 4dr,” “4dr

Honda,” “Honda Civc,” ...}

Figure 3.2: Creating a reference set from posts

Next, the method uses these bigrams to discover possible terms subsumptions for the

hierarchies. I use bigrams because in many cases, in posts, users list the attributes of the

entities. That is, because of the compactness of posts, users tend to use mostly attribute

terms to fill in their posts. Further, they sometimes order the attributes in a natural way.

For instance, in the second car classified ad of Table 1.1 the car make (Honda) is followed

by the model (Civc) followed by the trim (LX). These are useful clues for building the

Honda hierarchy. Note, however, that this structure does not always exist, otherwise it

could always be exploited for extraction.

Although the algorithm exploits the structure of some of the posts in order to discover

the reference set, this does not imply that all of the posts maintain this structure. The

point of the algorithm is to construct a reference set, without labeled data or a priori

knowledge, which can then be used for information extraction without any assumption

45

on the structure of the posts. So, by exploiting a little of the structure, the algorithm

performs information extraction from the posts that do not have structure.

Once the first step completes, the algorithm has a set of bigrams from which to

construct the independent hierarchies which will be translated into a reference set. To

construct the hierarchies, I use a slightly modified version of the conditional probability

test of Sanderson and Croft [54], rewritten later [33] as1:

x subsumes y IF P (x|y) ≥ 0.75 and P (y|x) ≤ P (x|y)

For example, using the bigram “Honda Civc” I would test if P (Honda|Civc) ≥ 0.75

and P (Civc|Honda) ≤ P (Honda|Civc). However, unlike the concept hierarchy work, the

attributes are not necessarily single token terms. So, I modify the conditional probability

heuristic to account for the fact that I may need to merge multiple terms to form a single

attribute.

Intuitively, if two terms almost always co-occur with each other, but only with each

other, then they are likely to be part of the same attribute. Therefore, I create the

heuristic that if x subsumes y, and P (y|x) is also greater than the threshold, then I

should merge the terms.2 So, if both terms are subsuming each other, and therefore

likely co-occur frequently, but only with each other (otherwise P (x|y) would be greater

but not P (y|x)), they should be merged. So, the merging heuristic is defined as:

Merge(x,y) IF x subsumes y and P (y|x) ≥ 0.75

Once this process finishes, I have multiple, independent hierarchies representing the

entities described by the posts. The algorithm flattens these hierarchies into the reference1The threshold was originally set as 0.8 by Sanderson and Croft [54], but I found that 0.75 worked

better empirically for my problem.2This requires that relax the assumption that P (y|x) < P (x|y), since they could be equal as well.

46

set by aligning columns along the tree depth, as shown in Figure 3.1. The whole algorithm

in shown in Table 3.1. Since this algorithm constructs a reference set using a single pass

over the posts, I call it the “Single Pass” algorithm.

Table 3.1: Constructing a reference set

BuildReferenceSet(Posts P)# The Single Pass algorithmBigrams B ← {}Hierarchies H ← {}∀ posts p ∈ P∀ tokens ti ∈ p, 1 ≤ i < NumToks(p)

B ← B ∪(ti, ti+1)For all bigrams b ∈ Bt1, t2 ← SplitBigram(b)If BigramCount(b) > 1 andP (t1|t2) ≥ 0.75 and P (t2|t1) ≤ P (t1|t2)t1 subsumes t2

H ← UpdateHierarchies(t1,t2)If t1 subsumes t2 and P (t2|t1) ≥ 0.75

H ← MergeTerms(t1,t2,H)For all hierarchies h ∈ H

FlattenToReferenceTuples(h)

Although the above approach works well for finding certain attributes and their re-

lationship, one limitation stems from what I call the “general token” effect. Since sub-

sumption is determined by the conditional probabilities of the tokens, when the second

token is frequent (more general) than the first token, the conditional probability will be

low and not yield a subsumption.

An example of this general token effect can be seen with the trim attribute of cars.

For instance, consider the posts in Table 3.2 which show the general token effect for the

trim value of “LE.” These posts show the “LE” trim occurring with a Corolla model, a

Camry model, a Grand AM model, and a Pathfinder. However, in the labeled data (used

for experiments), although the “CAMRY LE” has the highest conditional probability for

47

an ‘LE’ bigram, it is only 49%. This is due to the fact that so many other bigrams share

the LE value as their second token (e.g., Corolla, Pathfinder, etc.). Therefore, since LE

happens across many different posts in many varying bigrams, I call it a “general” token,

and its conditional probability will never be high enough for subsumption. Thus it is

never inserted into the reference set.

Table 3.2: Posts with a general trim token: LE

2001 Nissan Pathfinder LE - $15000Toyota Camry LE 2003 —- 20000 $1525098 Corolla LE 145K, Remote entry w/ alarm, $46001995 Pontiac Grand AM LE (Northern NJ) $700

To compensate for this “general token” peculiarity, I iteratively run the Single Pass

algorithm, where for each iteration after the first, I consider the conditional probability

using a set of the first tokens from bigrams that all share the common second token in

the bigram. Note, however, this set only contains bigrams whose first token is already

a node in a hierarchy, otherwise the algorithm may be counting noise in the conditional

probability. This is the reason the algorithm can only iterate after the first pass. The

algorithm iterates until it no longer generates new attribute values. I call this version of

the approach the “Iterative” algorithm.

As an example, consider again the ‘LE’ trim. The iterative approach considers the

following conditional probability for subsumption, assuming the models of Camry, Corolla

and Pathfinder have already been discovered:

P ({CAMRY ∪ COROLLA ∪ PATHFINDER}|LE)

48

Now, if this conditional probability fits the heuristic for subsumption (or merging),

then I will add LE as a child to the nodes CAMRY, COROLLA and PATHFINDER in

their own respective hierarchies.

The algorithm is now equipped to deal with the “common token” effect. However,

there are still two aspects of this approach that need clarification. First, how many posts

should be supplied for constructing the reference set? There are literally millions of car

classified ads, but it is unclear how many should be used for construction. Intuitively, it

seems there is redundancy (an assumption I use to build the reference sets), so one would

not need to see all of the classified ads. However, can one determine this stopping point

algorithmically? Further, given that in order to overcome the general token issue the

algorithm iteratively generates reference set tuples, some noisy attributes are introduced.

However, manually pruning away this noise takes away from the automatic nature of my

approach.

To deal with both of these issues simultaneously, I introduce a “locking” mechanism

into the algorithm. Since many of the attributes the algorithm discovers are specifications

of more general attributes (such as car models specify makes), there is a point at which

although the algorithm may be discovering new values for the more specific attributes

(car models), it has saturated what it can discover for the parent attribute (car makes).

Further, once it saturates the parent attributes, if the algorithm does introduce further

values, they are likely noise. So, the algorithm should “lock” the parent attribute at the

saturation point, and only allow the discovery new attributes that are below the level of

locked attribute in the hierarchies.

49

Figure 3.3: Locking car attributes

Consider the example shown in Figure 3.3. At iteration t the algorithm has discovered

two hierarchies, one rooted with the car make Ford and one rooted with the car make

Honda. Each of these makes also has a few models associated with them (their children).

The bottom of the figure shows some future time (iteration t + y), at which point the

system decides to lock the make attribute, but not the model and trim. Therefore, the

algorithm can still add car models (as it does with the Taurus model to the Ford make)

and car trims (as with the LX trim for the Civic model). However, since the make

attribute is locked, no more make attributes are allowed, and therefore the hierarchy that

would have been rooted on the make “Brand” with model “New” (which is noise) is not

allowed. This is why it is shown on the right as being crossed out.

In this manner, the locking acts like a pre-pruner of attribute values. Rather than

having to prune away errant attributes after discovering the reference set tuples, the

intent is that the algorithm will lock the attributes at the right time so as to minimize

the number of noisy attributes that may be introduced at later iterations. This works

because noise is often introduced as the algorithm iterates, but not in the beginning. In

50

my example, instead of post-pruning away the hierarchy rooted on “Brand,” the algorithm

instead prevented it from being added by locking the make attribute.

The locking mechanism also provides an algorithmic method to stop processing posts.

Simply, if all attributes (i.e. all levels of the hierarchies) become locked, there is no point

in seeing any more posts, so that would be the appropriate number of posts at which the

algorithm can stop creating reference set tuples. So, this introduces a way to determine

the number of posts needed for constructing the reference set. In practice, a user can

send posts into the algorithm, which in turn constructs and locks, and if attributes remain

unlocked, the machine requests more posts from the user. This would repeat until all

attributes are locked. Of course, given the prevalence of such technologies as RSS feeds,

a user would not even have to supply the algorithm with posts. It could simply request

them itself via RSS until it does not require any more.

Therefore, I leverage the notion that the locking process requests more posts from the

user until it locks all attributes. Essentially, at each request, the machine compares what it

can discover using the current set of given posts to what it can discover using the last set of

given posts (note that the newly requested set supersedes the previous set). For example,

the user starts by giving the algorithm 100 posts, and then the algorithm requests more,

at which point the person supplies 100 more posts. The algorithm compares what it

can discover by using the first 100 posts as compared to what it can discover using the

combined 200 posts. If the algorithm thinks it can’t discover more from the 200 than the

100, then it locks certain attributes (Again, note Posts100 ⊂ Posts200).

To do this comparison for locking the algorithm compares the entropies for generating

a given attribute, based upon the likelihood of a token being labeled as the given attribute.

51

So, in the cars domain it calculates the entropy of a token being labeled a make, model or

a trim (note that these label names are given post-hoc, and the machine simply regards

them as attribute1, attribute2, etc.). For clarity, I calculate the entropy H for the make

as:

H(make) = −∑

x∈tokens

px ∗ log(px)

The entropy of labeling a particular attribute can be interpreted as the uncertainty

of labeling tokens with the given attribute. So, as the algorithm sees more and more

posts, if the entropy does not change from seeing 100 posts to seeing 200 posts, then the

uncertainty in labeling that attribute is steady so there is no need to keep discovering

attribute values for that attribute in more posts. However, I cannot directly compare

the entropies across runs, since the underlying amounts of data are different. So, I use

normalized entropy instead. That is, for attribute X, across N posts, I define:

H(X)Norm =H(X)logN

Although entropy provides a measure of uncertainly for token labels, it does not

provide a sufficient comparison between runs over varying numbers of posts. To provide

an explicit comparison, I use the “maximum redundancy” metric, which uses the entropies

across runs to provide a normalized measure of scaled information. For a given attribute

X, if we are comparing runs across 100 posts and 200 posts, denoted X100 and X200

respectively, the maximum redundancy, RMax is defined as:

RMax =min(H(X100), H(X200))H(X100) +H(X200)

52

Maximum redundancy is a useful stopping criteria. When the maximum redundancy

is zero, the variables are independent, and it can only reach a maximum value, (RMax).

Reaching RMax means that one variable is completely redundant if you have knowledge

about the other variable. So, if the algorithm is given some set of posts, (i.e. my 100 and

200 posts), and it finds the maximum value for RMax for a given attribute between these

posts then it knows it can lock that attribute at 100 posts, since the entropy using the

200 did not yield more information. So, the algorithm locks an attribute when it finds

the maximum value for RMax for that attribute.

Table 3.3 summarizes the above technique for locking the attributes when constructing

a reference set.

Table 3.3: Locking attributesLockingAttributes(Attributes a, Posts pi, Posts pj)# pi are the first set of posts# pj are the second set, such that# pi ⊂ pj

for each a ∈ Attributesif a is not locked

HNorm(ai)← Entropy(pi, a)HNorm(aj) ← Entropy(pj , a)If RMax(HNorm(ai), HNorm(aj)) is maximumAND Parent(a) is locked

Lock(a)if all a ∈ Attributes are locked

return locked

There are two small things to note. First, I add a heuristic that an attribute may

only lock if its parent attribute is already locked, to prevent children from locking before

parents. Second, although the above discussion repeatedly mentions the machine request-

ing more posts from a user, note that the algorithm can easily automatically request its

53

own posts from the sources using technology such as RSS feeds, alleviating any human

involvement beyond supplying the URL to the source.

Therefore, by using the above technique to lock the attributes, even if the algorithm

generates new children attribute values, the parents will be locked, which acts as an

automatic pruning. Further, the algorithm now knows how many posts are required to

automatically construct a reference set from the source of posts.

Table 3.4 ties all of the aspects together (constructing reference set tuples, iterating

for the general tokens, and locking) yielding my algorithm for automatically building

reference set tuples from posts, which I call the “Iterative Locking Algorithm” (ILA).

Table 3.4: ILA method for building reference sets

BuildReferenceSet(Posts, x, y)# x is the number of posts to start with# y is the number of posts to add each iterationlocked ← falsewhile(locked is false)

Posts p ← GetPosts(x, y)ReferenceSet rs ← BuildReferenceSet(p) # BuildReferenceSet as in Table 3.1

while(continue)Updates ← FindGeneralTokens(rs) # Find general tokens using combined probabilities

if |Updates| i== 0 # No updates foundcontinue ← false # Stop iterating: No more general tokens found

elsers ← rs ∪ Updates # Found new general tokens, keep iterating

Attributes a ← GetFoundAttributes(rs)x← yPosts q ← GetPosts(x, y)repeat for q # Need generate attributes for qlocked ← LockingAttributes(a, p, q) # Defined in Table 3.3

Figures 3.4, 3.5, and 3.6 show hierarchies that ILA constructed for different domains

using real-world data. Specifically, Figure 3.4 shows an entity hierarchy constructed from

54

car posts from the classified website Craigslist rooted on “Honda.” Figure 3.5 shows

an entity hierarchy for laptop computers from Craiglist rooted on “HP,” and Figure 3.6

shows a hierarchy for skis, rooted on the “Stockli” ski brand. Note the ski posts come from

eBay. These examples show that across domains many attribute values are discovered

correctly. However, they also demonstrate two common types of errors that occur. First,

there are cases where noisy nodes are included in the hierarchies. For example, consider

the node “Autos” under “Honda” in Figure 3.4 and the node “Fun” under the ski model

“Sinox” in Figure 3.6, which are both noise. Another common error is placing a node in

the wrong level of the tree. For example, in Figure 3.5, the “ZT3000” node is actually

a specific model number for a “Pavilion” laptop, so it should be a child of “Pavilion”

rather than put in its current location. This misplacement generally occurs when people

leave out an implied attribute. That is, for the posts about ZT3000 laptops, the sellers

assumed that the buyers would know it’s a Pavilion, so in their posts they include the

attributes HP and ZT3000, but leave out Pavilion. Despite these errors, as the results

show, my method can use these automatically constructed reference sets for successful

extraction.

Figure 3.4: Hierarchy constructed by ILA from car classified ads

55

Figure 3.5: Hierarchy constructed by ILA from laptop classified ads

Figure 3.6: Hierarchy constructed by ILA from ski auction listings

As described above, once ILA constructs the entity hierarchies, it flattens them into

reference set tuples by starting at the root node of the hierarchy and tracing each path,

assigning each node along the path to an attribute. The flattened reference sets for

Figures 3.4, 3.5, and 3.6 are give in Tables 3.5, 3.6, and 3.7 respectively. Note, the

attribute names in the tables are provided for ease of reading. The system outputs the

attribute names as “Attribute0,” “Attribute1,” etc. since it does not know the schema.

56

Further, the correct tuples in each table are shown in bold. Interestingly, in some cases

the users incorrectly spelled the attribute value, but ILA still produces the correct tuple

using this misspelled value. The “Honda Civc” tuples of Table 3.5 show this behavior.

Lastly, determining the “correctness” of a tuple is not easy. For instance, the “CRX” car

by Honda is known both as its own model, and as a type of Civic. I choose to consider

it its own model.

These tables make some of ILA’s mistakes clearer than the hierarchies. For example,

it is much clearer that certain attributes that were placed in the wrong tree level end

up in the wrong attribute slot. Another interesting mistake that sometimes occurs, and

which is made clear by the tables, is that an attribute can appear in different locations for

different tuples. For instance, in Table 3.6, the value “DV2000” occurs as both a model

and a model number. This is due to the fact that sometimes posters include the term

Pavilion when selling this laptop, and sometimes they do not.

Table 3.5: Reference set constructed by flattening the hierarchy in Figure 3.4

Make Model Trim Trim Spec Make Model Trim Trim SpecHONDA ACCORD HONDA CRVHONDA ACCORD 00 HONDA CRV EXHONDA ACCORD 121K HONDA CRXHONDA ACCORD 500 HONDA CRX SIHONDA ACCORD DX HONDA DXHONDA ACCORD EX HONDA ELEMENTHONDA ACCORD EXL HONDA ELEMENT EXHONDA ACCORD LX HONDA HASHONDA ACCORD SE HONDA MECHANICHONDA AUTOS HONDA ODYSSEYHONDA CIVC HONDA ODYSSEY EXHONDA CIVC DX HONDA ODYSSEY EXLHONDA CIVIC 626 HONDA ODYSSEY MINIVANHONDA CIVIC CRX HONDA ODYSSEY EX 3800HONDA CIVIC CX HONDA ODYSSEY EX MINIVANHONDA CIVIC DX HONDA PASSPORTHONDA CIVIC EX HONDA PASSPORT EXHONDA CIVIC HATCH HONDA PILOTHONDA CIVIC HYBRID HONDA PRELUDEHONDA CIVIC LX HONDA PRELUDE SIHONDA CIVIC SI HONDA S2000HONDA CIVIC CX HATCH

57


Brand Model Model Num.HP DESKJETHP DV2000HP DV5000HP DV9410USHP NC6000HP PAVILIONHP PAVILION 5150HP PAVILION DVHP PAVILION DV2000HP PAVILION DV2500THP PAVILION DV5000HP PAVILION DV6000HP PAVILION DV6500HP PAVILION DV6500THP PAVILION DV9410USHP PAVILION TX1000ZHP PAVILION ZD8000HP PAVILLIONHP PAVILLION DVHP PAVILLION DV2000HP PRESARIOHP PROLIANT ML110HP ZT3000


Brand Model Model Spec. OtherSTOCKLI BCSTOCKLI ROTORSTOCKLI SCOTSTOCKLI SINOXSTOCKLI SINOX FUNSTOCKLI SNAKESTOCKLI SNAKE BCSTOCKLI SPIRIT EDSTOCKLI SPIRIT ED VISTOCKLI SPIRIT SCSTOCKLI STORMRIDER ATSTOCKLI STORMRIDER DPSTOCKLI STORMRIDER SCOTSTOCKLI STORMRIDER XL

58

One useful aspect of constructing the reference set directly from the posts, versus

manually constructing the reference set from a source on the Web has to do with the

nature of using reference sets for extraction. It is not enough to simply construct a

reference set from data on the Web and expect it to be useful across sets of posts. For

example, one might be able to construct a reference set of cars from Edmunds, but their

data only goes back a few decades. What if the cars in the posts are all classic cars?

However, by building the reference set directly from the posts themselves, users gain

assurance that the reference set will be useful. This is evident in my experiments where

I find that sometimes the reference set collected from freely available data on the Web

does not always have enough useful coverage for certain attributes.

As I mention above, there are alternative methods for building term hierarchies (sub-

sumptions) from text. However, these methods are not as well suited as the Sanderson

and Croft method [54] for my problem. First, since my data is ungrammatical, I can-

not use Formal Concept Analysis which relates verbs to nouns in the text to discover

subsumptions [13]. I have plenty of nouns in posts, but almost no verbs. Further, since

my algorithm runs iteratively due to the “general token” problem, I need a method that

runs efficiently. Using the Sanderson and Croft method, my algorithm runs in O(n) time

where n is the number of tokens from the posts, since my process scans the posts to

create the bigrams and calculate the probabilities, and then considers only those with

high enough probabilities. However, methods such as Principal Component Analysis [23]

run in O(n3) with respect to the token by token matrix (which may be sparse), and so

this method is not suitable for a large number of tokens and more than one iteration.

59

3.2 Experiments

My main purpose for discovering reference sets is to use them for reference-set based

information extraction from unstructured, ungrammatical data. Therefore, as my exper-

imental procedure, I first discover the reference set for a given set of posts, and then I

perform information extraction using the discovered reference set. If I have discovered a

useful reference set from the posts, the extraction results reflect that. For the extraction

itself, I use the automatic method of the previous chapter.3

I present a number of experiments using varied data sources to provide insight into

different aspects of my approach. However, note that all results are reported as field-level

extraction results using the standard measures of precision, recall and F1-measure for

each of the data sets.

Table 3.8: Post SetsName Source AttributesCraig’s Cars Craigslist make, model, trimLaptops Craigslist manufacturer, model, model num.Skis eBay brand, model, model spec.

The three sets of posts used in the experiments are outlined in Table 3.5. The first set

of posts is the same Craig’s cars set used in the previous chapter. The second set of posts

are laptops for sale from Craigslist, and the third set are skis from eBay. Again, I use a

variety of source types (classified ads and auction listings) and a variety of domains. For

each domain I labeled 1,000 posts fully, which I use to generate the field level extraction

results shown below. I concentrate on the particular attributes outlined above for each3Note, however, that the ILA method sometimes includes common terms, such as “Free Shipping” in

the built reference set, and so to compensate for this, I use TF/IDF for finding reference-set matches,rather than the Dice similarity, since TF/IDF will discount the commonly occurring terms.

60

domain because those are the common attributes found in online sources that I use to

manually create reference sets for extraction to compare the ILA approach against.

The first set of experiments demonstrate the utility of using my iterative method to

deal with the “general token” effect, versus doing a single pass for building the refer-

ence set. These experiments also show that the locking mechanism of the ILA approach

provides a level of cleaning when using the iterative method, in that by locking cer-

tain attributes generates better results because less noisy tuples are introduced to the

constructed reference set.

Table 3.9 shows the comparative results for the different domains. For each domain,

the number of posts required by the system is shown next to the domain name. Although

this number is determined by the locking mechanism of the ILA algorithm, it is also

used for the iterative and single pass method, because without it, I would not know how

many posts the algorithms should examine. Each method’s results are referred to by

their corresponding name (Single Pass, Iterative, and Iterative-Locking (ILA)), and the

number of reference set tuples generated by each method is shown in parentheses next to

the method’s name.

As Table 3.9 shows, my approach can effectively deal with the general token issue.

This is especially apparent for the more specific attributes, such as the Car trims, the

Laptop model numbers, and the Ski model-specifications where the improvement using

the iterative and locked (ILA) methods are dramatic. Further, by adding the locking

mechanism I can avoid some of the noise introduced by iterating multiple times over the

set of posts. This is most dramatically apparent in the Laptop domain where the brand

61

Table 3.9: Field-level extraction results

Craig’s Cars: 4,400 posts Laptops: 2,400 postsMake Recall Prec. F1-Meas. Make Recall Prec. F1-Meas.Single Pass (227) 79.19 85.30 82.13 Single Pass (148) 49.63 50.77 50.19Iterative (606) 79.31 84.30 81.73 Iterative (531) 51.27 46.22 48.61ILA (580) 78.19 84.52 81.23 ILA (295) 60.42 74.35 66.67Model Recall Prec. F1-Meas. Model Recall Prec. F1-Meas.Single Pass (227) 64.14 85.67 73.36 Single Pass (148) 51.49 61.11 55.89Iterative (606) 64.77 84.62 73.38 Iterative (531) 54.47 49.52 51.87ILA (580) 64.25 82.79 72.35 ILA (295) 61.91 76.18 68.31Trim Recall Prec. F1-Meas. Model Num. Recall Prec. F1-Meas.Single Pass (227) 3.91 66.67 7.38 Single Pass (148) 11.16 97.96 20.04Iterative (606) 23.45 54.10 32.71 Iterative (531) 25.58 77.46 38.46ILA (580) 23.45 52.17 32.35 ILA (295) 27.91 81.08 41.52

Skis: 4,600 postsBrand Recall Prec. F1-Meas.Single Pass (455) 53.69 50.58 52.09Iterative (1392) 60.59 55.03 57.68ILA (1392) 60.84 55.26 57.91Model Recall Prec. F1-Meas.Single Pass (455) 43.24 53.88 47.98Iterative (1392) 51.86 51.25 51.55ILA (1392) 51.33 48.93 50.10Model Spec. Recall Prec. F1-Meas.Single Pass (455) 20.22 78.99 32.19Iterative (1392) 42.37 63.55 50.84ILA (1392) 39.14 56.35 46.29

and model are clearly noisy in the iterative method, but cleaned up by using the locking

mechanism (ILA).

Note that although the locking only seems to dramatically provide pruning for the

Laptop domain, it also serves the very important function of letting us know how many

posts to examine. So, even if it does not provide tremendous pruning (in some cases,

such as the car domain the pruning is not needed), without this locking I would have no

idea how many posts I need to examine in order to build the reference set.

My next set of experiments compare extraction results using the automatically con-

structed reference set (using the ILA algorithm) against both a manually constructed

reference set using the ARX method of the previous chapter, and a supervised machine

learning approach. By comparing to the ARX method using a manually constructed

62

reference set, I can show the benefits of automatically building the reference set for ex-

traction, versus the tedious work of building it manually. Further, this justifies my claim

that in many cases, building a reference set manually is not necessary since competitive

results can be achieved by automatically building the reference set. Also, by comparing

against supervised machine learning approaches I can show that the automatic method

is competitive with expensive supervised methods, and in some cases, outperforms them

due to the unstructured nature of the posts.

For the manually constructed reference sets, I built the reference sets from freely

available data on the Web. For the Craig’s Cars domain, I again use the reference set

from Edmunds, outlined in the previous chapter. For the Laptops domain, I constructed

a reference set of laptops using the online store Overstock.com. Collecting new laptops

for sale provides an interesting reference set because while the set of laptop makers is

static, the models and model numbers of the laptops that are for sale as new might not

cover the used laptops for sale on Craigslist, which are generally older. Lastly, for the

Skis domain I found a list of skis and ski apparel online from the website skis.com.4 From

this list, I parsed out the skis, and increased the reference set’s utility by removing tokens

in the attributes that could cause noisy extractions, such as the common token “Men’s.”

The three manually constructed reference sets are described in Table 3.10.

For the machine learning comparison, I again use the CRF-Orth and CRF-Win imple-

mentations of Conditional Random Fields, described in the previous chapter. Again, it is

important to note that CRF-Win in particular demonstrates how the unstructured nature

of posts makes extraction difficult. I trained these extractors using 10% of the posts, since4http://www.skis.com/affiliates/skis.xls

63

Table 3.10: Manually constructed reference-sets

Name Source Attributes Numberof Tuples

Cars Edmunds and make, model 27,006Super-Lamb Auto trim, year

Laptops Overstock.com manufacturer, model, model num. 279Skis Skis.com brand, model, model spec. 213

this is a small enough amount to make it a fair comparison to automatic methods, but is

also large enough such that the extractor can learn well. Also, since these are supervised

methods, each was run 10 times and I report the averages. Table 3.11 shows the results

comparing the ARX method using the ILA method to build the reference set, to these

supervised methods and the ARX method using a manually constructed reference set

(denoted by its source).

The results comparing the fully automatic method to the other methods are very

encouraging. The fully automatic ILA method is competitive with the other methods

across all of the domains. In fact, the ILA method’s F1-measure either outperforms or is

within 10% of the F1-measure for four out of the nine attributes as compared to both the

CRF-Orth and manually constructed reference-set methods. This is the case for seven out

of the nine attributes when compared against the CRF-Win. So, clearly the automatic

method is competitive with previous appraoches that required a significant amount of

human effort in either labeling and training a supervised method, or else in manually

constructing a reference set to use for extraction.

Further, the fact that the CRF-Win method is often the worst performing method

also points to the difficulty of using structure as an aid when performing extraction on

posts. If the structure were very consistent, one would expect this method to greatly

64

outperform all of the other methods all of the time, since its key feature is the structural

relation of the tokens.

Table 3.11: Comparing field-level results

Cars LaptopsMake Recall Prec. F1-Meas. Make Recall Prec. F1-Meas.ILA (580) 78.19 84.52 81.23 ILA (295) 60.42 74.35 66.67Edmunds (27,006) 92.51 99.52 95.68 Overstock (279) 84.41 95.59 89.65CRF-Win (10%) 80.09 77.38 78.67 CRF-Win (10%) 52.35 72.84 60.40CRF-Orth (10%) 82.03 85.40 83.66 CRF-Orth (10%) 64.59 81.68 71.93Model Recall Prec. F1-Meas. Model Recall Prec. F1-Meas.ILA (580) 64.25 82.79 72.35 ILA (295) 61.91 76.18 68.31Edmunds (27,006) 79.50 91.86 85.23 Overstock (279) 43.19 80.88 56.31CRF-Win (10%) 67.87 69.67 68.72 CRF-Win (10%) 44.51 68.32 53.54CRF-Orth (10%) 71.04 77.81 74.25 CRF-Orth (10%) 58.07 78.29 66.54Trim Recall Prec. F1-Meas. Model Num. Recall Prec. F1-Meas.ILA (580) 23.45 52.17 32.35 ILA (295) 27.91 81.08 41.52Edmunds (27,006) 38.01 63.69 47.61 Overstock (279) 6.05 78.79 11.23CRF-Win (10%) 38.77 39.35 38.75 CRF-Win (10%) 42.15 65.22 50.66CRF-Orth (10%) 47.94 47.90 47.88 CRF-Orth (10%) 53.18 68.58 59.48

SkisBrand Recall Prec. F1-Meas.ILA (1392) 60.84 55.26 57.91Skis.com (213) 83.62 87.05 85.30CRF-Win (10%) 63.84 87.62 73.58CRF-Orth (10%) 74.37 87.46 80.22Model Recall Prec. F1-Meas.ILA (1392) 51.33 48.93 50.10Skis.com (213) 28.12 67.95 39.77CRF-Win (10%) 51.08 74.83 60.15CRF-Orth (10%) 61.36 73.15 66.59Model Spec. Recall Prec. F1-Meas.ILA (1392) 39.14 56.35 46.29Skis.com (213) 18.28 59.44 27.96CRF-Win (10%) 46.98 71.71 56.25CRF-Orth (10%) 59.10 65.84 61.89

The lone attribute where the ILA method is greatly outperformed by the other meth-

ods is the brand attribute for Skis. While there are only 38 distinct brands to extract in

the labeled data, the automatically built reference set contains 158 distinct brands. This

discrepancy occurred for two reasons. First, some of the constructed “brands” were com-

mon words such as “Free” (as in Free Shipping). This causes problems because a post

that says “ORG RETAIL: $925.00! FAST, FREE SHIPPING.” will match the {Free,

Shipping} tuple, creating a false positive. The other major reason that false brands are

65

put into the reference set happens because certain ski models were included without the

appropriate brand (such as the “Mojo” ski which is pushed into the reference set without

the “Head” brand). When this occurs, the correct reference set tuple is built, but the

extracted attribute is given the incorrect label (i.e. the “Mojo” token is labeled as a

brand, rather than a model). This confluence of circumstances led the ILA method to

poor results. However, given that the ILA method is fully automatic, it is important to

note that that this is the only attribute where ILA was greatly outperformed by all of

the other methods.

Table 3.12: Comparing the ILA methodILA vs. CRF-Win

Outperforms Within 10%4/9 7/9

ILA vs. CRF-OrthoOutperforms Within 10%

1/9 4/9ILA vs. ARX

Outperforms Within 10%4/9 4/9

It is also interesting to specifically examine the differences between the automatically

constructed and manually constructed reference sets. The ARX method using manual

reference-sets does a better job for the more general attributes, such as car makes and

models, but for the more specific attributes (laptop model numbers, ski models, and ski

model specifications) it is greatly outperformed by the automatic ILA method. What is

particularly encouraging about this result is that these are the attributes that are very

difficult to include in a manually constructed reference set because they are difficult to

manually discover and it is not clear whether their coverage will hold. For example, it is

66

hard to enumerate the laptop model numbers and ski model specifications because there

are so many, they are so varied, and they are hard to locate in a single place, so it requires

substantial effort to construct such an inclusive reference set. Further, when comparing

the ILA results to the Overstock results for the Laptop models and model numbers, it is

clear that the coverage of the Overstock laptops (which are new) does not cover the posts

(which are older, used laptops), since the results for these attributes using Overstock

have such low recall. Table 3.12 summarizes the comparisons by showing the number of

times the ILA method outperforms the given method and the number of times the ILA

method’s F1-measure is within 10% of the given method.

3.3 Applicability of the ILA Method

This chapter presents an automatic algorithm for building reference sets from posts and

then using them for extraction. However, the construction algorithm makes a few assump-

tions about the data it is processing. For one, the algorithm works well for hierarchical

data. Although this is an enormous set of categories (e.g., all items for sale, descriptive

categories such as geographical and person data, etc.) there are some categories that lack

this characteristic (e.g. personal ads). More importantly, there are particular cases (as I

will show) where the structure of the reference set attributes make it difficult to automat-

ically construct a reference set. In this section I examine these cases and present a simple

algorithm that can determine whether the fully automatic method can be used to build

a reference set for extraction. If the system determines that it cannot use the automatic

method of this chapter, then the user should construct a reference set manually, which

67

he or she can then exploit using the methods in this thesis. In this manner, the special

cases requiring a manually constructed reference set can be determined easily.

3.3.1 Reference-Set Properties

There are certain properties of the underlying entities represented in the posts that allow

my algorithm to construct the reference set well. Specifically, in the above experiments I

show three domains (Cars, Laptops and Skis) where my algorithm performed well, even

in some cases outperforming the state-of-the-art machine learning techniques. Yet, there

are other domains where my technique has trouble building a reference set, and hence

extracting the attributes. However, as stated above, if one is confronted with a domain

that does not conform well to my automatic technique, a user can still tackle the problem

by manually creating a reference set and using the extraction methods presented in this

thesis.

To motivate this idea I present two domains where the automatic method does not

construct a coherent reference set for extraction, but, given a manual reference set for

these domains, the automatic, reference-set based extraction method outperforms the

CRF-Win approach, which is supervised. The first domain, is the “BFT Posts” domain

of the previous chapter. The second domain, called “eBay Comics,” contains 776 eBay

auction listings for Fantastic Four and Incredible Hulk comic books and comic-themed

items. These domains are described in Table 3.13.

Table 3.13: Two domains where manual reference sets outperformName Source Website Attributes RecordsBFT Posts Bidding for Travel www.biddingfortravel.com star rating, area, name 1,125eBay Comics eBay comics: Fantastic Four & www.ebay.com title, issue number, 776

Incredible Hulk description, publisher

68

These are particularly interesting domains because not only do they present prob-

lems for automatically building a reference set, these domains are also very unstructured

and present problems to the CRF-Win method. Yet, as the results of Table 3.14 show,

exploiting a manually constructed reference-set, using the ARX method, yields good re-

sults. For the BFT Posts domain, the reference set is the same 132 hotels described in the

previous chapter as the “Hotels” reference set. For the eBay Comic domain, the reference

set is the 918 tuples from the Comic Book Price Guide, called “Comics” in the previous

chapter. To reiterate, each tuple of the Comics dataset contains a title, issue number,

description (a one line of text such as “1st appearance of Wolverine!”), and publisher.

As before, I ran CRF-Win 10 times, giving it 10% of the data for training, and present

average values for the results using this method.

Table 3.14: Results on the BFT Posts and eBay Comics domainseBay Comics BFT Posts

Title Recall Prec. F1-Meas. Area Recall Prec. F1-Meas.Comics 90.08 90.78 90.43 Hotels 62.23 85.36 71.98CRF-Win 77.69 77.85 77.77 CRF-Win 28.51 39.79 33.07ILA 30.93 38.83 34.43 ILA 20.14 16.94 18.40Issue Num. Recall Prec. F1-Meas. Star Rating Recall Prec. F1-Meas.Comics 46.21 87.21 60.41 Hotels 83.94 99.44 91.03CRF-Win 80.67 80.52 80.59 CRF-Win 93.77 94.67 94.22ILA 8.40 22.55 12.24 ILA 0.00 0.00 0.00Description Recall Prec. F1-Meas. Name Recall Prec. F1-Meas.Comics 24.80 15.90 19.38 Hotels 70.09 77.16 73.46CRF-Win 6.08 14.56 8.78 CRF-Win 39.84 42.95 41.33ILA 0.00 0.00 0.00 ILA 0.47 1.12 0.66Publisher Recall Prec. F1-Meas.Comics 100.00 84.16 91.40CRF-Win 38.42 86.84 52.48ILA 0.00 0.00 0.00

The results validate that in some cases, a manually constructed reference set using the

ARX method is the best option. CRF-Win only outperforms the other methods for two of

the attributes (Hotel’s star rating and Comic’s issue number), and only the issue number

69

is dramatically different. Both the Hotel’s star rating and the Comic’s issue number

have a numeric component, and since a number recognizer is one of the orthographic

features used by CRF-Win, this explains the CRF method’s high performance for those

attributes. However, this is in stark contrast to the other 5/7 attributes where CRF-

Win is drastically outperformed. The smallest difference between CRF-Win and ARX is

roughly 11%, and for 3 of the 5 cases where CRF-Win is outperformed, it is done so by

more than 30%. Clearly the data is unstructured (otherwise one would expect CRF-Win

to do better). Yet, one can sucessfully perform extraction on this data without labeling

training data (as in the CRF-Win) by manually constructing a reference set.

The results also indicate that the automatically constructed reference sets from the

ILA approach did not work for these domains. Clearly the ILA method is not discovering

useful reference set tuples. For some the attributes, such as the Hotel star rating and

Comic description, the algorithm did not discover a single attribute, resulting in no

extractions. Part of the issue for ILA is the small number of posts used for construction

for each domain, but there are other properties of these data sets that make it difficult

to automatically build reference sets from them.

Specifically, there are two factors that make it difficult to automatically construct

reference sets from certain posts. The first is the proportion of junk tokens (i.e. those

that should be ignored) as compared to attribute tokens (those that should be extracted).

If the proportion of junk to attribute tokens is close to even, then it is hard to distinguish

what is junk and what is not when generating attributes for the constructed reference

set. Along these lines, if many of the attributes for a given reference set are multi-token

terms, especially within a single tuple, then it is hard for the ILA algorithm to determine

70

where one attribute ends and the next begins, resulting in an ill defined reference set

tuple. A good example of this is a hotel which may have a multi-token name (“Courtyard

by Marriott”) and a multi-token area (“Rancho Cordova”), back to back within a post,

e.g. “2.5* Courtyard Marriott Rancho Cordova $43 4/14.” Both of these problems are

“boundary issues” where ILA gets confused as to which tokens belong to which attribute.

These boundary issues can be studied by creating a distribution of token characteris-

tics that describes the textual properties of the posts and the relationships between the

junk tokens and the attribute tokens. I define five distinct types of bigrams and define my

distribution as the likelihood of each bigram type occurring in the set of posts. Table 3.15

defines these bigram types along with an example of the type (in bold) within the con-

text of of a post that contains two attributes: a car make (Land Rover) and a car model

(Discovery). Note that Table 3.15 also lists a name for each bigram type which I use to

reference that type. Therefore, by comparing distributions of these five bigram types for

the posts across various domains I can compare the characteristics of each domain and

see if the boundary problems might occur.

Table 3.15: Bigram types that describe domain characteristicsAttri | Attrj Two tokens from . . . brand new Land Rover Discovery for . . . ...(“DIFF ATTR”) different attributesAttri | Attri Two tokens from . . . brand new Land Rover Discovery for . . . ...(“SAME ATTR”) the same attributeAttri | Junk A token . . . brand new Land Rover Discovery for . . . ...(“ATTR JUNK”) followed by junkJunk | Attri A junk token followed . . . brand new Land Rover Discovery for . . . ...(“JUNK ATTR”) by an attributeJunk | Junk Two junk tokens . . . brand new Land Rover Discovery for . . . ...(“JUNK JUNK”)

The key concept of the bigram distributions are the distributions of the junk tokens

and the tokens of the various attributes. However, this requires labeled data since the

71

bigram types distinguish between attribute-value tokens and junk tokens.5 Yet, the ILA

approach is automatic. If I want a mechanism by which the system can determine whether

or not to automatically construct the reference set, then I do not want to burden a user

with labeling large amounts of data to generate the bigram distributions. Note that

deciding to automatically construct or not is a pre-processing step, and as such it is

optional. Therefore, I do not think it is a problem to mix the single-label approach with

the fully automatic ILA method, since one could use ILA, automatically, without using

the approach described here.

So, to generate the labeled data for building the distributions in an efficient and simple

manner, a user takes a single post from a domain, and labels it. Then, the algorithm finds

all of the posts in the set that also contain the labeled tokens, and treats those tokens

as labeled within the new post. By doing this bootstrapping I can generate sufficiently

labeled data, and this is a process easy enough that I expect little burden on a user. One

benefit of this approach is that it is robust. It is only used for estimating distributions,

so I am not overly concerned about false labels, whereas I might be in a machine-learning

context.

As an example, given the post “Honda Accord 2002 - $5000” I would label the make

as Honda, the model as Accord and the year as 2002. Then, I would retrieve the following

posts, with the labeled attributes bootstrapped from my single labeled example:5Also, I make a distinction to define an attribute as only those values contained in a reference set. This

means I do not define prices, dates, etc. as attributes in this context. I do this to define the distributionswith respect to the reference set attributes only, since those are what I want ILA to discover.

72

Wanted Honda Accord Coupe {Make= Honda, Model=Accord}

2002 Honda Accord EX Sedan 4D {Make= Honda, Model=Accord, Year=2002}

2002 Accord for sale {Model=Accord, Year=2–2}

Accord 2002 must see!! {Model=Accord, Year=2002}

. . .

Once the algorithm has bootstrapped the labeled data, it can generate the distribution

using the five bigram types. To test the bootstrapping, and to generate the distributions

used below, I ran the bootstrapping method 20 times and generated the distributions

for each of the five domains described in this chapter (BFT Posts, eBay Comics, Craig’s

Cars, Skis and Laptops). Table 3.3.1 shows the average number of labeled posts returned

by bootstrapping a single post for each domain.6

Table 3.16: Number of bootstrapped postsDomain Avg. # Bootstrapped Posts

BFT Posts 20.95Craig’s Cars 10.90eBay Comics 31.30

Laptops 65.75Skis 16.35

Now that I have a mechanism to simply and efficiently generate the distributions,

I use these distributions to compare the domains. In particular, given that the ILA

technique is unable to discover useful reference sets from the BFT Posts and eBay Comics

domains, I next show that these domains have similar distributions of the bigram types.

Further, these distributions show that the boundary issues are indeed the impediment6I did not allow the case where bootstrapping returns less than 1 post from the set, because that would

mean you did not bootstrap at all, but only have the original labeled post. However, this can be easilyavoided by first clustering the tokens and then simply picking a single post from the largest cluster.

73

to successfully building the reference sets. On the other hand, the distributions for the

Craig’s Cars, Skis and Laptop domains are different distributions from the BFT Posts

and eBay Comics domains (and similar distributions to each other), which is expected

since the automatic method works well in these cases. Figure 3.7 shows the distribution

values for the different bigram types for each of the five domains averaged over the 20

trials. The x-axis of Figure 3.7 gives the percentage of the total distribution for each

bigram type, and the y-axis shows each bigram type referenced by its name as given in

Table 3.15.

Figure 3.7: Average distribution values for each domain using 10 random posts

Figure 3.7 shows that the eBay Comics and BFT Posts domains have similar distri-

butions to each other which are different from the rest of the domains. Further, both

the eBay Comics and BFT Posts domains have similar values for the “SAME ATTR”

bigram type and the “JUNK JUNK” type. Meanwhile, the other domains have very low

values for the “SAME ATTR” bigram type and very high values for the “JUNK JUNK”

74

type. The fact that the BFT Posts and eBay Comics domains have comparative values

for the “SAME ATTR” bigram type and the “JUNK JUNK” type shows that these two

domains have many multi-token attributes from the same tuple, but there are almost as

many junk tokens intermingled in the post which leads to boundary issues.

Specifically, according to Figure 3.7, in the BFT Posts and Comics domains about

25% of the bigrams in a post are from the same attribute. Couple this with the fact

that almost the same number of bigrams represent two junk tokens (“JUNK JUNK” in

the graph) or two tokens from a different attribute (“DIFF ATTR” in the graph), and it

is clear to see that ILA gets confused as to what attribute a token should belong to, or

whether that token is indeed an attribute instead of junk. Contrast this with the Skis,

Laptops, and Craig’s Cars domains, where a large majority of the tokens are junk, and

those that are not are usually from different attributes (though not always), which is a

case well suited for my automatic method. I must make note, however, that the Laptops

and Craig’s Cars domains have the same distribution for different attribute tokens as the

Comics domain, which shows my technique is able to find multi-token attributes. It is

more an issue of mixing the tokens of the many multi-token attributes that creates the

boundary problem. For instance, in the Comics post “FANTASTIC FOUR ANUAL [sic]

#2- DR.DOOM ORIGIN” it is hard to distinguish where one attribute ends and the next

begins, given the many multi-token attributes.

To make the similarities between the Skis/Laptops/Cars and the BFT/Comics distri-

butions more explicit than looking at the graph of Figure 3.7 I quantify the similarities

mathematically. Since the graph compares probability distributions for the different bi-

gram types, a natural way to quantify the comparison of the distributions is to use the

75

Kullback-Leibler (K-L) divergence, which measures the difference between two proba-

bility distributions. If one calls these distributions P and Q, then the Kullback-Leibler

divergence is:

KL(P ||Q) =∑

i

P (i) logP (i)Q(i)

If two distributions are similar, they have a low divergence. So, the BFT Posts and

eBay Comics domains will have a low divergence when compared to each other using the

average distributions. Further, I expect they will have a high divergence when compared

to the other domains. However, K-L divergence is asymmetric (it only compares P to

Q and not Q to P), so to make it symmetric I use a symmetric K-L divergence instead,

which is defined as:

SKL(P ||Q) =KL(P ||Q) +KL(Q||P )

2

Note that from this point forward, when I reference K-L divergence, I mean the symmetric

version.

Table 3.14 shows the K-L divergence for all pairs of domains for the average distribu-

tions presented in Figure 3.7. The results for Table 3.14 are presented in ascending order

of K-L divergence, which means that the pairs at the top of the table are the most simi-

lar. Clearly, from the table, the Laptops, Skis, and Craig’s Cars domains are all similar

to each other, and the BFT Posts and eBay Comics domains are similar to each other.

Further, the members of the Laptops/Skis/Cars group are all dissimilar as compared to

the BFT/Comics domains. These dissimilar pairs are at least three times greater than

the similar values, showing there is a strong differentiation between the domains that are

similar to each other and those that are not.

76

Table 3.17: K-L divergence between domains using average distributionsDomain Domain K-L divergenceLaptops Skis 0.06

Craig’s Cars Skis 0.10 SimilarCraig’s Cars Laptops 0.11

BFT Posts eBay Comics 0.13 Similar

Craig’s Cars eBay Comics 0.49eBay Comics Skis 0.71eBay Comics Laptops 0.78 DissimlarBFT Posts Craig’s Cars 0.83BFT Posts Laptops 1.01BFT Posts Skis 1.14

Up to this point, I have examined the average distributions. I now compare how many

times across the 20 trials the distributions are similar to each other. This demonstrates a

confidence level for each individual run in determining the similarity between the distri-

butions. In this context, if the divergence is below a threshold (empirically chosen as 2 in

this case, since the average KL-divergence for all pairs in all trials is 2.6), then I consider

the distribution pair to match. I define the confidence as the percentage of trials that

match out of all of the trials run.

Table 3.15 shows that indeed, the confidences hold well across individual trials. The

confidence (percentage of matches) for the BFT/Comics pairing is high, as are the confi-

dence measures for the pairs in the class of Laptops/Cars/Skis. However, the percentage

of trials that matches across the BFT/Comics and Laptops/Cars/Skis domains are quite

small.

Now, I have both a simple bootstrapping mechanism to generate the distributions, and

a way to determine whether or not the ILA approach can automatically build the reference

77

Table 3.18: Percent of trials where domains match (under K-L thresh)Domain Domain % MatchesLaptops Skis 1

Skis Craig’s Cars 0.9BFT Posts eBay Comics 0.9

Laptops Craig’s Cars 0.8Laptops BFT Posts 0.45Laptops eBay Comics 0.45

Skis BFT Posts 0.25Skis eBay Comics 0.25

BFT Posts Craig’s Cars 0.2Craig’s Cars eBay Comics 0.2

set, since I have confidence in using a single run to determine the similarity of distributions

to those that will work well for automatic construction (Skis/Laptops/Cars) versus those

that require a manual reference set (BFT/Comics). Therefore, given an arbitrary set of

posts, I should be able to label a single post, generate a distribution and then compare it

to my five known distributions. If the average of the K-L divergence between the given

distribution and the Comics and BFT distributions is less than a threshold (set to 2,

as it is above), then the algorithm should suggest to the user to manually construct the

reference set. If the average of the K-L divergence between the given distribution and

the Skis, Laptops, and Craig’s Cars domains is less than the threshold, the algorithm can

automatically build a reference set and therefore should run the ILA algorithm. Lastly,

if none of the averages are below the threshold, the algorithm should be conservative

and suggest that the distribution is “unknown.” Note that I choose to use the average

because it gives a better similarity to the whole class of known distributions, rather than

a single individual. I call this algorithm “Bootstrap-Compare,” and it is given in Table

3.19.

78

Table 3.19: Method to determine whether to automatically construct a reference set.

Bootstrap-Compare(Posts P)

ps ∈ P ← labeled# Label the single post: ps

LabeldPosts PL = BootStrapLabels(ps, P)Distribution D = GetBigramTypesDistribution(PL)If (Average(KL-Divergence(D, {Comics, Hotels})) < Thresh)

Return “manual”# User should manually create reference set.

Else If(Average KL-Divergence(D, {Cars, Laptops, Skis})) < Thresh)Run ILA# Reference set can be built automatically.

ElseReturn “unknown”# No similar known sets.

I test the Bootstrap-Compare approach using two new domains: digital cameras for

sale on eBay, and bibliographic citations from the Cora data set labeled for extraction.

The digital cameras data set (called “Digicams”) consists of 3,001 posts to eBay about

cameras for sale each with a brand, a model, and a model number. The Cora data set

contains 500 bibliographic citations. The Cora citations are an interesting set since I

can imagine the case where a user just wants to extract (create a reference set of) the

years, conference/journal names (called “booktitles”), and authors to see which authors

publish where and when. Attributes such as the paper titles, conference locations, etc.

can effectively be ignored as junk tokens. Descriptions of both of these domains are given

in Table 3.20.

Table 3.20: Two domains for testing the Bootstrap-Compare methodName Source Website Attributes RecordsDigicams eBay www.ebay.com brand, model, model num. 3,001Cora Cora http://www.cs.umass.edu/ mccallum/ author, booktitle, year 500

data/cora-ie.tar.gz

First, I ran the ILA method on each of these domains, and then generated field-

level extraction results using 100 labeled posts from the Digicams domain, and 500 posts

79

from the Cora domain. The results Table 3.21 show that the ILA method works well on

the Digicams data, but poorly on the Cora domain. Therefore, the Bootstrap-Compare

method should suggest it can use ILA for Digicams, but not for Cora.

Table 3.21: Extraction results using ILA on Digicams and Cora

Digicams CoraBrand Recall Prec. F1-Meas. Author Recall Prec. F1-Meas.

67.39 71.26 69.27 0.00 0.00 0.00Model Recall Prec. F1-Meas. Booktitle Recall Prec. F1-Meas.

54.65 61.04 57.67 0.00 0.00 0.00Model Num. Recall Prec. F1-Meas. Year Recall Prec. F1-Meas.

20.37 26.19 22.92 0.20 2.22 0.37

To test the effectiveness of the Bootstrap-Compare method, I ran it 20 times for each

domain and record whether it suggests it can automatically construct the reference set,

or not. For the Digicams domain, the Bootstrap-Compare algorithm correctly identified

that a reference set can be automatically constructed for 18/20 (90%) of the trials. This

gives us a 90% confidence that Bootstrap-Compare can identify this situation. For the

Cora domain, the algorithm suggested the user manually create a reference set in 20/20

(100%) of the trials. From these results I see the expected behavior of Bootstrap-Compare,

namely, it suggests to automatically construct the reference set for the Digicams domain,

but not for the Cora domain. Although this is an optional step for constructing reference

sets, especially since it requires a single labeled post, it can nonetheless help a user in

deciding whether to use the ILA method for building their reference set or not.

80

Chapter 4

A Machine Learning Approach to Reference-Set Based

Extraction

Although Chapters 2 and 3 present an automatic approach to extraction that covers the

cases for both manually constructed and automatically constructed reference sets, there

is still a special case to consider where an automatic approach to extraction would not

be as suitable. In particular, there are situations where a user requires highly accurate

extraction of attributes. This includes the high-accuracy extraction of attributes that are

not easily represented in a reference set, such as prices or dates. I call these “common at-

tributes.” Common attributes are usually an infinitely large set, but they contain certain

characteristics that can be exploited for extraction. For instance, regular expressions can

identify possible prices.

However, mixing the extraction of reference set attributes and common attributes can

lead to ambiguous extractions. For instance, while a regular expression for prices might

pull out the token “626” from a classified ad for cars, it might also be the case that the

“626” in the post refers to the car model, since the Mazda company makes a car called

the 626.

81

Another issue for high-accurate extraction stems from the fact that the automatic

methods can not explicitly model true positives from false negatives. Since the automatic

method only considers attributes in agreement to deal with possible false positives, it

may leave out certain extractions for attributes because it could not differentiate which

attribute from the reference set is better. Therefore the automatic method may lower its

recall because of the above effect.

To handle both the issue of ambiguities and attribute-in-agreement issue, this chap-

ter presents an extraction algorithm that exploits reference sets using techniques from

machine learning. While the method of this chapter yields high accuracy extractions, in-

cluding “common attribute” extractions, it does so at the cost of manually labeling data

for both matching and extraction. However, by using a reference set to aid disambigua-

tion, the system handles common attribute extraction from posts with high accuracy.

Further, by explicitly modeling matches versus non-matches and correct extractions ver-

sus incorrect extractions, the system can learn how to distinguish true positives from false

negatives, without resorting to using only the attributes-in-agreement from the matches.

This leads to a higher level of recall for the non-common attributes. Note, however, that

the same reference-set based extraction framework holds for the algorithm of this chapter.

The goal is still to match to a reference set and then exploit those matches for extraction.

The key difference in this chapter is that the matching and extraction steps are done using

techniques from machine learning that yield very high accuracy extractions, particularly

for the common attributes.

82

4.1 Aligning Posts to a Reference Set

As in the general reference-set based extraction framework, the algorithm needs to first

decide which members of the reference set best matches the posts. In the context of our

machine learning approach, this matching is known as record linkage [26]. Record linkage

can be broken into two steps: generating candidate matches, called “blocking”; and then

separating the true matches from these candidates in the “matching” step. Blocking is

a necessary step since using a machine learning component for matching involves a more

costly matching decision than in our automatic method which uses vector-based similarity

matching, so the number of candidate matches needs to be minimized to mitigate this

cost.

In our approach, the blocking generates candidate matches based on similarity meth-

ods over certain attributes from the reference set as they compare to the posts. For our

cars example, the algorithm may determine that it can generate candidates by finding

common tokens between the posts and the make attribute of the reference set. That is,

it may find that all posts with “Honda” in them are good candidates for detailed exami-

nation during the matching step with the “Honda” tuples of the reference set. This step

is detailed in Section 4.1.1 and is crucial in limiting the number of candidates matches

we later examine during the matching step. After generating candidates, the algorithm

generates a large set of features between each post and its candidate matches from the

reference set. Using these features, the algorithm employs machine learning methods

to separate the true matches from the false positives generated during blocking. This

matching is detailed in Section 4.1.2.

83

4.1.1 Generating Candidates by Learning

Blocking Schemes for Record Linkage

It is infeasible to compare each post to all of the members of a reference set since matching

using machine learning is more costly than our automatic matching technique. Therefore

a preprocessing step generates candidate matches by comparing all the records between

the sets using fast, approximate methods. This is called blocking because it can be

thought of as partitioning the full cross product of record comparisons into mutually

exclusive blocks [50]. That is, to block on an attribute, first an algorithm sorts or clusters

the data sets by the attribute. Then it applies the comparison method to only a single

member of a block. After blocking, the candidate matches (from the matching block) are

examined in detail to discover true matches.

There are two main goals of blocking. First, blocking should limit the number of

candidate matches, which limits the number of expensive, detailed comparisons needed

during record linkage. Second, blocking should not exclude any true matches from the

set of candidate matches. This means there is a trade-off between finding all matching

records and limiting the size of the candidate matches. So, the overall goal of blocking is

to make the matching step more scalable, by limiting the number of comparisons it must

make, while not hindering its accuracy by passing as many true matches to it as possible.

Most blocking is done using the multi-pass approach [29], which combines the can-

didates generated during independent runs. For example, with cars data, a blocking

method might make one pass over the data blocking on tokens in the car model, while

another run might block using tokens of the make along with common tokens in the

84

trim values. One can view the multi-pass approach as a rule in disjunctive normal form,

where each conjunction in the rule defines each run, and the union of these rules com-

bines the candidates generated during each run. In the context of blocking, I call this

disjunctive rule a “blocking scheme.” Using my care example, the rule might become

({token-match, model} ∧ ({token-match, year}) ∪ ({token-match, make})). The effec-

tiveness of the multi-pass approach hinges upon which methods and attributes are chosen

in the conjunctions.

Note that each conjunction is a set of {method, attribute} pairs, and I do not make

restrictions on which methods can be used. The set of methods could include full string

metrics such as cosine similarity, simple common token matching as outlined above, or

even state-of-the-art n-gram methods as shown in my experiments. The key for methods

is not necessarily choosing the fastest (though I show how to account for the method speed

below), but rather choosing the methods that will generate the smallest set of candidate

matches that still cover the true positives, since it is the matching step that will consume

the most time.

Therefore, a blocking scheme should include enough conjunctions to cover as many

true matches as it can. For example, the first conjunct might not cover all of the true

matches if the datasets being compared do not overlap in all of the years, so the sec-

ond conjunct can cover the rest of the true matches. This is the same as adding more

independent runs to the multi-pass approach.

However, since a blocking scheme includes as many conjunctions as it needs, these

conjunctions should limit the number of candidates they generate. For example, the

second conjunct is going to generate a lot of unnecessary candidates since it will return

85

all records that share the same make. By adding more {method, attribute} pairs to

a conjunction, the blocking method limits the number of candidates it generates. For

example, if I change ({token-match, make}) to ({token-match, make} ∧ {token-match,

trim}) the rule still covers new true matches, but it generates fewer additional candidates.

Therefore effective blocking schemes should learn conjunctions that minimize the false

positives, but learn enough of these conjunctions to cover as many true matches as possi-

ble. These two goals of blocking can be clearly defined by the Reduction Ratio and Pairs

Completeness [24].

The Reduction Ratio (RR) quantifies how well the current blocking scheme minimizes

the number of candidates. Let C be the number of candidate matches and N be the size

of the cross product between both data sets.

RR = 1− C/N

It should be clear that adding more {method,attribute} pairs to a conjunction increases

its RR, as when I changed ({token-match, zip}) to ({token-match, zip} ∧ {token-match,

first name}). Pairs Completeness (PC) measures the coverage of true positives, i.e., how

many of the true matches are in the candidate set versus those in the entire set. If Sm

is the number of true matches in the candidate set, and Nm is the number of matches in

the entire dataset, then:

PC = Sm/Nm

Adding more disjuncts can increase the PC. For example, I added the second conjunc-

tion to the example blocking scheme because the first did not cover all of the matches.

86

The blocking approach in this paper, “Blocking Scheme Learner” (BSL), learns ef-

fective blocking schemes in disjunctive normal form by maximizing the reduction ratio

and pairs completeness. In this way, BSL tries to maximize the two goals of blocking.

Previous work showed BSL aided the scalability of record linkage [45], and in this chapter

I extend that idea by showing that it also can work in the case of matching posts to the

reference set records.

The BSL algorithm uses a modified version of the Sequential Covering Algorithm

(SCA), used to discover disjunctive sets of rules from labeled training data [47]. In my

case, SCA will learn disjunctive sets of conjunctions consisting of {method, attribute}

pairs. Basically, each call to LEARN-ONE-RULE generates a conjunction, and BSL

keeps iterating over this call, covering the true matches left over after each iteration.

This way SCA learns a full blocking scheme. The BSL algorithm is shown in Table 4.1.

Table 4.1: Modified Sequential Covering AlgorithmSEQUENTIAL-COVERING(class, attributes, examples)LearnedRules ← {}Rule ← LEARN-ONE-RULE (class, attributes, examples)While examples left to cover, do

LearnedRules ← LearnedRules ∪RuleExamples ← Examples - {Examples covered by Rule}Rule← LEARN-ONE-RULE (class, attributes, examples)If Rule contains any previously learned rules, remove thesecontained rules.

Return LearnedRules

There are two modifications to the classic SCA algorithm, which are shown in bold.

First, BSL runs until there are no more examples left to cover, rather than stopping at

some threshold. This ensures that the algorithm maximizes the number of true matches

generated as candidates by the final blocking rule (Pairs Completeness). Note that this

might, in turn, yield a large number of candidates, hurting the Reduction Ratio. However,

87

omitting true matches directly affects the accuracy of record linkage, and blocking is a

preprocessing step for record linkage, so it is more important to cover as many true

matches as possible. This way BSL fulfills one of the blocking goals: not eliminating

true matches if possible. Second, if BSL learns a new conjunction (in the LEARN-ONE-

RULE step) and its current blocking scheme has a rule that already contains the newly

learned rule, then BSL removes the rule containing the newly learned rule. This is an

optimization that allows BSL to check rule containment as its progresses, rather than at

the end.

Checking rule containment is possible because BSL can guarantee that it learns less

restrictive rules as it goes. I prove this guarantee as follows, using proof by contradiction.

Assume two attributes A and B, and a method X. Also, assume that BSL previously

learned rules contain the following conjunction, ({X, A}) and BSL currently learned the

rule ({X, A}∧ {X, B}). That is, assume the learned rules contains a rule that is less

specific than the currently learned rule. If this were the case, then there must be at least

one training example covered by ({X, A}∧ {X, B}) that is not covered by ({X, A}),

since SCA dictates that BSL remove all examples covered by ({X, A}) when it is learned.

Clearly, this cannot happen, since any examples covered by the more specific ({X, A}∧

{X, B}) would have been covered by ({X, A}) already and removed, which means BSL

could not have learned the rule ({X, A}∧ {X, B}). Thus, there exists a contradiction.

As I stated before, the two main goals of blocking are to minimize the size of the

candidate set, while not removing any true matches from this set. I have already men-

tioned how BSL maximizes the number of true positives in the candidate set and now

I describe how BSL minimizes the overall size of the candidate set, which yields more

88

scalable record linkage. To minimize the candidate set’s size, BSL learns as restrictive a

conjunction as possible during each call to LEARN-ONE-RULE during the SCA. I define

restrictive as minimizing the number of candidates generated, as long as a certain number

of true matches are still covered. (Without this restriction, BSL could learn conjunctions

that perfectly minimize the number of candidates: they simply return none.)

To do this, the LEARN-ONE-RULE step performs a general-to-specific beam search.

It starts with an empty conjunction and at each step adds the {method, attribute} pair

that yields the smallest set of candidates that still cover at least a set number of true

matches. That is, BSL learns the conjunction that maximizes the Reduction Ratio, while

at the same time covering a minimum value of Pairs Completeness. BSL uses a beam

search to allow for some backtracking, since the search is greedy. However, since the

beam search goes from general-to-specific, it ensures that the final rule is as restrictive

as possible. The full LEARN-ONE-RULE is given in Table 4.2.

The constraint that a conjunction covers a minimum PC ensures that the learned

conjunction does not over-fit to the data. Without this restriction, it would be possible

Table 4.2: Learning a conjunction of {method, attribute} pairsLEARN-ONE-RULE (attributes, examples,min thresh, k)Best-Conjunction ← {}Candidate-conjunctions ← all {method, attribute} pairsWhile Candidate-conjunctions not empty, do

For each ch ∈ Candidate-conjunctionsIf not first iterationch← ch ∪ {method,attribute}

Remove any ch that are duplicates, inconsistent or not max. specificif REDUCTION-RATIO(ch) > REDUCTION-RATIO(Best-Conjunction)and PAIRS-COMPLETENESS(ch) ≥ min thresh

Best-Conjunction ← chCandidate-conjunctions ← best k members of Candidate-conjunctions

return Best-conjunction

89

for LEARN-ONE-RULE to learn a conjunction that returns no candidates, uselessly

producing an optimal RR.

The algorithm’s behavior is well defined for the minimum PC threshold. Consider,

the case where the algorithm learns the most restrictive rule it can with the minimum

coverage. In this case, the parameter ends up partitioning the space of the cross product

of example records by the threshold amount. That is, if we set the threshold amount to

50% of the examples covered, the most restrictive first rule covers 50% of the examples.

The next rule covers 50% of what is remaining, which is 25% of the examples. The next

will cover 12.5% of the examples, etc. In this sense, the parameter is well defined. If

a user sets the threshold high, BSL learns fewer, less restrictive conjunctions, possibly

limiting the RR, although this may increase PC slightly. If the threshold is set lower,

BSL covers more examples, but it needs to learn more conjuncts. These newer conjuncts,

in turn, may be subsumed by later conjuncts, so they will be a waste of time to learn. So,

as long as this parameter is small enough, it should not affect the coverage of the final

blocking scheme, and smaller than that just slows down the learning. I set this parameter

to 50% for my experiments1.

Now I analyze the running time of BSL, and show how BSL can take into account the

running time of different blocking methods, if need be. Assume that I have x (method,

attribute) pairs such as (token, first − name). Now, assume that the beam size is b,

since I use general-to-specific beam-search in the Learn-One-Rule procedure. Also, for

the time being, assume each (method, attribute) pair can generate its blocking candidates

in O(1) time. (I relax this assumption later.) Each time BSL hits the Learn-One-Rule1Setting this parameter lower than 50% had an insignificant effect on my results, and setting it much

higher, to 90%, only increased the PC by a small amount (if at all), while decreasing the RR.

90

phase, it will try all rules in the beam with all of the (attribute, method) pairs not in

the current beam rules. So, in the worst case, this takes O(bx) each time, since for each

(method, attribute) pair in the beam, BSL tries it against all other (method, attribute)

pairs. Now, in the worst case, each learned disjunct would only cover 1 training example,

so the final rule is a disjunction of all pairs x. Therefore, BSL runs the Learn-One-Rule

x times, resulting in a learning time of O(bx2). If BSL has e training examples, the full

training time is O(ebx2), for BSL to learn the blocking scheme.

Now, while I assumed above that each (method, attribute) runs in O(1) time, this

is clearly not the case, since there is a substantial amount of literature on blocking

methods and further the blocking times can vary significantly [5]. I define a function tx(e)

that represents how long it takes for a single (method, attribute) pair in x to generate

the e candidates in the training example. Using this notation, Learn-One-Rule’s time

becomes O(b(xtx(e))) (it runs tx(e) time for each pair in x) and so the full training time

becomes O(eb(xtx(e))2). Clearly the most expensive blocking methodology dominates

such a running time. Once a rule is learned, it is bounded by the time it takes to run the

rule and (method, attribute) pairs involved, so it takes O(xtx(n)), where n is the number

of records for classification.

From a practical standpoint, I can easily modify BSL to account for the time it takes

certain blocking methods to generate their candidates. In the Learn-One-Rule step, I

change the performance metric to reflect both Reduction Ratio and blocking time as a

weighted average. That is, given Wrr as the weight for Reduction Ratio and Wb as the

weight for the blocking time, I modify Learn-One-Rule to maximize the performance of

any disjunct based on this weighted average. Table 4.3 shows the modified version of

91

Learn-One-Rule, and the changes are shown in bold.

Table 4.3: Learning a conjunction of {method, attribute} pairs using weightsLEARN-ONE-RULE (attributes, examples,min thresh, k)Best-Conj ← {}Candidate-conjunctions ← all {method, attribute} pairsWhile Candidate-conjunctions not empty, do

For each ch ∈ Candidate-conjunctionsIf not first iterationch← ch ∪ {method,attribute}

Remove any ch that are duplicates, inconsistent or not max. specificSCORE(ch) = Wrr∗REDUCTION-RATIO(ch)+Wb∗BLOCK-TIME(ch)SCORE(Best-Conj) = Wrr∗REDUCTION-RATIO(Best-conj)+Wb∗BLOCK-TIME(Best-conj)if SCORE(ch) > SCORE(Best-conj)and PAIRS-COMPLETENESS(ch) ≥ min thresh

Best-conj ← chCandidate-conjunctions ← best k members of Candidate-conjunctions

return Best-conj

Note that when Wb is set to 0, the above version of Learn-One-Rule resolves into using

the same version of Learn-One-Rule as used throughout this chapter which only considers

the Reduction Ratio. Since the methods (token and n-gram match) are simple to compute,

requiring more time to build the initial index than to do the candidate generation, one can

safely set Wb to 0. Also, making this trade-off of time versus reduction might not always

be an appropriate decision. Although a method may be fast, if it does not sufficiently

reduce the reduction ratio, then the time it takes the record linkage step might increase

more than the time it would have taken to run the blocking using a method that provides

a larger increase in reduction ratio. Since classification often takes much longer than

candidate generation, the goal should be to minimize candidates (maximize reduction

ratio), which in turn minimizes classification time. Further, the key insight of BSL is

not only that it chooses the blocking method, but more importantly that it chooses the

appropriate attributes to block on. In this sense, BSL is more like a feature selection

92

algorithm than a blocking method. As I show in our experiments, for blocking it is

more important to pick the right attribute combinations, as BSL does, even using simple

methods, than to do blocking using the most sophisticated methods.

Although BSL was originally conceived for blocking in traditional record linkage, I can

easily extend the BSL algorithm to handle the case of matching posts to members of the

reference set. This is a special case because the posts have all the attributes embedded

within them while the reference set data is relational and structured into schema elements.

To handle this special case, rather than matching attribute and method pairs across the

data sources during LEARN-ONE-RULE, BSL instead compares attribute and method

pairs from the relational data to the entire post. This is a small change, showing that

the same algorithm works well even in this special case.

Once BSL learns a good blocking scheme, it can now efficiently generate candidates

from the post set to align to the reference set. This blocking step is essential for map-

ping large amounts of unstructured and ungrammatical data sources to larger and larger

reference sets.

4.1.2 The Matching Step

From the set of candidates generated during blocking one can find the member of the

reference set that best matches the current post. As mentioned throughout this thesis,

reference-set based extraction breaks into two larger steps: first, finding the matches

between the posts and reference set, and second, using those matches to aid extraction.

By using machine learning for this “matching step,” this alignment procedure can be

referred to as record linkage [26]. However, the record linkage problem presented in

93

Figure 4.1: The traditional record linkage problem

Figure 4.2: The problem of matching a post to the reference set

this chapter differs from the “traditional” record linkage problem and is not well studied.

Traditional record linkage matches a record from one data source to a record from another

data source by relating their respective, decomposed attributes. For instance, Figure 4.1

shows how traditional record linkage might match two sets of cars, each with make,

model, trim and year attributes. Note that each attribute is decomposed from the other

94

attributes and the matches are determined by comparing each attribute from one set to

the other set, i.e. comparing the makes, models, etc. Yet, the attributes of the posts are

embedded within a single piece of text and not yet identified. This text is compared to

the reference set, which is already decomposed into attributes and which does not have

the extraneous tokens present in the post. Figure 4.2 depicts this problem. With this

type of matching traditional record linkage approaches do not apply.

Instead, the matching step compares the post to all of the attributes of the reference

set concatenated together. Since the post is compared to a whole record from the reference

set (in the sense that it has all of the attributes), this comparison is at the “record level”

and it approximately reflects how similar all of the embedded attributes of the post are

to all of the attributes of the candidate match. This mimics the idea of traditional record

linkage, that comparing all of the fields determines the similarity at the record level.

Figure 4.3: Two records with equal record level but different field level similarities

95

However, by using only the record level similarity it is possible for two candidate

matches to generate the same record level similarity while differing on individual at-

tributes. If one of these attributes is more discriminative than the other, there needs to

be some way to reflect that. For example, consider Figure 4.3. In the figure, the two

candidates share the same make and model. However, the first candidate shares the year

while the second candidate shares the trim. Since both candidates share the same make

and model, and both have another attribute in common, it is possible that they generate

the same record level comparison. Yet, a trim on car, especially with a rare thing like a

“Hatchback” should be more discriminative than sharing a year, since there are lots of

cars with the same make, model and year, that differ only by the trim. This difference

in individual attributes needs to be reflected.

To discriminate between attributes, the matching step borrows the idea from tra-

ditional record linkage that incorporating the individual comparisons between each at-

tribute from each data source is the best way to determine a match. That is, just the

record level information is not enough to discriminate matches, field level comparisons

must be exploited as well. To do “field level” comparisons the matching step compares

the post to each individual attribute of the reference set.

These record and field level comparisons are represented by a vector of different

similarity functions called RL scores. By incorporating different similarity functions,

RL scores reflects the different types of similarity that exist between text. Hence, for

the record level comparison, the matching step generates the RL scores vector between

the post and all of the attributes concatenated. To generate field level comparisons,

the matching step calculates the RL scores between the post and each of the individual

96

attributes of the reference set. All of these RL scores vectors are then stored in a vec-

tor called VRL. Once populated, VRL represents the record and field level similarities

between a post and a member of the reference set.

In the example reference set from Figure 4.2, the schema has 4 attributes <make,

model, trim, year>. Assuming the current candidate is <“Honda”, “Civic”, “4D LX”,

“1993”>, then the VRL looks like:

VRL=<RL scores(post, “Honda”),

RL scores(post, “Civic”),

RL scores(post, “4D LX”),

RL scores(post, “1993”),

RL scores(post, “Honda Civic 4D LX 1993”)>

Or more generally:

VRL=<RL scores(post, attribute1),

RL scores(post, attribute2),

. . . ,

RL scores(post, attributen),

RL scores(post, attribute1 attribute2 . . . attributen)>

The RL scores vector is meant to include notions of the many ways that exist to define

the similarity between the textual values of the data sources. It might be the case that

one attribute differs from another in a few misplaced, missing or changed letters. This

sort of similarity identifies two attributes that are similar, but misspelled, and is called

“edit distance.” Another type of textual similarity looks at the tokens of the attributes

and defines similarity based upon the number of tokens shared between the attributes.

97

This “token level” similarity is not robust to spelling mistakes, but it puts no emphasis on

the order of the tokens, whereas edit distance requires that the order of the tokens match

in order for the attributes to be similar. Lastly, there are cases where one attribute may

sound like another, even if they are both spelled differently, or one attribute may share a

common root word with another attribute, which implies a “stemmed” similarity. These

last two examples are neither token nor edit distance based similarities.

To capture all these different similarity types, the RL scores vector is built of three

vectors that reflect the each of the different similarity types discussed above. Hence,

RL scores is:

RL scores(post, attribute)=<token scores(post, attribute),

edit scores(post, attribute),

other scores(post, attribute)>

The vector token scores comprises three token level similarity scores. Two similarity

scores included in this vector are based on the Jensen-Shannon distance, which defines

similarities over probability distributions of the tokens. One uses a Dirichlet prior [17]

and the other smooths its token probabilities using a Jelenik-Mercer mixture model [64].

The last metric in the token scores vector is the Jaccard similarity.

With all of the scores included, the token scores vector takes the form:

token scores(post, attribute)=<Jensen-Shannon-Dirichlet(post, attribute),

Jensen-Shannon-JM-Mixture(post, attribute),

Jaccard(post, attribute)>

98

The vector edit scores consists of the edit distance scores which are comparisons be-

tween strings at the character level defined by operations that turn one string into an-

other. For instance, the edit scores vector includes the Levenshtein distance [36], which

returns the minimum number of operations to turn string S into string T, and the Smith-

Waterman distance [56] which is an extension to the Levenshtein distance. The last

score in the vector edit scores is the Jaro-Winkler similarity [63], which is an extension of

the Jaro metric [30] used to find similar proper nouns. While not a strict edit-distance,

because it does not regard operations of transformations, the Jaro-Winkler metric is a

useful determinant of string similarity.

With all of the character level metrics, the edit scores vector is defined as:

edit scores(post, attribute)=<Levenshtein(post, attribute),

Smith-Waterman(post, attribute),

Jaro-Winkler(post, attribute)>

All the similarities in the edit scores and token scores vector are defined in the Sec-

ondString package [17] which was used for the experimental implementation as described

in Section 4.3.

Lastly, the vector other scores captures the two types of similarity that did not fit

into either the token level or edit distance similarity vector. The first is the Soundex

score between the post and the attribute. Soundex uses the phonetics of a token as

a basis for determining the similarity. That is, misspelled words that sound the same

will receive a high Soundex score for similarity. The other similarity is based upon the

Porter stemming algorithm [52], which removes the suffixes from strings so that the root

words can be compared for similarity. This helps alleviate possible errors introduced by

99

the prefix assumption introduced by the Jaro-Winkler metric, since the stems are scored

rather than the prefixes. Including both of these scores, the other scores vector becomes:

other scores(post, attribute)=<Porter-Stemmer(post, attribute),

Soundex(post, attribute)>

Figure 4.4: The full vector of similarity scores used for record linkage

Figure 4.4 shows the full composition of VRL, with all the constituent similarity scores.

Once a VRL is constructed for each of the candidate matches, the matching step

then performs a binary rescoring on each VRL to further help determine the best match

amongst the candidates. This rescoring helps determine the best possible match for the

post by separating out the best candidate as much as possible. Because there might be

a few candidates with similarly close values, and only one of them is a best match, the

rescoring emphasizes the best match by downgrading the close matches so that they have

100

the same element values as the more obvious non-matches, while boosting the difference

in score with the best candidate’s elements.

To rescore the vectors of candidate set C, the rescoring method iterates through the

elements xi of all VRL∈C, and the VRL(s) that contain the maximum value for xi map

this xi to 1, while all of the other VRL(s) map xi to 0. Mathematically, the rescoring

method is:

∀VRLj ∈ C, j = 0... |C|

∀xi ∈ VRLj , i = 0...∣∣∣VRLj

∣∣∣f(xi, VRLj ) =

1, xi = max(∀xt ∈ VRLs , VRLs ∈ C, t = i, s = 0... |C|)

0, otherwise

For example, suppose C contains 2 candidates, VRL1 and VRL2 :

VRL1 = <{.999,...,1.2},...,{0.45,...,0.22}>

VRL2 = <{.888,...,0.0},...,{0.65,...,0.22}>

After rescoring they become:

VRL1 = <{1,...,1},...,{0,...,1}>

VRL2 = <{0,...,0},...,{1,...,1}>

After rescoring, the matching step passes each VRL to a Support Vector Machine

(SVM) [31] trained to label them as matches or non-matches. The best match is the

candidate that the SVM classifies as a match, with the maximally positive score for

the decision function. If more than one candidate share the same maximum score from

the decision function, then they are thrown out as matches. This enforces a strict 1-1

mapping between posts and members of the reference set. However, a 1-n relationship

101

can be captured by relaxing this restriction. To do this the algorithm keeps either the

first candidate with the maximal decision score, or chooses one randomly from the set of

candidates with the maximum decision score.

Although I use SVMs in this paper to differentiate matches from non-matches, the

algorithm is not strictly tied to this method. The main characteristics for my learning

problem are that the feature vectors are sparse (because of the binary rescoring) and the

concepts are dense (since many useful features may be needed and thus none should be

pruned by feature selection). I also tried to use a Naıve Bayes classifier for the matching

task, but it was monumentally overwhelmed by the number of features and the number

of training examples. Yet this is not to say that other methods that can deal with sparse

feature vectors and dense concepts, such as online logistic regression or boosting, could

not be used in place of SVM.

After the match for a post is found, the attributes of the matching reference set

member are added as annotation to the post by including the values of the reference set

attributes with tags that reflect the schema of the reference set. The overall matching

algorithm is shown in Figure 4.5.

4.2 Extracting Data from Posts

As stated, the motivation of the algorithm of this chapter is a high-accuracy extraction

algorithm that can extract both attributes from a reference set, and those that are not

easily represented by a reference set, although this is done at the cost of manually la-

beling training data. To perform extraction, following the generic reference-set based

102

Figure 4.5: Our approach to matching posts to records from a reference set

framework, the matches from the previous matching step are used for extraction. How-

ever, unlike the extraction methods of the previous chapters, this extraction algorithm

uses machine learning so it can specifically model extraction types to help disambiguate

attribute extractions and improve the recall of extraction.

In a broad sense, the extraction algorithm of this chapter has two parts. First the

algorithm labels each token with a possible attribute label or as “junk” to be ignored.

After all the tokens in a post are labeled, it then cleans each of the extracted labels. Figure

4.6 shows the detailed procedure. Each of the steps shown in this figure are described in

detail below.

To begin the extraction process, the post is broken into tokens. Using the post from

Figure 4.6 as an example, the set of tokens becomes, {“93-”, “4dr”, “Honda”,...}. Each

of these tokens is then scored against each attribute of the record from the reference set

that was deemed the match.

103

Figure 4.6: Extraction process for attributes

To score the tokens, the extraction process builds a vector of scores, VIE . Like the VRL

vector of the matching step, VIE is composed of vectors which represent the similarities

between the token and the attributes of the reference set. However, the composition of

VIE is slightly different from VRL. It contains no comparison to the concatenation of all

the attributes, and the vectors that compose VIE are different from those that compose

VRL. Specifically, the vectors that form VIE are called IE scores, and are similar to the

RL scores that compose VRL, except they do not contain the token scores component,

since each IE scores only uses one token from the post at a time.

The RL scores vector:

104

RL scores(post, attribute)=<token scores(post, attribute),

edit scores(post, attribute),

other scores(post, attribute)>

becomes:

IE scores(token, attribute)=<edit scores(token, attribute),

other scores(token, attribute)>

The other main difference between VIE and VRL is that VIE contains a unique vector

that contains user defined functions, such as regular expressions, to capture the “common

attributes” that are not easily represented by reference sets, such as prices or dates.

These attribute types generally exhibit consistent characteristics that allow them to be

extracted, and they are usually infeasible to represent in reference sets. This vector is

called common scores because the types of characteristics of “common attributes” for

extraction are “common” enough between to be easily constructed and used (such as

regular expressions).

Using the post of Figure 4.6, assume the reference set match has the make “Honda,”

the model “Civic” and the year “1993.” This means the matching tuple would be

{“Honda”, “Civic”, “1993”}. This match generates the following VIE for the token

“civc” of the post:

VIE=<common scores(“civic”),

IE scores(“civc”,“Honda”),

IE scores(“civc”,“Civic”),

IE scores(“civc”,“1993”)>

More generally, for a given token, VIE looks like:

105

VIE=<common scores(token),

IE scores(token, attribute1),

IE scores(token, attribute2)

. . . ,

IE scores(token, attributen)>

Each VIE is then passed to a structured SVM [60; 59] trained to give it an attribute

type label, such as make, model, or price. Intuitively, similar attribute types should

have similar VIE vectors. The makes should generally have high scores against the make

attribute of the reference set, and small scores against the other attributes. Further,

structured SVMs are able to infer the extraction labels collectively, which helps in de-

ciding between possible token labels. This makes the use of structured SVMs an ideal

machine learning method for the extraction task, since collective labeling aids in the dis-

ambiguation. Note that since each VIE is not a member of a cluster where the winner

takes all, there is no binary rescoring.

Since there are many irrelevant tokens in the post, the SVM learns that any VIE that

does associate with a learned attribute type should be labeled as “junk”, which can then

be ignored. Without the benefits of a reference set, recognizing junk is difficult because

the characteristics of the text in the posts are unreliable. For example, if extraction

relies solely on capitalization and token location, the junk phrase “Great Deal” might be

annotated as an attribute. Many traditional extraction systems that work in the domain

of ungrammatical and unstructured text, such as addresses and bibliographies, assume

that each token of the text must be classified as something, an assumption that cannot

be made with posts.

106

Nonetheless, it is possible that a junk token will receive an incorrect class label. For

example, if a junk token has enough matching letters, it might be labeled as a trim (since

trims may only be a single letter or two). This leads to noisy tokens within the whole

extracted trim attribute. Therefore, labeling tokens individually gives an approximation

of the data to be extracted.

The extraction approach can overcome the problems of generating noisy, labeled to-

kens by comparing the whole extracted field to its analogue reference set attribute, sim-

ilarly to the process in Chapter 2. After all tokens from a post are processed, whole

attributes are built and compared to the corresponding attributes from the reference set.

This allows removal of the tokens that introduce noise in the extracted attribute.

The removal of noisy tokens from an extracted attribute starts with generating two

baseline scores between the extracted attribute and the reference set attribute. One

is a Jaccard similarity, to reflect the token level similarity between the two attributes.

However, since there are many misspellings and such, an edit-distance based similarity

metric, the Jaro-Winkler metric, is also used. These baselines demonstrate how accurately

the system extracted/classified the tokens in isolation. Then, the algorithm removes a

single token at a time, and rescores the similarity between the reference set attribute and

the updated extracted attribute. If the score of the updated extraction is higher than the

baseline, the removed token becomes a candidate for removal. After trying all attributes,

the candidate for removal that yielded the highest score is removed, and the baseline score

is updated to this new similarity score. If no tokens yield a higher score when removed,

then the cleaning is finished. The algorithm is shown formally in Figure 4.7.

107

Algorithm 4.2.1: CleanAttribute(E,R)

comment: Clean extracted attribute E using reference set attribute R

RemovalCandidates C ← nullJaroWinklerBaseline ← JaroWinkler(E,R)JaccardBaseline ← Jaccard(E,R)for each token t ∈ E

do

Xt ← RemoveToken(t, E)JaroWinklerXt ← JaroWinkler(Xt, R)JaccardXt ← Jaccard(Xt, R)

if

JaroWinklerXt>JaroWinklerBaseline

andJaccardXt>JaccardBaseline

then{C ← C ∪ t

if

{C = nullreturn (E)

else

{E ← RemoveMaxCandidate(C,E)CleanAttribute(E,R)

Figure 4.7: Algorithm to clean an extracted attribute

Note, however, that I do not limit the machine learning component of our extraction

algorithm to SVMs. Instead, I claim that in some cases, reference sets can aid machine

learning extraction, and to test this, in my architecture I can replace the SVM component

with other methods. For example, in my extraction experiments I replace the SVM

extractor with a Conditional Random Field (CRF) [32] extractor that uses the VIE as

features.

Therefore, the whole extraction process takes a token of the text, creates the VIE and

passes this to the machine-learning extractor which generates a label for the token. Then

each field is cleaned and the extracted attribute is saved.

108

4.3 Results

I built the “Phoebus” system to experimentally validate the machine learning approach

of this chapter. Specifically, Phoebus tests the technique’s accuracy in both the record

linkage and the extraction, and incorporates the BSL algorithm for learning and using

blocking schemes. The experimental posts data comes from three domains of posts:

BFT Posts, eBay Comics, and Craig’s Cars, which are described in Chapters 2 and

3. In particular, these three domains have attributes for extraction that are “common

attributes” (such as prices and dates) that are hard to distinguish from other attributes

from the reference set (such as BFT Posts’ numeric star rating, eBay Comics’ numeric

issue number, and Craig’s Cars’ years). The three reference sets used for each domain are

called Hotels, Comics and Cars respectively, and are described in Table 2.4 of Chapter 2.

Unlike the BFT and eBay comics domains, a strict 1-1 relationship between the post

and reference set was not enforced in the Craig’s Cars domain. As described previously,

Phoebus relaxed the 1-1 relationship to form a 1-n relationship between the posts and the

reference set. Sometimes the records do not contain enough attributes to discriminate a

single best reference member. For instance, posts that contain just a model and a year

might match a couple of reference set records that would differ on the trim attribute, but

have the same make, model, and year. Yet, Phoebus can still use this make, model and

year accurately for extraction. So, in this case, as mentioned previously, the algorithm

picks one of the matches. This way, it can exploit the attributes that it can from the

reference set, since it has confidence in those. In some sense, this is similar to the

“attributes in agreement” idea.

109

Since this chapter’s technique uses supervised machine learning, for the experiments

the posts in each domain are split into two folds, one for training and one for testing.

This is usually called two-fold cross validation. However, in many cases two-fold cross

validation results in using 50% of the data for training and 50% for testing. I believe

that this is too much data to have to label manually, especially as data sets become

large. Instead, my experiments focus on using more realistic amounts of training data.

One set of experiments uses 30% of the posts for training and tests on the remaining

70%, and the second set of experiments uses just 10% of the posts to train, testing on

the remaining 90%. I argue that training on small amounts of data, such as 10%, is

an important empirical procedure since real world data sets are large and labeling 50%

of such large data sets is time consuming and unrealistic. In fact, the size of the Cars

domain prevented me from using 30% of the data for training, since the machine learning

algorithms could not scale to the number of training tuples this would generate. So for

the Cars domain I only run experiments training on 10% of the data. All experiments

are performed 10 times, and I report the average results for these 10 trials.

4.3.1 Record Linkage Results

In this subsection I report our record linkage results, broken down into separate discus-

sions of the blocking results and the matching results.

4.3.1.1 Blocking Results

In order for the BSL algorithm to learn a blocking scheme, it must be provided with

methods it can use to compare the attributes. For all domains and experiments I use two

110

common, simple methods. The first, which I call “token,” compares any matching token

between the attributes. The second method, “ngram3,” considers any matching 3-grams

between the attributes.

It is important to note that a comparison between BSL and other blocking methods,

such as the Canopies method [43] and Bigram indexing [3], is slightly misaligned because

the algorithms solve different problems. Methods such as Bigram indexing are techniques

that make the process of each blocking pass on an attribute more efficient. The goal of

BSL, however, is to select which attribute combinations should be used for blocking as a

whole, trying different attribute and method pairs. Nonetheless, I contend that it is more

important to select the right attribute combinations, even using simple methods, than it

is to use more sophisticated methods, but without insight as to which attributes might

be useful. To test this hypothesis, I compare BSL using the token and 3-gram methods

to Bigram indexing over all of the attributes. This is equivalent to forming a disjunction

over all attributes using Bigram indexing as the method. I chose Bigram indexing in

particular because it is designed to perform “fuzzy blocking” which seems necessary in

the case of noisy post data. As stated previously [3], I use a threshold of 0.3 for Bigram

indexing, since that works the best. I also compare BSL to running a disjunction over

all attributes using the simple token method only. In my results, I call this blocking rule

“Disjunction.” This disjunction mirrors the idea of picking the simplest possible blocking

method: generating candidates by finding a common token in at least one attribute.

As stated previously, the two goals of blocking can be quantified by the Reduction

Ratio (RR) and the Pairs Completeness (PC). Table 4.4 shows not only these values but

also how many candidates were generated on average over the entire test set, comparing

111

Table 4.4: Blocking results using the BSL algorithm (amount of data used for trainingshown in parentheses).

RR PC # Cands Time Learn (s) Time Run (s) Time match (s)Hotels (30%)

BSL 81.56 99.79 19,153 69.25 24.05 60.93Disjunction 67.02 99.82 34,262 0 12.49 109.00

Bigrams 61.35 72.77 40,151 0 1.2 127.74Hotels (10%)

BSL 84.47 99.07 20,742 37.67 31.87 65.99Disjunction 66.91 99.82 44,202 0 15.676 140.62

Bigrams 60.71 90.39 52,492 0 1.57 167.00eBay Comics (30%)

BSL 42.97 99.75 284,283 85.59 36.66 834.94Disjunction 37.39 100.00 312,078 0 45.77 916.57

Bigrams 36.72 69.20 315,453 0 102.23 926.48eBay Comics (10%)

BSL 42.97 99.74 365,454 34.26 35.65 1,073.34Disjunction 37.33 100.00 401,541 0 52.183 1,179.32

Bigrams 36.75 88.41 405,283 0 131.34 1,190.31Craig’s Cars (10%)

BSL 88.48 92.23 5,343,424 465.85 805.36 25,114.09Disjunction 87.92 89.90 5,603,146 0 343.22 26.334.79

Bigrams 97.11 4.31 1,805,275 0 996.45 8,484.79

the three different approaches. Table 4.4 also shows how long it took each method to

learn the rule and run the rule. Lastly, the column “Time match” shows how long the

classifier needs to run given the number of candidates generated by the blocking scheme.

Table 4.5 shows a few example blocking schemes that the BSL algorithm generated.

For a comparison of the attributes BSL selected to the attributes picked manually for

different domains where the data is structured the reader is pointed to my previous work

on the topic [45].

The results of Table 4.4 validate the idea that it is more important to pick the cor-

rect attributes to block on (using simple methods) than to use sophisticated methods

without attention to the attributes. Comparing the BSL rule to the Bigram results, the

112

Table 4.5: Some example blocking schemes learned for each of the domains.

BFT Domain (30%)({hotel area,token} ∧ {hotel name,token} ∧ {star rating, token}) ∪ ({hotel name, ngram3})

BFT Domain (10%)({hotel area,token} ∧ {hotel name,token}) ∪ ({hotel name,ngram3})

eBay Comics Domain (30%)({title, token})

eBay Comics Domain (10%)({title, token}) ∪ ({issue number,token} ∧ {publisher,token} ∧ {title,ngram3})

Craig’s Cars Domain (10%)({make,token}) ∪ ({model,ngram3}) ∪ ({year,token} ∧ {make,ngram3})

combination of PC and RR is always better using BSL. Note that although in the Cars

domain Bigram took significantly less time with the classifier due to its large RR, it did

so because it only had a PC of 4%. In this case, Bigrams was not even covering 5% of

the true matches.

Further, the BSL results are better than using the simplest method possible (the

Disjuction), especially in the cases where there are many records to test upon. As the

number of records scales up, it becomes increasingly important to gain a good RR, while

maintaining a good PC value as well. This savings is dramatically demonstrated by the

Cars domain, where BSL outperformed the Disjunction in both PC and RR.

One surprising aspect of these results is how prevalent the token method is within

all the domains. I expect that the ngram method would be used almost exclusively

since there are many spelling mistakes within the posts. However, this is not the case.

I hypothesize that the learning algorithm uses the token methods because they occur

with more regularity across the posts than the common ngrams would since the spelling

mistakes might vary quite differently across the posts. This suggests that there might be

113

more regularity, in terms of what one can learn from the data, across the posts than I

initially surmised.

Another interesting result is the poor reduction ratio of the eBay Comics domain.

This happens because most of the rules contain the disjunct that finds a common token

within the comic title. This rule produces such a poor reduction ratio because the value

for this attribute is the same across almost all reference set records. That is to say, when

there are just a few unique values for the BSL algorithm to use for blocking, the reduction

ratio will be small. In this domain, there are only two values for the comic title attribute,

“Fantastic Four” and “Incredible Hulk” in the Comic reference set. So it makes sense

that if blocking is done using the title attribute only, the reduction is about half, since

blocking on the value “Fantastic Four” just gets rid of the “Incredible Hulk” comics. This

points to an interesting limitation of the BSL algorithm. If there are not many distinct

values for the different attribute and method pairs that BSL can use to learn from, then

this lack of values cripples the performance of the reduction ratio. Intuitively though, this

makes sense, since it is hard to distinguish good candidate matches from bad candidate

matches if they share the same attribute values.

Another result worth mentioning is that in the BFT domain BSL gets a lower RR but

the same PC when it uses less training data. This happens because the BSL algorithm

runs until it has no more examples to cover, so if those last few examples introduce a new

disjunct that produces a lot of candidates, while only covering a few more true positives,

then this would cause the RR to decrease, while keeping the PC at the same high rate.

This is in fact what happens in this case. One way to curb this behavior would be to set

some sort of stopping threshold for BSL, but as I mention previously, maximizing the PC

114

is the most important thing, so I choose not to do this. The goal is for BSL to cover as

many true positives as it can, even if that means losing a bit in the reduction.

In fact, I next test this notion explicitly. I set a threshold in the SCA such that

after 95% of the training examples are covered, the algorithm stops and returns the

learned blocking scheme. This helps to avoid the situation where BSL learns a very

general conjunction, solely to cover the last few remaining training examples. When that

happens, BSL might end up lowering the RR, at the expense of covering just those last

training examples, because the rule learned to cover those last examples is overly general

and returns too many candidate matches.

Table 4.6: A comparison of BSL covering all training examples, and covering 95% of thetraining examples

Domain Record Linkage RR PCF-Measure

BFT DomainNo Thresh (30%) 90.63 81.56 99.79

95% Thresh (30%) 90.63 87.63 97.66eBay Comics Domain

No Thresh (30%) 91.30 42.97 99.7595% Thresh (30%) 91.47 42.97 99.69

Craig’s Cars DomainNo Thresh (10%) 77.04 88.48 92.23

95% Thresh (10%) 67.14 92.67 83.95

Table 4.6 shows that when BSL uses a threshold in the BFT and Craig’s Cars domain

there is a statistically significant drop in Pairs Completeness with a statistically significant

increase in Reduction Ratio.2 This is expected behavior since the threshold causes BSL

to kick out of SCA before it can cover the last few training examples, which in turn allows2Bold means statistically significant using a two-tailed t-test with α set to 0.05

115

BSL to retain a rule with high RR, but lower PC. However, looking at the record linkage

results (shown as Record Linkage F-Measure in Table 4.6), this threshold does in fact have

a large, detrimental effect for the Craig’s Cars domain.3 When BSL uses a threshold, the

candidates not discovered by the rule generated when using the threshold have an effect of

10% on the final F-measure match results.4 Therefore, since the F-measure results differ

by so much, I conclude that it is worthwhile to maximize PC when learning rules with

BSL, even if the RR may decrease. That is to say, even in the presence of noise, which in

turn may lead to overly generic blocking schemes, BSL should try to maximize the true

matches it covers, because avoiding even the most difficult cases to cover may affect the

matching results. As Table 4.6 shows, this is especially true in the Craig’s Cars domain

where matching is much more difficult than in the BFT domain.

Interestingly, in the eBay Comics domain there is not see a statistically significant

difference in the RR and PC. This is because across trials BSL almost always learns

the same rule whether it uses a threshold or not, and this rule covers enough training

examples that the threshold is not hit. Further, there is no statistically significant change

in the F-measure record linkage results for this domain. This is expected since BSL would

generate the same candidate matches, whether it uses the threshold or not, since in both

cases it almost always learns the same blocking rules.

The BSL results are encouraging because they show that the algorithm also works for

blocking when matching unstructured and ungrammatical text to a relational data source.3Please see subsection 4.3.1.2 for a description of the record linkage experiments and results.4Much of this difference is attributed to the non-threshold version of the algorithm learning a final

predicate that includes the make attribute by itself, which the version with a threshold does not learn.Since each make attribute value covers many records, it generates many candidates which results inincreasing PC while reducing RR.

116

This means the algorithm works in this special case too, not just the case of traditional

record linkage where one matches one structured source to another. This means the overall

algorithm of this chapter is more scalable because it uses fewer candidate matches.

4.3.1.2 Matching Results

The matching step approach in this chapter is compared to WHIRL [16]. WHIRL per-

forms record linkage by performing soft-joins using vector-based cosine similarities be-

tween the attributes. Other record linkage systems require decomposed attributes for

matching, which is not the case with the posts. WHIRL serves as the benchmark because

it does not have this requirement. To mirror the alignment task of Phoebus, the exper-

iment supplies WHIRL with two tables: the test set of posts (either 70% or 90% of the

posts) and the reference set with the attributes concatenated to approximate a record

level match. The concatenation is also used because when matching on each individual

attribute, it is not obvious how to combine the matching attributes to construct a whole

matching reference set member.

To perform the record linkage, WHIRL does soft-joins across the tables, which pro-

duces a list of matches, ordered by descending similarity score. For each post with matches

from the join, the reference set member(s) with the highest similarity score(s) is called

its match. In the Craig’s Cars domain the matches are 1-N, so this means that only 1

match from the reference set will be exploited later in the information extraction step.

To mirror this idea, the number of possible matches in a 1-N domain for recall is counted

as the number of posts that have a match in the reference set, rather than the reference

set members themselves that match. Also, this means that we only add a single match

117

to our total number of correct matches for a given post, rather than all of the correct

matches, since only one matters. This is done for both WHIRL and Phoebus, and more

accurately reflects how well each algorithm would perform as the processing step before

our information extraction step.

The record linkage results for both Phoebus and WHIRL are shown in Table 4.7.

Note that the amount of training data for each domain is shown in parentheses. All

results are statistically significant using a two-tailed paired t-test with α=0.05, except

for the precision between WHIRL and Phoebus in the Craig’s Cars domain, and the

precision between Phoebus trained on 10% and 30% of the training data in the eBay

Comics domain.

Phoebus outperforms WHIRL because it uses many similarity types to distinguish

matches. Also, since Phoebus uses both a record level and attribute level similarities, it

is able to distinguish between records that differ in more discriminative attributes. This is

especially apparent in the Craig’s Cars domain. First, these results indicate the difficulty

of matching car posts to the large reference set. This is the largest experimental domain

yet used for this problem, and it is encouraging how well my approach outperforms

the baseline. It is also interesting that the results suggest that both techniques are

equally accurate in terms of precision (in fact, there is no statistically significant difference

between them in this sense) but Phoebus is able to retrieve many more relevant matches.

This means Phoebus can capture more rich features that predict matches than WHIRL’s

cosine similarity alone. I expect this behavior because Phoebus has a notion of both field

and token level similarity, using many different similarity measures. This justifies my use

118

of the many similarity types and field and record level information, since the goal is to

find as many matches as possible.

Table 4.7: Record linkage results

Precision Recall F-measureBFT

Phoebus (30%) 87.70 93.78 90.63Phoebus (10%) 87.85 92.46 90.09

WHIRL 83.53 83.61 83.13eBay ComicsPhoebus (30%) 87.49 95.46 91.30Phoebus (10%) 85.35 93.18 89.09

WHIRL 73.89 81.63 77.57Craig’s CarsPhoebus (10%) 69.98 85.68 77.04

WHIRL 70.43 63.36 66.71

It is also encouraging that using only 10% of the data for labeling, Phoebus is able to

perform almost as well as using 30% of the data for training. Since the amount of data

on the Web is vast, only having to label 10% of the data to get comparative results is

preferable when the cost of labeling data is great. Especially since the clean annotation,

and hence relational data, comes from correctly matching the posts to the reference set,

not having to label much of the data is important if I want this technique to be widely

applicable. In fact, as stated above, I faced this practical issue in the Craig’s Cars domain

where I was unable to use 30% for training since the machine learning method would not

scale to the number of candidates generated by this much training data.

While the matching method performs well and outperforms WHIRL from the results

above, it is not clear whether it is the use of the many string metrics, the inclusion of the

119

attributes and their concatenation, or the SVM itself that provides this advantage. To

test the advantages of each piece, I ran several experiments isolating each of these ideas.

First, I ran Phoebus matching on only the concatenation of the attributes from the

reference set, rather than the concatenation and all the attributes individually. Earlier, I

stated that I use the concatenation to mirror the idea of record level similarity, and I also

use each attribute to mirror field level similarity. It is my hypothesis that in some cases,

a post will match different reference set records with the same record level score (using

the concatenation), but it will do so matching on different attributes. By removing the

individual attributes and leaving only the concatenation of them for matching, I can test

how the concatenation influences the matching in isolation. Table 4.8 shows these results

for the different domains.

Table 4.8: Matching using only the concatenation

Precision Recall F-MeasureBFT

Phoebus (30%) 87.70 93.78 90.63Concatenation Only 88.49 93.19 90.78

WHIRL 83.61 83.53 83.13eBay Comics


WHIRL 73.89 81.63 77.57Craig’s Cars


WHIRL 70.43 63.36 66.71

For the Craig’s Cars and eBay Comics there is an improvement in F-measure, in-

dicating that using the attributes and the concatenation is much better for matching

120

than using the concatenation alone. This supports the notion that the matcher also

needs a method to capture the significance of matching individual attributes since some

attributes are better indicators of matching than others. It is also interesting to note

that for both these domains, WHIRL does a better job than the machine learning using

only the concatenation, even though WHIRL also uses a concatenation of the attributes.

This is because WHIRL uses information-retrieval-style matching to find the best match,

and the machine learning technique tries to learn the characteristics of the best match.

Clearly, it is very difficult to learn what such characteristics are.

In the BFT domain, there is not a statistically significant difference in F-measure using

the concatenation alone. This means that the concatenation is sufficient to determine

the matches, so there is no need for individual fields to play a role. More specifically,

the hotel name and area seem to be the most important attributes for matching and

by including them as part of the concatenation, the concatenation is still distinguishable

enough between all records to determine matches. Since in two of the three domains there

is a huge improvement, and across all domains the algorithm never loses in F-measure,

using both the concatenation and the individual attributes is valid for the matching. Also,

since in two domains the concatenation alone was worse than WHIRL, I conclude that

part of the reason Phoebus can outperform WHIRL is the use of the individual attributes

for matching.

The next experiment tests how important it is to include all of the string metrics in

the feature vector for matching. To test this idea, I compare using all the metrics to using

just one, the Jensen-Shannon distance. I choose the Jensen-Shannon distance because

it outperformed both TF/IDF and even a “soft” TF/IDF (one that accounts for fuzzy

121

Table 4.9: Using all string metrics versus using only the Jensen-Shannon distance


Phoebus (30%) 87.70 93.78 90.63Jensen-Shannon Only 89.65 92.28 90.94

WHIRL 83.61 83.53 83.13eBay Comics


WHIRL 73.89 81.63 77.57Craig’s Cars


WHIRL 70.43 63.36 66.71

token matches) in the task of selecting the right reference sets for a given set of posts in

Chapter 2. These results are shown in Table 4.9.

As Table 4.9 shows, using all the metrics yielded a statistically significant, large im-

provement in F-measure for the eBay Comics and Craig’s Cars domains. This means that

some of the other string metrics, such as the edit distances, capture similarities that the

Jensen-Shannon distance alone does not. Interestingly, in both domains, using Phoebus

with only the Jensen-Shannon distance does not dominate WHIRL’s performance. There-

fore, the results of Table 4.9 and Table 4.8 demonstrate that Phoebus benefits from the

combination of many, varied similarity metrics along with the use of individual attributes

for field level similarities, and both of these aspects contribute to Phoebus outperforming

WHIRL.

In the case of the BFT data, there is not a statistically significant difference in the

matching results, so in this case the other metrics do not provide relevant information

122

for matching. Therefore, all the matches missed by the Jensen-Shannon only method are

also missed when I include all of the metrics. Hence, either these missed matches are very

difficult to discover, or there is not a string metric in my method yet that can capture the

similarity. For example, when the post has a token “DT” and the reference set record it

should match has a hotel area of “Downtown,” then an abbreviation metric could capture

this relationship. However, Phoebus does not include an abbreviation similarity measure.

Since none of the techniques in isolation consistently outperforms WHIRL, I conclude

that Phoebus outperforms WHIRL because it combines multiple string metrics, it uses

both individual attributes and the concatenation, and, as stated in Section 4.1.2, the

SVM classifier is well suited for the record linkage task. These results also justify the

inclusion of many metrics and the individual attributes, along with our use of SVM as

our classifier.

The last matching experiment justifies our binary rescoring mechanism. Table 4.10

shows the results of performing the binary rescoring for record linkage versus not perform-

ing this binary recoring. I hypothesize earlier in this chapter that the binary rescoring

will allow the classifier to more accurately make match decisions because the rescoring

separates out the best candidate as much as possible. Table 4.10 shows this to be the

case, as across all domains when I perform the binary rescoring the algorithm gains a

statistically significant amount in the F-measure. This shows that the record linkage is

more easily able to identify the true matches from the possible candidates when the only

difference in the record linkage algorithm is the use of binary rescoring.

123

Table 4.10: Record linkage results with and without binary rescoring


Phoebus (30%) 87.70 93.78 90.63No Binary Rescoring 75.44 81.82 78.50


eBay ComicsPhoebus (30%) 87.49 95.46 91.30

No Binary Rescoring 84.87 89.91 87.31Phoebus (10%) 85.35 93.18 89.09

No Binary Rescoring 81.52 88.26 84.75Craig’s Cars


4.3.2 Extraction Results

This section presents results that experimentally validate the supervised machine learning

approach to extracting the actual attributes embedded within the post.

As stated in Section 4.2 on Extraction, I also created a version of Phoebus that uses

CRFs, which I call PhoebusCRF. PhoebusCRF uses the same extraction features (VIE)

as Phoebus using the SVM, such as the common score regular expressions and the string

similarity metrics. I include PhoebusCRF to show that supervised extraction in general

can benefit from our reference set matching when used in a reference-set based extraction

framework.

One component of the extraction vector VIE is the vector common scores, which

includes user defined functions, such as regular expressions. Since these are the only

domain specific functions used in the algorithm, the common scores for each domain

must be specified. For the Hotels domain, the common scores includes the functions

124

matchPriceRegex and matchDateRegex. Each of these functions gives a positive score

if a token matches a price or date regular expression, and 0 otherwise. For the Comic

domain, common scores contains the functions matchPriceRegex and matchYearRegex,

which also give positive scores when a token matches the regular expression. In the Cars

domain, common scores uses the function matchPriceRegex (since year is an attribute of

the reference set, we do not use a common score to capture its form).

Table 4.11, Table 4.12, and Table 4.13 compare the field-level extraction results using

Phoebus and PhoebusCRF to the ARX results of Chapter 2 and Chapter 3. The results

for CRF-Win, CRF-Orth, and Amilcare are repeated for comparison. Unlike the matching

results, I only use 10% of the data for training for Phoebus and PhoebusCRF to make

the comparison to CRF-Win, CRF-Orth and Amilcare explicit. The attributes shown in

italics are those represented in the reference set, and those in plain text are “common”

attributes. Note that ARX does not have results for the common attributes since it could

only extract such attributes using regular expressions, since common attributes are not

represented by a reference set.

Phoebus and PhoebusCRF outperform the other systems on almost all attributes

(12 of 16), as shown in Table 4.14. In fact, that CRF-Win performed the best on the

price attribute of the eBay comic domain is misleading since none of the results for that

attribute are statistically significant with respect to each other (there are few extractions

of this attribute, so the results varied wildly across the trials). Further, since both

Amilcare and ARX use reference set data, for every single attribute, the method with the

maximum F-measure benefited from a reference set.

125

Table 4.11: Field level extraction results: BFT domain

BFT PostsRecall Precision F-Measure

Local Area Phoebus 77.80 83.58 80.52PhoebusCRF 80.71 83.38 82.01ARX 62.23 85.36 71.98CRF-Orth 64.51 77.14 70.19CRF-Win 28.51 39.79 33.07Amilcare 64.78 71.59 68.01

Date Phoebus 82.13 83.06 82.59PhoebusCRF 84.39 84.48 84.43ARX . . .CRF-Orth 81.85 81.00 81.42CRF-Win 80.39 79.54 79.96Amilcare 86.18 94.10 89.97

Hotel Name Phoebus 75.59 74.25 74.92PhoebusCRF 81.46 81.69 81.57ARX 70.09 77.16 73.46CRF-Orth 66.90 68.07 67.47CRF-Win 39.84 42.95 41.33Amilcare 58.96 67.44 62.91

Price Phoebus 93.12 98.46 95.72PhoebusCRF 90.34 92.60 91.46ARX . . .CRF-Orth 92.61 92.63 92.62CRF-Win 90.47 90.40 90.43Amilcare 88.04 91.10 89.54

Star Rating Phoebus 96.94 96.90 96.92PhoebusCRF 96.17 96.74 96.45ARX 83.94 99.44 91.03CRF-Orth 94.37 95.17 94.77CRF-Win 93.77 94.67 94.21Amilcare 95.58 97.35 96.46

Phoebus performs especially well in the Craig’s Cars domain, where it is the best

system on all the attributes. One interesting thing to note about this result is that while

the record linkage results are not spectacular for this domain, they are good enough to

yield very highly accurate extraction results. This is because most times when the system

is not picking the best match from the reference set, it is still picking one that is close

enough such that most of the reference set attributes are useful for extraction. This is

126

Table 4.12: Field level extraction results: eBay Comics domain.

eBay Comics PostsRecall Precision F-Measure

Descr. Phoebus 30.16 27.15 28.52PhoebusCRF 15.45 26.83 18.54ARX 24.80 15.90 19.38CRF-Orth 7.07 24.95 10.81CRF-Win 6.08 14.56 8.78Amilcare 8.00 52.55 13.78

Issue Num. Phoebus 80.90 82.17 81.52PhoebusCRF 83.01 84.68 83.84ARX 46.21 87.21 60.41CRF-Orth 81.88 80.34 81.10CRF-Win 80.67 80.52 80.59Amilcare 77.66 89.11 82.98


Publisher Phoebus 99.85 83.89 91.18PhoebusCRF 53.22 87.29 64.26ARX 100.00 84.16 91.40CRF-Orth 41.71 88.39 55.42CRF-Win 38.42 86.84 52.48Amilcare 63.75 90.48 74.75

Title Phoebus 89.37 89.37 89.37PhoebusCRF 90.64 92.13 91.37ARX 90.08 90.78 90.43CRF-Orth 83.58 84.13 83.85CRF-Win 77.69 77.85 77.77Amilcare 89.88 95.65 92.65

Year Phoebus 77.50 97.35 86.28PhoebusCRF 54.63 85.07 66.14ARX . . .CRF-Orth 47.31 78.14 58.19CRF-Win 37.56 66.43 46.00Amilcare 77.05 85.67 81.04

why the trim extraction results are the lowest, because that is often the attribute that

determines a match from a non-match. The record linkage step likely selects a car that is

close, but differs in the trim, so the match is incorrect and the trim will most likely not

127

Table 4.13: Field level extraction results: Craig’s Cars domain.

Craig’s Cars PostsRecall Precision F-Measure

Make Phoebus 98.21 99.93 99.06PhoebusCRF 90.73 96.71 93.36ARX 95.99 100.00 97.95CRF-Orth 82.03 85.40 83.66CRF-Win 80.09 77.38 78.67Amilcare 97.58 91.76 94.57

Model Phoebus 92.61 96.67 94.59PhoebusCRF 84.58 94.10 88.79ARX 83.02 95.01 88.61CRF-Orth 71.04 77.81 74.25CRF-Win 67.87 69.67 68.72Amilcare 78.44 84.31 81.24


Trim Phoebus 63.11 70.15 66.43PhoebusCRF 55.61 64.95 59.28ARX 39.52 66.94 49.70CRF-Orth 47.94 47.90 47.88CRF-Win 38.77 39.35 38.75Amilcare 27.21 53.99 35.94

Year Phoebus 88.48 98.23 93.08PhoebusCRF 85.54 96.44 90.59ARX 76.28 99.80 86.47CRF-Orth 84.99 91.33 88.04CRF-Win 83.03 86.12 84.52Amilcare 86.32 91.92 88.97

Table 4.14: Summary results for extraction showing the number of times each systemhad highest F-Measure for an attribute.

Num. of Max. F-MeasuresDomainPhoebus PhoebusCRF ARX Amilcare CRF-Win CRF-Orth

Total Attributes

BFT 2 2 0 1 0 0 5eBay Comics 2 1 1 1 1 0 6Craig’s Cars 5 0 0 0 0 0 5All 9 3 1 2 1 0 16

128

be extracted correctly, but the rest of the attributes can be extracted using the reference

set member.

Another interesting result is that Phoebus or PhoebusCRF handily outperformed the

CRF based methods on every attribute (except for comic price which was insignificant for

all systems). This lends credibility to my claim that by training the system to extract all

of the attributes, even those in the reference set, it can more accurately extract attributes

not in the reference set because it is training the system to identify what something is

not.

The overall performance of Phoebus validates my supervised machine learning ap-

proach to extraction. By infusing information extraction with the outside knowledge of

reference sets, Phoebus is able to perform well across three different domains, each repre-

sentative of a different type of source of posts: the auction sites, Internet classifieds and

forum/bulletin boards.

129

Chapter 5

Related Work

The work in this thesis touches upon a number of areas, and I present related research

for the various areas and how my work differs from other research in the field.

First, the main thrust of this research is on the topic of extraction, both automatic

and machine-learning based. Although my automatic extraction method touches upon

methods of unsupervised extraction [8; 28; 51], such methods as theirs would not work

on posts where one cannot assume that any structure, and therefore extraction pattern,

will hold for all of the posts. Note that much work on extraction is supervised, and so

my automatic approach differs substantially.

Further, although my supervised approach fits into the context of other methods for

extraction that use supervised machine learning, my method is quite different as well.

Many other methods, such as Conditional Random Fields (CRF), assume at least some

structure in the extracted attributes to do the extraction. As my extraction experiments

show, the CRF-Win and CRF-Orth implemented using the Simple Tagger implementation

of Conditional Random Fields [42] did not perform well. Other extraction approaches,

such as Datamold [7] and CRAM [1], segment whole records (like bibliographies) into

130

attributes, with little structural assumption. In fact, CRAM even uses reference sets to

aid its extraction. However, both systems require that every token of a record receive

a label, which is not possible with posts that are filled with irrelevant, “junk” tokens.

Along the lines of CRAM and Datamold, the work of Bellare and McCallum [4] uses

a reference set to train a CRF to extract data, which is similar to my PhoebusCRF

implementation. However, there are two differences between PhoebusCRF and their work

[4]. First, the work of Bellare and McCallum [4] mentions that reference set records are

matched using simple heuristics, but it is unclear how this is done. In my work, matching

is done explicitly and accurately through record linkage. Second, their work only uses

the records from the reference set to label tokens for training an extraction module, while

PhoebusCRF uses the actual values from the matching reference set record to produce

useful features for extraction and annotation.

Another IE approach similar to mine performs named entity recognition using “Semi-

CRFs” with a dictionary component [15], which functions like a reference set. However,

in their work the dictionaries are defined as lists of single attribute entities, so finding an

entity in the dictionary is a look-up task. My reference sets are relational data, so finding

the match becomes a record linkage task. Further, their work on Semi-CRFs [15] focuses

on the task of labeling segments of tokens with a uniform label, which is especially useful

for named entity recognition. In the case of posts, however, Phoebus needs to relax such

a restriction because in some cases such segments will be interrupted, as the case of a

hotel name with the area in the middle of the hotel name segment. So, unlike their work,

Phoebus makes no assumptions about the structure of posts. Recently, Semi-CRFs have

been extended to use database records in the task of integrating unstructured data with

131

relational databases [41]. This work is similar to mine in that it links unstructured data,

such as paper citations, with relational databases, such as reference sets of authors and

venues. The difference is that I view this as a record linkage task, namely finding the

right reference set tuple to match. In their paper, even though they use matches from

the database to aid extraction, they view the linkage task as an extraction procedure

followed by a matching task. Lastly, I are not the first to consider structured SVMs for

information extraction. Previous work used structured SVMs to perform Named Entity

Recognition [60] but their extraction task does not use reference sets.

Using reference-sets to aid extraction is similar to the work on ontology-based infor-

mation extraction [25]. Later versions of their work even talk about using ontology-based

information extraction as a means to semantically annotate unstructured data such as car

classifieds [21]. However, in contrast to my work, the information extraction is performed

by a keyword-lookup into the ontology along with structural and contextual rules to aid

the labeling. The ontology itself contains keyword misspellings and abbreviations, so that

the look-up can be performed in the presence of noisy data. I argue the ontology-based

extraction approach is less scalable than a record linkage type matching task because

creating and maintaining the ontology requires extensive data engineering in order to en-

compass all possible common spelling mistakes and abbreviations. Further, if new data

is added to the ontology, additional data engineering must be performed. In my work,

I can simply add new tuples to the reference set. Also, my methods can construct its

own reference set, and none of these approaches mentions building the ontology to use for

extraction. Lastly, in contrast to my work, this ontology based work assumes contextual

and structural rules will apply, making an assumption about the data to extract from.

132

Yet another interesting approach to information extraction using ontologies is the

Textpresso system which extracts data from biological text [48]. This system uses a reg-

ular expression based keyword look-up to label tokens in some text based on the ontol-

ogy. Once all tokens are labeled, Textpresso can perform “fact extraction” by extracting

sequences of labeled tokens that fit a particular pattern, such as gene-allele reference

associations. Although this system again uses a reference set for extraction, it differs in

that it does a keyword look-up into the lexicon.

My work on automatically building reference sets is related to the topic of ontology

creation. My approach follows almost all of the current research on this topic in that

I generate some candidate terms and then plug these terms into an hierarchy. The

classic paper of Sanderson and Croft [54] laid much of the groundwork by developing the

subsumption heuristic based on terms. Other papers have applied the same technique to

developing other hierarchies, such as an ontology of Flickr tags [55]. Other techniques find

candidate terms, but build the hierarchies using other methods such as Smoothed Formal

Concept Analysis [13], Principal Component Analysis [23] or even Google to determine

term dependencies [40]. However, there are a number of differences between the above

work and mine (beyond the fact that some of the methods use NLP to determine the

terms, which we cannot do with posts). First is that all of the other systems build up

large concept hierarchies. In my case, I build up a large number of small independent

hierarchies, which I then flatten into a reference set. So, I use hierarchy building in order

to construct a set of relational tuples, which are only an hierarchy in the sense that they

can be represented by trees. My method is more like fleshing out an taxonomy, rather

than building a conceptual ontology. I also present work that suggests how many posts

133

are needed to construct the reference set, in contrast to the other work that just uses

many web pages without this consideration. I also present an algorithm that suggests

whether the user can actually do the construction by comparing bigram-type distributions

to known distributions. These other construction methods do not have this functionality.

The matching step yields annotation for posts and as such it is somewhat related

to the topic of semantic annotation as well. Many researchers have followed this path,

attempting to automatically mark up documents for the Semantic Web [12; 22; 27; 61;

35]. However, these systems rely on lexical information, such as part-of-speech tagging or

shallow Natural Language Processing to do their extraction/annotation (e.g., Amilcare,

Ciravegna, 2001) or specific structure to exploit. According to a recent survey [53],

systems that perform semantic annotation separate into three categories: rule-based,

pattern-based, and wrapper induction methods. However, the rule-based and pattern-

based methods rely on regularity within the text, which is not the case with posts.

Also, beyond exploiting regular structure, the wrapper induction methods use supervised

machine learning instead of unsupervised methods.

The semantic annotation system closest to mine is SemTag [20], which first identifies

tokens of interest in the text, and then labels them using the TAP taxonomy, which is

similar to a reference set. This taxonomy is carefully crafted, which gives it good accuracy

and meaning. However, annotation using ontologies faces the same issues as extraction

using ontologies. Particularly, it is hard to model and add new tuples in an ontology. In

contrast, my reference sets are flexible and I can incorporate any that I can automatically

collect.

134

Further, SemTag focuses on disambiguation which my approach (both automatic and

supervised) avoids. If one looks up the token “Jaguar” it might refer to a car or an animal,

because SemTag disambiguates after labeling. In my case, I perform disambiguation

before the labeling procedure, during the selection of the relevant reference sets, whether

given by the user, selected by the user, or built by the algorithm.

The supervised machine learning approach to matching is similar to record linkage

which is not a new topic [26] and is well studied even now [6]. However, most current

research focuses on matching one set of records to another set of records based on their

decomposed attributes. There is little work on matching data sets where one record is

a single string composed of the other data set’s attributes to match on, as in the case

with posts and reference sets. The WHIRL system [16] allows for record linkage without

decomposed attributes, but as shown in results of Chapter 4 my approach outperforms

WHIRL, since WHIRL relies solely on the vector-based cosine similarity between the

attributes, while my approach exploits a larger set of features to represent both field and

record level similarity. I note with interest the EROCS system [10] where the authors

tackle the problem of linking full text documents with relational databases. The technique

involves filtering out all non-nouns from the text, and then finding the matches in the

database. This is an intriguing approach; interesting future work would involve perfor3g

a similar filtering for larger documents and then applying my machine learning algorithm

to match the remaining nouns to reference sets.

Using the reference set’s attributes as normalized values is similar to the idea of data

cleaning. However, most data cleaning algorithms assume tuple-to-tuple transformations

[34; 11]. That is, some function maps the attributes of one tuple to the attributes of

135

another. This approach would not work on ungrammatical and unstructured data, where

all attributes are embedded within the post, which maps to a set of attributes from the

reference set.

My work on automatically selecting reference sets is similar to resource selection in

distributed information retrieval, sometimes used for the “hidden web.” In resource

selection, different servers are chosen from a set of servers to return documents for a

given query. Craswell, Bailey, and Hawking [18] compare three popular approaches to

resource selection. However, these retrieval techniques execute probe queries to estimate

the resource’s data coverage and its effectiveness at returning relevant documents. Then,

these coverage and effectiveness statistics are used to select and merge the appropriate

resources for a query. This overhead is unnecessary for my task. Because I have full

access to all of our reference sets in our repository, the system already knows the full

data coverage, and does not need to estimate our repository’s effectiveness, because it

always returns all of the sets.

Lastly, there is work related to the BSL algorithm that is proposed to aid the super-

vised machine learning approach to matching by providing a better approach to blocking.

In recent work on learning efficient blocking schemes Bilenko et al., 5 developed a sys-

tem for learning disjunctive normal form blocking schemes. However, they learn their

schemes using a graphical set covering algorithm, while we use a version of the Sequential

Covering Algorithm (SCA). There are also similarities between our BSL algorithm and

work on mining association rules from transaction data [2]. Both algorithms discover

propositional rules. Further, both algorithms use multiple passes over a data set to dis-

cover their rules. However despite these similarities, the techniques really solve different

136

problems. BSL generates a set of candidate matches with a minimal number of false

positives. To do this, BSL learns conjunctions that are maximally specific (eliminating

many false positives) and unions them together as a single disjunctive rule (to cover the

different true positives). Since the conjunctions are maximally specific, BSL uses SCA

underneath, which learns rules in a depth-first, general to specific manner [47]. On the

other hand, the work of mining association rules [2] looks for actual patterns in the data

that represent some internal relationships. There may be many such relationships in the

data that could be discovered, so this approach covers the data in a breadth-first fashion,

selecting the set of rules at each iteration and extending them by appending to each a

new possible item.

137

Chapter 6

Conclusions

This thesis presents an approach to information extraction from unstructured, ungram-

matical text such as internet classifieds, auction titles, and bulletin board postings. My

approach, called “reference-set based extraction” exploits a “reference set” of known en-

tities to aid extraction. By using reference sets for extraction, instead of grammar or

structure, my technique relaxes the assumption that posts require structure or grammar,

which they may not. To exploit a reference set my approach first finds the matches be-

tween the unstructured, ungrammatical text and the reference set. It then uses these

matches as clues to aid extraction. This thesis presents two approaches to reference-set

based extraction. One is a fully automatic method that can even include an automatic

approach to constructing the reference set from the unstructured data itself. The other

method is highly accurate and uses machine learning to solve the task.

To reiterate, the contributions of this thesis are as follows:

• An automatic matching and extraction algorithm that exploits a given reference set

(Chapter 2)

138

• A method that selects the appropriate reference sets from a repository and uses

them for extraction and annotation without training data (Chapter 2)

• An automatic method for constructing reference sets from the posts themselves.

This algorithm also is able to suggest the number of posts required to automatically

construct the reference set sufficiently (Chapter 3)

• An automatic method whereby the machine can decide whether to construct its

own reference set, or require a person to do it manually (Chapter 3)

• A method that uses supervised machine learning to perform high-accuracy extrac-

tion on posts, even in the face of ambiguity. (Chapter 4)

Reference-set based extraction allows for structured query support and data integra-

tion over unstructured, ungrammatical text, which was previously not possible using just

keyword search. For example, keyword search does not support aggregate queries. Nei-

ther is it easy to perform joins with outside sources using keyword search. Therefore, in

the larger context of information integration, reference-set based extraction allows for a

whole new set of sources, “posts,” to be included in the task of data integration which

previously dealt with structured and semi-structured sources such as databases and web-

pages respectively. By combining posts with more traditional structured sources, new

and interesting applications become available. For instance, by linking the cars for sale

on Craigslist with the database of car safety ratings from the National Highway Trans-

portation Safety Administration, a user can search for used cars that conform to desired

safety ratings.

139

Further, information extraction from unstructured, ungrammatical sources can aid

other fields such as information retrieval, knowledge bases, and the Semantic Web. In

information retrieval, Web crawlers can better index pages by classifying the higher level

class of the page. For instance, by using reference-set based extraction on the Craiglist

car ads, the Web crawler could determine that the class of items referred to on the web-

page are cars, since extraction used a set of Cars for the reference set. In turn, if a user

searches for “car classifieds,” then the set of car classifieds could be returned even if it

never mentions the keyword “car” because the reference-set used for extraction makes it

clear that the set of classifieds is about the topic of cars.

Along these lines, the algorithm for automatically constructing reference sets can be

used to flesh out ontologies and taxonomies for knowledge bases. Further, reference-set

based extraction might be a tool that can aid in the ontology alignment problem. For

example, two ontologies of hotels might not have enough information for alignment. It

might be that one set has star ratings and names, all of which are unique within the set,

and another ontology has a name and location, which are all unique within the set, but

the names individually are not unique within the set. In this case, each reference set

could be used to find matches to a set of posts about hotels. Then, using transitivity

across the matches between the ontologies and the posts, it might be possible to discover

how to join the ontologies together in a unified set.

Lastly, one bottleneck for the Semantic Web is marking up the entities on the Web

with the tags defined by the description logics. However, if a reference set is linked

to an RDF description of the attributes, a bot can use reference-set based extraction

140

to annotate the vast amount of posts on the Web, helping to realize the vision of the

Semantic Web.

There are a number of future directions that research on this topic can follow. First, as

stated in Chapter 3, there are certain cases where the automatic method of constructing

a reference set is not appropriate. One of these occurs when textual characteristics of the

posts make it difficult to automatically construct the reference set. One future topic of

research is a more robust and accurate method for automatically constructing reference

sets from data when the data does not fit the criteria for automatic creation. This is a

larger new topic that may involve combining the automatic construction technique in this

thesis with techniques that leverage the entire Web for extracting attributes for entities.

Along these lines, in certain cases it may simply not be possible for an automatic method

to discover a reference set. In this problem, a very weakly supervised approach might be

more appropriate where a user is involved in the reference set creation. However, this

user involvement should be limited, such as only providing seed examples of reference set

tuples, sets of Web pages outside of the posts for use in the automatic construction, or

helping the system prune errant reference set tuples.

Another topic of future research involves increasing the scope of reference-set based

extraction beyond unstructured, ungrammatical text. The use of reference sets should

be helpful for other topics such as Named-Entity Recognition and for linking larger doc-

uments with reference sets. This would allow for query and integration of these larger

documents, which is sometimes called information fusion. Another interesting future

direction is to allow the mining of synonyms and related string transformations, and

141

then figuring out how to use these in the context of reference-set based extraction in an

automatic and scalable fashion.

Beyond the scope of improving extraction, there are numerous lines of future research

that involve using the extracted data, which is output by my reference-set based extraction

approach. One future line of research involves using the extracted data for data mining to

help users make informed decisions about aspects such as the prices of items for sale, the

quality of items for sale, or the general opinions regarding items for sale. For instance,

after discovering a reference set of laptops, it would be useful for the system to gather

as much other data about each laptop as possible, such as the average price and average

review for the computer. It could do this dynamically, based on the newly extracted

attributes. Once this is done a user can make a much more informed decision about

whether they are looking at a good deal or not. Another way to use the data post

extraction is to create sale portals for e-Commerce. For instance, by aggregating all of

the used cars on the Web from all of the sites in one area where aggregate joins can be

made, users can glean many insights such as who has the best inventory, which source

gives the best prices, etc.

In this thesis I describe reference-set based extraction which allows users and comput-

ers to make sense of vast amounts of data that were previously only accessible through

limited keyword search. With so much information to process, hopefully computer sci-

ence research will continue in the direction of making sense of the World Wide Web, the

single most remarkable source of knowledge.

142

Bibliography

[1] Eugene Agichtein and Venkatesh Ganti. Mining reference tables for automatic textsegmentation. In the Proceedings of the 10th ACM Conference on Knowledge Dis-covery and Data Mining, pages 20 – 29. ACM Press, 2004.

[2] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rulesbetween sets of items in large databases. In Proceedings of the ACM SIGMODInternational Conference on Management of Data, pages 207–216. ACM Press, 1993.

[3] Rohan Baxter, Peter Christen, and Tim Churches. A comparison of fast blockingmethods for record linkage. In Proceedings of the 9th ACM SIGKDD Workshop onData Cleaning, Record Linkage, and Object Identification, pages 25–27, 2003.

[4] Kedar Bellare and Andrew McCallum. Learning extractors from unlabeled textusing relevant databases. In Proceedings of the AAAI Workshop on InformationIntegration on the Web, pages 10–16, 2007.

[5] Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney. Adaptive blocking:Learning to scale up record linkage and clustering. In Proceedings of the 6th IEEEInternational Conference on Data Mining, pages 87–96, 2006.

[6] Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learn-able string similarity measures. In Proceedings of the 9th ACM International Con-ference on Knowledge Discovery and Data Mining, pages 39–48. ACM Press, 2003.

[7] Vinayak Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmenta-tion of text into structured records. In Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data, pages 175–186. ACM Press, 2001.

[8] Michael J. Cafarella, Doug Downey, Stephen Soderland, and Oren Etzioni. Know-itnow: Fast, scalable information extraction from the web. In Proceedings of HLT-EMNLP, pages 563–570. Association for Computational Linguistics, 2005.

[9] Mary Elaine Califf and Raymond J. Mooney. Relational learning of pattern-matchrules for information extraction. In Proceedings of the 16th National Conference onArtificial Intelligence, pages 328–334, 1999.

[10] Venkatesan T. Chakaravarthy, Himanshu Gupta, Prasan Roy, and Mukesh Mohania.Efficiently linking text documents with relevant structured information. In Pro-ceedings of the International Conference on Very Large Data Bases, pages 667–678.VLDB Endowment, 2006.

143

[11] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robustand efficient fuzzy match for online data cleaning. In Proceedings of ACM SIGMODInternational Conference on Management of Data, pages 313–324. ACM Press, 2003.

[12] Philipp Cimiano, Siegfried Handschuh, and Steffen Staab. Towards the self-annotating web. In Proceedings of the 13th International Conference on World WideWeb, pages 462–471. ACM Press, 2004.

[13] Philipp Cimiano, Andreas Hotho, and Steffen Staab. Learning concept hierarchiesfrom text corpora using formal concept analysis. Journal of Artificial IntelligenceResearch, 24:305–339, 2005.

[14] Fabio Ciravegna. Adaptive information extraction from text by rule induction andgeneralisation. In Proceedings of the 17th International Joint Conference on Artifi-cial Intelligence, pages 1251–1256, 2001.

[15] William Cohen and Sunita Sarawagi. Exploiting dictionaries in named entity extrac-tion: combining semi-markov extraction processes and data integration methods. InProceedings of the 10th ACM International Conference on Knowledge Discovery andData Mining, pages 89–98, Seattle, Washington, August 2004. ACM Press.

[16] William W. Cohen. Data integration using similarity joins and a word-based in-formation representation language. ACM Transactions on Information Systems,18(3):288–321, 2000.

[17] William W. Cohen, Pradeep Ravikumar, and Stephen E. Feinberg. A comparison ofstring metrics for matching names and records. In Proceedings of the ACM SIGKDDWorkshop on Data Cleaning, Record Linkage, and Object Consoliation, pages 13–18,2003.

[18] Nick Craswell, Peter Bailey, and David Hawking. Server selection on the world wideweb. In Proceedings of the Conf. on Digital Libraries, pages 37–46. ACM Press,2000.

[19] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towardsautomatic data extraction from large web sites. In Proceedings of 27th InternationalConference on Very Large Data Bases, pages 109–118. VLDB Endowment., 2001.

[20] S. Dill, N. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan,A. Tomkins, J. A. Tomlin, and J. Y. Zien. Semtag and seeker: Bootstrapping thesemantic web via automated semantic annotation. In Proceedings of WWW, pages178–186. ACM Press, 2003.

[21] Yihong Ding, David W. Embley, and Stephen W. Liddle. Automatic creation andsimplified querying of semantic web content: An approach based on information-extraction ontologies. In Proceedings of the Asian Semantic Web Conference, pages400–414, 2006.

144

[22] Alexiei Dingli, Fabio Ciravegna, and Yorick Wilks. Automatic semantic annotationusing unsupervised information extraction and integration. In Proceedings of theK-CAP Workshop on Knowledge Markup and Semantic Annotation, 2003.

[23] Georges Dupret and Benjamin Piwowarski. Principal components for automaticterm hierarchy building. In SPIRE, pages 37–48, 2006.

[24] Mohamed G. Elfeky, Vassilios S. Verykios, and Ahmed K. Elmagarmid. TAILOR:A record linkage toolbox. In Proceedings of 18th International Conference on DataEngineering, pages 17–28, 2002.

[25] David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle, Yiu-KaiNg, Dallan Quass, and Randy D. Smith. Conceptual-model-based data extractionfrom multiple-record web pages. Data Knowledge Engineering, 31(3):227–251, 1999.

[26] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of theAmerican Statistical Association, 64:1183–1210, 1969.

[27] Siegfried Handschuh, Steffen Staab, and Fabio Ciravegna. S-cream - semi-automaticcreation of metadata. In Proceedings of the 13th International Conference on Knowl-edge Engineering and Knowledge Management, pages 165–184. Springer Verlag, 2002.

[28] Hany Hassan, Ahmed Hassan, and Ossama Emam. Unsupervised information ex-traction approach using graph mutual reinforcement. In the Conference on EmpiricalMethods in Natural Language Processing (EMNLP), pages 501–508. Association forComputational Linguistics, 2006.

[29] Mauricio A. Hernandez and Salvatore J. Stolfo. Real-world data is dirty: Datacleansing and the merge/purge problem. Data Mining and Knowledge Discovery,2(1):9–37, 1998.

[30] Matthew A. Jaro. Advances in record-linkage methodology as applied to matchingthe 1985 census of tampa, florida. Journal of the American Statistical Association,89:414–420, 1989.

[31] Thorsten Joachims. Advances in Kernel Methods - Support Vector Learning, chapter11: Making large-Scale SVM Learning Practical. MIT-Press, 1999.

[32] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. In Proceedings ofthe 18th International Conference on Machine Learning, pages 282–289. MorganKaufmann, 2001.

[33] Dawn Lawrie, W. Bruce Croft, and Arnold Rosenberg. Finding topic words for hier-archical summarization. In SIGIR ’01: Proceedings of the 24th annual internationalACM SIGIR conference on Research and development in information retrieval, pages349–357, New York, NY, USA, 2001. ACM.

145

[34] Mong-Li Lee, Tok Wang Ling, Hongjun Lu, and Yee Teng Ko. Cleansing data formining and warehousing. In Proceedings of the 10th International Conference onDatabase and Expert Systems Applications, pages 751–760. Springer-Verlag, 1999.

[35] Kristina Lerman, Cenk Gazen, Steven Minton, and Craig A. Knoblock. Populat-ing the semantic web. In Proceedings of the AAAI Workshop on Advances in TextExtraction and Mining, 2004.

[36] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions,and reversals. English translation in Soviet Physics Doklady, 10(8):707–710, 1966.

[37] Alon Levy. Logic-based techniques in data integration. In Jack Minker, editor, LogicBased Artificial Intelligence, pages 575–595. Kluwer Academic Publishers, Norwell,MA, USA, 2000.

[38] Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying heterogeneousinformation sources using source descriptions. In Proceedings of the 22nd VLDBConference, pages 251–262, Bombay, India, 1996. Morgan Kaufmann Publishers Inc.

[39] J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions onInformation Theory, 37(1):145–151, 1991.

[40] Masoud Makrehchi and Mohamed S. Kamel. Automatic taxonomy extraction usinggoogle and term dependency. In IEEE/WIC/ACM International Conference on WebIntelligence, 2007.

[41] Imran R. Mansuri and Sunita Sarawagi. Integrating unstructured data into relationaldatabases. In Proceedings of the International Conference on Data Engineering,page 29. IEEE Computer Society, 2006.

[42] Andrew McCallum. Mallet: A machine learning for language toolkit.http://mallet.cs.umass.edu, 2002.

[43] Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the6th ACM SIGKDD, pages 169–178, 2000.

[44] Matthew Michelson and Craig A. Knoblock. Semantic annotation of unstructuredand ungrammatical text. In Proceedings of the 19th International Joint Conferenceon Artificial Intelligence, pages 1091–1098, 2005.

[45] Matthew Michelson and Craig A. Knoblock. Learning blocking schemes for recordlinkage. In Proceedings of the 21st National Conference on Artificial Intelligence,2006.

[46] Steven N. Minton, Claude Nanjo, Craig A. Knoblock, Martin Michalowski, andMatthew Michelson. A heterogeneous field matching method for record linkage. InProceedings of the 5th IEEE International Conference on Data Mining (ICDM-05),2005.

146

[47] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

[48] Hans-Michael Muller and Eimear E. Kenny Paul W. Sternberg. Textpresso: Anontology-based information retrieval and extraction system for biological literature.PLoS Biology, 2(11), 2004.

[49] Ion Muslea, Steven Minton, and Craig A. Knoblock. Hierarchical wrapper induc-tion for semistructured information sources. Autonomous Agents and Multi-AgentSystems, 4(1/2):93–114, 2001.

[50] H. B. Newcombe. Record linkage: The design of efficient systems for linkingrecords into individual and family histories. American Journal of Human Genet-ics, 19(3):335–359, 1967.

[51] Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, and Alpa Jain. Orga-nizing and searching the world wide web of facts - step one: the one-million factextraction challenge. In Proceedings of the 21st National Conference on ArtificialIntelligence (AAAI), pages 1400–1405. AAAI Press, 2006.

[52] Martin F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

[53] Lawrence Reeve and Hyoil Han. Survey of semantic annotation platforms. In Pro-ceedings of ACM Symposium on Applied Computing, pages 1634–1638. ACM Press,2005.

[54] Mark Sanderson and Bruce Croft. Deriving concept hierarchies from text. In SI-GIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference onResearch and development in information retrieval, pages 206–213, New York, NY,USA, 1999. ACM.

[55] Patrick Schmitz. Inducing ontology from flickr tags. In Workshop on CollaborativeWeb Tagging, 2006.

[56] Temple F. Smith and Michael S. Waterman. Identification of common molecularsubsequences. Journal of Molecular Biology, 147:195–197, 1981.

[57] Stephen Soderland. Learning information extraction rules for semi-structured andfree text. Machine Learning, 34(1-3):233–272, 1999.

[58] Snehal Thakkar, Jose Luis Ambite, and Craig A. Knoblock. Composing, optimizing,and executing plans for bioinformatics web services. The VLDB Journal, SpecialIssue on Data Management, Analysis, and Mining for the Life Sciences, 14(3):330–353, 2005.

[59] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun.Support vector machine learning for interdependent and structured output spaces.In Proceedings of the 21st International Conference on Machine Learning, page 104.ACM Press, 2004.

147

[60] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun.Large margin methods for structured and interdependent output variables. Journalof Machine Learning Research, 6:1453–1484, 2005.

[61] Maria Vargas-Vera, Enrico Motta, John Domingue, Mattia Lanzoni, Arthur Stutt,and Fabio Ciravegna. MnM: Ontology driven semi-automatic and automatic sup-port for semantic markup. In Proceedings of the 13th International Conference onKnowledge Engineering and Management, pages 213–221, 2002.

[62] William E. Winkler. The state of record linkage and current research problems.Technical report, U.S. Census Bureau, 1999.

[63] William E. Winkler and Yves Thibaudeau. An application of the fellegi-sunter modelof record linkage to the 1990 U.S. Decennial Census. Technical report, StatisticalResearch Report Series RR91/09 U.S. Bureau of the Census, 1991.

[64] Chengxiang Zhai and John Lafferty. A study of smoothing methods for languagemodels applied to ad hoc information retrieval. In Proceedings of the 24th ACMSIGIR Conference on Research and Development in Information Retrieval, pages334–342. ACM Press, 2001.

148

A REFERENCE-SET APPROACH TO INFORMATION EXTRACTION...

Documents

Transcript of A REFERENCE-SET APPROACH TO INFORMATION EXTRACTION...