The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record...

10
The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkage Sheng Chen Stevens Institute of Technology Hoboken, NJ 07030, USA [email protected] Andrew Borthwick Intelius Data Research Bellevue, WA 98004, USA [email protected] Vitor R. Carvalho Intelius Data Research Bellevue, WA 98004, USA [email protected] ABSTRACT Record Linkage (RL) is the task of identifying two or more records referring to the same entity (e.g., a person, a com- pany, etc.). RL systems have traditionally handled all input record types in the same way. In an industrial setting, how- ever, business imperatives (such as privacy constraints, gov- ernment regulation, etc.) often force RL systems to operate with extremely high levels of false positive/negative error rates. For instance, false positive errors can be life threat- ening when identifying medical records, while false negative errors on criminal records can lead to serious legal issues. In this paper we introduce RL models based on Cost Sensi- tive Alternating Decision Trees (ADTree), an algorithm that uniquely combines boosting and decision trees algorithms to create shorter and easier-to-interpret linking rules. These models present a two-fold advantage when compared to tra- ditional RL approaches. First, they can be naturally trained to operate at industrial precision/recall operating points. Second, the shorter output rules are so clear that it can ef- fectively explain its decisions to non-technical users via score aggregation or visualization. Experiments show that the proposed models significantly outperformed other baselines on the desired industrial operating points, and the improved understanding of the model’s decisions led to faster debug- ging and feature development cycles. We then describe how we deployed the model to a commercial RL system with several billion personal records covering nearly the entire U.S. population as input, and obtained a 6:1 ratio of input records to output profiles, with an estimated 99.6%/86.2% precision/recall trade-o. This system was then deployed in a commercial e-commerce website, as well as to the sub- domain of linking criminal records, obtaining an impressive 99.7%/82.9% precision/recall overall trade-o. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. 9th International Workshop on Quality in Databases August 29, 2011, Seattle, WA Copyright 2011 VLDB Endowment, ACM 000-0-00000-000-0/00/00. General Terms Algorithms, Performance Keywords Data Cleaning, Record Linkage, Cost Sensitive 1. INTRODUCTION Due to the critical importance of databases in today’s economy, the quality of the information (or the lack thereof) stored in databases can have significant cost implications to conduct business [15]. For instance, poor quality cus- tomer data has been estimated as costing US businesses $611B annually in posting, printing, and staoverhead [14]. In the scenario in which we are consolidating information from multiple data sources, it is always desirable to create an error-free database through locating and merging dupli- cate records belonging to the same entity. These duplicate records could have many deleterious eects, such as prevent- ing discoveries of important regularities [7], and erroneously inflating estimates of the number of entities [34]. Unfortu- nately, this cleaning operation is frequently quite challenging due to the lack of a universal identifier that would uniquely identify each entity. The study of quickly and accurately identifying duplicates from one/multiple data source(s) is generally recognized as Record Linkage (RL)[18]. Synonyms in database commu- nity include record matching [3], merge-purge [21], dupli- cate detection [26], and reference reconciliation [13]. RL has been successfully applied in census databases [34], biomed- ical databases [12], and web applications such as reference disambiguation of the scholarly digital library CiteSeerX [1] and online comparison shopping [6]. The general approach to record linkage is to first esti- mate the similarity between corresponding fields to reduce or eliminate the confusion brought by typographical errors or abbreviations. A straightforward implementation of sim- ilarity function could be based on edit distance, such as the Levenshtein distance [24]. After that, a strategy for combin- ing these similarity estimates across multiple fields between two records is applied to determine whether the two records are a match or not. The strategy could be rule-based, which generally relies on domain knowledge or on generic distance metrics to match records [15]. However, a common practice is to use Machine Learning (ML) techniques, to treat the similarity across multiple fields as a vector of features and “learn” how to map them into a match/unmatch binary de- cision. ML techniques that have been tried for RL include

Transcript of The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record...

Page 1: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

The Case for Cost-Sensitive and Easy-To-Interpret Models

in Industrial Record Linkage

Sheng ChenStevens Institute of

TechnologyHoboken, NJ 07030, [email protected]

Andrew BorthwickIntelius Data Research

Bellevue, WA 98004, [email protected]

Vitor R. CarvalhoIntelius Data Research

Bellevue, WA 98004, [email protected]

ABSTRACTRecord Linkage (RL) is the task of identifying two or morerecords referring to the same entity (e.g., a person, a com-pany, etc.). RL systems have traditionally handled all inputrecord types in the same way. In an industrial setting, how-ever, business imperatives (such as privacy constraints, gov-ernment regulation, etc.) often force RL systems to operatewith extremely high levels of false positive/negative errorrates. For instance, false positive errors can be life threat-ening when identifying medical records, while false negativeerrors on criminal records can lead to serious legal issues.In this paper we introduce RL models based on Cost Sensi-tive Alternating Decision Trees (ADTree), an algorithm thatuniquely combines boosting and decision trees algorithms tocreate shorter and easier-to-interpret linking rules. Thesemodels present a two-fold advantage when compared to tra-ditional RL approaches. First, they can be naturally trainedto operate at industrial precision/recall operating points.Second, the shorter output rules are so clear that it can ef-fectively explain its decisions to non-technical users via scoreaggregation or visualization. Experiments show that theproposed models significantly outperformed other baselineson the desired industrial operating points, and the improvedunderstanding of the model’s decisions led to faster debug-ging and feature development cycles. We then describe howwe deployed the model to a commercial RL system withseveral billion personal records covering nearly the entireU.S. population as input, and obtained a 6:1 ratio of inputrecords to output profiles, with an estimated 99.6%/86.2%precision/recall trade-o↵. This system was then deployedin a commercial e-commerce website, as well as to the sub-domain of linking criminal records, obtaining an impressive99.7%/82.9% precision/recall overall trade-o↵.

Categories and Subject DescriptorsI.2.6 [Artificial Intelligence]: Learning

Permission to copy without fee all or part of this material is granted provided

that the copies are not made or distributed for direct commercial advantage,

the VLDB copyright notice and the title of the publication and its date appear,

and notice is given that copying is by permission of the Very Large Data

Base Endowment. To copy otherwise, or to republish, to post on servers

or to redistribute to lists, requires a fee and/or special permission from the

publisher, ACM.

9th International Workshop on Quality in Databases August 29, 2011,

Seattle, WA

Copyright 2011 VLDB Endowment, ACM 000-0-00000-000-0/00/00.

General TermsAlgorithms, Performance

KeywordsData Cleaning, Record Linkage, Cost Sensitive

1. INTRODUCTIONDue to the critical importance of databases in today’s

economy, the quality of the information (or the lack thereof)stored in databases can have significant cost implicationsto conduct business [15]. For instance, poor quality cus-tomer data has been estimated as costing US businesses$611B annually in posting, printing, and sta↵ overhead [14].In the scenario in which we are consolidating informationfrom multiple data sources, it is always desirable to createan error-free database through locating and merging dupli-cate records belonging to the same entity. These duplicaterecords could have many deleterious e↵ects, such as prevent-ing discoveries of important regularities [7], and erroneouslyinflating estimates of the number of entities [34]. Unfortu-nately, this cleaning operation is frequently quite challengingdue to the lack of a universal identifier that would uniquelyidentify each entity.

The study of quickly and accurately identifying duplicatesfrom one/multiple data source(s) is generally recognized asRecord Linkage(RL)[18]. Synonyms in database commu-nity include record matching [3], merge-purge [21], dupli-cate detection [26], and reference reconciliation [13]. RL hasbeen successfully applied in census databases [34], biomed-ical databases [12], and web applications such as referencedisambiguation of the scholarly digital library CiteSeerX [1]and online comparison shopping [6].

The general approach to record linkage is to first esti-mate the similarity between corresponding fields to reduceor eliminate the confusion brought by typographical errorsor abbreviations. A straightforward implementation of sim-ilarity function could be based on edit distance, such as theLevenshtein distance [24]. After that, a strategy for combin-ing these similarity estimates across multiple fields betweentwo records is applied to determine whether the two recordsare a match or not. The strategy could be rule-based, whichgenerally relies on domain knowledge or on generic distancemetrics to match records [15]. However, a common practiceis to use Machine Learning (ML) techniques, to treat thesimilarity across multiple fields as a vector of features and“learn” how to map them into a match/unmatch binary de-cision. ML techniques that have been tried for RL include

Page 2: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

Figure 1: A cost matrix for cost sensitive learning

Support Vector Machines [7], decision trees [30], maximumentropy [8], or composite ML classifiers tied together byboosting or bagging [36].

Despite its importance in producing accurate estimationof duplicates in databases, insu�cient attention has beengiven to tailoring ML techniques to optimize the perfor-mance of industrial RL systems. In this paper, we proposecost sensitive extensions of the Alternating Decision Tree(ADTree) [16] algorithm to address these problems. CostSensitive ADTrees (CS-ADTree) are a novel improvement tothe ADTree algorithm which is well-suited to handle busi-ness requirements to deploy a system with extremely dif-ferent minimum false-positive and false-negative error rates.Specifically, our method assigns biased misclassification costsfor positive class examples and negative class examples, whichmakes it fundamentally di↵erent from existing ML tech-niques used by most record linkage frameworks.

Because ADTrees outputs a single tree with shorter andeasy-to-ready rules, another key advantage of the proposedmethod is that it can e↵ectively explain its decisions, even tonon-technical users, using simple score aggregation and/ortree visualization. Even for very large models with hundredsof features, score aggregation can be straightforwardly ap-plied to perform feature blame assignment — i.e., consis-tently calculate the importance of each feature on the finalscore of any decision. We show that improved understand-ing of these models has led to faster debugging and featuredevelopment cycles.

This work is organized as follows. Section 2 describesthe concept of cost-sensitive learning and discusses how itis relevant to industrial RL. Then Section 3 discusses thedesign of reliable features for our system. After that, wedescribe ADTrees and introduce the CS-ADTree algorithmsin Section 4. The experimental performance of the proposedalgorithm is presented in Section 6, along with a compari-son with alternative ML techniques for record linkage. Wediscuss our results and cover related work on record linkagein Section 7, and conclude the paper in Section 8

2. COST-SENSITIVE LEARNINGOne common motivation for cost sensitive learning is the

scenario of training a classifier on a data set which contains asignificantly unequal distribution among classes. This sortof problem usually consists of relative imbalances and ab-solute imbalances [20]. Absolute imbalances arise in datasets where minority class examples are definitely scarce andunder-represented, whereas relative imbalances are indica-tive of data sets in which minority examples are well repre-sented but remain severely outnumbered by majority classexamples.

A second motivation for cost sensitive learning is when

there are di↵erent ”business costs” (real-world consequences)between false positive and false negative errors. Cost sen-sitive learning for the binary classification problem can bebest illustrated by a cost matrix adhering to data that con-sists of two classes: positive class P and negative class N .For the sake of convenience, in the rest of this paper werefer to examples/instances belonging to the positive andnegative classes as positive examples/instances and nega-tive examples/instances, respectively. In the person recordlinkage context, a positive example is a pair of records whichrepresent the same person. A negative example is when thepair of records represent two di↵erent people.

The cost matrix in Fig. 1 demonstrates the cost of fourdi↵erent scenarios of classifying an instance into either pos-itive class P or negative class N . The correct classificationsreside on the diagonal line of the cost matrix and have acost of 0, i.e., C(P, P ) = C(N, N) = 0. Traditional reportson Record Linkage work often assign equal costs to mis-classifying a positive instance into a negative instance andmisclassifying a negative instance into a positive instance,i.e., C(P, N) = C(N, P ). This works perfectly fine whenthe positive class and negative class are of equal interest.Nevertheless, this is rarely true in the real business world.For instance, failure to identify a credit fraud case wouldbring a much higher expense than the reverse case.

In a RL industrial setting, most applications are moreaverse to false positives than false negatives. For instance,the industrial database studied in this paper was driving anapplication which displayed the history of addresses a personhad resided at and jobs they had held, among other things.In this business, a false negative would generally mean thatwe would fail to list a true address or true job title thata person had had. However, it is considered far worse tomake a false positive error whereby we would say that aperson had lived at an address or held a job that pertainedto someone else. Similar business tradeo↵s are described in[28] where a false positive in record linkage on a children’simmunization database can lead to a child not being vacci-nated for a dangerous disease, whereas a false negative leadsto a child receiving a redundant, but harmless, vaccination.

A classifier created for record linkage has the task of classi-fying a pair of records in a database into match or unmatch.Although academic systems typically target f-measure (theharmonic mean of precision and recall), which weights falsepositive and false negative errors equally [4], as discussedpreviously, industrial applications typically consider falsepositives to be much more expensive than false negatives.Hence industrial systems will frequently seek to maximizerecall while ensuring that precision is at least ⇡ for some⇡ 2 [0, 1][3]. As an example, in one database that we weretargeting, there was a requirement of ⇡ � 0.985, where⇡ = 0.996 was a typical value. This implies that mis-classifying no-match cases, and thus making a false posi-tive error, should be much more expensive than misclassify-ing match cases (yielding a false negative) in terms of cost,which gives rise to the study of applying cost sensitive learn-ing to tackle record linkage problems in this paper. To thebest of our knowledge, this is the first work applying costsensitive learning methods to Record Linkage.

3. FEATURE DESIGNIt is common sense that to build a strong record linkage

system, coming up with a proper feature representation is

Page 3: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

Figure 2: An example of a pair of person profilerecords

of critical importance. To accomplish that goal, most ex-isting work concentrates on designing similarity functionsestimating the similarity levels between corresponding fieldsin a pair of records. Although this kind of method can ef-ficiently capture some of the semantic similarity between apair of record despite various levels of distortions of tex-tual strings, it leaves out some crucial signals. For instance,a rare name match in records should be considered morelikely to be a duplicate than a common name match (e.g.two “Hannah Philomene’s” are more likely to the same thantwo “John Smith’s”). More generally, our experience is thatachieving the highest levels of accuracy requires the designof at least dozens of heterogenous features so that the sys-tem can attempt to exploit every possible signal that is de-tectable by the feature designer. Given these requirements,working with a machine-learning algorithm which producesa human-understandable run-time model greatly facilitatesfeature development and debugging.

An illustration of our feature design starts with the ex-ample of a pair of person profile records shown in Fig. 2.Although the pair only di↵ers by one character across allfields, it can be clearly identified as a non-match by anyonefamiliar with American naming conventions, since it is ap-parent that these two individuals are likely father and son.However, this example would be predicted as a match bymany record linkage platforms because of the degree of tex-tual similarity. Our first feature is thus name su�x, whichis a categorical feature with three possible values: matchif the name su�xes match (i.e. John Smith Jr. vs. JohnSmith Jr.), di↵er if the name su�xes di↵er, and none if thisfeature does not apply (e.g. if one or both records do notcontain a name su�x).

The second feature is related to name frequency, which isglobal name frequency. It is a numeric feature characteriz-ing frequency of the name in the population. Note that ifthe names do not match, we consider the feature value inthat case to be positive infinite. This feature is negativelycorrelated to the record linkage decision, i.e., a large valueof global name frequency would decrease the likelihood thatthe two records match.

The third feature is telephone number match, i.e., phonematch, which is a categorical feature. US phone numberscan be segmented into three parts. Records matching on dif-ferent parts or di↵erent conjunctions of parts of phone num-ber should be considered duplicates with varied likelihood.Besides a no match where phone numbers are di↵erent inall three parts, Fig. 3 illustrates the other four feature val-ues phone match can take. The advantage of this feature isthat in lieu of determining the cost of variation in di↵erentpart of phone numbers either manually [27] or adaptively[7], we directly push the di↵erent match scenarios into theML algorithm as feature values to make the algorithm figure

out the appropriate strategy for “ranking” di↵erent matchcases.

Figure 3: Feature values of phone match

A partial list of other features used in this system includethe following:

Feature Name Explanationstreet address match returns “true” if street name and

house number matchbirthday di↵erence returns the number of days

separating the birthdaysregional population returns the population of the

region that the two records have incommon or positive infinity if notin the same region. (rationale: tworecords known to share aNew York City address areless likely to match than two addressessharing a Topeka, Kansas address)

name matches returns number of matching names(rationale: particularly useful forcriminal matching, where criminalsmay match on multiple aliases)

Table 1: Other Features

As can be seen from these feature examples, the construc-tion of a high-precision, industrial-strength record linkagesystem requires a lengthy list of complex features which arethe result of extensive feature engineering. It is thus im-portant that the machine learning algorithm supports thefeature development process.

4. ALGORITHMS

4.1 Alternating Decision Trees (ADTrees)The ADTree algorithm [16] is a combination in a pecu-

liar way of the decision tree [9] and boosting [17] algorithms.There are two kinds of nodes in an ADTree. One is the split-ter node which specifies the condition that should be testedfor the instance. The other is the prediction node, whichassigns a real-valued score to instances satisfying conditionsat di↵erent points. An ADTree can be split multiple timesat any point, i.e., it can attach more than one splitter nodeat any single prediction node. This is di↵erent from a deci-sion tree since (1) generally, a decision tree can only be split

Page 4: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

once at each point, and (2) the split can be only performedat the bottom of the tree, i.e., the prediction leaves, in theprogress of tree generation. Upon determining the class labelof an instance, ADTree sums up the score of the predictionnodes of all paths on which the condition specified by thesplitter nodes are all satisfied by the instance. The sign ofthe summation is the class label of the instance. Note thata conventional decision tree also decides the class label ofan instance by going through a path in the tree hierarchy.The di↵erence is that there is just one path and only theprediction leaf at the bottom determines the class label ofthe instance.

0.140

1: street_address_match = none 3: birthday_difference < 417.5 5: name_matches < 3.5

-0.291

y

2.313

n

2: regional_population < 5.000E19

0.399

y

-0.818

n

4: global_name_frequency < 45.708 6: regional_population < 75917.5

0.359

y

-0.667

n

0.589

y

-0.206

n

1.325

y

-0.233

n

-0.101

y

1.876

n

Figure 4: A tiny ADTree for person linkage

A toy example of a person-linking ADTree, shown in Fig.4, was built from a database of over 50,000 labeled examplesconsisting of a mixture of criminal and non-criminal personrecords. There are five features illustrated in this modelwhich were described in Section 4. Let’s suppose the modelis considering the pair of records shown in Table 2.

For this example, we first note that the root predictionnode starts us o↵ with a score of 0.140. The relatively smallpositive score indicates a slight predominance of positive(“match”) examples in the training corpus. We then seethat Feature 1, street address match, returns “none” (yes:-0.291), so we test Feature 2, which returns 3.3 million, sincethat is the population of the Seattle Metropolitan Area.Since feature 2 returned yes (+0.399), we now query fea-tures 4 and 6. Feature 4 returns 1000, the frequency of”Robert Jones” and thus answers no (-0.667), while Feature6 tests the ”regional population” feature again and decidesno (-0.206) this time since the population is not less than75,917.5. Feature 3 returns 14, which is less than 417.5, giv-ing a ”yes” decision (1.325). Feature 5 returns 1 since thereis only one matching name, so the decision is also “yes”(-0.101).

Summing these values, the class label is thus determinedthrough calculating 0 .140� 0.291+ 0.399� 0.667� 0.206+1.325�0.101 = 0.599, i.e. a relatively low-confidence match.Note also that we can determine the amount of ”blame” toassign to each feature by summing the prediction nodes ofeach feature. Consequently, we can see that “regional population”contributes 0.399 - 0.206 = 0.193 to the final score, which is

very helpful in feature engineering.Finally, note the following two benefits of ADTrees. Firstly,

if Feature 1 had returned “no” because street address matchdid not return “none”, we would not have tested features2, 4, and 6, thus reaping a major gain in run-time perfor-mance. Secondly, note how the ADTree seamlessly mixedreal-valued features with a boolean feature ( street addressmatch ), a property with simplifies development and facili-tates the incorporation of heterogenous features in a recordlinkage model.

As a second example, consider the name su�x featuredescribed in the previous section. Due to the importanceof this feature, ADTree generally puts it somewhere nearthe root node. The example in Fig. 9 illustrates this.“name su�x=di↵er” is a child of the root node and is theeighth feature chosen. If the name su�x di↵ers, we decreasethe score (making a match less likely) by �2.89 and don’tquery any additional features. On the other hand, if thesu�xes don’t di↵er, the score is basically una↵ected (adding0.039 is trivial in the context of the other, much larger val-ues) and we query many other features. Note that one addi-tional feature is whether “name su�x=match”, which getsa strong positive value of 1.032 if true.

4.2 Cost Sensitive Alternating Decision TreeIn this section, we formulate a novel implementation of

ADTree equipped with cost sensitive learning, called CostSensitive Alternating Decision Tree (CS-ADTree). To thebest of our knowledge, this is the first adaptation of ADTreeto a framework of cost sensitive learning and also the firstattempt at applying a cost sensitive ML technique to theproblem of record linkage.

Cost sensitive learning additionally assigns a cost factorci 2 (0,1) to each example xi, yi in the database to quan-tify the cost of misclassifying x into a class label other thanyi. In the standard boosting algorithm, the weight distribu-tion over the data space is revised in an iterative manner toreduce the total error rate. Cost sensitive learning operatesin a similar manner, but it operates over a space of exam-ples in which the weight distribution has been updated in abiased manner towards examples with higher costs. Accord-ing to the “translation theorem” [35], the classifier generatedin this way will be conceptually the same as the one thatexplicitly seeks to reduce the accumulated misclassificationcost over the data space. We will examine three di↵erentmethods of biasing the training data to account for the busi-ness preference for false negatives vs. false positives. Thebaseline weight update rule of ADTree is [16],

wt+1i wt

i · e�rt(xi)yi

In the method that we focus on in this paper, AdaC2 [33],the weight update rule for ADTree is modified as,

wt+1i c(i) · wt

i · e�rt(xi)yi

where c(i) = c+ · I(yi = +1) + c� · I(yi = �1). Note c+ andc� are the misclassification cost associated with positive annegative class examples, respectively, and I(•) = 1 whencondition • is met and 0 otherwise.

The intuition here is straightforward. We hope weightsof the examples with higher costs are increased faster thanthose with lower costs. According to Lemma 1 (omitted

Page 5: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

Record 1 Record 2 Comment

Name Robert Jones Robert Jones Name Frequency = 1000

Birthday 3/1/1966 3/15/1966 Birthday di↵erence = 14

Address 121 Walnut St. 234 Chestnut St.

City/State Seattle, WA Seattle, WA Population of Seattle region = c. 3.3 million

Table 2: An example comparison of two records

Algorithm 1 Cost Sensitive Alternating Decision TreeInputs:

1: database S = {(x1, y1, c1), . . . , (xn, yn, cn)}, where xi 2 X , yi 2{�1,+1}, and c(i) = c+ · I(yi = +1) + c� · I(yi = �1).

2: weights W = {w01, . . . , w

0n}, where w

0i = 1.

/* uniform distribution initially */

3: D {all possible conditions}.4: N Number of iterative rounds.

5: ⌘ smooth factor.

Procedure:

6: r0 T with score

1

2

ln

c+ · W+(T) + ⌘

c� · W�(T) + ⌘

/* T: precondition is true for all examples. */

7: P0 r0 /* precondition set */

8: for t : 1! N do

9: d1, d2 = argmin

d1,d2

(Z)

s.t. Z = [2(

pW+(d1 \ d2)W�(d1 \ d2)

+

pW+(d1 \ ¬d2)W�(d1 \ ¬d2)) + W(¬d1)]

d1 2 Pt, d2 2 D10: ↵1 =

1

2

ln

c+ · W+(d1 \ d2) + ⌘

c� · W�(d1 \ d2) + ⌘

! d1 \ d2

/* ↵1 is the score associated with d1 \ d2 */

11: ↵2 =

1

2

ln

c+ · W+(d1 \ ¬d2) + ⌘

c� · W�(d1 \ ¬d2) + ⌘

! d1 \ ¬d2

/* ↵2 is the score associated with d1 \ ¬d2 */

12: rt d1 \ d2L

d1 \ ¬d2

/* rt(x) is a new splitter node with two associated prediction

nodes. */

13: Pt+1 {Pt, rt}14: w

t+1i c(i) · wt

i · e�rt(xi)yi

/* update example weights */

15: end for

16: return Classifier for unseen unlabeled instances:

H(x) = sgn(

NX

t=0

rt(x))

for space reasons), the learning focus of ADTree will be bi-ased towards examples with higher costs. Given the modi-fied weight update rule, the original equation for calculatingscores ↵1 and ↵2 at each iterative round no longer guaran-tees the error rate could be decreased fastest. Inspired by theinduction of optimized ↵ for AdaC2, ↵1 and ↵2 can be mod-

ified to ↵1 = 12 ln

c+W+(d1\d2)c�W�(d1\d2)

and ↵2 = 12 ln

c+W+(d1\¬d2)c�W�(d1\¬d2)

,

where d2 is the condition set by splitter node rt, and d1 isthe condition that an example has to meet in order to reachrt. The splitter node rt can also be interpreted as rule inthat an example x arriving at rt would be assigned score ↵1

if it satisfies both d1 and d2, i.e., d1\d2, and ↵2 if it satisfiesd1 but not d2, i.e., d1 \ ¬d2.

Now that we have modified the weight update rule andequations for calculating prediction scores ↵1 and ↵2, weformulate CS-ADTree in Algorithm 1.

The cost factor c(i) can also be put in other spots of theweight update equation. For instance, it can be put insidethe exponential term e�rt(xi)yi . When example x is misclas-sified by rt, we have sgn�rt(xi)yi > 0. A high cost insidethe exponential term would thus increase the weight of theexample in exponential order. This is the idea of AdaC1

[33]. Its weight update rule and prediction score equationsof AdaC1 can be adapted for ADTree as,

w

t+1i w

ti · e�c(i)·rt(xi)yi

↵1 =

1

2

ln

W(d1 \ d2) + c+ · W+(d1 \ d2)� c� · W�(d1 \ d2)

W(d1 \ d2)� c+ · W+(d1 \ d2) + c� · W�(d1 \ d2)

↵2 =

1

2

ln

W(d1 \ ¬d2) + c+ · W+(d1 \ ¬d2)� c� · W�(d1 \ ¬d2)

W(d1 \ ¬d2)� c+ · W+(d1 \ ¬d2) + c� · W�(d1 \ ¬d2)

The cost factor can also be put both inside and outside theexponential term in the weight update equation, which givesrise to AdaC3 [33]. Its weight update rule and predictionscore equations can be adapted for ADTree as,

w

t+1i c(i) · wt

i · e�c(i)·rt(xi)yi

↵1 =

12⇥

lnc+·W+(d1\d2)+c�·W�(d1\d2)+c2+·W+(d1\d2)�c2�·W�(d1\d2)

c+·W+(d1\d2)+c�·W�(d1\d2)�c2+·W+(d1\d2)+c2�·W�(d1\d2)

↵2 =

12⇥

lnc+·W+(d1\¬d2)+c�·W�(d1\¬d2)+c2+·W+(d1\¬d2)�c2�·W�(d1\¬d2)

c+·W+(d1\¬d2)+c�·W�(d1\¬d2)�c2+·W+(d1\¬d2)+c2�·W�(d1\¬d2)

Despite the fact that cost sensitive learning can mani-fest itself in di↵erent forms in boosting, we chose to createCS-ADTree by integrating AdaC2 into ADTree’s trainingprocedure since the weight updating rule of AdaC2 weighseach example by its associated cost item directly [33], whichnaturally fits the algorithm into the realm of the transla-tion theorem. On the other hand, AdaC1 and AdaC3 bothhave cost item placed inside exponential term which is lackof direct connection to the weight update. In addition, ourpreliminary empirical studies showed that AdaC2 performedthe best across a number of benchmarks. That being said,we would like to further explore the performance of ADTreewith the cost sensitive learning frameworks of AdaC1 andAdaC3 in future work.

5. EVALUATION PARAMETERSMost of the time the metrics for evaluating a classifier’s

performance on record linkage problem are precision, recalland their harmonic mean, f-measure — metrics obtained byfixing the classifier’s decision threshold to zero. A confusionmatrix shown in Fig. 5 is always helpful to fully capturewhat these metrics stand for. A quick impression is that aconfusion matrix is very similar to a cost matrix. The dif-ference is that what resides in the matrix is the number ofinstances satisfying the scenario specified by the row and col-umn indexes. Therefore TP (True Positive) and TN (TrueNegative) are the number of correctly classified positive andnegative instances, respectively. FP (False Positive) andFN (False Negative) are the number of instances falselyclassified into positive class and negative class, respectively.Considering a matched pair of records as a positive exam-ple, and an unmatched pair of records as a negative example,

Page 6: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

Figure 5: A confusion matrix for binary classifica-tion

precision, recall, and f-measure for record linkage are definedas,

precision =

TP

TP+FP

recall =

TP

TP+FN

f-measure =

2 · precision · recallprecision+ recall

From the definitions, precision measures the accuracy of aclassifier’s predictions of positive instances, while recall teststhe completeness of a classifier’s coverage of real positive in-stances. It is clear that there is a trade-o↵ between precisionand recall. This can be seen by the explanation that if allinstances are frivolously predicted as positive, recall is max-imized to be 1, whereas precision is equal to the proportionof positive examples in the database, which is very bad. Tothis end, f-measure is frequently used to measure the overallperformance of a classification algorithm when the relativeimportance of recall and precision are evenly balanced.Customization of evaluation metrics Good per-

formance at a single threshold such as zero may lead tounwanted subjectivity. Recent studies show that the per-formance of a classifier should be evaluated upon a range ofthresholds [19]. An appealing property of ADTree is that in-stead of merely outputting a hard class label, it gives a scorevalue,

PNt=0 rt(x), to estimate the confidence level of its de-

cision. While ADTree generally uses the sign function toclassify unlabeled instances, other numerical value can alsobe used to serve as the threshold. In this case, the classifier ofADTree can be represented asH(x) = sgn(

PNt=0(rt(x))�d),

where d 6= 0. Since the business priority for the industrialdatabase on which we were working was to keep precisionabove a threshold which was in excess of 99%, precision ina high range was of much more interest to us and we thusfocused on metrics other than f-measure. In this study, westart by setting the threshold, d, to be large enough to makeprecision = 1, and then tune down the threshold to progres-sively decrease precision down to 0.985 in steps of 0.001 todetermine the recall that we are able to achieve at varyinghigh precision levels.Choice of cost ratio Some business applications do

have strategies to decide costs for di↵erent examples. Forinstance, the amount of money involved in a transactioncan be used to quantify the cost related to it. Neverthelessfor many other applications such as record linkage, the onlyprior knowledge available is the biased interest towards oneclass over the other. In industrial applications, this is usu-ally expressed as a requirement that the precision be at least

⇡, where in our applications, we generally had ⇡ � 0.985 anda more typical requirement for ⇡ was in excess of 0.995. Notethat two parameters influence the precision of the classifier,the cost factor, C, and the the required precision, ⇡. Wehave found that a reasonable heuristic is to adjust your costfactor C so that the threshold, d, that yields your desiredprecision, ⇡ is close to 0, this generally yields close to opti-mal recall at ⇡. We give an informal theoretical justificationfor this heuristic in section 6.1.

6. EXPERIMENTObjectives In this section, we present experimental re-

sults of ADTree and CS-ADTree for comparing records ofperson profile databases. The objectives here are to 1. demon-strate the e↵ectiveness of ADTree and CS-ADTree on classi-fying pairs of record into match/unmatch for record linkage;2. illustrate how the run-time representation of ADTree canenable humans’ interpretation of the classifier derived byADTree; 3. demonstrate the competitiveness of ADTree/CS-ADTree with alternative ML techniques heavily used bymost existing record linkage frameworks; and 4. Show howCS-ADTree demonstrates superior performance to ADTreeat both very high precision requirements.Implementation In this study, ADTree, CS-ADTree,

boosted decision tree, and decision tree algorithms are allimplemented based on the JBoost platform [2]. The SVMalgorithm is implemented based on SVM-Light [22].

6.1 Performance of ADTree/CS-ADTreeIn our initial experiments, we used a database S with

match/unmatch labels assigned by expert internal annota-tors which had 20, 541 pairs of person profile records charac-terized by more than 42 features including those describedin Section 3. Since the cost of hiring experts is relative high,we later switched to Amazon’s Mechanical Turk (MT) Sys-tem which provides a significantly cheaper and faster way tocollect label data from a broad base of non-expert contribu-tions over the web [32]. A drawback of this method is thatthe database is likely to become noisy since it is possiblethat some workers on MT (“Turkers”) are purely spammersand some are lacking in the necessary domain knowledge forthe task.

We report average results for a 10-fold cross validation.For these initial experiments, we held the number of boost-ing iterative rounds, T fixed at 200 and we used a cost factorof C = 4 based on earlier experiments showing these to beoptimal values over the precision range [0.985, 1].

With T and C determined, the record linkage performanceof CS-ADTree, ADTree, and CS-ADTree with cost-sensitivelearning frameworks of AdaC1 and AdaC3 on our initial per-son profile database S are studied. Their average recall aregiven in Fig. 6, which clearly shows that CS-ADTree per-forms the best across all methods under comparison. Av-erage recall can sometime be misleading when the physicalP-R curves of two methods under comparison cross in P-Rspace [19]. In this case, it would be nonsensical to concludeone method is better than the other based on the area un-der P-R curve which is conceptually equal to the averagedrecall. To that end, we also plot P-R curves of all methodunder comparison in Fig 7. One can clearly see that CS-ADTree can consistently contain other methods in terms ofP-R curve, which validates the supremacy of CS-ADTree.

One perspective on how CS-ADTree is able to achieve this

Page 7: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

Figure 6: Comparison of average recall’s forADTree/CS-ADTree

Figure 7: P-R curves of ADTree/CS-ADTree

superior performance can be seen in Fig. 8. CS-ADTreeachieves a given level of precision at a much lower thresholdthan the other three methods. It is particularly instructiveto look at CS-ADTree as compared to ADTree. CS-ADTreeachieves 0.995 precision at a threshold of roughly 0, whileADTree achieves this at about 4. An intuition as to CS-ADTree’s superiority is that the model is seeking to pushpositive examples to have scores above 0 and negative scoresbelow 0. In the case of CS-ADTree, we are able to put ourthreshold, d, at about the same threshold as the model’sthreshold, whereas with ADTree, the model is not “aware”that the threshold of interest to us is 4.

To demonstrate how run-time representation of CS-ADTree

Figure 8: Threshold range of ADTree/CS-ADTreew.r.t high precision range

can help a practitioner better understand the record linkageproblem of his/her database in a human-interpretable way,we present two partial snapshots of a CS-ADTree gener-ated by our initial labeled database S in Fig. 9 and Fig.10. In Fig. 9, one can clearly see that the decision nodeof name su�x resides right below the root node. Whenname su�x=di↵er, the corresponding path ends immedi-ately and assigns a large negative score to the instance underconsideration, otherwise CS-ADTree would go on to checkmany other features. This is exactly as predicted by our dis-cussion for the name su�x feature in Section 3. Note alsothat this tree can be very e�ciently computed if the fea-tures are lazily evaluated. If the name su�x di↵ers, thereis no need to compute the values for features 14 - 18, whichall have name su�x 6= “di↵er” as a precondition (in fact,87 out of the 100 features in this sample CS-ADTree havethis as a precondition). More hierarchical levels are in-volved in the example in Fig. 10. If the value of featurebirthday di↵erence for a pair of records is relatively small,(the birthdays are less than 432.5 days apart), CS-ADTreewould terminate the corresponding path by just examiningthe value of birthday di↵erence. This is intuitive becausehaving nearly matching birthdays is a strong “match” in-dicator. We don’t need to ask further questions to reach adecision. Otherwise it asks if birthday di↵erence is greaterthan 5⇥ 1019, i.e., infinity, which is how we indicate a nullvalue (birthday isn’t present on one or the other records).In this case, CS-ADTree would continue to check a numberof other features to determine the match/unmatch status ofthe pair of records. So in both cases, the derived model iseasy to understand and can be seen to be doing somethingreasonable.E�ciency Our experiments also show that ADTree and

CS-ADTree are e�cient at training and run-time. Given 50candidate features, a training set size 11,309 example pairs,and a pre-computed feature vector, we trained the system onJBoost using CS-ADTree for 200 iterations in 314 seconds.On a major run of the system using a Python implementa-tion and a 100 node model, the model performed 180 recordpair comparisons per second per processor, including boththe time to compute feature computations and the compu-tation in the CS-ADTree. All statistics were on an IntelXeon 2.53 GHz processor with 60 GB of RAM.

6.2 Performance of ADTree/CS-ADTree withactive learning

The initial database S is amplified by picking records froma data pool with size 8 ⇥ 106 which are then presented toTurkers for labeling. Denoting the pairwise classifier trainedon S to be H, the principle of choosing records are based onthree active learning approaches listed as follows,

• The first approach serves as our baseline active learningmodel, which is basically to randomly choose r 2 S, s.t.,H(r) > �3. The reason we choose �3 as the threshold topick records is based on the empirical observation that exam-ples in S that are given a prediction score by H greater than�3 exhibit an appropriate proportion between positive andnegative examples, i.e., negative examples shouldn’t be ex-cessively outnumbered by positive examples to avoid Turkersfrom clicking “match” all the way through.

• The second approach is based on the randomized param-eter approach described in [30]. Instead of always splittingwith the feature of minimum loss Z, we enable CS-ADTree

Page 8: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

-1.675

1: distance_between_locations < 36.249 2: blurb_tag_123gram < 25.240 3: global_frequency<f_name,m_name,l_name> < 27.797 4: house_number_match = none 5: tag_tag_hybrid < 16.977 6: birthday_difference < 35.0 7: domain_name_match = true 8: name_suffix = differ

0.895

y

-0.306

n

-0.203

y

1.852

n

39: blurb_tag_123gram < 13.717 40: global_frequency<f_name,m_name,l_name> < 28.116 41: jobtitle_tag_hybrid < 1.630

-0.008

y

0.609

n

-0.108

y

0.117

n

-0.033

y

0.361

n

0.592

y

-0.499

n

-0.136

y

1.257

n

-0.154

y

1.224

n

1.617

y

-0.066

n

1.453

y

-0.064

n

-2.890

y

0.039

n

9: phone_match = 10digits 10: blurb_blurb_123gram < 16.954 11: global_frequency<f_name,m_name,l_name> < 3657.778 12: tag_anchor_ngram < 8.300 13: tag_tag_hybrid < 7.458 14: zipcode_match_5_digit = none 15: distance_between_locations < 9486.706 16: global_frequency<f_name,m_name,l_name> < 3.375 17: name_suffix = match 18: birthday_difference < 5.000E19 19: birthday_difference < 367.5 30: SimilarButNotSameFML < 54.378 31: education_multi_field = none 32: field_tag_hybrid < 9.292 33: school_tag_hybrid < 8.445 34: population_of_mutual_area < 267537.5 35: ExactFLAndDifferentMiddle < 5.000E19 36: phone_match = area_code 37: global_frequency<f_name,m_name,l_name> < 2.045 38: blurb_tag_123gram < 72.026

1.206

y

-0.065

n

-0.058

y

1.489

n

0.088

y

-0.695

n

-0.040

y

1.668

n

-0.082

y

0.492

n

-0.089

y

0.463

n

-0.257

y

0.205

n

0.333

y

-0.124

n

1.032

y

-0.035

n

-0.484

y

0.069

n

20: exact_address_matches < 0.5 21: phone_match = 10digits 22: blurb_tag_123gram < 7.396 23: state_tag_whole < -5.0E19 24: anchor_anchor_ngram < 7.290 25: global_frequency<f_name,m_name,l_name> < 302.518 26: zipcode_match_5_digit = none 27: employer_tag_hybrid < 12.200 28: distance_between_locations < 4786.833

-0.004

y

1.408

n

1.563

y

-0.040

n

-0.091

y

0.444

n

-0.065

y

0.624

n

-0.020

y

2.414

n

0.112

y

-0.278

n

-0.059

y

0.415

n

-0.043

y

0.650

n

-0.196

y

0.128

n

1.147

y

-0.056

n

29: birthday_difference < 5.000E19

-1.675

y

-2.340

n

1.381

y

-0.015

n

-0.009

y

1.816

n

0.026

y

-0.706

n

42: tag_density < 1130743.158 43: blurb_blurb_123gram < 42.854 44: ssn_matches < 0.5 45: if_[!domain_name_match]_then_[domain_to_tag] < 9.140 46: zipcode_match_5_digit = none 47: tag_density < 3. 48: global_frequency<f_name,m_name,l_name> < 5.755 49: distance_between_locations < 183.960 50: jobtitle_employer_multi_field_weighted < 49.672 51: tag_tag_hybrid < 3.287 52: tag_tag_hybrid < 17.008 53: ExactFLAndDifferentMiddle < 35.029 54: if_[!title_whole_field_weighted]_then_[title_correlation] < 0.202 56: phone_match = area_code+exchange 57: tag_density < 229349.692 58: blurb_blurb_123gram < 4.448 59: exact_address_matches < 3.5 61: tag_tag_hybrid < 7.458 62: population_of_mutual_area < 3728.0 63: distance_between_locations < 4786.833 64: population_of_mutual_area < 3433383.5 65: global_frequency<f_name,m_name,l_name> < 274.755 66: blurb_tag_123gram < 40.054 67: blurb_tag_123gram < 24.494 68: social_network_match = none 69: birthday_difference < 730.5 71: lynx_house_number_match = none 72: city_tag_hybrid < -5.0E19 73: tag_tag_hybrid < 109.875 75: phone_match = 7digits 76: astrological_sign_match = none

0.021

y

-0.464

n

-0.011

y

1.270

n

-0.005

y

1.726

n

55: ssn_matches < -5.0E19

0.006

y

-1.022

n

-0.007

y

0.933

n

-0.035

y

0.223

n

-0.140

y

0.061

n

0.134

y

-0.066

n

0.124

y

-0.061

n

-0.003

y

1.492

n

-0.032

y

0.203

n

0.024

y

-0.315

n

0.731

y

-0.008

n

-0.007

y

0.726

n

-0.623

y

0.008

n

0.032

y

-0.144

n

-0.017

y

0.344

n

0.003

y

-1.054

n

60: phone_match = 10digits 70: tag_density < 13.193

0.447

y

-0.012

n

-0.111

y

0.055

n

-0.026

y

0.172

n

0.485

y

-0.009

n

-0.084

y

0.057

n

0.138

y

-0.036

n

0.044

y

-0.127

n

-0.007

y

0.609

n

0.015

y

-0.346

n

0.009

y

-0.492

n

77: birthday_difference < 5.000E19 78: birthday_difference < 2388.5 79: email_match = none 80: NotSameNameAndNotSimilar < 5.0E19 81: SuffixOnOneSideOnly = none 82: school_tag_hybrid < 14.179 83: ExactFLAndDifferentMiddleFemale < 5.000E19

-0.375

y

0.033

n

0.226

y

-0.012

n

-0.003

y

0.799

n

0.466

y

-0.005

n

0.007

y

-0.328

n

-0.005

y

0.496

n

-1.002

y

0.001

n

84: tag_anchor_ngram < 74.930

0.001

y

-1.028

n

85: blurb_tag_123gram < 71.701 86: ExactFLAndDifferentMiddle < 12.903 88: tag_density < 616624.303 89: blurb_tag_123gram < 9.190 90: global_frequency<f_name,m_name,l_name> < 1710.805 91: min[distance_between_locations]_>_min[distance_from_university] < 2673.822 92: min[distance_between_locations]_>_min[distance_from_university] < 4.006 93: global_frequency<f_name,m_name,l_name> < 29.970 94: global_frequency<f_name,m_name,l_name> < 1.189 95: birthday_difference < 2464.0 96: sex_match = none 97: zipcode_match_5_digit = none 98: distance_between_locations < 2566.767 99: distance_between_locations < 183.960

-0.001

y

0.961

n

-0.749

y

0.002

n

87: ExactFLAndDifferentMiddle < 13.699

1.557

y

-0.005

n

0.015

y

-0.164

n

-0.019

y

0.141

n

0.019

y

-0.135

n

-0.195

y

0.013

n

0.419

y

-0.007

n

-0.044

y

0.047

n

0.103

y

-0.022

n

0.195

y

-0.010

n

-0.033

y

0.073

n

-0.021

y

0.129

n

-0.067

y

0.047

n

0.071

y

-0.037

n

0.311

y

-0.020

n

-0.018

y

0.373

n

0.011

y

-0.399

n

-0.001

y

1.195

n

74: tag_tag_hybrid < 104.766

0.003

y

-1.253

n

1.106

y

-0.001

n

-0.003

y

0.689

n

-0.027

y

0.809

n

0.307

y

-0.049

n

-0.531

y

0.027

n

-0.433

y

0.028

n

0.201

y

-0.060

n

-0.006

y

1.739

n

Figure 9: Partial CS-ADTree. If name su�x=di↵er, we get a strong negative score and don’t query additionalfeatures.

-1.818

1: global_frequency<f_name,m_name,l_name> < 27.312 2: blurb_tag_123gram < 25.240 3: distance_between_locations < 21.545 4: tag_tag_hybrid < 16.977 5: phone_match = 10digits 6: blurb_blurb_123gram < 16.954 7: domain_name_match = none 8: name_suffix = differ 9: global_frequency<f_name,m_name,l_name> < 968.867 10: name_suffix = match 13: distance_between_locations < 9486.706 15: birthday_difference < 5.000E19

0.699

y

-0.547

n

-0.181

y

1.632

n

23: tag_anchor_ngram < 8.300

0.007

y

1.456

n

0.903

y

-0.281

n

-0.174

y

1.175

n

1.443

y

-0.087

n

-0.077

y

1.592

n

-0.068

y

1.361

n

-2.496

y

0.037

n

11: anchor_anchor_ngram < 7.290 12: house_number_match = true 14: tag_tag_hybrid < 7.458 16: birthday_difference < 432.5 22: jobtitle_employer_multi_field_weighted < -5.0E19 24: street_address_match = true 25: tag_density < 3. 26: tag_density < 230932.948 27: education_multi_field = none 28: SimilarButNotSameFML < 7.364 29: tag_tag_hybrid < 109.875 30: tag_tag_hybrid < 17.008 31: if_[!domain_name_match]_then_[domain_to_tag] < -5.0E19 32: title_wand_dist < 0.5 33: global_frequency<f_name,m_name,l_name> < 2503.785 34: global_frequency<f_name,m_name,l_name> < 28.034 35: global_frequency<f_name,m_name,l_name> < 1.200 36: field_tag_hybrid < 7.809 37: phone_match = area_code 38: distance_between_locations < 2.127 39: school_tag_hybrid < 6.914 40: blurb_blurb_123gram < 42.854 41: blurb_tag_123gram < 133.835 42: distance_between_locations < 4786.833 43: distance_between_locations < 183.960 44: blurb_tag_123gram < 9.190 45: blurb_tag_123gram < 25.508 46: employer_tag_hybrid < 12.200 47: global_frequency<f_name,m_name,l_name> < 289.668 48: global_frequency<f_name,m_name,l_name> < 27.678 49: blurb_tag_123gram < 196.237 50: employer_tag_hybrid < 135.970 51: distance_between_locations < 292.656 52: tag_tag_hybrid < 3.287 53: employer_tag_hybrid < 33.841 54: if_[!title_whole_field_weighted]_then_[title_correlation] < 0.399 55: global_frequency<f_name,m_name,l_name> < 8.012 56: city_tag_hybrid < 18.779 57: city_tag_hybrid < 15.379 58: blurb_tag_123gram < 9.729 59: ExactFLAndDifferentMiddle < 21.902 60: blurb_tag_123gram < 23.727 61: blurb_tag_123gram < 57.042 62: SimilarButNotSameFML < 54.378 64: social_network_match = none 65: birthday_difference < 730.5 66: birthday_difference < 5.000E19 67: global_frequency<f_name,m_name,l_name> < 235.731 68: jobtitle_tag_hybrid < 37.483 69: jobtitle_tag_hybrid < 1.630 70: birthday_difference < 45.0 76: sex_match = none 77: tag_density < 2179808.945 78: email_match = none 79: tag_tag_hybrid < 17.312 80: tag_tag_hybrid < 54.070 81: phone_match = area_code+exchange 82: blurb_anchor_ngram < 19.961 84: SuffixOnOneSideOnly = none 85: field_tag_hybrid < 26.062 86: blurb_blurb_123gram < 236.766 88: min[distance_between_locations]_>_min[distance_from_university] < 2444.928 89: blurb_tag_123gram < 13.717 90: blurb_tag_123gram < 25.683 91: school_tag_hybrid < 34.448 92: school_tag_hybrid < 37.573 93: employer_tag_hybrid < 76.813 95: tag_tag_hybrid < 41.719

-0.020

y

2.559

n

0.876

y

-0.052

n

-0.094

y

0.436

n

1.817

y

-0.039

n

21: birthday_difference < 5.000E19 71: birthday_difference < 730.5

-1.802

y

0.005

n

72: phone_match = 10digits 73: tag_density < 37.785 74: zipcode_match_5_digit = none 75: population_of_mutual_area < 27363.0

0.578

y

-0.003

n

-0.146

y

0.035

n

-0.019

y

0.367

n

0.412

y

-0.003

n

1.269

y

-0.004

n

-0.014

y

2.080

n

0.937

y

-0.054

n

-0.318

y

0.054

n

0.082

y

-0.241

n

-0.011

y

1.811

n

1.353

y

-0.008

n

-0.005

y

1.809

n

83: tag_tag_hybrid < 97.516

0.003

y

-0.925

n

94: tag_tag_hybrid < 64.008

-0.003

y

0.914

n

0.041

y

-0.358

n

-0.010

y

1.038

n

0.661

y

-0.014

n

0.032

y

-0.255

n

-0.106

y

0.113

n

0.222

y

-0.050

n

0.019

y

-0.375

n

-0.351

y

0.020

n

0.246

y

-0.033

n

-0.024

y

0.293

n

-0.007

y

0.916

n

0.011

y

-0.600

n

63: blurb_tag_123gram < 109.945 87: min[distance_between_locations]_>_min[distance_from_university] < 53.391

0.003

y

1.378

n

0.325

y

-0.009

n

-0.343

y

0.132

n

0.168

y

-0.046

n

-0.038

y

0.152

n

0.034

y

-0.292

n

-0.025

y

0.279

n

0.045

y

-0.140

n

-0.083

y

0.091

n

-0.002

y

1.402

n

-0.001

y

1.331

n

0.122

y

-0.036

n

-0.032

y

0.137

n

0.015

y

-0.304

n

-0.006

y

0.611

n

0.077

y

-0.049

n

-0.008

y

1.324

n

0.009

y

-0.483

n

-0.029

y

0.119

n

0.618

y

-0.005

n

0.020

y

-0.167

n

-0.012

y

0.282

n

0.587

y

-0.005

n

0.003

y

-0.388

n

0.427

y

-0.008

n

-0.761

y

0.020

n

0.033

y

-0.096

n

0.017

y

-0.180

n

-0.024

y

0.159

n

1.030

y

-0.002

n

-0.053

y

0.072

n

0.007

y

-0.474

n

-0.003

y

0.892

n

0.018

y

-0.160

n

-0.008

y

0.457

n

-0.528

y

0.005

n

-0.001

y

1.064

n

0.008

y

-0.276

n

0.007

y

-0.270

n

-0.001

y

1.026

n

-0.244

y

0.022

n

-0.040

y

0.212

n

0.019

y

-0.175

n

0.005

y

-0.392

n

-0.001

y

1.123

n

-0.001

y

0.965

n

0.008

y

-0.266

n

0.103

y

-0.499

n

1.046

y

-0.050

n

-0.318

y

0.160

n

-0.629

y

0.058

n

17: phone_match = 10digits 18: zipcode_match_5_digit = none 19: blurb_tag_123gram < 4.977 20: global_frequency<f_name,m_name,l_name> < 3.900

1.374

y

0.005

n

-0.044

y

0.679

n

-0.100

y

0.314

n

0.262

y

-0.106

n

Figure 10: Partial CS-ADTree. Note how we assign di↵erent scores to di↵erent birthday distances and queryadditional features if birthday di↵erence is unknown (infinite)

to always randomly choose 1 of the 3 features with minimumlosses across all features. A committee of 5 CS-ADTree’strained on S is then created as {H0,H1,H2,H3,H4}, recordsr 2 S with minimal absolute committee score, i.e., |P4

i=0(Hi(r))|,will be picked for class labeling.

• The last approach is pretty straightforward. We chooserecords r 2 S with minimal absolute prediction score byH, i.e., |H(r)|, since it represents a small margin giventhe natural threshold 0 which signifies the uncertainty ofH for classifying the particular instance. This approachesresonates with our argument that a model that yields ourdesired precision, ⇡ at a threshold close to 0 will have closeto optimal recall.

We pick 1000 pairs of records from the data pool usingeach of the three active learning approaches, which yields3000 records in total. We also pick another 5000 pairs ofrecords using active learning approach 1 to serve as the in-dependent data set E for evaluating the learning models.The P-R curves of CS-ADTree trained on S plus the recordspicked by each of three active learning approaches are givenin Fig. 11, which clearly shows active learning approaches 2and 3 perform better than the baseline approach.

Expanded by the 3000 pairs of records picked by activelearning approaches, S now becomes noisier since it is un-avoidable that class labels assigned by Turkers will be par-tially erroneous. Using ADTree and CS-ADTree to deriveclassifiers from S, Fig. 12 shows the P-R curves of applyingthese classifiers on E , which clearly indicates CS-ADTree ex-hibits improved performance compared to ADTree for record

Figure 11: P-R curves of 3 active learning ap-proaches

linkage of noisy database.In order to demonstrate the competitiveness of ADTree/CS-

ADTree against other popular ML techniques used by recordlinkage designers, we also apply decision tree (DT), boostedDT (T = 200), and SVMs to classify pairs of records inE . The kernel function selected for SVMs is the Gaussiankernel with � = 0.1. Empirical study discovers that SVMscannot e�ciently handle the original feature representation,i.e., a mixture of categorical and numerical features. Thuswe apply two strategies to transform the features to cater forSVMs’ needs. Categorical features are all transferred intodiscrete integers for both strategies. For numerical features,strategy 1 is uniform bucketing which evenly divides its datarange into 5 buckets, and transforms the feature value into

Page 9: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

Figure 12: P-R curves of ADTree and CS-ADTreeon E

Figure 13: Histogram of f-measure’s for ADTree,CS-ADTree, and alternative ML techniques on E

the integral index of the bucket it falls in. Strategy 2 is in-formative bucketing which e�ciently makes use of the infor-mation provided by the CS-ADTree structure after training.If numerical feature f is split on values of p0 < p1 < . . . < pnin the training process of CS-ADTree, any feature value v off will be rewritten into integer i, s.t., pi < v < pi+1. Using 0as the threshold for ADTree/CS-ADTree, Fig. 13 shows thef-measure of ADTree, CS-ADTree, boosted decision tree, de-cision tree, and SVM’s using two bucketing strategies. It isobvious that ADTree and CS-ADTree both perform betterthan alternative ML techniques on record linkage of E .

7. RELATED WORKSince record linkage has been a well-studied topic, we rec-

ommend interested readers to [15] and [18] for a compre-hensive survey. Due to the importance of feature repre-sentation, similarity function design is at the core of manyrecord linkage studies [23]. The most straightforward oneis the Levenshtein distance [24] which counts the number ofinsert, remove, and replace operations when mapping stringA into B. Considering the unbalanced cost of applying dif-ferent operations in practice, [27] modifies the definition ofedit distance to explicitly allow for the cost customizationby designers. In recent years similarity function design isincreasingly focused on adaptive methods. [29] [7] proposedsimilar stochastic models to learn the cost factors of dif-ferent operations for edit distance. Rooted in the spirit offuzzy match, [11] considers the text string at the tuple leveland proposes a probabilistic model to retrieve the K near-est tuples w.r.t an input tuple received in streamed format.Exploiting the similarity relation hidden under a big um-brella of linked pairs, [5] iteratively extracts useful informa-

tion from the pairs to progressively refine a set of similarityfunctions. [10] introduces the similarity functions from prob-abilistic information retrieval and empirically studies theiraccuracy for record linkage.

However, whatever ingenious methods may be used forsimilarity functions, it is necessary to integrate all of thesefield-level similarity judgments into an overall match/no-match decision. Various learning methods have been pro-posed for this task. [7] proposed stacked SVMs to learn andclassify pairs of record into match/unmatch, in which thesecond layer of SVMs is trained on a vector of similarityvalues that are output by the first layer of SVMs. [25] con-siders the records in a database as nodes of a graph, andapplies a clustering approach to divide the graph into anadaptively determined number of subsets, in which incon-sistencies among paired records are expected to be mini-mized. [31] instead considers features of records as nodesof a graph. Matched records would excite links connect-ing corresponding fields, which could be used to facilitateother record comparisons. A well-performing pairwise clas-sifier depends on the representativeness of the record pairsselected for training, which calls for an active learning ap-proach to e�ciently pick informative paired records froma data pool. [30] described several committee-based activelearning approaches for record linkage. Considering the ef-ficiency concern of applying an active learning model on adata pool with quadratically large size, [3] proposed a scal-able active learning method that is integrated with blockingto alleviate this dilemma.

8. CONCLUSIONIn this paper, we proposed to study record linkage of

databases by ADTree. Considering that an essential prob-lem of record linkage is that the business costs of misclassify-ing matched pair and unmatched pair are extremely biased,we further propose CS-ADT which assigns a higher misclas-sification cost for matched pair in the process of trainingADTree. Experiments show CS-ADTree and ADTree per-form extremely well on a clean database and exhibit supe-rior performance on noisy database compared with alterna-tive ML techniques. We also demonstrate how the run-timerepresentation of ADTree/CS-ADTree can facilitate humanunderstandability of learned knowledge by the classifier andyield a compact and e�cient run-time classifier.

9. REFERENCES[1] CiteSeerX. http://citeseerx.ist.psu.edu/.[2] JBoost Project. http://jboost.sourceforge.net.[3] A. Arasu, M. Gotz, and R. Kaushik. On active

learning of record matching packages. In SIGMOD,pages 783–794, 2010.

[4] J. Artiles, A. Borthwick, a. S. S. J. Gonzalo, andE. Amigo. Weps-3 evaluation campaign: Overview ofthe web people search clustering and attributeextraction tasks. In CLEF-2010.

[5] I. Bhattacharya and L. Getoor. Iterative recordlinkage for cleaning and integration. In Proc. ACMSIGMOD workshop on Research issues in data miningand knowledge discover, pages 11–18, 2004.

[6] M. Bilenko and S. Basu. Adaptive productnormalization: Using online learning for record linkagein comparison shopping. In ICDM, pages 58–65, 2005.

Page 10: The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkagevitor/papers/vldb2011.pdf · 2011. 10. 30. · Record Linkage (RL) is the task of identifying

[7] M. Bilenko and R. J. Mooney. Adaptive duplicatedetection using learnable string similarity measures. InKDD, pages 39–48, 2003.

[8] A. Borthwick, M. Buechi, and A. Goldberg. Keyconcepts in the choicemaker 2 record matchingsystem. In ACM KDD Workshop on Data Cleaning,Record Linkage, and Object Consolidation, 2003.

[9] L. Breiman, J. Friedman, and C. J. Stone.Classification and Regression Trees. Chapman andHall/CRC, 1984.

[10] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi,and D. Srivastava. Benchmarking declarativeapproximate selection predicates. In SIGMOD, pages353–364, 2007.

[11] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani.Robust and e�cient fuzzy match for online datacleaning. In SIGMOD, pages 313–324, 2003.

[12] P. Christen and T. Churches. Febrl - Freely ExtensibleBiomedical Record Linkage. Australian NationalUniversity, 0.4 edition.

[13] X. Dong, A. Y. Halevy, and J. Madhavan. Referencereconciliation in complex information spaces. InSIGMOD, pages 85–96, 2005.

[14] W. Eckerson. Data warehousing special report: Dataquality and the bottom line. ApplicationsDevelopment Trends, 2002.

[15] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.Duplicate record detection: A survey. IEEE TKDE,19(1):1–16, 2007.

[16] Y. Freund and L. Mason. The alternating decision treealgorithm. In ICML, pages 124–133, 1999.

[17] Y. Freund and R. Schapire. A decision-theoreticgeneralization of on-line learning and an application toboosting. J. Computer and System Sciences,55(1):119–139, 1997.

[18] L. Gu, R. Baxter, D. Vickers, and C. Rainsford.Record linkage: Current practice and futuredirections. Technical report, CSIRO Mathematicaland Information Sciences, 2003.

[19] D. J. Hand. Measuring classifier performance: acoherent alternative to the area under roc curve.Machine Learning, 77(1):103–123, 2009.

[20] H. He and E. A. Garcia. Learning from imbalanceddata. IEEE TKDE, 21(9):1263–1284, 2009.

[21] M. A. Hernandez and S. J. Stolfo. Real-world data isdirty: Data cleansing and the merge/purge problem.In KDD, pages 9–37, 1998.

[22] T. Joachims. Making large-Scale SVM LearningPractical. Advances in Kernel Methods - SupportVector Learning. MIT-Press, 1999.

[23] N. Koudas, S. Sarawagi, and D. Srivastava. Recordlinkage: Similarity measures and algorithms. InSIGMOD, pages 802–803, 2006.

[24] V. I. Levenshtein. Binary codes capable of correctingdeletions, insertions and reversals. Soviet PhysicsDoklady, 10:707–710, 1966.

[25] A. McCallum, K. Nigam, and L. Ungar. E�cientclustering of high-dimensional data sets withapplication to reference matching. In KDD, pages169–178, 2000.

[26] A. E. Monge and C. P. Elkan. An e�cientdomain-independent algorithm for detectingapproximately duplicate database records. In ACMSIGMOD Workshop on Research Issues on DataMining and Knowledge Discovery, pages 23–29, 1997.

[27] S. B. Needleman and C. D. Wunsch. A generalmethod applicable to the search for similarities in theamino acid sequence of two proteins. J. MolecularBiology, 48(3):443–453, 1970.

[28] V. Papadouka, P. Schae↵er, A. Metroka,A. Borthwick, and et al. Integrating the new yorkcitywide immunization registry and the childhoodblood lead registry. J. Public Health Management andPractice, 10:S72, 2004.

[29] E. S. Ristad and P. N. Yianilos. Learning string editdistance. IEEE Trans. Pattern Analysis and MachineIntelligence, 20(5):522–532, 1998.

[30] S. Sarawagi and A. Bhamidipaty. Interativededuplication using active learning. In KDD, pages269–278, 2002.

[31] P. Singla and P. Domingos. Multi-relational recordlinkage. In Proc. ACM KDD Workshop onMulti-Relational Data Mining, pages 31–48, 2004.

[32] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng.Cheap and fast - but is it good?: evaluatingnon-expert annotations for natural language tasks. InEMNLP, pages 254–263, 2008.

[33] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang.Cost-sensitive boosting for classification of imbalanceddata. Pattern Recognition, 40(12):3358–3378, 2007.

[34] W. E. Winkler. Overview of record linkage andcurrent research directions. Technical report, Bureauof the Census, 2006.

[35] B. Zadrozny, J. Langford, and N. Abe. Cost-sensitivelearning by cost-proportionate example weighting. InICDM, pages 435–442, 2003.

[36] H. Zhao and S. Ram. Entity identification forheterogeneous database integration: a multipleclassifier system approach and empirical evaluation.Information Systems, 30(2):119–132, 2005.