Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University...

32
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research

Transcript of Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University...

Page 1: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Mining Reference Tables for Automatic Text Segmentation

Eugene AgichteinColumbia University

Venkatesh GantiMicrosoft Research

Page 2: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Scenarios

Importing unformatted strings into a target structured database– Data warehousing– Data integration

Requires each string to be segmented into the target relation schema

Input strings are prone to errors (e.g., data warehousing, data exchange)

Page 3: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Current Approaches

Rule-based– Hard to develop, maintain, and deploy

comprehensive sets of rules for every domain

Supervised– E.g., [BSD01]– Hard to obtain comprehensive datasets needed to

train robust models

Page 4: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Our Approach

Exploit large reference tables– Learn domain-specific dictionaries– Learn structure within attribute values

Challenges– Order of attribute concatenation in future test

input is unknown– Robustness to errors in test input after training on

clean and standardized reference tables

Page 5: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Problem Statement

Target schema: R[A1,…,An] For a given string s (a sequence of tokens)

– segment s into s1,…,sn substrings at token boundaries – map s1,…,sn to Ai1,…,Ain

– maximize P(Ai1|s1)*…*P(Ain|sn) among all possible segmentations of s

Product combination function handles arbitrary concatenation order of attribute values

P(Ai|x) that a string x belongs to Ai estimated by an Attribute Recognition Model ARMi

ARMs are learned from a reference relation r[A1,…,An]

Page 6: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Segmentation Architecture

t1, t2, t3,….,tm SEGMENTATION t1 | t2, t3 | …. | tn

INPUT STRING SEGMENTED TUPLE

A1 A2 … An

PRE-PROCESSING/TRAINING

ARM1

ARM2

ARMn

REFERENCE TABLE

feature hierarchy,tokenization

Page 7: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

ARMs

Design goals– Accurately distinguish an attribute value from

other attributes– Generalize to unobserved/new attribute values– Robust to input errors – Able to learn over large reference tables

Page 8: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

ARM: Instantiation of HMMs

Purpose: Estimate probabilities of token sequences belonging to attributes

ARM: instantiation of HMMs (sequential models)

Acceptance probability: product of emission and transition probabilities

Number ending

in ‘st’ or ‘th’

Short word(<= 5 chars)

st|rd|wy|blvd

START

0.3 0.4 1.0 1.0

END

Page 9: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Instantiating HMMs

Instantiation has to define– Topology: states & transitions– Emission & transition probabilities

Current automatic approaches for topology search from among a pre-defined class of topologies are based on cross validation [FC00, BSD01]

– Expensive– Number of states in the ARM is small to keep the search

space tractable

Page 10: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Intuition behind ARM Design

Street address examples – [nw 57th St], [Redmond Woodinville Rd]

Album names– [The best of eagles], [The fury of aquabats], [Colors Soundtrack]

Large dictionaries (e.g., aquabats, soundtrack, st…) to exploit Begin and end tokens are very important to distinguish values

of an attribute (nw, st, the,…) Can learn patterns on tokens (e.g., 57th generalizes to *th) Need robustness to input errors

– [Best of eagles] for [The best of eagles], [nw 57th] for [nw 57th st]

Page 11: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Large Number of States

Associate a state per token: Each state only emits a single base token– More accurate transition probabilities

Model sizes for many large reference tables are still within a few megabytes– Not a problem with current main memory sizes!

Prune the number of states (say, remove low frequency tokens) to limit the ARM size

Page 12: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

BMT Topology: Relax Positional Specificity

BEGIN MIDDLE TRAILING

START END

A single state per distinct symbol within a category -- emission probability of a symbol within a category is same

Page 13: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Feature Hierarchy: Relax Token Specificity [BSD01]

ave apt st 5th 42nd 40th

*words[a-z]{1-}

numbers[0-9]{1-}

delimitersmixed[a-z0-9]{1-}

[a-z]{1-10}

[a-z]{1-9}

[a-z]{1-1}

[0-9]{1-10}

[0-9]{1-9}

[0-9]{1-1}

... ...

[a-z0-9]{1-10}

[a-z0-9]{1-10}

[a-z0-9]{1-2}

...

123 55 5 #

Featureclasses

Basetokens

Page 14: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Example ARM for Address

W

E

c1-3

w+

BEGIN TRAILING

Rd

c1-3

w+

St

50th

c1-3

w+

MIDDLE

42nd

40th

Street

START END

... Address …

40th Rd E 50th Street W 42nd St ….

......

Page 15: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Robustness Operations: Relax Sequential Specificity

Make ARMs robust to common errors in the input, i.e., maintain high probability of acceptance despite these errors

Common types of errors [HS98]– Token deletions– Token insertions– Missing values

Intuition: Simulate the effects of such erroneous values over each ARM

Page 16: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Robustness Operations

BEGIN MIDDLE TRAILING

END

Simulating the effect of token insertions: token and corresponding transition probabilities are copied

from BEGIN to MIDDLE state

Page 17: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Transition Probabilities

Transitions from BM and BT and MM and MT allowed

Learned from examples in reference table Transition probabilities are also weighted by

their ability to distinguish an attribute– A transition “*” “*” which is common across

many attributes gets low weight

Page 18: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Summary of ARM Instantiation

BMT topology Token hierarchy to generalize observed

patterns Robustness operations on HMMs to address

input errors One state per token in reference table to

exploit large dictionaries

Page 19: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Attribute Order Determination

If attribute order is known– Can use dynamic programming algorithm to segment [Rabiner89]

If attribute order is unknown– Can ask the user to provide attribute order– Can discover attribute order

Naïve expensive strategy: evaluate all concatenation orders and segmentations for each input string

Consistent Attribute Order Assumption: the attribute order is the same across a batch of input tuples

– Several datasets on the web satisfy this assumption– Allows us to efficiently

Determine the attribute order over a batch of tuples Segment input strings (using dynamic programming)

Page 20: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Segmentation Algorithm (runtime)

BATCH

INPUT STRING OUTPUT TUPLE

LEARN ATTRIBUTE VALUE ORDER

SEGMENT(Dynamic programming

algorithm)

ARMs

Page 21: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Experimental Evaluation

Reference relations from several domains– Addresses: 1,000,000 tuples

[Name, #1, #2, Street Address, City, State, Zip]

– Media: 280,000 tuples [ArtistName, AlbumName, TrackName]

– Bibliography: 100,000 tuples [Title, Author, Journal, Volume, Month, Year]

Compare CRAM (our system) with DataMold [BSD01]

Page 22: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Test Datasets

Naturally erroneous datasets: unformatted input strings seen in operational databases

– Media– Customer addresses

Controlled error injection:– Clean reference table tuples [Inject errors] Concatenate

to generate input strings Evaluate whether a segmentation algorithm recovered

the original tuple– Accuracy Measure: % of attribute values correctly recognized

Page 23: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Overall Accuracy

86

88

90

92

94

96

98

100

Missing Insertions Deletions Spelling Reordering AllErrors Clean

CRAM Datamold

65

70

75

80

85

90

95

Missing Insertions Deletions Spelling Reordering AllErrors Clean

CRAM Datamold

Addresses DBLP

Page 24: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Topology & Robustness Operations

Addresses

80

85

90

95

100

1 Pos BMT BMT-robust

Page 25: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Training on Hypothetical Error Models

70

75

80

85

90

95

100

Addresses:AllErrors Addresses:Clean DBLP:AllErrors DBLP:Clean

Datamold Datamold:Hypothetical CRAM

Page 26: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Exploiting Dictionaries

40

60

80

1001.

E+

02

1.E

+03

2.E

+03

5.E

+03

1.E

+04

2.E

+04

4.E

+04

1.E

+05

2.E

+05

DBLP

Addresses

Media

Accuracy vs Reference Table size

Page 27: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Conclusions

Reference tables leveraged for segmentation Combining ARMs based on independence

allows segmenting input strings with unknown attribute order

ARM models learned over clean reference relations can accurately segment erroneous input strings– BMT topology– Robustness operations– Exploiting large dictionaries

Page 28: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Model Sizes & Pruning

80

85

90

95

100

0.1 0.2 0.3 0.5 1

AddressesDBLPMedia

0.E+00

5.E+05

1.E+06

2.E+06

2.E+06

0.1 0.2 0.3 0.5 1

AddressesDBLPMedia

0

1

2

3

4

5

6

7

8

0.1 0.2 0.3 0.5 1

AddressesDBLPMedia

Accuracy #States & Transitions Model Size in MB

Page 29: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Order Determination Accuracy

Page 30: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Topology

50

55

60

65

70

75

80

85

90

Err Clean Err Clean Err Clean Err Clean Err Clean

500 1000 2000 5000 inf

1-Pos BMT 9-Pos

Media

Page 31: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Specificities of HMM Models

Model “specificity” restricts accepted token sequences

Positional specificity– Number ending in ‘th|st’ can

only be the 2nd token in an address value

Token specificity– Last state only accepts “st, rd,

wy, blvd” Sequential specificity

– “st, rd, wy, blvd” have to follow a number in ‘st|th’

Number ending

in ‘st’ or ‘th’

Short word(<= 5 chars)

st|rd|wy|blvd

START

0.3 0.4 1.0 1.0

END

Page 32: Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Robustness Operations

BEGIN MIDDLE TRALING

END

BEGIN MIDDLE TRALING

START END

BEGIN MIDDLE TRALING

ENDSTART

B’

Token insertion Token deletion Missing values