Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University...

Mining Reference Tables for Automatic Text Segmentation

Eugene AgichteinColumbia University

Venkatesh GantiMicrosoft Research

Scenarios

Importing unformatted strings into a target structured database– Data warehousing– Data integration

Requires each string to be segmented into the target relation schema

Input strings are prone to errors (e.g., data warehousing, data exchange)

Current Approaches

Rule-based– Hard to develop, maintain, and deploy

comprehensive sets of rules for every domain

Supervised– E.g., [BSD01]– Hard to obtain comprehensive datasets needed to

train robust models

Our Approach

Exploit large reference tables– Learn domain-specific dictionaries– Learn structure within attribute values

Challenges– Order of attribute concatenation in future test

input is unknown– Robustness to errors in test input after training on

clean and standardized reference tables

Problem Statement

Target schema: R[A1,…,An] For a given string s (a sequence of tokens)

– segment s into s1,…,sn substrings at token boundaries – map s1,…,sn to Ai1,…,Ain

– maximize P(Ai1|s1)*…*P(Ain|sn) among all possible segmentations of s

Product combination function handles arbitrary concatenation order of attribute values

P(Ai|x) that a string x belongs to Ai estimated by an Attribute Recognition Model ARMi

ARMs are learned from a reference relation r[A1,…,An]

Segmentation Architecture

t1, t2, t3,….,tm SEGMENTATION t1 | t2, t3 | …. | tn

INPUT STRING SEGMENTED TUPLE

A1 A2 … An

PRE-PROCESSING/TRAINING

ARM1

ARM2

…

ARMn

REFERENCE TABLE

feature hierarchy,tokenization

ARMs

Design goals– Accurately distinguish an attribute value from

other attributes– Generalize to unobserved/new attribute values– Robust to input errors – Able to learn over large reference tables

ARM: Instantiation of HMMs

Purpose: Estimate probabilities of token sequences belonging to attributes

ARM: instantiation of HMMs (sequential models)

Acceptance probability: product of emission and transition probabilities

Number ending

in ‘st’ or ‘th’

Short word(<= 5 chars)

st|rd|wy|blvd

START

0.3 0.4 1.0 1.0

END

Instantiating HMMs

Instantiation has to define– Topology: states & transitions– Emission & transition probabilities

Current automatic approaches for topology search from among a pre-defined class of topologies are based on cross validation [FC00, BSD01]

– Expensive– Number of states in the ARM is small to keep the search

space tractable

Intuition behind ARM Design

Street address examples – [nw 57th St], [Redmond Woodinville Rd]

Album names– [The best of eagles], [The fury of aquabats], [Colors Soundtrack]

Large dictionaries (e.g., aquabats, soundtrack, st…) to exploit Begin and end tokens are very important to distinguish values

of an attribute (nw, st, the,…) Can learn patterns on tokens (e.g., 57th generalizes to *th) Need robustness to input errors

– [Best of eagles] for [The best of eagles], [nw 57th] for [nw 57th st]

Large Number of States

Associate a state per token: Each state only emits a single base token– More accurate transition probabilities

Model sizes for many large reference tables are still within a few megabytes– Not a problem with current main memory sizes!

Prune the number of states (say, remove low frequency tokens) to limit the ARM size

BMT Topology: Relax Positional Specificity

BEGIN MIDDLE TRAILING

START END

A single state per distinct symbol within a category -- emission probability of a symbol within a category is same

Feature Hierarchy: Relax Token Specificity [BSD01]

ave apt st 5th 42nd 40th

*words[a-z]{1-}

numbers[0-9]{1-}

delimitersmixed[a-z0-9]{1-}

[a-z]{1-10}

[a-z]{1-9}

[a-z]{1-1}

[0-9]{1-10}

[0-9]{1-9}

[0-9]{1-1}

... ...

[a-z0-9]{1-10}

[a-z0-9]{1-10}

[a-z0-9]{1-2}

...

123 55 5 #

Featureclasses

Basetokens

Example ARM for Address

W

E

c1-3

w+

BEGIN TRAILING

Rd

c1-3

w+

St

50th

c1-3

w+

MIDDLE

42nd

40th

Street

START END

... Address …

40th Rd E 50th Street W 42nd St ….

......

Robustness Operations: Relax Sequential Specificity

Make ARMs robust to common errors in the input, i.e., maintain high probability of acceptance despite these errors

Common types of errors [HS98]– Token deletions– Token insertions– Missing values

Intuition: Simulate the effects of such erroneous values over each ARM

Robustness Operations

BEGIN MIDDLE TRAILING

END

Simulating the effect of token insertions: token and corresponding transition probabilities are copied

from BEGIN to MIDDLE state

Transition Probabilities

Transitions from BM and BT and MM and MT allowed

Learned from examples in reference table Transition probabilities are also weighted by

their ability to distinguish an attribute– A transition “*” “*” which is common across

many attributes gets low weight

Summary of ARM Instantiation

BMT topology Token hierarchy to generalize observed

patterns Robustness operations on HMMs to address

input errors One state per token in reference table to

exploit large dictionaries

Attribute Order Determination

If attribute order is known– Can use dynamic programming algorithm to segment [Rabiner89]

If attribute order is unknown– Can ask the user to provide attribute order– Can discover attribute order

Naïve expensive strategy: evaluate all concatenation orders and segmentations for each input string

Consistent Attribute Order Assumption: the attribute order is the same across a batch of input tuples

– Several datasets on the web satisfy this assumption– Allows us to efficiently

Determine the attribute order over a batch of tuples Segment input strings (using dynamic programming)

Segmentation Algorithm (runtime)

BATCH

INPUT STRING OUTPUT TUPLE

LEARN ATTRIBUTE VALUE ORDER

SEGMENT(Dynamic programming

algorithm)

ARMs

Experimental Evaluation

Reference relations from several domains– Addresses: 1,000,000 tuples

[Name, #1, #2, Street Address, City, State, Zip]

– Media: 280,000 tuples [ArtistName, AlbumName, TrackName]

– Bibliography: 100,000 tuples [Title, Author, Journal, Volume, Month, Year]

Compare CRAM (our system) with DataMold [BSD01]

Test Datasets

Naturally erroneous datasets: unformatted input strings seen in operational databases

– Media– Customer addresses

Controlled error injection:– Clean reference table tuples [Inject errors] Concatenate

to generate input strings Evaluate whether a segmentation algorithm recovered

the original tuple– Accuracy Measure: % of attribute values correctly recognized

Overall Accuracy

86

88

90

92

94

96

98

100

Missing Insertions Deletions Spelling Reordering AllErrors Clean

CRAM Datamold

65

70

75

80

85

90

95

Missing Insertions Deletions Spelling Reordering AllErrors Clean

CRAM Datamold

Addresses DBLP

Topology & Robustness Operations

Addresses

80

85

90

95

100

1 Pos BMT BMT-robust

Training on Hypothetical Error Models

70

75

80

85

90

95

100

Addresses:AllErrors Addresses:Clean DBLP:AllErrors DBLP:Clean

Datamold Datamold:Hypothetical CRAM

Exploiting Dictionaries

40

60

80

1001.

E+

02

1.E

+03

2.E

+03

5.E

+03

1.E

+04

2.E

+04

4.E

+04

1.E

+05

2.E

+05

DBLP

Addresses

Media

Accuracy vs Reference Table size

Conclusions

Reference tables leveraged for segmentation Combining ARMs based on independence

allows segmenting input strings with unknown attribute order

ARM models learned over clean reference relations can accurately segment erroneous input strings– BMT topology– Robustness operations– Exploiting large dictionaries

Model Sizes & Pruning

80

85

90

95

100

0.1 0.2 0.3 0.5 1

AddressesDBLPMedia

0.E+00

5.E+05

1.E+06

2.E+06

2.E+06

0.1 0.2 0.3 0.5 1

AddressesDBLPMedia

0

1

2

3

4

5

6

7

8

0.1 0.2 0.3 0.5 1

AddressesDBLPMedia

Accuracy #States & Transitions Model Size in MB

Order Determination Accuracy

Topology

50

55

60

65

70

75

80

85

90

Err Clean Err Clean Err Clean Err Clean Err Clean

500 1000 2000 5000 inf

1-Pos BMT 9-Pos

Media

Specificities of HMM Models

Model “specificity” restricts accepted token sequences

Positional specificity– Number ending in ‘th|st’ can

only be the 2nd token in an address value

Token specificity– Last state only accepts “st, rd,

wy, blvd” Sequential specificity

– “st, rd, wy, blvd” have to follow a number in ‘st|th’

Number ending

in ‘st’ or ‘th’

Short word(<= 5 chars)

st|rd|wy|blvd

START

0.3 0.4 1.0 1.0

END

Robustness Operations

BEGIN MIDDLE TRALING

END


START END


ENDSTART

B’

Token insertion Token deletion Missing values

Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University...

Documents

Transcript of Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University...