Integration and representation of unstructured text in relational databases

Integration and representation of unstructured text in relational

databases

Sunita Sarawagi

IIT Bombay

Database

HR databaseResumes: skills, experience, references (emails)

Unstructured data

Text resume in an email

Company database: products with features

Product reviews on the webCustomer emails

Citeseer/Google scholarStructured records from publishers

Publications from homepages

Personal Databases: bibtex, address book

Extract bibtex entries when I download a paper

Enter missing contacts via web search

Id Title Year Journal Canonical

2 Update Semantics 1983 10

Id Name Canonical

10 ACM TODS

17 AI 17

16 ACM Trans. Databases

Article Author

2 11

2 2

2 3

Id Name Canonical

11 M Y Vardi

2 J. Ullman 4

3 Ron Fagin 3

4 Jeffrey Ullman 4

Authors

Writes

JournalsArticles

Probabilistic variant links to

canonical entries

Database: imprecise

3 Top-level

entities

R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see



7 Belief, awareness, reasoning

1988 17

Id Name Canonical

10 ACM TODS

17 AI 17


10

Article Author

2 11

2 2

2 3

7 8

7 9

Id Name Canonical

11 M Y Vardi

2 J. Ullman 4

3 Ron Fagin 3

4 Jeffrey Ullman 4

8 R Fagin 3

9 J Helpern 8

Authors

Writes

JournalsArticles


Extraction

Integration

Match with existing linked entities while respecting

all constraints

Author: R. Fagin AAuthor: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998

Outline Statistical models for integration

Extraction while fully exploiting existing database Entity match, Entity pattern, link/relationship constraints,

Integrate extracted entities, resolve if entity already in database

Performance challenges Efficient graphical model inference algorithms Indexing support

Representing uncertainty of integration in DB Imprecise databases and queries

1 2 3 4 5 6 7 8

R. Fagin and J. Helpbern Belief Awareness Reasoning

Author Author Other Author Author Title Title Title

t

x

y

Extraction using chain CRFsR. Fagin and J. Helpern, Belief, awareness, reasoning

Flexible overlapping features •identity of word•ends in “-ski”•is capitalized•is part of a noun phrase?•is under node X in WordNet•is in bold font•is indented•next two words are “and Associates”•previous label is “Other”

y1 y2 y3 y4 y5 y6 y7 y8

Difficult to effectively combine features from labeled unstructured data and structured DB

1 2 3 4 5 6 7 8


Author Author Other Author Author Title Title Title

t

x

y

Features describe the single word “Fagin”

CRFs for Segmentation

l1=1, u1=2 l1=u1=3 l1=4, u1=5 l1=6, u1=8


Author Other Author Title

x

y

Features describe the segment from l to u

Similarity to author’s column in database

l,u

Features from database Similarity to a dictionary entry

JaroWinkler, TF-IDF Similarity to a pattern level dictionary

Regex based pattern index for database entities Entity classifier

A multi-class regression model which gives likelihood of a segment being a particular entity type

Features for the classifier: all standard entity-level extraction features

Segmentation models Segmentation

Input: sequence x=x1,x2..xn, label set Y

Output: segmentation S=s1,s2…sp

sj = (start position, end position, label) = (tj,uj,yj)

Score: F(x,s) = Transition potentials

Segment starting at i has label y and previous label is y’ Segment potentials

Segment starting at i’, ending at i, and with label y. All positions from i’ to i get same label.

Probability of a segmentation: Inference O(nL2)

Most likely segmentation, Marginal around segments




1988 17

Id Name Canonical

10 ACM TODS

17 AI 17


10

Article Author

2 11

2 2

2 3

7 8

7 9

Id Name Canonical

11 M Y Vardi

2 J. Ullman 4

3 Ron Fagin 3

4 Jeffrey Ullman 4

8 R Fagin 3

9 J Helpern 8

Authors

Writes

JournalsArticles


Extraction

Integration

Match with existing linked entities while respecting

all constraints

Author: R. Fagin AAuthor: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998

CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoning in AI

Author: R. Fagin Author: J. Helpern Title: Belief,..reasoning Journal: AI Year: 2000

Only extractionCombined

Extraction+integration

Year mismatch!


2 Update Semantics

1983 10


1988 17

Author: R. FaginAuthor: J. Helpern Title: Belief,..reasoning in AI Journal: CACM Year: 2000

Combined extraction + matching Convert predicted label to be a pair y = (a,r) (r=0) means none-of-the-above or a new entry

Id of matching entity

r

l1=1, u1=2 l1=u1=3 l1=4, u1=8

CACM. 2000 Fagin Belief Awareness Reasoning In AI

Journal Year Author Title

0 7 3 7

x

y

l,u

Constraints exist on ids that can be assigned to

two segments

Constrained models Two kinds of constraints between arbitrary

segments Foreign key constraint across their canonical-ids Cardinality constraint

Training Ignore constraints or use max-margin methods that

require only MAP estimates Application:

Formulate as a constrained integer programming problem (expensive)

Use general A-star search to find most likely constrained assignment

Effect of database on extraction performance

L L+DB %Δ

PersonalBib

author 75.7 79.5 4.9

journal 33.9 50.3 48.6

title 61.0 70.3 15.1

Address

city_name 72.4 76.7 6.0

state_name 13.9 33.2 138.5

zipcode 91.6 94.3 3.0

L = Only labeled structured data

L + DB: similarity to database entities and other DB features

(Mansuri and Sarawagi ICDE 2006)

Effect of various features

55

60

65

70

75

80

85o

nly

_L

(n

oD

B)

"+ca

rdin

alit

y

"+d

b_

sim

ilari

ty

"+d

b_

cla

ssifie

r

"+d

b_

reg

ex

"+d

b_

link

"-L

_e

ntity

"-L

_co

nte

xt

"-L

_e

dg

e

on

ly_

L (

no

DB

)

"+ca

rdin

alit

y

"+d

b_

sim

ilari

ty

"+d

b_

cla

ssifie

r

"+d

b_

reg

ex

"-L

_e

ntity

"-L

_co

nte

xt

"-L

_e

dg

e

Train=5% Train=10%

F1

Full integration performance

L L+DB %Δ

PersonalBib

author 70.8 74.0 4.5

journal 29.6 45.5 53.6

title 51.6 65.0 25.9

Address

city_name 70.1 74.6 6.4

state_name 9.0 28.3 213.8

pincode 87.8 90.7 3.3

L = conventional extraction + matching L + DB = technology presented here

Much higher accuracies possible with more training data

(Mansuri and Sarawagi ICDE 2006)

Inference in segmentation models

R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998

Surface features(cheap)

Database lookup features(expensive!)

S Chakrabarti

Jay Shan

Jackie Chan

Bill Gates

Thorsten

J Kleinberg

J. Gherke

Claire Cardie

Jeffrey Ullman

Ron Fagin

J. Ullman

M Y Vardi

Name

Authors

Invertedindex

Efficient search fortop-k most similar

entities1. Batch up to do better than

individual top-k?

2. Find top segmentation without top-k matches for all segments?

Many large tables

Top-k similarity search

t1 t3t2 tU

Tidlists: pointers to DB tuples (on disk)

- - -

Bounds on normalized idf values (cached)

1. Fetch/mergetidlist subsets2. Point queries

Upper and lower bounds on dictionary match scores

Tuple id upper lowerScore bounds

Candidate matches

Q: query segmentE: an entry in the database D

Similarity score: Goal: get k highest scoring Es in D

Best segmentation with inexact, bounded features Normal Viterbi:

Forward pass over data positions, at each position maintain

Best segmentation ending at that position

Modify to: best-first search with selective feature refinement

s(0,0)

s(1,1)

s(1,3)

s(1,2)

s(3,3)

s(3,4) s(5,5)

s(3,5)

s(4,4)

End state

Suffix upper/lower

bound: from a backward Viterbi

with bounded features

(Chandel, Nagesh and Sarawagi, ICDE 2006)

Performance results

DBLP authors and titles100 citations

(Chandel, Nagesh and Sarawagi, ICDE 2006)

Inference in segmentation models

R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998

Surface features(cheap)

Not quite!Semi-CRFs 3—8 times slower than chain CRFs

Key insight Applications have a mix of token-level and

segment-level features Many features applicable to several overlapping

segments Compactly represent the overlap through new

forms of potentials Redesign inference algorithms to work on

compact features Cost is independent of number of segments a feature

applies to

(Sarawagi, ICML 2006)

Compact potentials Four kinds of potentials

Running time and Accuracy

Address

50

550

1050

1550

2050

2550

3050

0 10 20 30 40 50 60

Training %

Tim

e (s

ec)

Sequence-BCEUSegmentOptSegment

Cora

50

2050

4050

6050

8050

10050

12050

0 10 20 30 40 50 60 70

Training %

Tim

e (seconds)

Sequence-BCEU

SegmentOpt

Segment

78

80

82

84

86

88

90

92

F1 Accuracy

Address Cora Articles

Sequence-BCEU

Segment

Probabilistic Querying Systems Integration systems while improving, cannot be

perfect particularly for domains like the web Users supervision of each integration result

impossible

Create uncertainty-aware storage and querying engines Two enablers:

Probabilistic database querying engines over generic uncertainty models

Conditional graphical models produce well-calibrated probabilities

Probabilities in CRFs are well-calibrated

Probability of segmentation Probability correct

E.g: 0.5 probability Correct 50% of the times

Cora citations Cora headers

Ideal Ideal

Uncertainty in integration systems

Model

Unstructured

text

Entities p1

Entities p2

Entities pk

Other more compact models?

Very

uncertain?

Additional training data

Probabilistic database system

Select conference name of article RJ03?

Find most cited author?

IEEE Intl. Conf. On data mining 0.8

Conf. On data mining 0.2

D Johnson 16000 0.6

J Ullman 13000 0.4

Segmentation-per-row model

(Rows: Uncertain; Cols: Exact)

HNO AREA CITY PINCODE PROB

52 Bandra West Bombay 400 062 0.1

52-A Bandra West Bombay

400 062 0.2

52-A Bandra West Bombay 400 062 0.5

52 Bandra West Bombay

400 062 0.2

Exact but impractical. We can have toomany segmentations!

One-row Model

Each column is a multinomial distribution

(Row: Exact; Columns: Independent, Uncertain)

HNO AREA CITY PINCODE

52 (0.3) Bandra West (0.6)

Bombay (0.6) 400 062 (1.0)

52-A (0.7) Bandra (0.4) West Bombay (0.4)

e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.7 x 0.6 x 0.6 x 1.0 = 0.252

Simple model, closed form solution, poor approximation.

Multi-row Model

Segmentation generated by a ‘mixture’ of rows

(Rows: Uncertain; Columns: Independent, Uncertain)

HNO AREA CITY PINCODE Prob

52 (0.167)

52-A (0.833)

Bandra West (1.0)

Bombay (1.0) 400 062 (1.0) 0.6

52 (0.5)

52-A (0.5)

Bandra (1.0) West Bombay (1.0)

400 062 (1.0) 0.4

Excellent storage/accuracy tradeoffPopulating probabilities challenging

(Gupta and Sarawagi, VLDB 2006)

Populating a multi-row model Challenge

Learning parameters of a mixture model to approximate the SemiCRF but without enumerating the instances from the model

Solution Find disjoint partitions of string

Direct operation on marginal probability vectors (efficiently computable for SemiCRFs)

Each partition a row

Experiments: Need for multi-row

• KL very high at m=1. One-row model clearly inadequate.• Even a two-row model is sufficient in many cases.

What next in data integration? Lots to be done in building large-scale, viable

data integration systems Online collective inference

Cannot freeze database Cannot batch too many inferences Need theoretically sound, practical alternatives to

exact, batch inference Queries and Mining over imprecise databases

Models of imprecision for results of deduplication

Thank you.

Summary Data integration with statistical models an

exciting research direction + a useful problem Four take-home messages

Segmentation models (semi-CRFs) provide a more elegant way to exploit entity features and build integrated models (NIPS 2004, ICDE 2006a)

A-star search adequate for link and cardinality constraints (ICDE 2006a)

Recipe for combing two top-k searches so that expensive DB lookup features are refined gradually (ICDE 2006b)

An efficient segmentation model with succinct representation of overlapping features + message passing over partial potentials (NIPS 2005 workshop)

Software: http://crf.sourceforge.net

Outline Problem statement and goals Models for data integration

Information Extraction State-of-the-art

Overview: Conditional Random Fields Our extensions to incorporate database of entity names

Entity matching Combined model for extraction and matching Extending to multi-relational data

Entity resolution

Labeled data: record pairs with labels 0 (red-edges) 1 (black-edges)

Input features: Various kinds of similarity functions between attributes

Edit distance, Soundex, N-grams on text attributes Jaccard, Jaro-Winkler, Subset match

Classifier: any binary classifier CRF for extensibility

AuthorsVariantsJeffrey Ullman

J. Ullmann Jefry UlmanProf. J. Ullman

Jeffrey Smith

M, Stonebraker

J SmithMike Stonebraker

Michael Stonebraker Pedro Domingos

Domingos, P.?

CRFs for predicting matches Given record pair (x1 x2), predict y=1 or 0 as

Efficiency: Training: filter and only include pairs which satisfy

conditions like at least one common n-gram

Link constraints in multi-relational data Any pair of segments in previous output needs to

satisfy two conditions Foreign key constraint across their canonical-ids Cardinality constraint

Our solution: Constrained Viterbi (branch and bound search) Modified search that retains with best path labels

along the path Backtracks when constraints are violated

Normal CRF

Normal CRF

Semi-CRF

Constrained Viterbi

Compound-label

Entity column names in the database: Surface patterns, regular expression:

Example: pattern: X. [X.] Xx* author name Commonly occurring words:

Journal, IEEE journal name Ordering of words:

Part after “In” is journal name Similarity-based features:

Labeled data: Order of attributes: Title before journal name

Canonical links: Schema-level: cardinality of attributes

Links between entities: what entity is allowed to go with what.

The final picture.

Summary Exploiting existing large databases to bridge with

unstructured data, an exciting research problem with many applications

Conditional graphical models to combine all possible clues for extraction/matching in a simple framework

Probabilistic: robust to noise, soft predictions Ongoing work:

Probabilistic output for imprecise query processing

Available clues.. Entity column names in the database:

Surface patterns, regular expression: Example: pattern: X. [X.] Xx* author name

Commonly occurring words: Journal, IEEE journal name

Ordering of words: Part after “In” is journal name

TF-IDF similarity with stored entities Labeled data: Order of attributes:

Title before journal name Schema-level: cardinality of attributes

Links between entities: what entity is allowed to go with what.

Adding structure to unstructured data Extensive research in web, NLP, machine

learning, data mining and database communities. Most current research ignores existing structured

databases Database just a store at the last step of data integration.

Our goal Extend statistical models to exploit database of entities

and relationships Models: persistent, part of database, stored, indexed,

evolving and improving along with data.

Integration and representation of unstructured text in relational databases

Documents

Transcript of Integration and representation of unstructured text in relational databases