Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1,...

Efficient Approximate Entity Extraction with Edit Distance Constraints

Wei Wang1, Chuan Xiao1, Xuemin Lin1 and Chengqi Zhang2

1 University of New South Wales and NICTA2 University of Technology, Sydney

2

Named Entity Recognition

Dictionary-based NER

Dictionary of Entities

Isaac Newton Sigmund Freud

English Austrian physicist

mathematician astronomer philosopher alchemist theologian psychiatrist economist historian

sociologist ...

Documents

1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophiæ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics.

2 Sigmund Freud was an Austrian psychiatrist who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalyst.

3

Approximate Entity Extraction

What if data are not cleaned or standardized? due to typos, multiple representations, etc.

Example – multiple representations al qaeda al qaida al-qaeda al-qa’ida

Using similarity measures token-based measures: jaccard e.g.

x = {al, qaeda}, y = {al, qaida} J(x, y) = 1/3 = 0.33

If we set the threshold as 0.33, it works well for entities with several tokens, but, {al, qaeda} will match {al, gore} !

match the same entity!

tyx

yxyxJ

),(

4

Using Edit Distance Constraints

Using string-based measures edit-distance

Problem Definition Given a document R and a dictionary E of entities, the task of

approximate entity extraction with edit distance threshold d is to find all sub-strings in R such that they are within edit distance d from one of the entities in E.

{ R[i .. j], E | k, ed(R[i .. j], Ek) d }

E

5

Previous Approaches

q-gram based method count filtering

at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 – q*d

position filtering positions of common q-grams

should be within d length filtering

| len(s)-len(t) | d

Steps index the q-grams for the entities probe index for the q-grams of

each sub-string (query) of the document form candidates

verify the candidates

Rhode_IslandRho hod ode de_ e_I _Is Isl sla lan and

a

Example: q = 3

at most q*d q-grams are destroyed

6

Drawbacks of q-gram Based Methods

entities are short we have to use small q to ensure the lower bound of matching

q-grams is positive short q-grams result in poor performance

short q-grams are frequent long inverted lists the lower bound is low for short entities large candidate size

It has to try all the queries with length from Lmin – d to Lmax + d at every starting position.

Document

1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophiæ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics.

Dictionary (Lmin=9, Lmax=43)

1 physicist

2 mathematician

3 Philosophiæ Naturalis Principia Mathematica

7

FastSS Algorithm [T. Bocek et. al. 2007]

Basic Idea – Neighborhood Generation generate the variants for each entity and query by

enumerating edit operations at any possible position Steps

enumerate by at most d deletions for each entity resulting strings are called d-variant family, inserted into

inverted index generate d-variant family for each query, probe the index to

form candidates, and then verify them Example, d = 1

e = qaeda q = qaida Ve = {qaeda, aeda, qeda, qada, qaea, qaed} Vq = {qaida, aida, qida, qada, qaia, qaid}

Problem the size of d-variant family for each entity (query) is O(|s|d) too many variants when entities are long or d is large!

8

Partitioning Scheme

How to reduce the number of variants? immediate solution: divide an entity (query) into several

partitions generate d-variants within each partition only guarantee not

to miss any result

still too many variants? pigeon-hole principle If we consider shifting and scaling, there exists an entity

partition and a query partition such that their edit distance is within 1 generate 1-variant family for each partition

divide each entity (query) into k = ceil[(d+1)/2] partitions

Partitioning Scheme

divide each entity (query) into k = ceil[(d+1)/2] partitions shift within the range of [-d, d] scale within the range of [-2, 2] (it can be proved 2 is

enough)

shifting an scaling are only needed on entities special cases

first partition: only need to consider scaling within [-2, 2] last partition: only need to consider same amount of shifting

and scaling within [-d, d]

dd

22

always start from the first character

always end with the last character

10

Partitioning Scheme - Example

Example, d = 3

e = abcdefgh

q = axxbcdefgyh

Partitioning k = 2 Pe = {<ab,1>; <abc,1>, <abcd,1>; <abcde,1>; <abcdef,1>;

<bcdefgh,2>; <cdefgh,2>; <defgh,2>; <efgh,2>; <fgh,2>; <gh,2>; <h,2>}

Pq = {<axxbc,1>;<defgyh,2>}

Generating 1-variants V{defgh} and V{defgyh} share a common variant ‘defgh’, so this

candidate will be identified

represented in the form of <str, partition_id>

11

Prefix Pruning

What if a partition is still quite long? still many 1-variants solution: generate 1-variant family on prefix only!

Prefix Pruning If a partition is longer than a threshold l, we only generate 1-

variant family on its l-prefix. Example, l = 5

P = abcdefg generate 1-variant family on its 5-prefix

P[1 .. 5] = abcde Vp[1 .. 5] = {abcde, bcde, acde, abce, abcd}

space complexity - # of variants generated FastSS: O(|s|d) after partitioning and prefix pruning: O(l * d2)

12

NGPP Algorithm

Neighborhood Generation + Partitioning + Prefix Balance between variant size and selectivity

different schemes to deal with short and long entities Index short and long entities

short: for entities which are shorter than k*l+d, we index d-variant family on its l-prefix (prefix pruning only)

long: for entities which are no shorter than k*l, we first divide them into k partitions, and index 1-variant family on the l-prefix of the partitions (partitioning + prefix pruning)

Scan documents scan for each starting position enumerate the query length from Lmin – d to l generate its d-variant family, search for short entities generate its 1-variant family, search for long entities

13

NGPP Example

d = 2, l = 4 short < 10, long >= 8 Entity

e1 = ‘Providence’ (long)

e2 = ‘capital’ (short)

Document Prowidnce is the kaepital of Rhode Island.

genenrate 1-variant familiy

pr

pro

prov

provi

provid

vidence

idence

dence

ence

nce

genenrate d-variant familiy

capital

Prowrowiowid

e1 Providence

…

kaep

e2 capital

…

1-variant match

d-variant match

14

Experiment Settings

Algorithms NGPP FastSS q-gram based method

Measure number of variants, candidate size, running time

Datasetdataset # of records avg. string length

DBLPDICT (author) 108k 14.5

DOC (author, title) 87k 104.7

GENEDICT (gene/protein name) 381k 22.4

DOC (author, title, abstract) 10k 870.0

CONLLDICT (person, location) 8k 12.6

DOC (news article) 19k 819.0

15

Experiment Results

NGPP vs FastSS DBLP; d = 2

algorithm # of variants candidate size running time

FastSS 7500M 2.1M 2643s

NGPP(l = 10)

150M 11M 40s

Experiment Results

NGPP vs q-gram based method DBLP; d = 1, 2, 3

Candidate Size Running Time

Conclusion

Contributions an efficient algorithm for approximate entity extraction with

edit distance constraints based on neighborhood generation two techniques to reduce the number of variants generated, as

well as running time partitioning prefix pruning

Future work approximate multiple pattern matching

other similarity measures, e.g., the function used in DNA/protein sequence alignment

18

Thank you!

Questions?

19

Related Work

neighborhood generation approaches E. W. Myers. A sublinear algorithm for approximate keyword searching.

Algorithmica, 12(4/5):345–374, 1994. T. Bocek, E. Hunt, B. Stiller. Fast Similarity Search in Large Dictionaries.

Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007.

q-gram based approaches L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan,

and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.

C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008.

alternative: use vgrams instead of q-grams C. Li, B. Wang, and X. Yang. VGRAM: Improving performance of

approximate queries on string collections using variable-length grams. In VLDB, 2007.

X. Yang, B. Wang, and C. Li. Cost-based variable length gram selection for string collections to support approximate queries efficiently. In SIGMOD, 2008.

Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1,...

Documents

Transcript of Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1,...