Applying search technology to data matching consolidation...

24
The MIT Information Quality Industry Symposium, 2007 Applying search technology to data matching consolidation and cleansing matching, consolidation, and cleansing Jeff Fried VP Advanced Solutions [email protected]

Transcript of Applying search technology to data matching consolidation...

Page 1: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007

Applying search technology to data matching consolidation and cleansingmatching, consolidation, and cleansing

Jeff FriedVP Advanced Solutions

[email protected]

Page 2: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007

All Data is DirtyForrt Lauderdale Fort LauderdaleqFort ALuderdale Fort LauderrdaleFort Laderdale Fort LaudewrdaleFort Laderdale Fort LaudewrdaleFort Laudedale FORT LUADERDALEFort Lauderale Fort. LauderdaleFort Lauderdal Fort.LauderdaleFort Lauderdale Ft LauderdaleFORT LAUDERDALE FLA Ft LauderdaleFORT LAUDERDALE FLA. Ft. LauderdaleFort LAuderdale, FT. LAUDERDALEFtLauderdale FT.LAUDERDALE

Page 3: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007

Search technology emphasizes connecting to the User

FAST C t U T I f ti P d t d C iti

USERS

FAST Connects Users To Information, Products and Communities

FAST Platform & SolutionsFrontOffice: Search IS the Portal

Search Connects: Service Delivery Framework

BackOffice: Adaptive Information Warehouse

Information CommunitiesProducts

XML

3

XML

Page 4: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007The Adaptive Information Warehouse

The first business intelligence solution built on search. Redefining how organizations access, interact with and exploit

information

The Adaptive Information Warehouse (AIW)

information.

STR

UC

TUR

ED

DAT

A

RIC

H M

EDIA

UN

STR

UC

TUR

ED

DAT

A

Clean Data

Integrated intuitive business intelligence

Advanced data integration and

Dynamic, real-time structured and

Data Cleansing Radar

business intelligence portal

integration and cleansing

FAST ESP

structured and unstructured data

4

g

Page 5: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007

Customer Records from Corporate CRM Database

Match, merge,

cleanse.

Regional Sales Call Center /

5

g

Customer Service Database

Page 6: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007

Data matching, cleansing, and consolidation

• Biggest problem for all data warehouses– Merge of multiple sources of data– No common keyy

• FAST solution: unique combination of traditional DWH logic (ETL) and fuzzy matchingand fuzzy matching– Full ETL tool for the traditional ETL logic– Use search engine core for fuzzy match

Apply linguistic techniques to data quality issues– Apply linguistic techniques to data quality issues

6

Page 7: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007D

Schibsted built Internet Yellow Pages and White Pages from scratch in 8 weeks outperforming

e isting pro iders on data q alit

Cleansing Example:ST

RU

CTU

RE

DA

TA

existing providers on data quality98 % Cleansing Precision

NST

RU

CTU

RED

DA

TA

MED

IAU

N

Clean Data

RIC

H

►Semantic and PhoneticCleansing Analysis► Content Fusion

7

Page 8: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Example system configuration

Ling. matching

XMLXML

XML

Master database

’ambigous’ database

Editorial Inspection GUI

Cleansing

”Back office”database p

”Front office”

Search Portal

Data Load For Search Search Portal

Page 9: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Data Warehouse of the Year Award 2006

Find the business who called our

RecruitmentThree Years of Financial Data

Sesam creates composite objects

by joining and cleansing data

Recruitment Officer showing growth

and profits

Blended Content:

The Chairman is born in 1942,

lives in Vinterbro d

A blog on the founder of

Boyden International

and runs a competing firm

as well ☺

The CEO, Bjørn Boberg called us.

The cellphone

9

pnumber is

registered to him.

Page 10: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Data Cleansing Scenario One: ”Master” DB exists

One ’ Master’ exists, other datasources are cleansed towards it

Step one:

- The Customer Master is indexed by FAST ESP

Step two:

A datasource is cleansed according to the Master

Result: a corrected version of that datasource

10

Page 11: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Data Cleansing Scenario Two: No ”Master” DB

Two or more databases exist, and a merge is needed

Incident records from Central Region Cleansed Unioned Incident list

Incidents from East Region

Duplication identification and removalDuplication identification and removalApproximate matching to help oversee minor errorsMulti-field compare to ensure secure identificationTunable relative importance of each fieldMore than 20 million names known by the plattformSpell checking and error correction included

11

Private attributes from eiher source are merged to a super-set

Page 12: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Search Is a Matching Service

3 Gram

ABU BUK UKA KAB ABL BLA LAN AN, N, , K KH KHA HAL ALI LID ID D H HU HUS USS SSE SEI EIN IN N O OC OCE CEA EAN AN N B BL BLV LVD

3-Gram

ABUKABLAN, KHALID HUSSEIN OCEAN BLVD 33062 POMPANO BEACH

Score

131 7763 Gram

KAB ABL BLA LAN AN, N, , K KH KHA HAL ALI LID ID D H HU HUS USE SEI EIN IN N O OC OCE CEA EAN AN N B BUL ULE LEV EVA VAR ARD

131 7763-Gram

KABLAN, KHALID HUSEIN OCEAN BULEVARD POMPANO BEACH

12

Search improves Data Quality through de-duplication

Page 13: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Linguistic Capabilities Provide High Data Matching Accuracy

- Score strings on similarity- Lai Nancy, 501 N. Clinton St. ≈ Nancy J. Lai, 501 North Clinton

- Levenshtein edit distance algorithms- Calculates the difference on two strings (names)- John Meyers ≈ John P. Meiers; Britney Spears ≈ Britney Spaers

- Phonetic match- Calculates an integer from the strings, and matches on ’sounds

like’Ch i i K i i Ki k d Ki k å d- Christin ≈ Kristin; Kierkegaard ≈ Kierkegård

- Scope Fields (new in FAST Instream 4.1 and FAST ESP 5.0)- Finds relevancy in a scope or a context – for example ’Person’ vs.Finds relevancy in a scope or a context for example Person vs.

’Address’.

- ’Charset’ normalizationF lti ti l h d bi ti- For multinational chars and name combinations

- André = Andre, Jürgen = Jurgen etc.

Page 14: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007

Linguistic FundamentalsLexicon Base Basic Linguistic Algorithms

Geographical and people’s names

Special terminology lexica Pattern extraction

Stemming / Lemmatization

Part-of-speech Tagging

Language normalization

Vectorization

Subject-specific ontologies

Spellcheck dictionaries

Geographical and people s names Stemming / Lemmatization

Part-of-speech Dictionaries

Synonymy Dictionaries Categori-zation

Entity Extraction

Stop word Spell Lang ID (77

Include all forms (10

languages)

Language-specific Common Words

Inflection DictionariesDrill-down & Refinement

Suggest Synonyms

Find similar

Stop word elimination

Spell checking

Lang ID (77 languages)

14

Applications

Page 15: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Where do Linguistics and other search technologies apply to data quality?

• On textual and/or multimedia data• In combination with other rules based techniques• In combination with other rules-based techniques• In hard-to-match situations

– Match and cleanse with significantly fewer rules– Match and cleanse with much higher accuracy

• Where structuring “blobs” of text is useful– Entity Extractiony– Relationship Extraction

Page 16: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Data Consolidation and ETL Studio

Direct access to RDBMsfor info from some Telco’s Lookup to ESP for

Matching Logic for matching

XML feed from other Telco’s

Ordered hits (by quality)Logic for matchingand cleansing

XML

Cleansed data Back to ESP for

next round Cleansing

or for ’search’

Flat files (CSV or fixed)from the ’laggards’

Ambigous data

(close hits or unidentified)

clean data

’Error’ database for manual inspection, correction, storage/learning

Logic for ETL

16

Master database for persistant storage

Page 17: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007

Key Cleansing Benefits

• Increased accuracy in search

• Improved entity recognition and relationship mappingImproved entity recognition and relationship mapping

• Better analytics and intelligence

• Improvement and reuse of data for all applications

17

Page 18: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Cleansing: Impact on Discovery

18

Page 19: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007The Largest BI Limitations Today

What else is wrong with BI today?If We Build It,

<50% of Them Will Come

“Is everyone using BI who should be using BI?”

Source: Gartner

Page 20: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007BI Built on Search Redefines How Organizations Access, Interact With and Exploit Business IntelligenceExploit Business Intelligence

Drivers of Pervasive BI:ease of use

• Better business decisions with

ease of usedemocratize information

perform deeper analytics

currently just a handful of power usersaccurate, enriched data

• Increased productivity with ad-hoc real-time performance

high quality data

cu e t y just a a d u o po e use s

hoc, real-time performance

• Increased adoption with easy and intuitive access to

break down silos

real-time information

intelligence timely cross-enterprise information

integrate structured and unstructured data

speed-up decision making

suppress noise in data

Page 21: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007BI Built on Search Redefines How Organizations Access, Interact with and Exploit Informationp

1.2.2.

16

18

20

. Que

ries

FASTOracle

6

8

10

12

14

# qu

erie

s

No.

Better Business decisions with

3.

0

2

4

6

1/16 1/8 1/4 1/2 1 2 4 8 16 32

[sec.]

Latency (S) (logarithmic/compressed scale)

accurate data

Increased productivity with ad-hoc, real-time performance

Latency (S) (logarithmic/compressed scale)

Increased Adoption with easy andIncreased Adoption with easy and intuitive access to intelligence

Page 22: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007Analytical Portals built on Search

22

Page 23: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007FAST Vision For Pervasive Business Intelligence

AdAd--hoc, FAST access hoc, FAST access

Exabyte scale information managementExabyte scale information managementStructured/SemiStructured/Semi--structured/Unstructuredstructured/UnstructuredVVV LDB

n

EASY analytics EASY analytics

Data ImprovementData ImprovementData ImprovementData Improvement“Content Capture and Refinement”“Content Capture and Refinement”

Connections across informationConnections across information“Semantic Integration”“Semantic Integration”

Real Time informationReal Time information

Page 24: Applying search technology to data matching consolidation ...mitiq.mit.edu/ERIQ/2007/iq_sym_07/Sessions/Session 3F/Session 3F... · matching consolidation and cleansingmatching, consolidation,

The MIT Information Quality Industry Symposium, 2007

Thank you!Thank you!

find the real value of search

Jeff FriedVP Advanced Solutions

24

[email protected]