Data Cleaning & Integration - Polo Club of Data...

Post on 03-Feb-2018

219 views 1 download

Transcript of Data Cleaning & Integration - Polo Club of Data...

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Data Cleaning & Integration

Duen Horng (Polo) Chau Georgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Last TimeBig data analytics building blocksData collection & simple data storage

• Why SQLite? • Simplicity : nothing to install/

maintain, database in a single file

• Popular: cross-platform, cross-device

• SQL basics (create table, join, create index, etc.)

2

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Data CleaningWhy data can be dirty?

Examples

• …

4

How dirty is real data?

Examples

• duplicates

• empty rows

• abbreviations (different kinds)

• difference in scales / inconsistency in description/ sometimes include units

• typos

• missing values

• trailing spaces

• incomplete cells

• synonyms of the same thing

• skewed distribution (outliers)

• bad formatting / not in relational format (in a format not expected)

5

(Fall’14)How dirty is real data?

More to readBig Data's Dirty Problem [Fortune]http://fortune.com/2014/06/30/big-data-dirty-problem/

A Taxonomy of Dirty Data [Won Kim+]http://sci2s.ugr.es/docencia/m1/KimTaxonomy03.pdf(Very detailed, may be slightly outdated)

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights [New York Times]http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0

6

Data CleanersWatch videos • Open Refine (previously Google Refine)

• Data Wrangler (research at Stanford)

Write down• Examples of data dirtiness• Tool’s features demo-ed (or that you like)

Will collectively summarize similarities and differences afterwards

Open Refine: http://openrefine.orgData Wrangler: http://vis.stanford.edu/wrangler/

8

How are the tools similar or different?

• …

G = Google RefineW = Data wrangler11

! The videos only show

some of the tools’ features. Try them out.

Google Refine: http://code.google.com/p/google-refine/Data Wrangler: http://vis.stanford.edu/wrangler/

12

Data Integration

Course OverviewCollection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

What is Data Integration? Why is it Important?

16

Data IntegrationCombining data from different sources to provide the user with a unified view

As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges

How to help people effectively leverage multiple data sources? (People: analysts, researchers, practitioners, etc.)

Examples of businesses based on

data integration

Mashup

More Examples?• [FREE] Mint: account app, integrates multiple account (credit

card, bank, etc.), can parse receipts

• Google News

• Crime mapping

• Feedly

• app that check gas prices, coupons

• zillow-trulia/redfin

• imdb (movie database)

• coin: combine multiple credits

• ebay22

More Examples?• Palantir gotham

• Yelp: restaurant reviews, business reviews

• Facebook friend request: look at your friends’s friends and recommend those friends as your friends

• Trulia / zillow (real estate sites)

• graph search (facebook)

• waze

• yahoo pipe

• google search engine

• google transit

• google now / apple siri23

How to do data integration?

“Low” Effort ApproachesUse database’s “Join”! (e.g., SQLite)

Google Refinehttp://code.google.com/p/google-refine/ (video #3)

25

id name state111 Smith GA222 Johnson NY333 Obama CA

id name111 Smith222 Johnson333 Obama

id state111 GA222 NY333 CA

Crowd-sourcing Approaches: Freebase

26http://wiki.freebase.com/wiki/What_is_Freebase%3F

Freebase(a graph of entities)

“…a large collaborative knowledge base consisting of metadata composed mainly

by its community members…”

27

Wikipedia.

So what? What can you do with Freebase?

Hint: Google acquired it in 2010 Freebase to move over to Wikidata in July (2015): http://goo.gl/3ZDTg7

28

http://www.google.com/insidesearch/features/search/knowledge.html

Given a graph of entities, like Freebase, what other cool

things can you do?

30

https://www.facebook.com/about/graphsearch

Facebook’s Graph Search

Integrate your friends’ info with yours

32

FeldsparFinding Information by Association.

CHI 2008 Polo Chau, Brad Myers, Andrew Faulring

33Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdfYouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

Summary for data integrationOpportunities

• enable new services (Siri, padmapper)• enable new ways to discover info• improve existing services• reduce redundancy• new way to interactive with data• promote knowledge transfer (e.g., between

companies)35