poloclub.github.io/#cse6242 CSE6242 / CX4242 Data & Visual ...
Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan...
Transcript of Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan...
![Page 1: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/1.jpg)
Data Cleaning & Integration
CSE6242 / CX4242Jan 14, 2014
Duen Horng (Polo) ChauGeorgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
![Page 2: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/2.jpg)
Last TimeBig data analytics building blocks"Data collection & simple data storage!
• Why SQLite? "• Simplicity : nothing to install/
maintain, database in a single file"
• Popular: cross-platform, cross-device"
• SQL basics (create table, join, create index, etc.)
�2
Collection
Cleaning
Integration
Visualization
Analysis
Presentation
Dissemination
![Page 3: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/3.jpg)
Data CleaningHow dirty is real data?
![Page 4: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/4.jpg)
Data CleanersWatch videos "• Google Refine"• Data Wrangler (research at Stanford)"
Write down"• Examples of data dirtiness"• Tool’s features demo-ed (or that you like)"
Will collectively summarize similarities and differences afterwards
Google Refine: http://code.google.com/p/google-refine/"Data Wrangler: http://vis.stanford.edu/wrangler/
�4
![Page 5: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/5.jpg)
How dirty is real data?Examples"
• no specific schemas / different names for the same thing / numbers and text mixed"
• trailing spaces/ text not relevant to data"
• different units / data out of range (unrealistic) / skew data distributions"
• missing values / missing rows entirely"
• file formats"
• text may not be where you want it to be (maybe at a different column)"
• improper merge of two tables"
• duplications�5
![Page 6: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/6.jpg)
How are they similar?• mass/batch conversion "
• graph/chart visualization"
• heuristics (e.g., group in G, selection in W)"
• removing redundancy"
• tracking changes / history / undo-redo"
• table based"
• suggestions (what to fix)"
• filtering (show less)�6
G = Google Refine"W = Data wrangler
![Page 7: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/7.jpg)
How do they different?• G has clustering feature"
• W has format conversion (1 column spread into multiple)"
• W can export actions as scripts"
• G supports offline mode (online too?)"
• W extracts part of text into new column"
• W can copy and paste"
• W allow you to preview changes"
• W uses colors to indicate different kinds of changes"
• G can show statistics
G = Google Refine"W = Data wrangler
�7
![Page 8: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/8.jpg)
! The videos only show
some of the tools’ features. Try them out.
Google Refine: http://code.google.com/p/google-refine/"Data Wrangler: http://vis.stanford.edu/wrangler/
�8
![Page 9: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/9.jpg)
Data Integration
![Page 10: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/10.jpg)
Course OverviewCollection
Cleaning
Integration
Visualization
Analysis
Presentation
Dissemination
![Page 11: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/11.jpg)
What is Data Integration? Why is it Important?
![Page 12: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/12.jpg)
�12
Data IntegrationCombining data from different sources to provide the user with a unified view"
As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges"
How to help people effectively leverage multiple data sources? (People: analysts, researchers, practitioners, etc.)
![Page 13: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/13.jpg)
Examples of businesses based on
data integration
![Page 14: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/14.jpg)
![Page 15: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/15.jpg)
![Page 16: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/16.jpg)
![Page 17: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/17.jpg)
Mashup
![Page 18: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/18.jpg)
More Examples?• Palantir gotham"
• Yelp: restaurant reviews, business reviews"
• Facebook friend request: look at your friends’s friends and recommend those friends as your friends"
• Trulia / zillow (real estate sites)"
• graph search (facebook)"
• waze"
• yahoo pipe "
• google search engine"
• google transit"
• google now / apple siri�18
![Page 19: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/19.jpg)
How to do data integration?
![Page 20: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/20.jpg)
“Low” Effort ApproachesUse database’s “Join”! (e.g., SQLite)"
"
"
"
"
Google Refinehttp://code.google.com/p/google-refine/ (video #3)
�20
id name state111 Smith GA222 Johnson NY222 Obama CA
id name111 Smith222 Johnson333 Obama
id state111 GA222 NY222 CA
![Page 21: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/21.jpg)
Crowd-sourcing Approaches: Freebase
�21http://wiki.freebase.com/wiki/What_is_Freebase%3F
![Page 22: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/22.jpg)
Freebase(a graph of entities)!
“…a large collaborative knowledge base consisting of metadata composed mainly
by its community members…”
�22
Wikipedia.
![Page 23: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/23.jpg)
So what? What can you do with Freebase?
(Hint: Google acquired it in 2010)!
�23
![Page 24: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/24.jpg)
http://www.google.com/insidesearch/features/search/knowledge.html
![Page 25: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/25.jpg)
Given a graph of entities, like Freebase, what other cool
things can you do? "
�25
![Page 27: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/27.jpg)
Facebook’s Graph Search!
Integrate your friends’ info with yours
�27
![Page 28: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/28.jpg)
Feldspar!Finding Information by Association.
CHI 2008 Polo Chau, Brad Myers, Andrew Faulring
�28Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdfYouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E
![Page 29: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/29.jpg)
![Page 30: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014 · • SQL basics (create table, join, create index, etc.) 2 Collection](https://reader034.fdocuments.in/reader034/viewer/2022052102/603cef9ae1bd60729125f893/html5/thumbnails/30.jpg)
Summary for data integrationOpportunities"
• enable new services (Siri, padmapper)"• enable new ways to discover info"• improve existing services"• reduce redundancy"• new way to interactive with data"• promote knowledge transfer (e.g., between
companies)�30