MIT Big Data Explorers - presentation by Daniel Burseth
-
Upload
don-dark -
Category
Data & Analytics
-
view
225 -
download
1
description
Transcript of MIT Big Data Explorers - presentation by Daniel Burseth
AN END-TO-END DEMONSTRATION OF GENERATING, CLEANING, AND VISUALIZING A “MESSY” DATA SETDaniel BursethCo-president MIT Big Data [email protected]@dmbnycGithub: dburseth
WHAT’S THE MOTIVATION? Acronyms abound
Tremendous complexity
Use building blocks not code
CLEAN DATA IS A LUXURY This is easy
EPPM of 10 requires 500 professionals
BUT WHAT ABOUT INFORMATION THAT ISN’T NICELY STRUCTURED AND DOESN’T HAVE AN API?
ANOTHER AREA THAT DOESN’T GET MUCH AIR TIME….
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?emc=eta1&_r=0
Data preparation and cleansing:• Missing• Duplicative• Conventions (dates, time,
geographies)• Spacing• Can we measure data
cleanliness?• What’s our Pareto point?
LOGIN TO YOUR AWS INSTANCE AWS -> EC2
Launch instance: ami-c6b61fae (US-EAST)
Instance type m3.medium
Connect
You should see some software on the desktop
AGENDA
Scrape all of Craiglist’s Boston apartment listings using WebHarvy
Examine, clean, and prepare the data set using OpenRefine
Map our data and apply filters using Tableau
……all without writing a single line of code.
DOWNLOAD MY SLIDES AT SHOUTKEY.COM/EFFIGY
WEBHARVY A hyper-intelligent utility to scrape website
data.
SysNucleus, makers of USBTrace
Heavy duty alternatives: Scrapy (scrappy.org), Beautiful Soup
GO TO HTTP://SHOUTKEY.COM/WIRE
1. Start Config
2. Click on Hungry Mother – capture text
3. Click on Hungry Mother – capture URL
4. Click on Kendall Square/MIT – capture text
5. Click lasts review– capture text
CLEAR
6. Mine -> Scrape a list of similar links
7. Click on Hungry Mother
WE’VE NOW DRILLED INTO THE TOP LINK Let’s start collecting
information in the first sub-page.
THIS CAPTURED THE FIRST PAGE, BUT WHAT IF WE WANT MORE? Edit Clear
Navigate into a sub-page
Start Config
Set as Next Page Link
OTHER BELLS AND WHISTLES Scheduler
Input keywords
Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for apps, commercial purposes!)
TRY VISITING CRAIGSLIST IN AWS BTW!!
Proxy
Database export
20K ROWS OF MESS!
Download Craigslist Boston from http://shoutkey.com/glorify
Look at our data: open Boston Dirty.csv (20k rows of mess!)
Time to CLEAN: Launch GOOGLE-REFINE.EXE
Within MOZILLA, navigate to http://127.0.0.1:3333/
Create Project -> This Computer -> Browse
Parse by tab
Create Project
REMOVE DUPLICATES1. First, sort your column. 2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the middle of the data table. 3. Then invoke Edit cells and Blank down on the Title column. 4. Then on that column, invoke menu Facet > Custom facets and Facet by blank. 5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu. 6. Remove the facet.
DUPLICATE “TITLE”
“TITLE” CONTAINS KEY INFO, LET’S PARSE IT
MORE CHANGES TO “TITLE”
TITLE REMAINS MESSY
Then run the “To Number” transform again
LET’S EXTRACT LOCATION
REMOVE TRAILING PAREN
NOW THE FUN PART: CLUSTERING
SWITCH THE METHOD: NEAREST NEIGHBOR
Increment the radius to 7 and make judgment calls along the way.
Change the Distance Function and do the same thing
TRIM WHITESPACE ON OUR CITY DATA
ADD “,MA” TO OUR CITY DATA
LET’S PLOT OUR VALUES Looks like we have SOME really expensive
real estate. Data errors????
EXPORT OUR DATA AND LEAVE REFINE
Boston Clean.csv
WELCOME TO TABLEAU Load Boston
clean.csv
“Go to Worksheet”
DRAG CITY TO THE BLACK BOX
Great “semantic” example. Tableau understands that this text translates to a lat/long
TABLEAU ALERTS TO UNPLOTTED POINTS Look on the map in the lower right corner
Let’s “Filter Data”
SIZE AND LABEL OUR DATA Under “Measures”, drag “Price” onto size in “Marks”
Change sum(Price) to avg(Price)
Drag Price, change to max(price) into Filters and select an “At Most”
Right click on the filter and show “Quick Filter”
Drag “City” onto “Label”
Menu Map -> Map Options
Click on a node for info and drill down potential
VISUALIZATION IS A HUGE TOPIC!
RECAP
1. Explored various webpage structures and scraped them2. Exported the data to Refine3. Parsed columns to extract critical price and location information4. Used clustering algorithms to merge related geographies5. Applied filters to identify errant prices6. Exported the data to Tableau7. Completed a real cursory mapping visualization
WHAT’S YOUR BUSINESS IDEA? Please come talk to me
QUESTIONS? THANK YOU!GITHUB:DMBNYC [email protected]