Big mountain data competition training: scraping-n-munge

Post on 06-Jul-2015

94 views 2 download

description

BMDC stand-up lunch presentation

Transcript of Big mountain data competition training: scraping-n-munge

BMDC: Utah Air Quality

@davidbgonzalezZiff.io

Boawp.comhttps://github.com/davidbgonzalez/bmdcfall2014data

??

??

WWWWW

Find: “\<this is what I'm looking for>”

Replace: %s/<leave me blank for the last thing I searched>/<replace with this>/

Scrape

txtwww

pdf

BIG

sql

BIG

● wc -l # 128438621 # Oh No's● Work with sample for flow● head -n 100000● Compress it

– Avros

– Parque

● Play with it hdfs

txt

Tools

● Beautifulsoup4 ← python● Vim + regex● Xlrd● CLI

– jq

– json2csv

– csvkit

– subsample