Big mountain data competition training: scraping-n-munge

10

Click here to load reader

description

BMDC stand-up lunch presentation

Transcript of Big mountain data competition training: scraping-n-munge

Page 1: Big mountain data competition training: scraping-n-munge

BMDC: Utah Air Quality

@davidbgonzalezZiff.io

Boawp.comhttps://github.com/davidbgonzalez/bmdcfall2014data

Page 2: Big mountain data competition training: scraping-n-munge
Page 3: Big mountain data competition training: scraping-n-munge
Page 4: Big mountain data competition training: scraping-n-munge

??

??

Page 5: Big mountain data competition training: scraping-n-munge

WWWWW

Page 6: Big mountain data competition training: scraping-n-munge
Page 7: Big mountain data competition training: scraping-n-munge

Find: “\<this is what I'm looking for>”

Replace: %s/<leave me blank for the last thing I searched>/<replace with this>/

Page 8: Big mountain data competition training: scraping-n-munge

Scrape

txtwww

pdf

BIG

sql

Page 9: Big mountain data competition training: scraping-n-munge

BIG

● wc -l # 128438621 # Oh No's● Work with sample for flow● head -n 100000● Compress it

– Avros

– Parque

● Play with it hdfs

txt

Page 10: Big mountain data competition training: scraping-n-munge

Tools

● Beautifulsoup4 ← python● Vim + regex● Xlrd● CLI

– jq

– json2csv

– csvkit

– subsample