Python In The Cloud

25
Python In The Cloud PyHou MeetUp, Dec 17 th 2013 Chris McCafferty, SunGard Consulting Services

description

Python In The Cloud. PyHou MeetUp , Dec 17 th 2013 Chris McCafferty, SunGard Consulting Services. Overview. What is the Cloud? What is Big Data? Big Data Sources Python and Amazon Web Services Python and Hadoop Other Pythonic Cloud providers Wrap-up. What Is The Cloud. - PowerPoint PPT Presentation

Transcript of Python In The Cloud

Page 1: Python In The Cloud

Python In The Cloud

PyHou MeetUp, Dec 17th 2013Chris McCafferty, SunGard Consulting Services

Page 2: Python In The Cloud

Overview

• What is the Cloud?• What is Big Data?• Big Data Sources• Python and Amazon Web Services• Python and Hadoop• Other Pythonic Cloud providers• Wrap-up

Page 3: Python In The Cloud

What Is The Cloud

• I want 40 servers and I want them NOW• I want to store 100Tb of data cheaply and

reliably• We can do this with Cloud technologies

Page 4: Python In The Cloud

What is Big Data

• “Three Vs”– Volume– Variety– Velocity

• Genome: sequencing machines throw off several TB per day. Each.

• Hard drive performance is often the killer bottleneck, both reading and writing

Page 5: Python In The Cloud

What is NOT Big Data

• Anything where the whole data set can be held in memory on a single standard instance

• Data that can be held straightforwardly in a traditional relational database

• Problems where most of the data can be trivially excluded

• There are many challenging problems in the world – but not all need Cloud or Big Data tools to solve them

Page 6: Python In The Cloud

To The Cloud!

• Amazon Web Services is the 800lb gorilla in this space– Start here if in doubt

• Other options are RackSpace, Microsoft Azure, (PiCloud/Multyvac?)

• You can also spin up some big iron very cheaply– Current AWS big memory spec is cr1.8xlarge– 244GB RAM, 32 Xeon-E5 cores, 10 Gigabit network– $3.50 per hour

Page 7: Python In The Cloud

Geo Big Data Sources

• NASA SRTM data is on the large side• NASA

recently released a huge set of data directly into the cloud: NEX– Earth Sciences data sets

• Made available on Amazon Web Services public datasets• Available on S3 at:

– s3://nasanex/NEX-DCP30– s3://nasanex/MODIS– s3://nasanex/Landsat

• There are many, many geo data sets available now (NOAA Lidar, etc)

Page 8: Python In The Cloud

Time for some code

• Example - Use S3 browser to look at new NASA NEX data

• Let’s download some with boto package• Quickest to do this from an Amazon data

centre• See DemoDownloadNasaNEX.py

Page 9: Python In The Cloud

Weather & Big Data Sources

• Good public weather and energy data• It's hard to move data around for free: just try!• Power grids shed many GB of public data a day– Historical data sets form many Terabytes

• Weather data available from NOAA– QCLD: Hourly, daily, and monthly summaries for

approximately 1,600 U.S. locations.– ASOS data contains sensor data at one-minute intervals. 5

min intervals available too. • 900 stations, 3-4MB per day, 12 years of data = 11-15TB data set.

Page 10: Python In The Cloud

Why go to the cloud

• Cheap - see AWS pricing here– spot pricing of m1.medium normally ~1c/hr

• The cloud is increasingly where the (public) data will reside

• Pay as you go, less bureaucracy• Support for Big Data technologies out of the box– Amazon Elastic Compute Cloud (EC2) gives you a Hadoop

cluster with minimal• Host a big web server farm or video streaming

cluster

Page 11: Python In The Cloud

Python on AWS EC2

• AWS = Amazon Web Services. The Big Cloud• EC2 = Elastic Cloud Compute• Let’s run up an instance and see what we have

available• See this script as one way to upgrade to Python 2.7• Note absence of high-level packages like NumPy,

matplotlib and Pandas• It would be very useful to have a very high-level

Python environment…

Page 12: Python In The Cloud

StarCluster• Cluster management in AWS, written by a group at MIT• Convenient package to spin up clusters (Hadoop or other) and

copy across files• Machine images (AMIs) for high-level Python environments

(NumPy, matplotlib, Pandas, etc)• Not every high-level library is there

– No sklearn (SciKit-Learn, machine learning)– But easier to pip-install with most pre-requisites already there

• Sun Grid Engine: Job Management• Hadoop• Boto plugin• dumbo… and much more

Page 13: Python In The Cloud

Python's Support for AWS

• boto - interface to AWS (Amazon Web Services)• Hadoop Streaming - use Python in MapReduce

tasks• mrjob - Framework that wraps Hadoop Streaming

and uses boto• pydoop- wraps Hadoop Pipes, which is a C++ API

into Hadoop Map Reduce• Write Python in User-Defined Functions in Pig, Hive– Essentially wraps MapReduce and Hadoop Streaming

Page 14: Python In The Cloud

Boto - Python Interface to AWS

• Support for HDFS• Upload/download from Amazon S3 and Glacier• Start/stop EC2 instances• Manage users through IAM• Virtually every API available from AWS is supported• django-storages uses boto to present an S3 storage

option• See http://docs.pythonboto.org/en/latest/• Make sure you keep your AWS key-pair secure

Page 15: Python In The Cloud

Another Code Example – upload

• Example where we merge many files together and upload to S3

• Merge files to avoid the Small Files Problem• Note use of retry decorator (exponential

backoff)• See CopyToCloud.py and

MergeAndUploadTxOutages.py

Page 16: Python In The Cloud

What is ?

• A scalable data and job manager suitable for MapReduce jobs

• Core technologies date from early 2000s at Google• Retries failed tasks, redundant data, good for

commodity hardware• Rich ecosystem of tools including NoSQL

databases, good Python support• Example, let’s spin up a cluster of 30 machines

with StarCluster

Page 17: Python In The Cloud

Hadoop Scales Massively

Page 18: Python In The Cloud

Hadoop Streaming

• Hadoop passes incoming data in rows on stdin• Any program (including Python) can process

the rows and emit to stdout• Logging and errors to stderror

Page 19: Python In The Cloud

Hadoop Streaming - Echo

• Useful example that can be used for debugging

• Tells you what Hadoop is actually passing your task

• See echo.py• Similar example firstten.py peeks at the first

ten lines then stops• Useful for debugging

Page 20: Python In The Cloud

Hadoop Parsing Example• Python's regex support makes it very good for parsing

unstructured data• One of the keys in working with Hadoop and Big Data is getting it

into a clean row-based format• Apply 'schema on read'• Transmission Data from PJM is updated here every 5 mins:

https://edart.pjm.com/reports/linesout.txt• Needs cleaning up before we can use it for detailed analysis -

note multi-line format• Script split_transmission.py• Watch out for Hadoop splitting input blocks in the middle of a

file

Page 21: Python In The Cloud

Alternatives to AWS

• Picloud offers open source software enabling you to run large computational clusters– Just acquired by DropBox– Pay for what you use: 1 core and 300MB or RAM costs $0.05/hr– Doesn't offer many of the things Amazon does (AMIs, SMS) but

great for computation or a private cloud• Disco is MapReduce implemented in Python

– Started life at Nokia– Has its own Distributed Filesystem (like HDFS)

• Or roll your own cluster in-house with pp (parallel python)• StarCluster Sun Grid Engine on other vendor/in-house• Google App Engine…?

Page 22: Python In The Cloud

PiCloud

• Acquired by DropBox Nov 2013• DropBox will probably come out with its own cloud

compute offering in 2014• As of Dec 2013, no new sign-ups• Existing customers encouraged to migrate to

Multyvac• Feb 25th 2014 PiCloud will switch off• The underlying PiCloud software is still open source

Page 23: Python In The Cloud

Conclusions

• For cheap compute power and cheap storage, look to the cloud

• Python is well-supported in this space• Consider being close to your data: in the same cloud

– Moving data is expensive and slow• Leverage AWS with tools like boto, StarCluster• Beware setting up complex environments: installing

packages takes time and effort• Ideally, think Pythonicly – use the best tools to get the

job done

Page 24: Python In The Cloud

Links• Good rundown on the Python ecosystem around Hadoop from Jan

2013:– http

://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

• Early vision for PiCloud (YouTube Mar 2012)– http://www.youtube.com/watch?v=47NSfuuuMfs

• Disco MapReduce Framework from PyData– http://www.youtube.com/watch?v=YuLBsdvCDo8– PuTTY tool for windows

• Some AWS & Python war stories:– http://nz.pycon.org/schedule/presentation/12

Page 25: Python In The Cloud

Thank you

• Chris McCafferty• http://christophermccafferty.com/blog

• Slides will be at:• http://christophermccafferty.com/slides

• Contact me at:• [email protected][email protected]