Big Data Analytics London - Data Science in the Cloud

@MargrietGr

Margriet Groenendijk, PhDDeveloper Advocate for IBM Cloud Data Services

Big Data Analytics MeetupLondon

6 December 2016

Data Science in the Cloud

@MargrietGr https://blog.rjmetrics.com/2015/10/05/how-many-data-scientists-are-there/

@MargrietGr

1781

http://visual.ly/exports-and-imports-scotland

@MargrietGr

1821

https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png

@MargrietGr

1960s

http://www.computerhistory.org/collections/catalog/102630767

@MargrietGr

1960s

http://www.climatecentral.org/news/first-climate-model-video-19007

@MargrietGr

2016

@MargrietGr

20th century fractional change of Water Use Efficiency

@MargrietGr

2016

@MargrietGr

Toolbox

http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png

@MargrietGr

ExploreData

CleanDataStoreData

@MargrietGr

Spark on a Cluster

@MargrietGr

The Spark Stack

from Karau et al.: Learning Spark

@MargrietGr

RDDs : Resilient Distributed Datasets• Data does not have to fit on a single machine• Data is separated into partitions

• Creation of RDDs•Load an external dataset•Distribute a collection of objects

• Transformations construct a new RDD from a previous one (lazy!)• Actions compute a result based on an RDD

@MargrietGr

Run Spark locally in a Python notebook

https://www.continuum.io/downloads

http://spark.apache.org/downloads.html

Create a new kernel to use in a Jupyter notebook

www.slideshare.net/MargrietGroenendijk

@MargrietGr

Jupyter Notebooks!

• Server-client application to edit and run notebook documents via a web browser

• Cells with:•Code•Figures and tables•Rich text elements

• Different kernels: Python, R, Scala, Spark

In the Cloud:

@MargrietGr

http://datascience.ibm.com/

Sign up for beta: http://datascience.ibm.com/features#machinelearning

Sign up for beta: http://datascience.ibm.com/features#machinelearning

@MargrietGr

Store Data in the Cloud

Object Store

Relational database

Document store - json

@MargrietGr

https://github.com/ibm-cds-labs/ibmseti/

SETI

@MargrietGr

• Mission: To explore, understand and explain the origin and nature of life in the universe

• The Allen Telescope Array•198 million radio events detected in the last decade

•400,000 candidate signals identified •5TB data generated in 10 hours

• No modern analysis or machine learning has been performed on this data

• 5 TB of special observations on IBM Object Store

SETI@IBMCloud

https://github.com/ibm-cds-labs/ibmseti/

http://www.seti.org/node/861

@MargrietGr

Access SETI data from Object Store

Local

On DSX

@MargrietGr

@MargrietGr

Weather Data

@MargrietGr

What will the weather be next weekend?

https://unsplash.com/search/autumn?photo=LSF8WGtQmn8https://unsplash.com/search/rain?photo=19tQv51x4-A

@MargrietGr

Find Data https://console.ng.bluemix.net/

@MargrietGr

Load weather data

@MargrietGr

Weather forecast for London

https://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Visualize Data

Demo!

@MargrietGr

Weather map - example for UK

https://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Python packages• matplotlib• Basemap• itertools• urllib

@MargrietGr Run as a daily cron job

cloudant

@MargrietGr

@MargrietGr

Weather,Twitter and Sentiment

@MargrietGr

Weather, Twitter and Sentiment

• Where to find the data?• Where to store the data?• Where to analyse the data?

• Quick tools to explore

@MargrietGr

Insights for Twitter

@MargrietGr

Add sentiment - example

@MargrietGr

• watson tone analyser

EmotionLanguage style

Social propensities

Analyze how you are coming across to others

@MargrietGr

Workflow

Weather Company Data

crontab -e

0 23 * * * /path/to/file/do_something.sh

python do_something.py

TweetsWeatherSentiment

Watson Tone Analyser

Insights for Twitter

Cloudant NoSQL

@MargrietGr

https://github.com/ibm-cds-labs/pixiedust

PixieDust

@MargrietGr

https://github.com/ibm-cds-labs/pixiedust

@MargrietGr

PixieDust: an Open Source Library that simplifies and improves Jupyter Python Notebooks

• PackageManager• Visualizations• Cloud Integration• Scala Bridge• Extensibility• Embedded Apps

https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-python-notebook/

@DTAIEB55

@MargrietGr

Install Spark packages or plain jars in your Notebook Python kernel without the need to modify configuration file

Uses the GraphFrame Python APIs

Install GraphFrames Spark Package

@MargrietGr

One simple API: display()Call the Options dialog

Panning/Zooming options

Performance statistics

@MargrietGr

Easily export your data to csv, json, html, etc. locally on your laptop or into a cloud-based service like Cloudant or Object Storage

@MargrietGr

Scala Bridge

Define a Python variable

Use the Python var in Scala

Define a Scala variable

Use the Scala var in Python

@MargrietGr

Easily extend PixieDust to create your own visualizations using HTML/CSS/JavaScript

Customized Visualization for GraphFrame Graphs

@MargrietGr

Real time Twitter sentiment analysis

@MargrietGr

@MargrietGr

https://developer.ibm.com/clouddataservices/author/mgroenen/

Thanks!

Slides will be here: http://www.slideshare.net/MargrietGroenendijk

http://www.slideshare.net/MargrietGroenendijk

@MargrietGr

Spark installation•http://spark.apache.org/downloads.html

•Spark release: 1.6.2•package type: Pre-built for Hadoop 2.6

•mkdir dev

•cd dev

•tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz

•ln -s spark-1.6.2-bin-hadoop2.6 spark

•mkdir dev/notebooks

@MargrietGr

•mkdir ~/.ipython/kernels/pyspark1.6/

• create file kernel.json

•cd ~/dev/spark/conf

•cp spark-defaults.conf.template spark-defaults.conf

• add to end of spark-defaults.conf: spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/*

{ "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" }}

Big Data Analytics London - Data Science in the Cloud

Data & Analytics

Transcript of Big Data Analytics London - Data Science in the Cloud