Data Science in the Cloud

50
@MargrietGr Margriet Groenendijk Developer Advocate for IBM Cloud Data Services SW Cloud meetup Bristol 24 November 2016 Data Science in the Cloud

Transcript of Data Science in the Cloud

@MargrietGr

Margriet GroenendijkDeveloper Advocate for IBM Cloud Data Services

SW Cloud meetupBristol

24 November 2016

Data Science in the Cloud

@MargrietGr

About me• Developer Advocate at IBM Cloud Data Services, UK

•Data science•Python, Spark, R, Cloudant, dashDB

• Research Fellow at University of Exeter, UK•Worked with very large observational datasets and the output of global scale climate models

• PhD at Vrije Universiteit Amsterdam, the Netherlands•Explored large observational datasets of carbon uptake by forests

@MargrietGr

1781

http://visual.ly/exports-and-imports-scotland

@MargrietGr

1821

https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png

@MargrietGr

1960s

http://www.computerhistory.org/collections/catalog/102630767

@MargrietGr

1960s

http://www.climatecentral.org/news/first-climate-model-video-19007

@MargrietGr

Data Engineers

Data Scientists

BusinessAnalysts

App Developers

Data Science is a Team Effort

Data

@MargrietGr

Toolbox

http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png

@MargrietGr

Data Science Workflow

@MargrietGr

DiscoverData

UseData Publish Data Socialize

Data

Data Science Workflow

@MargrietGr

Data Science Workflow

DefineQuestion

FindData

ExploreData

CleanData VisualizeandSummarizeData

CreatePredictiveModels

PresentResults

@MargrietGr

Collect Data

APIs

Open Data

MapsWeb Scraping

Time Series

@MargrietGr

Store Data

Object Store - binary files

Relational database

Document store - json

Bluemix

https://console.ng.bluemix.net/

@MargrietGr

Explore Data

@MargrietGr

ExploreData

CleanDataStoreData

@MargrietGr

Spark on a Cluster

@MargrietGr

The Spark Stack

from Karau et al.: Learning Spark

@MargrietGr

RDDs : Resilient Distributed Datasets• Data does not have to fit on a single machine• Data is separated into partitions

• Creation of RDDs•Load an external dataset•Distribute a collection of objects

• Transformations construct a new RDD from a previous one (lazy!)• Actions compute a result based on an RDD

@MargrietGr

Run Spark locally in a Python notebook

https://www.continuum.io/downloads

http://spark.apache.org/downloads.html

Create a new kernel to use in a Jupyter notebook

@MargrietGr

Jupyter Notebooks!

• Server-client application to edit and run notebook documents via a web browser

• Cells with:•Code•Figures and tables•Rich text elements

• Different kernels: Python, R, Scala, Spark

In the Cloud:

@MargrietGrhttp://datascience.ibm.com/

@MargrietGr

@MargrietGr

@MargrietGr

@MargrietGr

Weather Data

@MargrietGr

Define Question

What will the weather be next weekend?

https://unsplash.com/search/autumn?photo=LSF8WGtQmn8https://unsplash.com/search/rain?photo=19tQv51x4-A

@MargrietGr

Find Data

https://console.ng.bluemix.net/

@MargrietGr

Explore DataPython packages• requests and json

•API credentials and latitude/longitude of Bristol•json data returned

• pandas, numpy and datetime•convert json to pandas DataFrame (table with multiple indices)•add time as index

@MargrietGr

Weather forecast for Bristolhttps://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Visualize DataPython packages• pandas - rolling mean• matplotlib• Basemap

Demo

@MargrietGr

Weather map

https://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Python packages• matplotlib• Basemap• itertools• urllib

@MargrietGr

@MargrietGr

@MargrietGr

Weather,Twitter and Sentiment

@MargrietGr

Weather, Twitter and Sentiment

• Where to find the data?• Where to store the data?• Where to analyse the data?

• Quick tools to explore

@MargrietGr

Insights for Twitter

@MargrietGr

Add sentiment - example

@MargrietGr

• watson tone analyser

EmotionLanguage style

Social propensities

Analyze how you are coming across to others

@MargrietGr

Workflow

Weather Company Data

crontab -e

0 23 * * * /path/to/file/do_something.sh

python do_something.py

TweetsWeatherSentiment

Watson Tone Analyser

Insights for Twitter

Cloudant NoSQL

@MargrietGr

PixieDust

https://github.com/ibm-cds-labs/pixiedust

Simpler Workflow

@MargrietGr

PixieDust: an Open Source Library that simplifies and improves Jupyter Python Notebooks

• PackageManager• Visualizations• Cloud Integration• Scala Bridge• Extensibility• Embedded Apps

https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-python-notebook/

@DTAIEB55

@MargrietGr

Install Spark packages or plain jars in your Notebook Python kernel without the need to modify configuration file

Uses the GraphFrame Python APIs

Install GraphFrames Spark Package

@MargrietGr

One simple API: display()Call the Options dialog

Panning/Zooming options

Performance statistics

@MargrietGr

Easily export your data to csv, json, html, etc. locally on your laptop or into a cloud-based service like Cloudant or Object Storage

@MargrietGr

Scala Bridge

Define a Python variable

Use the Python var in Scala

Define a Scala variable

Use the Scala var in Python

@MargrietGr

Easily extend PixieDust to create your own visualizations using HTML/CSS/JavaScript

Customized Visualization for GraphFrame Graphs

@MargrietGr

Encapsulate your analytics into compelling User Interfaces better suited for Line of Business Users

@MargrietGr

@MargrietGr

IBM Watson Data Platform• Data Science Experience• Watson Data Platform• Machine Learning

• Sign up for beta: http://datascience.ibm.com/features#machinelearning

@MargrietGr

@MargrietGr

https://developer.ibm.com/clouddataservices/author/mgroenen/

Thanks!

Slides will be here: http://www.slideshare.net/MargrietGroenendijk