Cloud architectures for data science

63
@MargrietGr Margriet Groenendijk, PhD Developer Advocate for IBM Cloud Data Services O’Reilly Software Architecture Conference San Francisco 16 November 2016 Cloud Architectures for Data Science

Transcript of Cloud architectures for data science

@MargrietGr

Margriet Groenendijk, PhDDeveloper Advocate for IBM Cloud Data Services

O’Reilly Software Architecture ConferenceSan Francisco16 November 2016

Cloud Architectures for Data Science

@MargrietGr

About me• Developer Advocate at IBM Cloud Data Services, UK

•Data science•Python, Spark, R, Cloudant, dashDB

• Research Fellow at University of Exeter, UK•Worked with very large observational datasets and the output of global scale climate models

• PhD at Vrije Universiteit Amsterdam, the Netherlands•Explored large observational datasets of carbon uptake by forests

@MargrietGr

A Brief History of Data Science

• Computer Science• Data Technology• Visualization• Mathematics• Statistics

http://www.datascienceassn.org/content/history-data-science

@MargrietGr

1781

http://visual.ly/exports-and-imports-scotland

@MargrietGr

1821

https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png

@MargrietGr

1855

http://visual.ly/diagram-causes-mortality-army-east

@MargrietGr

1960s

http://www.computerhistory.org/collections/catalog/102630767

@MargrietGr

1960s

http://www.climatecentral.org/news/first-climate-model-video-19007

@MargrietGr

2016

@MargrietGr

2016

@MargrietGrhttps://blog.rjmetrics.com/2015/10/05/how-many-data-scientists-are-there/

How many Data Scientists are there?

@MargrietGrhttps://whatsthebigdata.com/2015/11/08/top-skills-and-backgrounds-of-data-scientists-on-linkedin/

@MargrietGr

https://whatsthebigdata.com/2015/11/08/top-skills-and-backgrounds-of-data-scientists-on-linkedin/

@MargrietGr

Toolbox

http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png

@MargrietGr

Data Engineers

Data Scientists

BusinessAnalysts

App Developers

Data Science is a Team Effort

Data

@MargrietGr

@MargrietGr

Data Science Workflow

@MargrietGr

DiscoverData

UseData Publish Data Socialize

Data

Data Science Workflow

@MargrietGr

Data Science Workflow

DefineQuestion

FindData

ExploreData

CleanData VisualizeandSummarizeData

CreatePredictiveModels

PresentResults

@MargrietGr

Collect Data

APIs

Open Data

MapsWeb Scraping

Time Series

@MargrietGr

Store Data

Object Store - binary files

Relational database

Document store - json

@MargrietGr

Explore Data

@MargrietGr

ExploreData

CleanDataStoreData

@MargrietGr

Spark on a Cluster

@MargrietGr

The Spark Stack

from Karau et al.: Learning Spark

@MargrietGr

RDDs : Resilient Distributed Datasets• Data does not have to fit on a single machine• Data is separated into partitions

• Creation of RDDs•Load an external dataset•Distribute a collection of objects

• Transformations construct a new RDD from a previous one (lazy!)• Actions compute a result based on an RDD

@MargrietGr

Run Spark locally in a Python notebook

https://www.continuum.io/downloads

http://spark.apache.org/downloads.html

Create a new kernel to use in a Jupyter notebook

@MargrietGr

Jupyter Notebooks!

• Server-client application to edit and run notebook documents via a web browser

• Cells with:•Code•Figures and tables•Rich text elements

• Different kernels: Python, R, Scala, Spark

In the Cloud:

@MargrietGrhttp://datascience.ibm.com/

@MargrietGr

@MargrietGr

@MargrietGr

@MargrietGr

Weather Data

@MargrietGr

Define Question

What will the weather be next weekend?

https://unsplash.com/search/autumn?photo=LSF8WGtQmn8https://unsplash.com/search/rain?photo=19tQv51x4-A

@MargrietGr

Find Data

https://console.ng.bluemix.net/

@MargrietGr

Explore DataPython packages• requests and json

•API credentials and latitude/longitude of San Francisco•json data returned

• pandas, numpy and datetime•convert json to pandas DataFrame (table with multiple indices)•add time as index

@MargrietGr

Weather forecast for San Franciscohttps://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Visualize DataPython packages• pandas - rolling mean• matplotlib• Basemap

@MargrietGr

Weather map - example for UK

https://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Python packages• matplotlib• Basemap• itertools• urllib

@MargrietGr Run as a daily cron job

cloudant

@MargrietGr

@MargrietGr

@MargrietGr

Weather,Twitter and Sentiment

@MargrietGr

Weather, Twitter and Sentiment

• Where to find the data?• Where to store the data?• Where to analyse the data?

• Quick tools to explore

@MargrietGr

Insights for Twitter

@MargrietGr

Add sentiment - example

@MargrietGr

• watson tone analyser

EmotionLanguage style

Social propensities

Analyze how you are coming across to others

@MargrietGr

Simpler Workflow

Weather Company Data

crontab -e

0 23 * * * /path/to/file/do_something.sh

python do_something.py

TweetsWeatherSentiment

Watson Tone Analyser

Insights for Twitter

Cloudant NoSQL

@MargrietGr

PixieDust

https://github.com/ibm-cds-labs/pixiedust

Simpler Workflow

@MargrietGr

PixieDust: an Open Source Library that simplifies and improves Jupyter Python Notebooks• PackageManager• Visualizations• Cloud Integration• Scala Bridge• Extensibility• Embedded Apps

https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-python-notebook/

@DTAIEB55

@MargrietGr

Install Spark packages or plain jars in your Notebook Python kernel without the need to modify configuration file

Uses the GraphFrame Python APIs

Install GraphFrames Spark Package

@MargrietGr

One simple API: display()Call the Options dialog

Panning/Zooming options

Performance statistics

@MargrietGr

Easily export your data to csv, json, html, etc. locally on your laptop or into a cloud-based service like Cloudant or Object Storage

@MargrietGr

Scala Bridge

Define a Python variable

Use the Python var in Scala

Define a Scala variable

Use the Scala var in Python

@MargrietGr

Easily extend PixieDust to create your own visualizations using HTML/CSS/JavaScript

Customized Visualization for GraphFrame Graphs

@MargrietGr

Encapsulate your analytics into compelling User Interfaces better suited for Line of Business Users

@MargrietGr

@MargrietGr

https://github.com/ibm-cds-labs/ibmseti/

SETI

@MargrietGr

• Mission: To explore, understand and explain the origin and nature of life in the universe

• Origins: Started in 1959 by two physicists at Cornell

• NASA became interested in 1970, started working with SETI in 1988, funding cut in 1993

SETI@IBMCloud

http://www.seti.org/node/861

@MargrietGr

• The Allen Telescope Array•198 million radio events detected in the last decade•400,000 candidate signals identified •5TB data generated in 10 hours

• No modern analysis or machine learning has been performed on this data• 5 TB of special observations on IBM Object Store

SETI@IBMCloud - the Data

https://github.com/ibm-cds-labs/ibmseti/

@MargrietGr

Public Spark@SETI

4 TB of SETI Data stored in Object Storage

Web API provides Bluemix users access to download SETI data

ObjectStorage

WebAPI Spark Object

Storage

Public Spark@SETI Bluemix Account My Bluemix Account

Spark using Jupyter Notebook and IBM SETI Python Library

Goal: Amateur scientists/data scientists download and analyze SETI data

@MargrietGr

IBM Watson Data Platform• Data Science Experience• Watson Data Platform• Machine Learning

• Sign up for beta: http://datascience.ibm.com/features#machinelearning

@MargrietGr

Data Science in the Cloud• Flexible and quick to iterate, play and explore data• APIs

•Streaming data•Cloud databases•Watson

• Scaling up - add storage or Spark kernels• Easy collaboration and presentation

•Store Data•Share your analyses in notebooks

• Some useful packages: pandas, pyspark, requests, matplotlib, cloudant• Notebooks can be extended! PixieDust

@MargrietGr

https://developer.ibm.com/clouddataservices/author/mgroenen/

Thanks!

Slides will be here: http://www.slideshare.net/MargrietGroenendijk