Open data - Potential into practice, Harvey Lewis, Research Director - Analytics, Deloitte (London)
Big Data Analytics London - Data Science in the Cloud
-
Upload
margriet-groenendijk -
Category
Data & Analytics
-
view
239 -
download
1
Transcript of Big Data Analytics London - Data Science in the Cloud
@MargrietGr
Margriet Groenendijk, PhDDeveloper Advocate for IBM Cloud Data Services
Big Data Analytics MeetupLondon
6 December 2016
Data Science in the Cloud
@MargrietGr https://blog.rjmetrics.com/2015/10/05/how-many-data-scientists-are-there/
@MargrietGr
1781
http://visual.ly/exports-and-imports-scotland
@MargrietGr
1821
https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png
@MargrietGr
1960s
http://www.computerhistory.org/collections/catalog/102630767
@MargrietGr
1960s
http://www.climatecentral.org/news/first-climate-model-video-19007
@MargrietGr
2016
@MargrietGr
20th century fractional change of Water Use Efficiency
@MargrietGr
2016
@MargrietGr
Toolbox
http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png
@MargrietGr
ExploreData
CleanDataStoreData
@MargrietGr
Spark on a Cluster
@MargrietGr
The Spark Stack
from Karau et al.: Learning Spark
@MargrietGr
RDDs : Resilient Distributed Datasets• Data does not have to fit on a single machine• Data is separated into partitions
• Creation of RDDs•Load an external dataset•Distribute a collection of objects
• Transformations construct a new RDD from a previous one (lazy!)• Actions compute a result based on an RDD
@MargrietGr
Run Spark locally in a Python notebook
https://www.continuum.io/downloads
http://spark.apache.org/downloads.html
Create a new kernel to use in a Jupyter notebook
www.slideshare.net/MargrietGroenendijk
@MargrietGr
Jupyter Notebooks!
• Server-client application to edit and run notebook documents via a web browser
• Cells with:•Code•Figures and tables•Rich text elements
• Different kernels: Python, R, Scala, Spark
In the Cloud:
@MargrietGr
http://datascience.ibm.com/
Sign up for beta: http://datascience.ibm.com/features#machinelearning
Sign up for beta: http://datascience.ibm.com/features#machinelearning
@MargrietGr
Store Data in the Cloud
Object Store
Relational database
Document store - json
@MargrietGr
https://github.com/ibm-cds-labs/ibmseti/
SETI
@MargrietGr
• Mission: To explore, understand and explain the origin and nature of life in the universe
• The Allen Telescope Array•198 million radio events detected in the last decade
•400,000 candidate signals identified •5TB data generated in 10 hours
• No modern analysis or machine learning has been performed on this data
• 5 TB of special observations on IBM Object Store
SETI@IBMCloud
https://github.com/ibm-cds-labs/ibmseti/
http://www.seti.org/node/861
@MargrietGr
Access SETI data from Object Store
Local
On DSX
@MargrietGr
@MargrietGr
Weather Data
@MargrietGr
What will the weather be next weekend?
https://unsplash.com/search/autumn?photo=LSF8WGtQmn8https://unsplash.com/search/rain?photo=19tQv51x4-A
@MargrietGr
Find Data https://console.ng.bluemix.net/
@MargrietGr
Load weather data
@MargrietGr
Weather forecast for London
https://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/
Visualize Data
Demo!
@MargrietGr
Weather map - example for UK
https://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/
Python packages• matplotlib• Basemap• itertools• urllib
@MargrietGr Run as a daily cron job
cloudant
@MargrietGr
@MargrietGr
@MargrietGr
Weather,Twitter and Sentiment
@MargrietGr
Weather, Twitter and Sentiment
• Where to find the data?• Where to store the data?• Where to analyse the data?
• Quick tools to explore
@MargrietGr
Insights for Twitter
@MargrietGr
Add sentiment - example
@MargrietGr
• watson tone analyser
EmotionLanguage style
Social propensities
Analyze how you are coming across to others
@MargrietGr
Workflow
Weather Company Data
crontab -e
0 23 * * * /path/to/file/do_something.sh
python do_something.py
TweetsWeatherSentiment
Watson Tone Analyser
Insights for Twitter
Cloudant NoSQL
@MargrietGr
https://github.com/ibm-cds-labs/pixiedust
PixieDust
@MargrietGr
https://github.com/ibm-cds-labs/pixiedust
@MargrietGr
PixieDust: an Open Source Library that simplifies and improves Jupyter Python Notebooks
• PackageManager• Visualizations• Cloud Integration• Scala Bridge• Extensibility• Embedded Apps
https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-python-notebook/
@DTAIEB55
@MargrietGr
Install Spark packages or plain jars in your Notebook Python kernel without the need to modify configuration file
Uses the GraphFrame Python APIs
Install GraphFrames Spark Package
@MargrietGr
One simple API: display()Call the Options dialog
Panning/Zooming options
Performance statistics
@MargrietGr
Easily export your data to csv, json, html, etc. locally on your laptop or into a cloud-based service like Cloudant or Object Storage
@MargrietGr
Scala Bridge
Define a Python variable
Use the Python var in Scala
Define a Scala variable
Use the Scala var in Python
@MargrietGr
Easily extend PixieDust to create your own visualizations using HTML/CSS/JavaScript
Customized Visualization for GraphFrame Graphs
@MargrietGr
Real time Twitter sentiment analysis
@MargrietGr
@MargrietGr
https://developer.ibm.com/clouddataservices/author/mgroenen/
Thanks!
Slides will be here: http://www.slideshare.net/MargrietGroenendijk
@MargrietGr
Spark installation•http://spark.apache.org/downloads.html
•Spark release: 1.6.2•package type: Pre-built for Hadoop 2.6
•mkdir dev
•cd dev
•tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz
•ln -s spark-1.6.2-bin-hadoop2.6 spark
•mkdir dev/notebooks
@MargrietGr
•mkdir ~/.ipython/kernels/pyspark1.6/
• create file kernel.json
•cd ~/dev/spark/conf
•cp spark-defaults.conf.template spark-defaults.conf
• add to end of spark-defaults.conf: spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/*
{ "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" }}