DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven...
Transcript of DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven...
![Page 1: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/1.jpg)
DataScience in The Cloud
1
BigDataWorkGroup
![Page 2: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/2.jpg)
An Introduction to Data Science
Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.
An Introduction to Data Science
Jeffrey Stanton
Syracuse University School of Information Studies
A data scientist is someone who can obtain, scrub, explore, model and interpretdata, blending hacking, statistics and machine learning. Data Scientists not only are adept at working with data, but appreciate data itself as a first-class product
Hilary Mason, chief scientist at bit.ly
Data wrangling, Data jujitsu, Data munginghttps://www.coursera.org/specializations/data-science
2
![Page 3: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/3.jpg)
Data Products
Data science is about building data products, not just answering questions.
Data-driven apps: Spellcheckers ,Machine Translator
Interactive visualization : Google flu application, Global Burden of Disease
Online Databases: Enterprise data warehouse, Sloan Digital Sky Survey
https://www.coursera.org/specializations/data-science
3
![Page 4: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/4.jpg)
eScience = Data Science
Empirical:
Observe Natural world, Replicate natural world in laboratory
https://www.coursera.org/specializations/data-science
4
![Page 5: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/5.jpg)
eScience = Data Science
Empirical:
Observe Natural world, Replicate natural world in laboratory
Theoretical:
to model the empirical observation use theory
https://www.coursera.org/specializations/data-science
5
![Page 6: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/6.jpg)
eScience = Data Science
Empirical:
Observe Natural world, Replicate natural world in laboratory
Theoretical:
to model the empirical observation use theory
Computational:
Simulate in the computer (Problems that couldn’t observe in laboratory and more complex that could be Analysis by theoretical models. Define initial conditions and run the simulation.
https://www.coursera.org/specializations/data-science
6
![Page 7: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/7.jpg)
eScience = Data Science
Empirical:
Observe Natural world, Replicate natural world in laboratory
Theoretical:
to model the empirical observation use theory
Computational:
Simulate in the computer (Problems that couldn’t observe in laboratory and more complex that could be Analysis by theoretical models. Define initial conditions and run the simulation.
eScience: acquire massive data sets (databases, visualization, scale out computing, NoSql,macinue learing
https://www.coursera.org/specializations/data-science
7
https://vidensportal.deic.dk/what-is-eScience?language=en
![Page 8: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/8.jpg)
eScience = Data Science
Science is about asking questions
Traditionally: “Query the world”
Data acquisitions activities coupled to a specific hypothesis
eScicence: “Download the world”
Data acquire in massive in support of many hypothesis
The cost of data acquisition has dropped precipitously
The cost of finding, integrating, analyzing and communicating results is the new bottleneck
https://www.coursera.org/specializations/data-science
8
https://vidensportal.deic.dk/what-is-eScience?language=en
![Page 9: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/9.jpg)
eScience is about the analysis of data
http://www.slideshare.net/RenuSuren/big-data-analysis-for-page-ranking-using-mapreduce
9
![Page 10: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/10.jpg)
Who are data scienctis?
To be successful, data scientists need an environment that is open, engaging, and fosters collaboration. They need:
Ability to use open source tools they know and love
Enterprise-grade functionality they’ll need for critical data science projects
Community that supports them throughout the whole process
In this seedbed of innovation, data scientists can break down data barriers and develop ideas that change the world.
http://www.ibm.com/analytics/us/en/technology/data-science/
10
![Page 11: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/11.jpg)
Cloud Providers
DataBricks
IBM Cloud Data services
Google BigQuery
DataScience
Big Data Structures
11
![Page 12: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/12.jpg)
DataBricks
Data Bricks
12
![Page 13: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/13.jpg)
A Gentle Introduction to Apache Spark on Databricks
Workspaces
Notebooks
Dashboard
Jobs
Libraries (different languages)
Tables (Amazon s3)
Clusters (groups of computers)
Apps (Third party applications, Tableau)
Data Bricks
13
![Page 14: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/14.jpg)
Spark
sparkContext (Apache Spark engine) and SQLContext (DataFrame Functionality)
Spark 2.0 : SparkSession
Data Interface:
Dataset
Dataframe
RDD (Resilient Distributed Dataset)
Data Bricks
14
![Page 15: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/15.jpg)
IBM Cloud Data Service
Big Data Structures
15
![Page 16: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/16.jpg)
Articles + Data sets + Notebooks + Tutorials
IBM Cloud Data Service
16
![Page 17: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/17.jpg)
Data Source
Data Service
External
Amazon Redshift
Amazon S3
Apache Hive
Cloudera Impala
dashDB
DB2
Hortonworks HDFS
IBM Cloud Data Service
17
IBM InfomixMicrosoft AzureMircosoft SQLMysqlNetezzaOraclePostgreSQLSQL Database
![Page 18: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/18.jpg)
DataScience Cloud
DataScience Cloud
18
![Page 19: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/19.jpg)
Datascoence Cloud
DataScience Cloud
19
![Page 20: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/20.jpg)
Solutions
DataScience Cloud
20
![Page 21: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/21.jpg)
Google Cloud Platform for Data Scientists
Google Cloud Platform for Data Scientists
21
![Page 22: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/22.jpg)
Big query (google data warehouse)
Big query
22
![Page 23: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/23.jpg)
BigQuery features
Speed $ Scale : BigQuery can scan TB in seconds and PB in minutes. Stream 100,000 rows per second
Incredible Pricing: scale and pay for storage and compute independently (pays-as-you-go model)
Security and Reliability: automatically encrypt and replicates your data, fully controlled,
Global Availability: store BigQuery data in European locations.
Fully Integrated with: SQL, Cloud dataflow, Spark, Hadoop
Partnership
Big Data Structures
23
![Page 24: DataScience in The Clouddocs.occc.ir/occc70/OCCC70_Data_Science_in_the_Cloud.pdf · Data-driven apps: Spellcheckers ,Machine Translator ... Define initial conditions and run the simulation.](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6ef9eff4a11075f7713a6/html5/thumbnails/24.jpg)
Q & A
Big Data Structures
24
Telegram.me/BigDataWorkGroup
Sg.sharif.ir