in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not...
Transcript of in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not...
![Page 1: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/1.jpg)
Scientific Computing in the CloudsKaran Bhatia, GoogleMay 1, 2017
![Page 2: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/2.jpg)
Investing to meet University and research needs
![Page 3: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/3.jpg)
$29.4 BillionGoogle’s trailing 3 Year CAPEX investment
1 Billion End users served by GCP customers
![Page 4: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/4.jpg)
Current regions and number of zones
Data Localization
Network path
Committed regions for 2017 and number of zones
#
# https://peering.google.comhttps://cloud.google.com/compute/docs/regions-zones/regions-zones
2
3
Singapore2
S Carolina
N Virginia
BelgiumLondon
Tokyo (2016)
TaiwanMumbai
Sydney
Oregon
Iowa
Frankfurt
São Paulo
Finland
3
3
33
3
3
2
43
3
3
Points of presence
![Page 5: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/5.jpg)
Agenda
Big Compute
Big Data
Programs
Patterns
![Page 6: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/6.jpg)
Big Compute
![Page 7: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/7.jpg)
Proprietary + Confidential
SC16 CMS DemonstratorTarget: generate 1 Billion events in 48 hours during Supercomputing 2016 on Google Cloud via HEPCloud
35% filter efficiency = stage out 380 million events → 150 TB output
Double the size of global CMS computing resources
CMS Higgs Event - credit: CERN https://commons.wikimedia.org/wiki/File:CMS_Higgs-event.jpg
![Page 8: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/8.jpg)
Cores from Google
![Page 9: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/9.jpg)
MIT Research w/ VMs
Products used: Google Compute Engine, Cloud Storage, DataStore
220,000 cores on preemptible VMs
2,250 32-core instances, 60 CPU-years of computation in a single afternoon
Answers in hours v. months
![Page 10: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/10.jpg)
Broad Firecloud:WDL, Cromwell and Google Genomics
WDL: an external DSL used by computational biologists to express the analytical pipelines
Cromwell: a scalable, robust engine for executing WDL against pluggable backends including local, Docker, Grid Engine or …
Google Genomics Pipelines API: co-developed by Broad and Google Genomics, a scalable Docker-as-a-Service with data scheduling
![Page 11: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/11.jpg)
Pipeline definition{
"name": "samtools index",
"description": "Run samtools index to generate a BAM index file",
"inputParameters": [
{"name": "inputFile",
"localCopy": {
"disk": "data",
"path": "input.bam"
}
},
{"name": "outputFile",
"localCopy": {
"disk": "data",
"path": "output.bam.bai"
}
},
],
"resources": {
"minimumCpuCores": 1,
"minimumRamGb": 1,
"disks": [{
"name": "data",
"type": "PERSISTENT_HDD"
"sizeGb": 200,
"mountPoint": "/mnt/data",
}]
},
"docker": {
"imageName": "quay.io/cancercollaboratory/dockstore-tool-samtools-index",
"cmd": "samtools index /mnt/data/input.bam /mnt/data/output.bam.bai"
}
}
![Page 12: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/12.jpg)
Create, run, monitor, and kill pipelines
Create$ gcloud alpha genomics pipelines create --pipeline-json-file PIPELINE-FILE.json --pipeline-json-file samtools_index.json
Created samtools index, id: PIPELINE-ID
Run$ gcloud alpha genomics pipelines run --pipeline_id PIPELINE-ID \
--logging gs://YOUR-BUCKET/YOUR-DIRECTORY/logs \
--inputs inputFile=gs://genomics-public-data/gatk-examples/example1/NA12878_chr22.bam \
--outputs outputFile=gs://YOUR-BUCKET/YOUR-DIRECTORY/output/NA12878_chr22.bam.bai
Running: operations/OPERATION-ID
Status$ gcloud alpha genomics operations describe OPERATION-ID
Kill$ gcloud alpha genomics operations cancel OPERATION-ID
![Page 13: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/13.jpg)
DSUB (google genomics pipelines)
![Page 14: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/14.jpg)
![Page 15: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/15.jpg)
Lessons
● Integration with third-party workload manager vs roll your own vs something in between
○ HTCondor, Slurm, Google Genomics Pipelines, ssh○ Managed instance groups
● On-premise + hybrid vs on-cloud● Cost optimizations
○ Preemptible vms and custom machine types○ Per-minute billing
● Networking is a key differentiator, public peering + internet2 member
![Page 16: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/16.jpg)
Intel Skylake
● Significant “per core” performance improvements
● Intel® Advanced Vector Extension 512 (Intel® AVX-512)
○ 2x flops/second● Accelerated IO with Intel® Omni-Path
Architecture (Fabric)● Integrated Intel® QuickAssist Technology
(crypto & compression offload)● Intel® Resource Director Technology (Intel®
RDT) for Efficiency & TCO
![Page 17: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/17.jpg)
Hardware Accelerated
● Available Today: NVIDIA K80 GPU● Coming Soon: Tensor Processing
Unit (TPU)● Custom ASIC built and optimized
for TensorFlow● Used in production at Google for
over 16 months● 7 years ahead of GPU performance
per watt
![Page 18: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/18.jpg)
![Page 19: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/19.jpg)
Data
![Page 20: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/20.jpg)
© 2016 Google 20Proprietary + Confidential
Data Prep (beta)
Cloud Dataprep
Cloud Pub/Sub
Cloud Dataflow
1. Ingest Data
Clean Data Raw Data
Google BigQuery
Data Studio
Cloud ML
2. Instantly Prepare Data 3. Analyze Data
![Page 21: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/21.jpg)
© 2016 Google 21Proprietary + Confidential
Supports Common Data Sources of Any SizeProcess diverse datasets - structured and unstructured. Transform data stored in CSV, JSON, or relational Table formats. Prepare datasets of any size, megabytes to terabytes, with equal ease.
Cloud DataprepInstant Data ExplorationVisually explore and interact with data in seconds. Instantly understand data distribution and patterns. There is no need for one to write code. You can prepare data with a few clicks.
Intelligent Data CleansingCloud Dataprep automatically identifies data anomalies and helps you to take corrective actions fast. Get data transformation suggestions based on your usage pattern. Standardize, structure, and join datasets easily with a guided approach.
ServerlessCloud Dataprep is a serverless service, so you do not need to create or manage infrastructure.
Seriously PowerfulCloud Dataprep is built on top of powerful Google Cloud Dataflow service. Cloud Dataprep is auto-scalable and can easily handle processing massive data sets.
![Page 22: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/22.jpg)
© 2016 Google 22
![Page 23: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/23.jpg)
Google cloud computing can help universities transform
![Page 24: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/24.jpg)
Teaching Faculty in select
countries
Teachinguniversity courses
In computer science or
related fields
![Page 25: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/25.jpg)
● National Science Foundation
○ BIGDATA
● National Institutes of Health
○ Data Commons
Funding Agency Partnerships
![Page 26: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/26.jpg)
![Page 27: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/27.jpg)
Google Cloud Public Datasets Program
Mission: Facilitate the onboarding of datasets into Google Cloud products
![Page 28: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/28.jpg)
How to use the basic headline + body:
1. Replace body text by either typing directly into table boxes or copy and paste content in from other source42+ datasets
![Page 30: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/30.jpg)
Themes / Patterns for Scientific Computing
![Page 31: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/31.jpg)
Extending the Cloud APIs
PI/BiologistWeb Access
Computational Research ScientistPython, R, SQL
Algorithm Developerssh, programmatic access
ISB-CGC GUI Google GUI
GoogleISB-CGC API
Compute Engine VMs
Cloud Storage BigQuery Genomics
API
Local Storage
ISB-CGC Hosted Data Controlled-Access Data Open-Access Data User Data
![Page 32: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/32.jpg)
![Page 33: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/33.jpg)
![Page 34: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/34.jpg)
TensorFlow
● World’s most popular ML framework
● Developer friendly yet performance optimized
● Powers over 100 Google services
● Managed infrastructure with Cloud ML
● Tutorials at https://www.tensorflow.org
![Page 35: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/35.jpg)
Linear Regression VS Neural Network
![Page 36: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built](https://reader034.fdocuments.in/reader034/viewer/2022050305/5f6d7ac34a86b9420e400d4b/html5/thumbnails/36.jpg)
Thank you!