Big data on google cloud

43
Big Data On Google Cloud Tu Pham - IO extended 2017

Transcript of Big data on google cloud

Page 1: Big data on google cloud

Big Data On Google CloudTu Pham - IO extended 2017

Page 2: Big data on google cloud
Page 3: Big data on google cloud

CTO @ Dyno A Data as service company

Technologies: Java, Python, all kind of databases and Cloud platform from Google, Aws, Azure.

Interests: Cloud computing / architecture, technology evolution, distributed systems.

Husband, Father, GDE, Open source contributor.

Tu Pham

foto: Lars Kruse, Aarhus Universitet 3

Giới thiệu Dyno: - Tech marketing & digital

agency

Page 4: Big data on google cloud

Forthepast17 years,Googlehasbeenbuildingouttheworld’sfastest,mostpowerful,highestqualitycloudinfrastructureon the planet.

Images by Connie Zhou

Page 5: Big data on google cloud

Google Cloud Platform is built ont h e s am e i n f r a s t r u c t u re t h a tpowersGoogle.

ImagesbyConnieZhou

Page 6: Big data on google cloud

Google’sPlatform“[Google's]abilitytobuild,organize,andoperateahugenetworkofserversandfiber-opticcableswithanefficiencyandspeedthatrocksphysicsonitsheels.

This is what makes Google Google: itsphysicalnetwork,itsthousandsoffibermiles,andthosemanythousandsofserversthat,inaggregate,adduptothemother of all

clouds.”

-Wired

Page 7: Big data on google cloud

77Peering locations

Page 8: Big data on google cloud

Yes,WeCanPowerthat

Web Mobile Storage&Database

BigData HighlyScalableSystem DataMining

CloudPlatform

Page 9: Big data on google cloud

Google Cloud Platform

Organizetheworld’sinformationandmakeituniversallyaccessibleanduseful.Google’s Mission

2

Page 10: Big data on google cloud

Google Cloud Platform 5

Source: Boston Consulting Group : The Mobile Revolution: How Mobile Technologies Drive a Trillion-Dollar Impact IDC, 2015

By2020,therewillbe8Billionconnectedsmartphones

—2Xmorethantoday.

And 32 Billion connected “IOT” devices

— 6X more than today.

Page 11: Big data on google cloud

ExploringtheCloud

IaaSInfrastructure-as-a-

Service

PaaSPlatform-as-a-

Service

SaaSSoftware-as-a-

Service

GoogleCloudPlatform

CloudPlatform

Page 12: Big data on google cloud
Page 13: Big data on google cloud

GoogleComputeEngine

CloudPlatform

• FlexibleInfrastructure

• CustomerVMSize

• OnlineDiskResizing

• Network

• InternalNetwork

• Firewall

• LoadBalancing

• ExternalIpAddress

• Billing

• SustainedUsageDiscounts

• PreemptibleVM

Page 14: Big data on google cloud

AppEngine

•FullyManagedPlatform

• PopularProgrammingLanguageSupport

• FlexibleandScalableApplicationStorage

• Auto-scaling

• VersioningandTrafficSplitting

• LocalDeveloperTools•Third-partyFrameworksandExtensions

CloudPlatform

Page 15: Big data on google cloud

• GlobalPresence

• FlexibleDeliveryOptions

• Pull

• Push

• DataReliability

• FlowControl

• DataSecurityAndProtection

CloudPlatform

PubSub

Page 16: Big data on google cloud

• Reliable&ConsistencyProcessing

• UnifiedProgramingModel

• IntelligenceWorkScheduling

• AutoScaling

• Monitoring

• OpenSource

CloudPlatform

CloudDataFlow

Page 17: Big data on google cloud

• Versioning

• StaticSites

• ResumableTransfers

• ObjectChangeNotifications

• TBscale

CloudPlatform

CloudStorage

Page 18: Big data on google cloud

CloudSQL

• Fullymanaged

• EaseofUse

• HighlyReliable

• FlexibleCharging

• Security,Availability,Durability

• EasyMigration&DataPortability

• OptimizedMysqlversions

CloudPlatform

Page 19: Big data on google cloud

BigQuery

• FullyManagedBigDataAnalyticsService

• SupportSQL

• Fast

• Scalable

• FlexibleandFamiliar

• SecurityandReliability

CloudPlatform

Page 20: Big data on google cloud

DataProc

• Includes

• ApacheHadoop

• ApachePig

• ApacheHive

• ApacheSpark

• FastAndScalableDataProcessing

• FlexibleVirtualMachines

• ResizableCluster

CloudPlatform

Page 21: Big data on google cloud

DataLab

• PowerfulDataExploration

• Scalable

• DataManagement

• Visualization

• OpenSource(Jupyter)

CloudPlatform

Page 22: Big data on google cloud

Google’s Data Services for everyone

Page 23: Big data on google cloud

A common configuration: drawconclusions

Cloud Datalab

Events,metrics,etc.

StreamVisualization and BI

Rawlogs,files,assets,Google

Analyticsdataetc. Co-workers Batch

Batch

B C Applications and A Reports

Confidential + Proprietary

Aserverless bigdatastackthatscalesautomatically

Page 24: Big data on google cloud

10+YearsofTacklingBigDataProblems

Google Cloud Platform 13

Google Papers

20082002 2004 2006 2010 2012 2014 2015

GFS Map Reduce

Flume Java Millwheel

Open Source

2005

Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable

BigTable Dremel PubSub

Apache Beam

Tensorflow

Page 25: Big data on google cloud

Confidential & ProprietaryGoogle Cloud Platform 24

Transform Data into Actions

Exploration & CollaborationDatabases Storage

Data Preparation &

Processing Analytics

Advanced Analytics & Intelligence

Mobile apps

Sensors and devices

Web apps

Relational

Key-value

Document

SQL

Wide column

ObjectStream processing

Batch processing

Data preparation

Federated query

Data catalog

Data exploration

Data visualization

Developers

Data scientists

Business analysts

Development environment for Machine

Learning

Pre-Trained Machine Learning models

Data Ingestion

Messaging

Logs

Page 26: Big data on google cloud

Confidential & ProprietaryGoogle Cloud Platform 25

Transform Data into Actions

Data Preparation &

Processing

Cloud Dataflow

Cloud Dataproc

Exploration & Collaboration

Google BigQuery

Cloud Datalab

Google Analytics 360

Cloud Dataproc

Mobile apps

Sensors and devices

Web apps

Developers

Data scientists

Business analysts

Data Ingestion

Cloud Pub/Sub

App Engine

Databases/Storage

Cloud SQL

Cloud Bigtable

Cloud Datastore

Cloud Storage

Analytics

Google BigQuery

Google Analytics 360

Cloud Dataproc

Google Drive

Advanced Analytics & Intelligence

Cloud Machine Learning

Translate API

Vision API

Speech API

Page 27: Big data on google cloud

Google Cloud Platform 3

Apache Spark and Apache Hadoop should be

fast, easy, and cost-effective.

GoogleCloudDataProc

Page 28: Big data on google cloud

Traditional Spark and Hadoop clusters

Page 29: Big data on google cloud

Google Cloud Dataproc

Page 30: Big data on google cloud

Google Cloud Dataproc - under the hood

Applications on the cluster

Dataproc Jobs

GCP Products

Spark

PySpark

Spark SQL

MapReduce

Pig

Hive

Dataproc Cluster

Spark & Hadoop OSS

Cloud Dataproc Agent

Google Cloud Services

Dataproc Jobs Features Data Outputs

Page 31: Big data on google cloud

Easy, fast, cost-effective

Fast Things take seconds to minutes, not hours or weeks Easy Be an expert with your data, not your data infrastructure Cost-effective Pay for exactly what you use

Page 32: Big data on google cloud

Running Hadoop on Google Cloud

bdutil Free OSS Toolkit

Dataproc Managed Hadoop

Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation

Custom Code Monitoring/Health Dev Integration Manual Scaling Job Submission GCP Connectivity Deployment Creation

On Premise

Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation

Google Managed

Google Cloud Platform

Customer Managed

Vendor Hadoop

Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation

Page 33: Big data on google cloud

6

Cloud Dataproc - integrated

6

Cloud Dataproc is

natively integrated with

several Google Cloud

Platform products as

part of an integrated data platform.

Storage

Operations

Data

Page 34: Big data on google cloud

7

Where Cloud Dataproc fits into GCP

7

Google Bigtable (HBase)

Google BigQuery (Analytics, Data warehouse)

Stackdriver Logging (Logging Ops.)

Google Cloud Dataflow (Batch/Stream Processing)

Google Cloud Storage (HCFS/HDFS)

Stackdriver Monitoring (Monitoring)

Page 35: Big data on google cloud

Building what’s next 33

Scales automatically No setup or administration

Stream up to 100,000 rows p/sec

Easily integrates with third-party software

Google BigQuery makescomplexdataanalysissimple

Page 36: Big data on google cloud
Page 37: Big data on google cloud

Confidential + Proprietary

GoogleBigQueryPerformanceExample?

Running an inefficientregular expression over 100 billion rows in

less than 60 seconds

Source: h ttps://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query

Page 38: Big data on google cloud

GoogleBigQuery

ThePowerofGoogleDremelforeveryone

Storage Compute

Fast IngestQuery

Terabit Network

Page 39: Big data on google cloud

1000-core Hadoop Cluster = 2.5 hours

Before

Making ad hoc Queries with BigQuery < 5min

After

● 500+Games● HundredsofAnalysts● TerabytesofDataDaily

Page 40: Big data on google cloud
Page 41: Big data on google cloud

“Rightatthestartofthepartnershipwewereabletoreducetimetoinsightfrom96hoursto30minutesbyusingBigQuery,allowingustoreactinrealtimetocustomerneedsandprovidebetterservice..”

Gary Sanders Head of the bank's digital analytics function

h ttps://www.finextra.com/newsarticle/28566/lloyds-partners-google-on-data-analytics

Page 42: Big data on google cloud

Big Data Challenges At Dyno

- Multi TB data warehouse - Raw input > 100 GB new raw data per day (Structured

& Unstructured) - 65 online data source - Unlimited offline data source - Face with data quality problem everyday - From user information & behavior to user interest &

intention - Manage high performance / cost effective system

Page 43: Big data on google cloud

JOIN THE FLIGHT - WE ARE HIRING

IO Extended 2017

Twitter: @phamptu Email: [email protected]

Frontend Developer: goo.gl/EY8RvV Backend Developer: goo.gl/BnmmK6