Big data on google cloud

Post on 21-Jan-2018

285 views 2 download

Transcript of Big data on google cloud

Big Data On Google CloudTu Pham - IO extended 2017

CTO @ Dyno A Data as service company

Technologies: Java, Python, all kind of databases and Cloud platform from Google, Aws, Azure.

Interests: Cloud computing / architecture, technology evolution, distributed systems.

Husband, Father, GDE, Open source contributor.

Tu Pham

foto: Lars Kruse, Aarhus Universitet 3

Giới thiệu Dyno: - Tech marketing & digital

agency

Forthepast17 years,Googlehasbeenbuildingouttheworld’sfastest,mostpowerful,highestqualitycloudinfrastructureon the planet.

Images by Connie Zhou

Google Cloud Platform is built ont h e s am e i n f r a s t r u c t u re t h a tpowersGoogle.

ImagesbyConnieZhou

Google’sPlatform“[Google's]abilitytobuild,organize,andoperateahugenetworkofserversandfiber-opticcableswithanefficiencyandspeedthatrocksphysicsonitsheels.

This is what makes Google Google: itsphysicalnetwork,itsthousandsoffibermiles,andthosemanythousandsofserversthat,inaggregate,adduptothemother of all

clouds.”

-Wired

77Peering locations

Yes,WeCanPowerthat

Web Mobile Storage&Database

BigData HighlyScalableSystem DataMining

CloudPlatform

Google Cloud Platform

Organizetheworld’sinformationandmakeituniversallyaccessibleanduseful.Google’s Mission

2

Google Cloud Platform 5

Source: Boston Consulting Group : The Mobile Revolution: How Mobile Technologies Drive a Trillion-Dollar Impact IDC, 2015

By2020,therewillbe8Billionconnectedsmartphones

—2Xmorethantoday.

And 32 Billion connected “IOT” devices

— 6X more than today.

ExploringtheCloud

IaaSInfrastructure-as-a-

Service

PaaSPlatform-as-a-

Service

SaaSSoftware-as-a-

Service

GoogleCloudPlatform

CloudPlatform

GoogleComputeEngine

CloudPlatform

• FlexibleInfrastructure

• CustomerVMSize

• OnlineDiskResizing

• Network

• InternalNetwork

• Firewall

• LoadBalancing

• ExternalIpAddress

• Billing

• SustainedUsageDiscounts

• PreemptibleVM

AppEngine

•FullyManagedPlatform

• PopularProgrammingLanguageSupport

• FlexibleandScalableApplicationStorage

• Auto-scaling

• VersioningandTrafficSplitting

• LocalDeveloperTools•Third-partyFrameworksandExtensions

CloudPlatform

• GlobalPresence

• FlexibleDeliveryOptions

• Pull

• Push

• DataReliability

• FlowControl

• DataSecurityAndProtection

CloudPlatform

PubSub

• Reliable&ConsistencyProcessing

• UnifiedProgramingModel

• IntelligenceWorkScheduling

• AutoScaling

• Monitoring

• OpenSource

CloudPlatform

CloudDataFlow

• Versioning

• StaticSites

• ResumableTransfers

• ObjectChangeNotifications

• TBscale

CloudPlatform

CloudStorage

CloudSQL

• Fullymanaged

• EaseofUse

• HighlyReliable

• FlexibleCharging

• Security,Availability,Durability

• EasyMigration&DataPortability

• OptimizedMysqlversions

CloudPlatform

BigQuery

• FullyManagedBigDataAnalyticsService

• SupportSQL

• Fast

• Scalable

• FlexibleandFamiliar

• SecurityandReliability

CloudPlatform

DataProc

• Includes

• ApacheHadoop

• ApachePig

• ApacheHive

• ApacheSpark

• FastAndScalableDataProcessing

• FlexibleVirtualMachines

• ResizableCluster

CloudPlatform

DataLab

• PowerfulDataExploration

• Scalable

• DataManagement

• Visualization

• OpenSource(Jupyter)

CloudPlatform

Google’s Data Services for everyone

A common configuration: drawconclusions

Cloud Datalab

Events,metrics,etc.

StreamVisualization and BI

Rawlogs,files,assets,Google

Analyticsdataetc. Co-workers Batch

Batch

B C Applications and A Reports

Confidential + Proprietary

Aserverless bigdatastackthatscalesautomatically

10+YearsofTacklingBigDataProblems

Google Cloud Platform 13

Google Papers

20082002 2004 2006 2010 2012 2014 2015

GFS Map Reduce

Flume Java Millwheel

Open Source

2005

Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable

BigTable Dremel PubSub

Apache Beam

Tensorflow

Confidential & ProprietaryGoogle Cloud Platform 24

Transform Data into Actions

Exploration & CollaborationDatabases Storage

Data Preparation &

Processing Analytics

Advanced Analytics & Intelligence

Mobile apps

Sensors and devices

Web apps

Relational

Key-value

Document

SQL

Wide column

ObjectStream processing

Batch processing

Data preparation

Federated query

Data catalog

Data exploration

Data visualization

Developers

Data scientists

Business analysts

Development environment for Machine

Learning

Pre-Trained Machine Learning models

Data Ingestion

Messaging

Logs

Confidential & ProprietaryGoogle Cloud Platform 25

Transform Data into Actions

Data Preparation &

Processing

Cloud Dataflow

Cloud Dataproc

Exploration & Collaboration

Google BigQuery

Cloud Datalab

Google Analytics 360

Cloud Dataproc

Mobile apps

Sensors and devices

Web apps

Developers

Data scientists

Business analysts

Data Ingestion

Cloud Pub/Sub

App Engine

Databases/Storage

Cloud SQL

Cloud Bigtable

Cloud Datastore

Cloud Storage

Analytics

Google BigQuery

Google Analytics 360

Cloud Dataproc

Google Drive

Advanced Analytics & Intelligence

Cloud Machine Learning

Translate API

Vision API

Speech API

Google Cloud Platform 3

Apache Spark and Apache Hadoop should be

fast, easy, and cost-effective.

GoogleCloudDataProc

Traditional Spark and Hadoop clusters

Google Cloud Dataproc

Google Cloud Dataproc - under the hood

Applications on the cluster

Dataproc Jobs

GCP Products

Spark

PySpark

Spark SQL

MapReduce

Pig

Hive

Dataproc Cluster

Spark & Hadoop OSS

Cloud Dataproc Agent

Google Cloud Services

Dataproc Jobs Features Data Outputs

Easy, fast, cost-effective

Fast Things take seconds to minutes, not hours or weeks Easy Be an expert with your data, not your data infrastructure Cost-effective Pay for exactly what you use

Running Hadoop on Google Cloud

bdutil Free OSS Toolkit

Dataproc Managed Hadoop

Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation

Custom Code Monitoring/Health Dev Integration Manual Scaling Job Submission GCP Connectivity Deployment Creation

On Premise

Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation

Google Managed

Google Cloud Platform

Customer Managed

Vendor Hadoop

Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation

6

Cloud Dataproc - integrated

6

Cloud Dataproc is

natively integrated with

several Google Cloud

Platform products as

part of an integrated data platform.

Storage

Operations

Data

7

Where Cloud Dataproc fits into GCP

7

Google Bigtable (HBase)

Google BigQuery (Analytics, Data warehouse)

Stackdriver Logging (Logging Ops.)

Google Cloud Dataflow (Batch/Stream Processing)

Google Cloud Storage (HCFS/HDFS)

Stackdriver Monitoring (Monitoring)

Building what’s next 33

Scales automatically No setup or administration

Stream up to 100,000 rows p/sec

Easily integrates with third-party software

Google BigQuery makescomplexdataanalysissimple

Confidential + Proprietary

GoogleBigQueryPerformanceExample?

Running an inefficientregular expression over 100 billion rows in

less than 60 seconds

Source: h ttps://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query

GoogleBigQuery

ThePowerofGoogleDremelforeveryone

Storage Compute

Fast IngestQuery

Terabit Network

1000-core Hadoop Cluster = 2.5 hours

Before

Making ad hoc Queries with BigQuery < 5min

After

● 500+Games● HundredsofAnalysts● TerabytesofDataDaily

“Rightatthestartofthepartnershipwewereabletoreducetimetoinsightfrom96hoursto30minutesbyusingBigQuery,allowingustoreactinrealtimetocustomerneedsandprovidebetterservice..”

Gary Sanders Head of the bank's digital analytics function

h ttps://www.finextra.com/newsarticle/28566/lloyds-partners-google-on-data-analytics

Big Data Challenges At Dyno

- Multi TB data warehouse - Raw input > 100 GB new raw data per day (Structured

& Unstructured) - 65 online data source - Unlimited offline data source - Face with data quality problem everyday - From user information & behavior to user interest &

intention - Manage high performance / cost effective system

JOIN THE FLIGHT - WE ARE HIRING

IO Extended 2017

Twitter: @phamptu Email: tu@dyno.vn

Frontend Developer: goo.gl/EY8RvV Backend Developer: goo.gl/BnmmK6