Google Cloud Dataflow meets TensorFlow

62
Google Cloud Dataflow meets TensorFlow Dataflow TensorFlow Datastore H.Yoshikawa (@hayatoy) GCPUG Shonan #12 25 Mar. 2017

Transcript of Google Cloud Dataflow meets TensorFlow

Page 1: Google Cloud Dataflow meets TensorFlow

Google Cloud Dataflowmeets TensorFlowDataflow TensorFlow Datastore

H.Yoshikawa (@hayatoy)GCPUG Shonan #1225 Mar. 2017

Page 2: Google Cloud Dataflow meets TensorFlow

@hayatoy

GAE/Py 7~8 ?

APAC

TensorFlow

Presenter

Page 3: Google Cloud Dataflow meets TensorFlow

Disclaimer

Page 4: Google Cloud Dataflow meets TensorFlow
Page 5: Google Cloud Dataflow meets TensorFlow

Google Cloud Dataflow

TensorFlow

DatastoreIO

Page 6: Google Cloud Dataflow meets TensorFlow

Google Cloud Dataflow

Page 7: Google Cloud Dataflow meets TensorFlow

Google Cloud Dataflow

MapReduce

GCE

Jupyter Notebook deploy

Page 8: Google Cloud Dataflow meets TensorFlow

Google Cloud Dataflow

Python

Page 9: Google Cloud Dataflow meets TensorFlow

Prerequisites

Create GCP account

Enable billing

Enable Google Dataflow API

Page 10: Google Cloud Dataflow meets TensorFlow

Installation

Apache Beam

$ git clone https://github.com/apache/beam.git

$ cd beam/sdks/python/

$ python setup.py sdist

$ cd dist/

$ pip install apache-beam-sdk-*.tar.gz

>> 0.7.0.dev0

google‑cloud‑dataflow

$ pip install google-cloud-dataflow

>> 0.5.5

Page 11: Google Cloud Dataflow meets TensorFlow

Installation

# gcloud-python$ pip install gcloud

>> 0.18.3

# Application Default Credentials$ gcloud beta auth application-default login

 gs:// 

Page 12: Google Cloud Dataflow meets TensorFlow

Pipeline

Page 13: Google Cloud Dataflow meets TensorFlow

Pipeline

p = beam.Pipeline('DirectRunner')

(p | 'input' >> beam.Create(['Hello', 'World'])

| 'output' >> beam.io.WriteToText('gs://bucket/hello')

)

p.run()

(Pythonista )

Page 14: Google Cloud Dataflow meets TensorFlow

Pipeline

(p | 'input' >> beam.Create(['Hello', 'World'])

| 'output' >> beam.io.WriteToText('gs://bucket/hello')

)

Pipeline  | 

Page 15: Google Cloud Dataflow meets TensorFlow

PCollection

Page 16: Google Cloud Dataflow meets TensorFlow

PCollection

 PCollection 

 PCollection   key-value 

 bounded   unbounded 

Page 17: Google Cloud Dataflow meets TensorFlow

PCollection

beam.Create(['Hello', 'World'])

 plaintext  'Hello' 'World' 2In Memory

Page 18: Google Cloud Dataflow meets TensorFlow
Page 19: Google Cloud Dataflow meets TensorFlow

TextFile

beam.io.ReadFromText('gs://bucket/input.txt')

* OK

Page 20: Google Cloud Dataflow meets TensorFlow

BigQuery

Table

beam.io.Read(

beam.io.BigQuerySource(

'clouddataflow-readonly:samples.weather_stations'))

Query

beam.io.Read(

beam.io.BigQuerySource(

query='SELECT year, mean_temp FROM samples.weather_stations'))

Page 21: Google Cloud Dataflow meets TensorFlow

Datastore

Python

Page 22: Google Cloud Dataflow meets TensorFlow

Dynamic Work Rebalancing

Page 23: Google Cloud Dataflow meets TensorFlow

Dynamic Work Rebalancing

https://cloud.google.com/blog/big‑data/2016/05/no‑shard‑left‑behind‑dynamic‑work‑rebalancing‑in‑google‑cloud‑dataflow

Page 24: Google Cloud Dataflow meets TensorFlow

TensorFlow

Page 25: Google Cloud Dataflow meets TensorFlow

Q: Google Cloud Dataflow

TensorFlow

A:

Page 26: Google Cloud Dataflow meets TensorFlow

TensorFlow Pipeline

(p | 'generate params' >> beam.Create(params)

| 'train' >> beam.Map(train)

| 'output' >> beam.io.WriteToText('gs://bucket/acc')

)

 generate params 

 train 

 output 

Page 27: Google Cloud Dataflow meets TensorFlow
Page 28: Google Cloud Dataflow meets TensorFlow

 train # (p | 'generate params' >> beam.Create(params) | 'train' >> beam.Map(train) # ←# | 'output' >> beam.io.WriteToText('gs://bucket/acc')# )

 PCollection 

 PCollection  WindowingWorker

Prediction PCollection

Page 29: Google Cloud Dataflow meets TensorFlow

 train TensorFlow OK Dataflow

def train(param): import tensorflow as tf from sklearn import cross_validation # iris = tf.contrib.learn.datasets.base.load_iris() train_x, test_x, train_y, test_y = cross_validation.train_test_split( iris.data, iris.target, test_size=0.2, random_state=0 ) # https://www.tensorflow.org/get_started/tflearn feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)] classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=param['hidden_units'], dropout=param['dropout'], n_classes=3, model_dir='gs://{BUCKET}/models/%s'% model_id) classifier.fit(x=train_x, y=train_y, steps=param['steps'], batch_size=50) result = classifier.evaluate(x=test_x, y=test_y) ret = {'accuracy': float(result['accuracy']), 'loss': float(result['loss']), 'model_id': model_id, 'param': json.dumps(param)} return ret

Page 30: Google Cloud Dataflow meets TensorFlow

Dataflow

Page 31: Google Cloud Dataflow meets TensorFlow
Page 32: Google Cloud Dataflow meets TensorFlow
Page 33: Google Cloud Dataflow meets TensorFlow

Dataflow

Page 34: Google Cloud Dataflow meets TensorFlow

Pipeline

(p | 'generate params' >> beam.Create(params)

| 'train' >> beam.Map(train)

| 'output' >> beam.io.WriteToText('gs://bucket/acc')

)

Page 35: Google Cloud Dataflow meets TensorFlow
Page 36: Google Cloud Dataflow meets TensorFlow
Page 37: Google Cloud Dataflow meets TensorFlow

(p | 'generate params' >> beam.Create(params)

# | 'train' >> beam.Map(train)# | 'output' >> beam.io.WriteToText('gs://bucket/acc')#)

 PCollection 

 train Dataflow

Page 38: Google Cloud Dataflow meets TensorFlow

Auto Scaling, Machine Type

Page 39: Google Cloud Dataflow meets TensorFlow

worker_options.max_num_workers = 10worker_options.num_workers = 10worker_options.disk_size_gb = 20worker_options.machine_type = 'n1-standard-16'

 max_num_workers 

 num_workers  auto scaling

 disk_size_gb  GB 250GB

 machine_type  GCE REST

GPU

Page 40: Google Cloud Dataflow meets TensorFlow
Page 41: Google Cloud Dataflow meets TensorFlow

Pipeline

Page 42: Google Cloud Dataflow meets TensorFlow
Page 43: Google Cloud Dataflow meets TensorFlow

Branch

1‑Transform if

Pipeline

Console

Page 44: Google Cloud Dataflow meets TensorFlow

Branch

Page 45: Google Cloud Dataflow meets TensorFlow

Branch

def split_branch(n, side=0): if n % 2 == side: yield n

pipe_0 = p | 'param' >> beam.Create(range(100))

branch1 = (pipe_0 | 'branch1' >> beam.FlatMap(split_branch, 0))branch2 = (pipe_0 | 'branch2' >> beam.FlatMap(split_branch, 1))

Page 46: Google Cloud Dataflow meets TensorFlow

Dynamic Work Rebalancing

Struggle with Stragglers

Page 47: Google Cloud Dataflow meets TensorFlow
Page 48: Google Cloud Dataflow meets TensorFlow

2000

20 → 5 /

→ 5 DNN 15

ON/OFF OFF

Page 49: Google Cloud Dataflow meets TensorFlow

DNN

Page 50: Google Cloud Dataflow meets TensorFlow

Page 51: Google Cloud Dataflow meets TensorFlow

Dataflow

Page 52: Google Cloud Dataflow meets TensorFlow

DatastoreIO

Page 53: Google Cloud Dataflow meets TensorFlow

DatastoreIO (Python )

Python Dataflow GA DatastoreIO

Protobuf

← New!

Page 54: Google Cloud Dataflow meets TensorFlow

from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastorefrom apache_beam.io.gcp.datastore.v1.datastoreio import WriteToDatastorefrom google.cloud.proto.datastore.v1 import entity_pb2from google.cloud.proto.datastore.v1 import query_pb2from googledatastore import helper as datastore_helper, PropertyFilterfrom gcloud.datastore.helpers import entity_from_protobuf

 DatastoreSource   DatastoreSink 

→ (Beam v0.7.0 )

Page 55: Google Cloud Dataflow meets TensorFlow

Read from Datastore

Pipeline

(p | 'read from datastore' >>

ReadFromDatastore(project=PROJECTID,

query=query)

...

Page 56: Google Cloud Dataflow meets TensorFlow

Query

query = query_pb2.Query()

query.kind.add().name = 'Test'

datastore_helper.set_property_filter(

query.filter, 'foo',

PropertyFilter.EQUAL, 'lorem'

)

GQL

SELECT * FROM Test WHERE foo = 'lorem'

Page 57: Google Cloud Dataflow meets TensorFlow

Protobuf Datastore Client libdef csv_format(entity_pb): entity = entity_from_protobuf(entity_pb)

columns = ['"%s"' % entity[k]

for k in sorted(entity.keys())] return ','.join(columns)

p = beam.Pipeline(options=options)

(p | 'read from datastore' >>

ReadFromDatastore(project=PROJECTID,

query=query)

| 'format entity to csv' >>

beam.Map(csv_format)

...

Page 58: Google Cloud Dataflow meets TensorFlow

Write to Datastore

Pipeline...

| 'create entity' >>

beam.Map(create_entity)

| 'write to datastore' >>

WriteToDatastore(project=PROJECTID))

Page 59: Google Cloud Dataflow meets TensorFlow

Entity

Protobuf

def create_entity(param): entity = entity_pb2.Entity()

datastore_helper.add_key_path(entity.key,

'Test',

str(uuid.uuid4()))

datastore_helper.add_properties(entity,

{"foo": u"hoge",

"bar": u"fuga",

"baz": 42})

return entity

Page 60: Google Cloud Dataflow meets TensorFlow

Demo

Page 61: Google Cloud Dataflow meets TensorFlow

Questions?

Page 62: Google Cloud Dataflow meets TensorFlow

Thank you!

bit.ly/gcp‑dataflow

Qiita

http://qiita.com/hayatoy/items/2eb2bc9223dd6f5c91e0

Medium

https://medium.com/@hayatoy/training‑multiple‑models‑of‑tensorflow‑using‑

dataflow‑7a5a9efafe53#.yvrblb6r3