Google Cloud Dataflow meets TensorFlow

Google Cloud Dataflowmeets TensorFlowDataflow TensorFlow Datastore

H.Yoshikawa (@hayatoy)GCPUG Shonan #1225 Mar. 2017

@hayatoy

GAE/Py 7~8 ?

APAC

TensorFlow

Presenter

Disclaimer

Google Cloud Dataflow

TensorFlow

DatastoreIO


MapReduce

GCE

Jupyter Notebook deploy


Python

Prerequisites

Create GCP account

Enable billing

Enable Google Dataflow API

Installation

Apache Beam

$ git clone https://github.com/apache/beam.git

$ cd beam/sdks/python/

$ python setup.py sdist

$ cd dist/

$ pip install apache-beam-sdk-*.tar.gz

>> 0.7.0.dev0

google‑cloud‑dataflow

$ pip install google-cloud-dataflow

>> 0.5.5

Installation

# gcloud-python$ pip install gcloud

>> 0.18.3

# Application Default Credentials$ gcloud beta auth application-default login

gs://

Pipeline

Pipeline

p = beam.Pipeline('DirectRunner')

(p | 'input' >> beam.Create(['Hello', 'World'])

| 'output' >> beam.io.WriteToText('gs://bucket/hello')

)

p.run()

(Pythonista )

Pipeline

(p | 'input' >> beam.Create(['Hello', 'World'])

| 'output' >> beam.io.WriteToText('gs://bucket/hello')

)

Pipeline |

PCollection

PCollection

PCollection

PCollection key-value

bounded unbounded

PCollection

beam.Create(['Hello', 'World'])

plaintext 'Hello' 'World' 2In Memory

TextFile

beam.io.ReadFromText('gs://bucket/input.txt')

* OK

BigQuery

Table

beam.io.Read(

beam.io.BigQuerySource(

'clouddataflow-readonly:samples.weather_stations'))

Query

beam.io.Read(

beam.io.BigQuerySource(

query='SELECT year, mean_temp FROM samples.weather_stations'))

Datastore

Python

Dynamic Work Rebalancing


https://cloud.google.com/blog/big‑data/2016/05/no‑shard‑left‑behind‑dynamic‑work‑rebalancing‑in‑google‑cloud‑dataflow

https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow

TensorFlow

Q: Google Cloud Dataflow

TensorFlow

A:

TensorFlow Pipeline

(p | 'generate params' >> beam.Create(params)

| 'train' >> beam.Map(train)

| 'output' >> beam.io.WriteToText('gs://bucket/acc')

)

generate params

train

output

train # (p | 'generate params' >> beam.Create(params) | 'train' >> beam.Map(train) # ←# | 'output' >> beam.io.WriteToText('gs://bucket/acc')# )

PCollection

PCollection WindowingWorker

Prediction PCollection

train TensorFlow OK Dataflow

def train(param): import tensorflow as tf from sklearn import cross_validation # iris = tf.contrib.learn.datasets.base.load_iris() train_x, test_x, train_y, test_y = cross_validation.train_test_split( iris.data, iris.target, test_size=0.2, random_state=0 ) # https://www.tensorflow.org/get_started/tflearn feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)] classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=param['hidden_units'], dropout=param['dropout'], n_classes=3, model_dir='gs://{BUCKET}/models/%s'% model_id) classifier.fit(x=train_x, y=train_y, steps=param['steps'], batch_size=50) result = classifier.evaluate(x=test_x, y=test_y) ret = {'accuracy': float(result['accuracy']), 'loss': float(result['loss']), 'model_id': model_id, 'param': json.dumps(param)} return ret

Dataflow

Pipeline


| 'train' >> beam.Map(train)

| 'output' >> beam.io.WriteToText('gs://bucket/acc')

)


# | 'train' >> beam.Map(train)# | 'output' >> beam.io.WriteToText('gs://bucket/acc')#)

PCollection

train Dataflow

Auto Scaling, Machine Type

worker_options.max_num_workers = 10worker_options.num_workers = 10worker_options.disk_size_gb = 20worker_options.machine_type = 'n1-standard-16'

max_num_workers

num_workers auto scaling

disk_size_gb GB 250GB

machine_type GCE REST

GPU

Pipeline

Branch

1‑Transform if

Pipeline

Console

Branch

Branch

def split_branch(n, side=0): if n % 2 == side: yield n

pipe_0 = p | 'param' >> beam.Create(range(100))

branch1 = (pipe_0 | 'branch1' >> beam.FlatMap(split_branch, 0))branch2 = (pipe_0 | 'branch2' >> beam.FlatMap(split_branch, 1))


Struggle with Stragglers

2000

20 → 5 /

→ 5 DNN 15

ON/OFF OFF

‑

DNN

Dataflow

DatastoreIO

DatastoreIO (Python )

Python Dataflow GA DatastoreIO

Protobuf

← New!

from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastorefrom apache_beam.io.gcp.datastore.v1.datastoreio import WriteToDatastorefrom google.cloud.proto.datastore.v1 import entity_pb2from google.cloud.proto.datastore.v1 import query_pb2from googledatastore import helper as datastore_helper, PropertyFilterfrom gcloud.datastore.helpers import entity_from_protobuf

DatastoreSource DatastoreSink

→ (Beam v0.7.0 )

Read from Datastore

Pipeline

(p | 'read from datastore' >>

ReadFromDatastore(project=PROJECTID,

query=query)

...

Query

query = query_pb2.Query()

query.kind.add().name = 'Test'

datastore_helper.set_property_filter(

query.filter, 'foo',

PropertyFilter.EQUAL, 'lorem'

)

GQL

SELECT * FROM Test WHERE foo = 'lorem'

Protobuf Datastore Client libdef csv_format(entity_pb): entity = entity_from_protobuf(entity_pb)

columns = ['"%s"' % entity[k]

for k in sorted(entity.keys())] return ','.join(columns)

p = beam.Pipeline(options=options)

(p | 'read from datastore' >>

ReadFromDatastore(project=PROJECTID,

query=query)

| 'format entity to csv' >>

beam.Map(csv_format)

...

Write to Datastore

Pipeline...

| 'create entity' >>

beam.Map(create_entity)

| 'write to datastore' >>

WriteToDatastore(project=PROJECTID))

Entity

Protobuf

def create_entity(param): entity = entity_pb2.Entity()

datastore_helper.add_key_path(entity.key,

'Test',

str(uuid.uuid4()))

datastore_helper.add_properties(entity,

{"foo": u"hoge",

"bar": u"fuga",

"baz": 42})

return entity

Questions?

Thank you!

bit.ly/gcp‑dataflow

Qiita

http://qiita.com/hayatoy/items/2eb2bc9223dd6f5c91e0

Medium

https://medium.com/@hayatoy/training‑multiple‑models‑of‑tensorflow‑using‑

dataflow‑7a5a9efafe53#.yvrblb6r3

http://bit.ly/gcp-dataflow

http://qiita.com/hayatoy/items/2eb2bc9223dd6f5c91e0

https://medium.com/@hayatoy/training-multiple-models-of-tensorflow-using-dataflow-7a5a9efafe53#.yvrblb6r3

Google Cloud Dataflow meets TensorFlow

Technology

Transcript of Google Cloud Dataflow meets TensorFlow