Google Cloud Dataflow meets TensorFlow
-
Upload
hayato-yoshikawa -
Category
Technology
-
view
283 -
download
3
Transcript of Google Cloud Dataflow meets TensorFlow
Google Cloud Dataflowmeets TensorFlowDataflow TensorFlow Datastore
H.Yoshikawa (@hayatoy)GCPUG Shonan #1225 Mar. 2017
@hayatoy
GAE/Py 7~8 ?
APAC
TensorFlow
Presenter
Disclaimer
Google Cloud Dataflow
TensorFlow
DatastoreIO
Google Cloud Dataflow
Google Cloud Dataflow
MapReduce
GCE
Jupyter Notebook deploy
Google Cloud Dataflow
Python
Prerequisites
Create GCP account
Enable billing
Enable Google Dataflow API
Installation
Apache Beam
$ git clone https://github.com/apache/beam.git
$ cd beam/sdks/python/
$ python setup.py sdist
$ cd dist/
$ pip install apache-beam-sdk-*.tar.gz
>> 0.7.0.dev0
google‑cloud‑dataflow
$ pip install google-cloud-dataflow
>> 0.5.5
Installation
# gcloud-python$ pip install gcloud
>> 0.18.3
# Application Default Credentials$ gcloud beta auth application-default login
gs://
Pipeline
Pipeline
p = beam.Pipeline('DirectRunner')
(p | 'input' >> beam.Create(['Hello', 'World'])
| 'output' >> beam.io.WriteToText('gs://bucket/hello')
)
p.run()
(Pythonista )
Pipeline
(p | 'input' >> beam.Create(['Hello', 'World'])
| 'output' >> beam.io.WriteToText('gs://bucket/hello')
)
Pipeline |
PCollection
PCollection
PCollection
PCollection key-value
bounded unbounded
PCollection
beam.Create(['Hello', 'World'])
plaintext 'Hello' 'World' 2In Memory
TextFile
beam.io.ReadFromText('gs://bucket/input.txt')
* OK
BigQuery
Table
beam.io.Read(
beam.io.BigQuerySource(
'clouddataflow-readonly:samples.weather_stations'))
Query
beam.io.Read(
beam.io.BigQuerySource(
query='SELECT year, mean_temp FROM samples.weather_stations'))
Datastore
Python
Dynamic Work Rebalancing
Dynamic Work Rebalancing
https://cloud.google.com/blog/big‑data/2016/05/no‑shard‑left‑behind‑dynamic‑work‑rebalancing‑in‑google‑cloud‑dataflow
TensorFlow
Q: Google Cloud Dataflow
TensorFlow
A:
TensorFlow Pipeline
(p | 'generate params' >> beam.Create(params)
| 'train' >> beam.Map(train)
| 'output' >> beam.io.WriteToText('gs://bucket/acc')
)
generate params
train
output
train # (p | 'generate params' >> beam.Create(params) | 'train' >> beam.Map(train) # ←# | 'output' >> beam.io.WriteToText('gs://bucket/acc')# )
PCollection
PCollection WindowingWorker
Prediction PCollection
train TensorFlow OK Dataflow
def train(param): import tensorflow as tf from sklearn import cross_validation # iris = tf.contrib.learn.datasets.base.load_iris() train_x, test_x, train_y, test_y = cross_validation.train_test_split( iris.data, iris.target, test_size=0.2, random_state=0 ) # https://www.tensorflow.org/get_started/tflearn feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)] classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=param['hidden_units'], dropout=param['dropout'], n_classes=3, model_dir='gs://{BUCKET}/models/%s'% model_id) classifier.fit(x=train_x, y=train_y, steps=param['steps'], batch_size=50) result = classifier.evaluate(x=test_x, y=test_y) ret = {'accuracy': float(result['accuracy']), 'loss': float(result['loss']), 'model_id': model_id, 'param': json.dumps(param)} return ret
Dataflow
Dataflow
Pipeline
(p | 'generate params' >> beam.Create(params)
| 'train' >> beam.Map(train)
| 'output' >> beam.io.WriteToText('gs://bucket/acc')
)
(p | 'generate params' >> beam.Create(params)
# | 'train' >> beam.Map(train)# | 'output' >> beam.io.WriteToText('gs://bucket/acc')#)
PCollection
train Dataflow
Auto Scaling, Machine Type
worker_options.max_num_workers = 10worker_options.num_workers = 10worker_options.disk_size_gb = 20worker_options.machine_type = 'n1-standard-16'
max_num_workers
num_workers auto scaling
disk_size_gb GB 250GB
machine_type GCE REST
GPU
Pipeline
Branch
1‑Transform if
Pipeline
Console
Branch
Branch
def split_branch(n, side=0): if n % 2 == side: yield n
pipe_0 = p | 'param' >> beam.Create(range(100))
branch1 = (pipe_0 | 'branch1' >> beam.FlatMap(split_branch, 0))branch2 = (pipe_0 | 'branch2' >> beam.FlatMap(split_branch, 1))
Dynamic Work Rebalancing
Struggle with Stragglers
2000
20 → 5 /
→ 5 DNN 15
ON/OFF OFF
‑
DNN
‑
Dataflow
DatastoreIO
DatastoreIO (Python )
Python Dataflow GA DatastoreIO
Protobuf
← New!
from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastorefrom apache_beam.io.gcp.datastore.v1.datastoreio import WriteToDatastorefrom google.cloud.proto.datastore.v1 import entity_pb2from google.cloud.proto.datastore.v1 import query_pb2from googledatastore import helper as datastore_helper, PropertyFilterfrom gcloud.datastore.helpers import entity_from_protobuf
DatastoreSource DatastoreSink
→ (Beam v0.7.0 )
Read from Datastore
Pipeline
(p | 'read from datastore' >>
ReadFromDatastore(project=PROJECTID,
query=query)
...
Query
query = query_pb2.Query()
query.kind.add().name = 'Test'
datastore_helper.set_property_filter(
query.filter, 'foo',
PropertyFilter.EQUAL, 'lorem'
)
GQL
SELECT * FROM Test WHERE foo = 'lorem'
Protobuf Datastore Client libdef csv_format(entity_pb): entity = entity_from_protobuf(entity_pb)
columns = ['"%s"' % entity[k]
for k in sorted(entity.keys())] return ','.join(columns)
p = beam.Pipeline(options=options)
(p | 'read from datastore' >>
ReadFromDatastore(project=PROJECTID,
query=query)
| 'format entity to csv' >>
beam.Map(csv_format)
...
Write to Datastore
Pipeline...
| 'create entity' >>
beam.Map(create_entity)
| 'write to datastore' >>
WriteToDatastore(project=PROJECTID))
Entity
Protobuf
def create_entity(param): entity = entity_pb2.Entity()
datastore_helper.add_key_path(entity.key,
'Test',
str(uuid.uuid4()))
datastore_helper.add_properties(entity,
{"foo": u"hoge",
"bar": u"fuga",
"baz": 42})
return entity
Demo
Questions?
Thank you!
bit.ly/gcp‑dataflow
Qiita
http://qiita.com/hayatoy/items/2eb2bc9223dd6f5c91e0
Medium
https://medium.com/@hayatoy/training‑multiple‑models‑of‑tensorflow‑using‑
dataflow‑7a5a9efafe53#.yvrblb6r3