A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four...

24
© 2019 NTT DATA Corporation Naoto Umemori and Masaru Dobashi May 20, 2019 OSS Professional Services, NTT DATA Corporation A Distributed Machine Learning For Giant Hogweed Eradication 2019 USENIX Conference on Operational Machine Learning

Transcript of A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four...

Page 1: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation

Naoto Umemori and Masaru Dobashi

May 20, 2019

OSS Professional Services, NTT DATA Corporation

A Distributed Machine Learning For Giant Hogweed Eradication

2019 USENIX Conference on Operational Machine Learning

Page 2: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 2

1. Introduction

2. Our OpML Experience - Giant Hogweed Eradication -

3. Lessons Learned while Adopting ML in SYS

Agenda

Page 3: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 3

Introduction

1

Page 4: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 4

1

Hadoop/Spark/Kafka Capabilities• World’s no.7 in the Big Data era led by Hadoop. (as of 2017)• Consulting, Design, Deploy and Operation of Hadoop Clusters, and Publication

Experiences• 10+ years of experience (in Distributed Computing)• 100+ production cases (Hadoop Clusters in the range of 10 - 1200+ nodes)• Solution that covers security, application development, cluster construction etc.• Customers in a wide range of industries (Automotive, Enterprise, Financial, Telco, etc.)

Our Capabilities

Expert / Professional Team of Open-Source Software for 15+ Years

These books were written by our team members.

Page 5: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 5

Our OpML Experience - Giant Hogweed Eradication -

2

Page 6: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 6

2

Overview of Giant Hogweed Eradication Project

Automating the detection of highly toxic plants by exploiting image

recognition/detection processing on a distributed processing platform

4K IMAGES

Data Sources Data Lake Data Analytics

Page 7: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 7

2

Data Volumes

Preparation of Supervised Data

Coordinate Calculation at Pinpoint

Challenges of The Project

1

2

3

Page 8: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 8

2

Challenge.1: Data Volumes

200.0+ Terabyte

Denmark

Experiment Locations

Around 3217km2

Parameters Values

Farmland rate of Denmark*1 62.01 [%]

Estimated value of land area of

Experiment Location

1,994.86 ->

2,000 [km2]

Size of an image taken by Drone 10 [MiB]

Aspect ratio of an image 4 : 3

Drone aviation altitude 20 [m]

FOV: Field of View 90 [Degree]

The estimated number of Images 20,000,000

Estimated data volume 200.0 [TiB]

*1: https://ecodb.net/country/DK/nature/

Page 9: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 9

2

Our Approach and Solution.1

• Preparation: SAP HANA is used to prepare Supervised Data

• Machine Learning: [Training] TensorFlow is running on a single GPU node.

[Inference] TensorFlow is running on Hadoop/Spark Cluster. (Distributed Processing)

Collect.

Data Lake<Amazon S3>

Data Processing<Spark>

DWH<Hadoop>

Accumu-lation

Transforming Utilization

DataProcessing

Data Processing<Spark>

Data MartData MartData Mart

<SAP HANA>

System

Upstream

Batc

h

Report

Manual Take-off/Landing

AWS Cloud (VPC)

ML(Training)<TensorFlow>

Visualization /Analysis / BI

ManualLabeling

InteractiveAnalysis

Data Processing<Spark>

ML(Inference)<TensorFlow>

Manual Labeling<SAP HANA XSA>

Downstream

GPU Node

DistributedProcessing

Page 10: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 10

2

• Lack of supervised data for Training is a common story in ML.• However, there are cases where it is difficult even to prepare what can be correct data. For instance,

identifying Giant Hogweed from an aerial image of Drone needs specialized knowledge by biologists.

Challenge.2: Preparation of Supervised Data

Preparation of Supervised Data is a hard task.It is more difficult to identify Giant Hogweed from aerial images.

Giant Hogweed

(Positive Class)

Weed - NOT Giant Hogweed

(Negative Class)

Could you tell the Giant Hogweed from Weed? :(

Page 11: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 11

2

• A user-friendly UI that non-engineers can

operate intuitively.

• Prepare a slightly larger image and label

it by an expert on a small 6x4 split image.

• Labeled data is stored as supervised data

in Data Lake.

• Note: There is no correction mechanism

for what is labeled

Our Approach and Solution.2

Introduce the Labeling Application for Data Preparation

and itelligence

Utilization

Data MartData MartData Mart

<SAP HANA>

ManualLabeling

Manual Labeling<SAP HANA XSA>

Downstream

Data Lake<Amazon S3>

DataProcessing

Accumu-lation

Transforming

Page 12: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 12

2

• The aerial image range taken by drone is quite

wide.

• Images taken aerially with Drone have latitude

and longitude information. However, the input

24x18 split image(220x220 pix) of inference

processing is obtained by dividing an aerial

image, and the image does not have latitude

and longitude information.

• In order to eradicate Giant Hogweed, it is

necessary to estimate the latitude and

longitude of the habitat by pinpointing.

Challenge.3: Coordinate Calculation at Pinpoint

No Coordinate Calculation at Pinpoint, No Giant Hogweed Eradication

5280[pix], 23[m]

3960[p

ix], 17.3

[m]

Center of the image

(x, y) = (Longitude, Latitude)

N

• Altitude of Drone: 20[m]

• Focal Range: 15[mm]

Page 13: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 13

2

The spherical plane of the earth can be approximated to the horizontal plane because

the length of on a side is short (0.96[m], 220[pix]).

Our Approach and Solution.3

Approximated geo coordinate calculation

Calculate the distance and direction

from the center coordinates of raw

image(C) to the center coordinates of

split image(E).

Convert pixel values to physical

distances by using exif.

Calculate the latitude and longitude

of the split image by approximating

the earth ellipse to horizontal plane.

ypix

xpixC

E

ym

xmC

E

C

E

M2

M1

(longM1, latM1)

(longM2, latM2)(longC, latC)

(longM2, latM1)

Split image

Raw image

Page 14: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 14

2

Data VolumesDistributed processing based on Apache Hadoop/Spark

Preparation of Supervised DataLabeling Application for Data Preparation based on

SAP HANA Platform

Coordinate Calculation at PinpointApproximated geo coordination calculation

Summary up to here

1

2

3

Page 15: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 15

Lessons Learned while Adopting ML in SYS

3

Page 16: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 16

3

ML Systems

• Configuration• Data Collection• Data Verification• Machine Resource Management• Serving Infrastructure• Monitoring• Analysis Tools• Feature Extraction• Process Management Tools• ML Code

Context of Our Architecture Design

Scaled ML Systems

• Orchestration• Data Pipelines• Data Verification• System Resource Management• Distributed-Serving Infrastructure• System Monitoring• Analysis Platform• Feature Extraction• Job Management Platform• Distributed-ML Code

From “Hidden Technical Debt in Machine Learning Systems”, D. Sculley at al. (Google), paper at NIPS 2015

Page 17: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 17

3

Context of Our Architecture Design

ML Systems

• Configuration• Data Collection• Data Verification• Machine Resource Management• Serving Infrastructure• Monitoring• Analysis Tools• Feature Extraction• Process Management Tools• ML Code

Scaled ML Systems

• Orchestration• Data Pipelines• Data Verification• System Resource Management• Distributed-Serving Infrastructure• System Monitoring• Analysis Platform• Feature Extraction• Job Management Platform• Distributed-ML Code

From “Hidden Technical Debt in Machine Learning Systems”, D. Sculley at al. (Google), paper at NIPS 2015

Page 18: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 18

3

Data Pipelines for Scalable ML Systems

Inference: Hadoop/Spark + TensorFlow (Each Spark job natively calls TensorFlow libs.)

Training: (Future work)

Collect.

Data Lake<Amazon S3>

Data Processing<Spark>

DWH<Hadoop>

Accumu-lation

Transforming Utilization

DataProcessing

Data Processing<Spark>

Data MartData MartData Mart

<SAP HANA>

Report

AWS Cloud (VPC)

ML(Training)<TensorFlow>

Visualization /Analysis / BI

ManualLabeling

InteractiveAnalysis

Data Processing<Spark>

ML(Inference)<TensorFlow>

Manual Labeling<SAP HANA XSA>

Downstream

(omit)

Preparation pipeline

Analysis pipeline

Inference pipeline

Training pipeline

Batc

h

Page 19: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 19

3

Distributed-ML Code

There are four approaches for

Distributed-ML Model Development and Model Operation

Approach: Dev 2 Op Model Workflow

1: ML-friendly to ML-friendly

2: ML-friendly to SYS-friendly

3: SYS-friendly to ML-friendly N/A

4: SYS-friendly to SYS-friendly

We have chosen approach-2.

python

TensorFlowAP modification

Distributed TF

Custom FW

python

TensorFlowExport

TFonSpark

BigDLImportModel

BigDL

DL4J

BigDL

DL4JScale out

Development Operation

ex.

ex.

Page 20: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 20

3

LL: Integration of Apache Spark and TF

The complexity of processing logic changes depend on whether the inference processing implementation is Spark API based or the TensorFlow API based.

DWH Data Processing ML(Inference)

HDFS py4j Infer AP Executor TensorFlowDriver

Figure: Sequence Diagram in case of TensorFlow API-based inference processing w/o HDFS command (25)

(Python-daemon)

Get file path of images

Launch JVMGet model module

Get model parameter

Get images

collect()Serialized the data

to Protocol Buffer

infer()

Page 21: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 21

3

LL: Integration of Apache Spark and TF

The complexity of processing logic changes depend on whether the inference processing implementation is Spark API based or the TensorFlow API based.

DWH Data Processing ML(Inference)

HDFS py4j Infer AP Executor TensorFlowDriver (Python-daemon)

Get image list (hdfs command/jvm)

Figure: Sequence Diagram in case of TensorFlow API-based inference processing (25->22)

Get model module (hdfs command/jvm), if not exist

Get model parameter (hdfs command/jvm), if not exist

Load model module

Load model parameter

Get image files (hdfs command/jvm)

infer()

collect()

Save image list (hdfs command/jvm)

Page 22: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 22

3

LL: Integration of Apache Spark and TF

The complexity of processing logic changes depend on whether the inference processing implementation is Spark API based or the TensorFlow API based.

DWH Data Processing ML(Inference)

HDFS py4j Infer AP Executor TensorFlowDriver (Python-daemon)

Load sequence files

Figure: Sequence Diagram in case of Apache Spark API-based inference processing (25->22->16)

Load model module

Load model parameter

infer()

Save sequence files

Distribute model module, model parameter (spark submit)

Page 23: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation 23

3

• We introduced the challenges, approach and solutions in

our ML project - Giant Hogweed Eradication.

• We shared our lesson learned to adopt ML in SYS through

the project.

Future Work: Try & Error and get a better practice of

other items in Scaled ML Systems.

Summary

Page 24: A Distributed Machine Learning For Giant Hogweed …...Distributed-ML Code 3 There are four approaches for Distributed-ML Model Development and Model Operation Approach: Dev 2 Op Model

© 2019 NTT DATA Corporation