Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to...

20
Monsanto Company Confidential - Attorney Client Privilege Geospatial Processing @ Monsanto Hadoop Summit 2013 Robert Grailer, Big Data Engineer Erich Hochmuth, Data & Analytics Architecture Lead

description

Monsanto built a geospatial platform on Hadoop and HBase capable of managing over 120 billion polygons. As a result of the extreme data volumes and compute complexities we were forced to migrate our data processing from a more traditional RDBMS to a scale out Hadoop implementation. Data processing that took over 30 days on 8% of the data now runs in under 12 hours on the entire data set. Very little concrete material exist for how you process spatial data via MapReduce or model it in HBase. We will provide concrete and novel examples for processing and storing spatial data on Hadoop and HBase. As part of the data processing pipeline we integrated the popular open source geospatial processing library GDAL with MapReduce to convert all geospatial datasets to a common format and projection. We developed a method for splitting and processing images via MapReduce in which the boundaries of splits needed to be shared by multiple tasks due to the nature of the computation being performed on the data. Bulk writes to HBase were performed by writing HFiles directly. Finally we developed a novel method for storing geospatial data in HBase that met the needs of our access pattern.

Transcript of Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to...

Page 1: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege

Geospatial Processing @ MonsantoHadoop Summit 2013

Robert Grailer, Big Data EngineerErich Hochmuth, Data & Analytics Architecture Lead

Page 2: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege

Our Vision: Sustainable AgricultureA Strong Vision That Guides All We Do

• Producing More– We are committed to increasing yields to meet

the growing demand for food, fiber & fuel

• Conserving More– We are committed to reducing the amount

of land, water and energy needed to grow our crops

• Improving Lives– We are committed to improving lives around

the world

2

Page 3: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege

ADVANCED EQUIPMENT

AVERAGE CORN YIELD–300 BU/AC

AUTOMATED WEATHER STATIONS

FIELD SENSORS PROVIDING INFORMATION

ADVANCED IMAGERY TECHNOLOGY

Doubling Yields by 2030 - Farming in the Future Will Be Increasingly Information-Driven

3

Page 4: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege 4

Planting Prescription 2012(DKC63-84 Brand)

Target Rate (Count)(ksds/ac)

38.00 (24.75 ac)37.00 (22.63 ac)35.00 (16.60 ac)34.00 ( 8.23 ac)33.00 ( 6.00 ac)32.00 ( 2.82 ac)

Integrated Farming Systems – FieldScriptsSM for 2014• FieldScripts℠ will deliver, by field, a corn hybrid recommendation utilizing variable

rate seeding by FieldScripts management zones to increase yield potential and reduce risk

• The science of FieldScripts is based on proprietary algorithms that combine data from the FieldScripts Testing Network and Monsanto generated hybrid response to plant population research

Precision Planting

Page 5: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege5

IL Irrigated, Back 80Treatment Yield (bu/ac)

Static|34000 196FieldScripts (35000) 233

Central IL Dry Land, 47-50Treatment Yield (bu/ac)

Static|34000 139FieldScripts (33000) 145

MS Irrigated, 21Treatment Yield (bu/ac)

Static|34000 166FieldScripts (34700) 181

2012 Field Trials Indicate 5-10 bu/a Average Yield Gain

In the United States Alone:

Corn acres planted in 2013 – 96M

Price of Corn per bushel – $6.93*

Advantage of 5–10 Bu/Ac

*Price reflects CBOT price of corn 1/9/2013

Page 6: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege6

Integrated Farming SystemsSM Combine Advanced Seed Genetics, On-farm Agronomic Practices, Software and Hardware Innovations to Drive Yield

DATABASE BACKBONEExpansive product by environment testing makes on-farm prescriptions possible

VARIABLE-RATE FERTILITY

Variable rate N, P & K “Apps” aligned with yield management zones

PRECISION SEEDING Planter hardware systems enabling variable rate seeding & row spacing of multiple hybrids in a field by yield management zone

FERTILITY & DISEASE

MANAGEMENT“Apps” for in-season custom application of supplemental late nitrogen and fungicides

YIELD MONITORAdvances in Yield Monitoring to deliver higher resolution data

BREEDINGSignificant increases in data points collected per year to increase annual rate genetic gain

Page 7: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege7

Use Case

Public Data

Monsanto Data

Grower Data

Standardize &

LinkAlgorithms

• Load thousands of files containing spatial data• Support diverse range of data types

— tabular, vector, raster• Join & link data spatially• Generate dense grid covering entire US

— 120 billion polygons• Generate a set of derived attributes

— Think moving average• Make data available for other data products such as Field Scripts

High Level Data Flow

Page 8: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege8

Version 1 Architecture

• In RDBMS spatial• PL/SQL• Multiple patches to DB Engine• Just 8% of the data!!

– 35+ days to process

• TBs in indexes• Tradeoffs

– Compressed vs. Uncompressed– Performance vs. Storage– Read vs. Write performance

• Options/recommendations– Limit use of in DB spatial functionality– Buy more RDBMS

Series105

101520253035

Data Processing Time

SoilElevationSpatial IndexProcessingD

ays

Series10

20406080

100

Data Volumes

Raw DataUncompressedCompressedSpatial IndexTB

s

Page 9: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege9

Version 2 Architecture

• Combination of MapReduce & HBase• Leverage existing Hadoop cluster• MapReduce

– Parallelize everything!– Bulk HBase loads

• HBase– Spatial data model– Custom spatial engine

Page 10: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege10

Data Ingestion

• Bulk load 1,000s of files into HDFS• Standardize data

– Common usable format• Storage vs. Compute• Raster format is easily splitable

• Hadoop Streaming integrated with GDAL• Streaming API Lessons Learned

– Lack of documentation– Counters to track task progress– Jobs run as mapred user– HDFS Access outside of MR

Series10

10203040506070

Data Ingestion Time

RDBMSHadoop

Hou

rs

NFS

• Raster Images• Vector Shape Files• Zip Files• Text Data

•Unzip •Convert to Raster• Re-project

HDFSHadoop

Streaming

• Raster Files

Results

Page 11: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege11

Data Processing

• Process raster data– Dense matrix

• Generic InputFormat & RecordReader for raster data

• HFiles easily transportable between clusters• Challenges tuning Jobs

– IO Sort Factor– Split/Task Size

HDFS HBase

Generate DerivedAttributes

• Raster Files

Results

Pre-splittable

Generate HFiles

Series105

1015202530

Data Processing Time

RDBMSHadoop

Day

s

Page 12: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege12

HBASE SCHEMA DESIGN

Page 13: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege

Geospatial in HBase

Need– Dense data set– Complex computations– Scalable & cost efficient– Bulk analytics & random reads

HBase– GeoHash most notable example

• Best suited for sparse data– Precision of reads– Alphanumeric key

HBase Considerations– Key overhead– Scan vs. Get performance– Reduce reading unnecessary data

Example Field

Complex Data Interactions

Page 14: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege

Global Coordinate SystemLo

ngitu

de

Latitude-180 180

-90

90

Page 15: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege

Reference SystemLo

ngitu

de

Latitude-180 180

-90

90

Page 16: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege

Reference System ContinuedLo

ngitu

de

Latitude

1 2 3 20

21 22 23

19

381 382 400399

190

-180 180

-90

90

4

Page 17: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege17

HBase Schema Take 1Spatial Table• Key: cell_id long • Column Family: A

– Column: Data Holder• elevation• slope: float• aspect: float

• Each spatial dataset is a separate table

• All attributes for a layer that are read together are stored together

‒ Attributes packed into a single column as an Avro object• 1 row per record• 120 billion rows total!• 1,000s of Get requests per field• TBs of key overhead – roughly 56% of the data

Page 18: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege

Reference System Storage Format

• Data grouped into 100 x 100 super cells

• A super cell of 100 x 100 cells is a single row in HBase

• At most 4 disk reads are required to read all data for one layer for a 150 acre field

• Given a bounding box the super cells and attributed grid cells containing the desired data can easily be computed

• A generic geospatial data service when given a set of layers will read each layer in parallel

• Overhead of key data reduced from 56% to below 0.1%

Super Grid Cells

Attributed Grid CellsSpatial Table• Key: super_cell_id long • Column Family: A

– Column: Data Holder• elevation : array float [ values ]• slope: array float [ values ]• aspect: array float [ values ]

Page 19: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege

Results• Significant cost savings in required hardware• 120 billion unique polygons in total• 1.5 trillion data points• Dense grid of the entire U.S.• Foundational architecture for other spatial data sets• Fully unit tested implementation

RDBMS• 4 states only• 30+ days to load• 8 months of dev.

Hadoop• Entire U.S.• 18 hour load time• 3 months of dev.• 100% scalable• Cloud ready

Series10

10

20

30

Total Data Processing Time

RDBMS Hadoop

Day

s8% of the

data

Full data set

Total Run Time

Page 20: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield

Monsanto Company Confidential - Attorney Client Privilege20

Thank You