Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to...
-
Upload
hadoopsummit -
Category
Technology
-
view
4.208 -
download
3
description
Transcript of Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to...
Monsanto Company Confidential - Attorney Client Privilege
Geospatial Processing @ MonsantoHadoop Summit 2013
Robert Grailer, Big Data EngineerErich Hochmuth, Data & Analytics Architecture Lead
Monsanto Company Confidential - Attorney Client Privilege
Our Vision: Sustainable AgricultureA Strong Vision That Guides All We Do
• Producing More– We are committed to increasing yields to meet
the growing demand for food, fiber & fuel
• Conserving More– We are committed to reducing the amount
of land, water and energy needed to grow our crops
• Improving Lives– We are committed to improving lives around
the world
2
Monsanto Company Confidential - Attorney Client Privilege
ADVANCED EQUIPMENT
AVERAGE CORN YIELD–300 BU/AC
AUTOMATED WEATHER STATIONS
FIELD SENSORS PROVIDING INFORMATION
ADVANCED IMAGERY TECHNOLOGY
Doubling Yields by 2030 - Farming in the Future Will Be Increasingly Information-Driven
3
Monsanto Company Confidential - Attorney Client Privilege 4
Planting Prescription 2012(DKC63-84 Brand)
Target Rate (Count)(ksds/ac)
38.00 (24.75 ac)37.00 (22.63 ac)35.00 (16.60 ac)34.00 ( 8.23 ac)33.00 ( 6.00 ac)32.00 ( 2.82 ac)
Integrated Farming Systems – FieldScriptsSM for 2014• FieldScripts℠ will deliver, by field, a corn hybrid recommendation utilizing variable
rate seeding by FieldScripts management zones to increase yield potential and reduce risk
• The science of FieldScripts is based on proprietary algorithms that combine data from the FieldScripts Testing Network and Monsanto generated hybrid response to plant population research
Precision Planting
Monsanto Company Confidential - Attorney Client Privilege5
IL Irrigated, Back 80Treatment Yield (bu/ac)
Static|34000 196FieldScripts (35000) 233
Central IL Dry Land, 47-50Treatment Yield (bu/ac)
Static|34000 139FieldScripts (33000) 145
MS Irrigated, 21Treatment Yield (bu/ac)
Static|34000 166FieldScripts (34700) 181
2012 Field Trials Indicate 5-10 bu/a Average Yield Gain
In the United States Alone:
Corn acres planted in 2013 – 96M
Price of Corn per bushel – $6.93*
Advantage of 5–10 Bu/Ac
*Price reflects CBOT price of corn 1/9/2013
Monsanto Company Confidential - Attorney Client Privilege6
Integrated Farming SystemsSM Combine Advanced Seed Genetics, On-farm Agronomic Practices, Software and Hardware Innovations to Drive Yield
DATABASE BACKBONEExpansive product by environment testing makes on-farm prescriptions possible
VARIABLE-RATE FERTILITY
Variable rate N, P & K “Apps” aligned with yield management zones
PRECISION SEEDING Planter hardware systems enabling variable rate seeding & row spacing of multiple hybrids in a field by yield management zone
FERTILITY & DISEASE
MANAGEMENT“Apps” for in-season custom application of supplemental late nitrogen and fungicides
YIELD MONITORAdvances in Yield Monitoring to deliver higher resolution data
BREEDINGSignificant increases in data points collected per year to increase annual rate genetic gain
Monsanto Company Confidential - Attorney Client Privilege7
Use Case
Public Data
Monsanto Data
Grower Data
Standardize &
LinkAlgorithms
• Load thousands of files containing spatial data• Support diverse range of data types
— tabular, vector, raster• Join & link data spatially• Generate dense grid covering entire US
— 120 billion polygons• Generate a set of derived attributes
— Think moving average• Make data available for other data products such as Field Scripts
High Level Data Flow
Monsanto Company Confidential - Attorney Client Privilege8
Version 1 Architecture
• In RDBMS spatial• PL/SQL• Multiple patches to DB Engine• Just 8% of the data!!
– 35+ days to process
• TBs in indexes• Tradeoffs
– Compressed vs. Uncompressed– Performance vs. Storage– Read vs. Write performance
• Options/recommendations– Limit use of in DB spatial functionality– Buy more RDBMS
Series105
101520253035
Data Processing Time
SoilElevationSpatial IndexProcessingD
ays
Series10
20406080
100
Data Volumes
Raw DataUncompressedCompressedSpatial IndexTB
s
Monsanto Company Confidential - Attorney Client Privilege9
Version 2 Architecture
• Combination of MapReduce & HBase• Leverage existing Hadoop cluster• MapReduce
– Parallelize everything!– Bulk HBase loads
• HBase– Spatial data model– Custom spatial engine
Monsanto Company Confidential - Attorney Client Privilege10
Data Ingestion
• Bulk load 1,000s of files into HDFS• Standardize data
– Common usable format• Storage vs. Compute• Raster format is easily splitable
• Hadoop Streaming integrated with GDAL• Streaming API Lessons Learned
– Lack of documentation– Counters to track task progress– Jobs run as mapred user– HDFS Access outside of MR
Series10
10203040506070
Data Ingestion Time
RDBMSHadoop
Hou
rs
NFS
• Raster Images• Vector Shape Files• Zip Files• Text Data
•Unzip •Convert to Raster• Re-project
HDFSHadoop
Streaming
• Raster Files
Results
Monsanto Company Confidential - Attorney Client Privilege11
Data Processing
• Process raster data– Dense matrix
• Generic InputFormat & RecordReader for raster data
• HFiles easily transportable between clusters• Challenges tuning Jobs
– IO Sort Factor– Split/Task Size
HDFS HBase
Generate DerivedAttributes
• Raster Files
Results
Pre-splittable
Generate HFiles
Series105
1015202530
Data Processing Time
RDBMSHadoop
Day
s
Monsanto Company Confidential - Attorney Client Privilege12
HBASE SCHEMA DESIGN
Monsanto Company Confidential - Attorney Client Privilege
Geospatial in HBase
Need– Dense data set– Complex computations– Scalable & cost efficient– Bulk analytics & random reads
HBase– GeoHash most notable example
• Best suited for sparse data– Precision of reads– Alphanumeric key
HBase Considerations– Key overhead– Scan vs. Get performance– Reduce reading unnecessary data
Example Field
Complex Data Interactions
Monsanto Company Confidential - Attorney Client Privilege
Global Coordinate SystemLo
ngitu
de
Latitude-180 180
-90
90
Monsanto Company Confidential - Attorney Client Privilege
Reference SystemLo
ngitu
de
Latitude-180 180
-90
90
Monsanto Company Confidential - Attorney Client Privilege
Reference System ContinuedLo
ngitu
de
Latitude
1 2 3 20
21 22 23
19
381 382 400399
190
-180 180
-90
90
4
Monsanto Company Confidential - Attorney Client Privilege17
HBase Schema Take 1Spatial Table• Key: cell_id long • Column Family: A
– Column: Data Holder• elevation• slope: float• aspect: float
• Each spatial dataset is a separate table
• All attributes for a layer that are read together are stored together
‒ Attributes packed into a single column as an Avro object• 1 row per record• 120 billion rows total!• 1,000s of Get requests per field• TBs of key overhead – roughly 56% of the data
Monsanto Company Confidential - Attorney Client Privilege
Reference System Storage Format
• Data grouped into 100 x 100 super cells
• A super cell of 100 x 100 cells is a single row in HBase
• At most 4 disk reads are required to read all data for one layer for a 150 acre field
• Given a bounding box the super cells and attributed grid cells containing the desired data can easily be computed
• A generic geospatial data service when given a set of layers will read each layer in parallel
• Overhead of key data reduced from 56% to below 0.1%
Super Grid Cells
Attributed Grid CellsSpatial Table• Key: super_cell_id long • Column Family: A
– Column: Data Holder• elevation : array float [ values ]• slope: array float [ values ]• aspect: array float [ values ]
Monsanto Company Confidential - Attorney Client Privilege
Results• Significant cost savings in required hardware• 120 billion unique polygons in total• 1.5 trillion data points• Dense grid of the entire U.S.• Foundational architecture for other spatial data sets• Fully unit tested implementation
RDBMS• 4 states only• 30+ days to load• 8 months of dev.
Hadoop• Entire U.S.• 18 hour load time• 3 months of dev.• 100% scalable• Cloud ready
Series10
10
20
30
Total Data Processing Time
RDBMS Hadoop
Day
s8% of the
data
Full data set
Total Run Time
Monsanto Company Confidential - Attorney Client Privilege20
Thank You