Magellen: Geospatial Analytics on Spark by Ram Sriharsha
-
Upload
spark-summit -
Category
Data & Analytics
-
view
2.660 -
download
2
Transcript of Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Page 1
Magellan: Geospatial Analytics on SparkRam Sriharsha
Spark and Data Science Architect, Hortonworks
Page 2
What is geospatial context?
•Given a point = (-122.412651, 37.777748) whichcity is it in?
•Does shape X intersect shape Y? –Compute the intersection
•Given a sequence of points and a system of roads–Compute best path representing points
Page 3
Geospatial context is useful
What neighborhoods do people go to on weekends?Predict the drop off neighborhood of a user?Predict the location where next pick up can be expected?How does usage pattern change with time?
Identify crime hotspot neighborhoodsHow do these hotspots evolve with time?Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular levelClimate insurance: do I need to buy insurance for my crops?Climate as a factor in crime: Join climate dataset with Crimes
Page 4
Geospatial data is pervasive
Page 5
Why geospatial now?
Vast mobile data + geospatial= truly big data problem !
Page 6
Do you think we need one more geospatial library?
Page 7
Ancient data formats
Page 8
Coordinate System Hell!
Mobile data = GPS coordinatesMap coordinate systems optimized for precision⇒Transform from one to another
No good transformation libraries
Page 9
Simple, intuitive, handles common
formats
Scalable
Feature rich but still
extensible
Venn Diagram of geospatial libraries?
Page 10
Feature Extractors
Language integration simplifies exploratory analytics
Q-QQ-Asimilarity
Parse + Clean Logs
Ad category mapping
Query category mapping
PolyExp(Q-A)Features
Model
ConvexSolver
Train/Test
Splittrain
Test/validation
MetricsAd Server
HDFS
Data Prep
Score Model - Real-time
DataFlowStage
Data Flow Stage - Batch
Feedback
Spatial Context
Page 11
Not all is lost!
• local computations w/ ESRI Java API• Scale out computation w/ Spark• Python + R support without compromising
performance via Pyspark , SparkR• Catalyst + Data Sources + Data Frames
= Flexibility + Simplicity + Performance• Stitch it all together + Allow extension points
=> Success!
Page 12
Magellan: a complete story for geospatial?
Create geospatial analytics applicationsfaster:
• Use your favorite language (Python/ Scala), even R• Get best in class algorithms for common spatial analytics• Write less code• Read data efficiently• Let the optimizer do the heavy lifting
Page 13
How does it work?
Custom Data Types for Shapes:
• Point, Line, PolyLine, Polygon extend Shape• Local Computations using ESRI Java API• No need for Scala -> SQL serialization
Expressions for Operators:
• Literals e.g point(-122.4, 37.6)• Boolean Expressions e.g Intersects, Contains• Binary Expressions e.g Intersection
Custom Data Sources:
• Schema = [point, polyline, polygon, metadata]• Metadata = Map[String, String]• GeoJSON and Shapefile implementations
Custom Strategies for Spatial Join:
• Broadcast Cartesian Join• Geohash Join (in progress)• Plug into Catalyst as experimental strategies
Page 14
Magellan in a nutshell
• Read Shapefiles/ GeoJSON as DataSources:– sqlContext.read("magellan").load(“$path”)– sqlContext.read(“magellan”).option(“type”, “geojson”).load(“$path”)
• Spatial Queries using Expressions–point(-122.5, 37.6) = Shape Literal
–$”point” within $”polygon” = Boolean Expression–$”polygon1” intersection $”polygon2” = Binary Expression
• Joins using Catalyst + Spatial Optimizations–points.join(polygons).where($”point” within $”polygon”)
Page 15
Where are we at?
Magellan 1.0.3 is out on Spark Packages, go give it a try!:
• Scala support, Python support will be functional in 1.0.4 (needs Spark 1.5)• Github: https://github.com/harsha2010/magellan• Spark Packages: http://spark-packages.org/package/harsha2010/magellan• Data Formats: ESRI Shapefile + metadata, GeoJSON• Operators: Intersects, Contains, Within, Intersection• Joins: Broadcast• Blog: http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/• Zeppelin Notebook Example: http://bit.ly/1kvwGjC
Page 16
What is next?
Magellan 1.0.4 expected release December:
• Python support • MultiPolygon (Polygon Collection), MultiLineString (PolyLine Collection)• Spark 1.5, 1.6• Spatial Join Optimization• Map Matching Algorithms• More Operators based on requirements • Support for other common geospatial data formats (WKT, others?)
Page 17
DemoReading Geospatial FormatsUber queries