OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

62
Using Cascalog to build an app with City of Palo Alto Open Data Paco Nathan http://liber118.com/pxn/ 1 Sunday, 28 July 13

description

OSCON 2013 talk in Portland about https://github.com/Cascading/CoPA project for CMU, to build a recommender system based on Open Data from City of Palo Alto. This talk examines a "lengthy" (400+ lines) Cascalog app -- which is big for Cascalog, as well as issues involved in commercial use cases for Open Data.

Transcript of OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Page 1: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Using Cascalog to build an app with

City of Palo Alto Open Data

Paco Nathanhttp://liber118.com/pxn/

1Sunday, 28 July 13

Page 2: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

GitHub repo for the open source project: github.com/Cascading/CoPA/wiki

This project began as a Big Data workshop for a graduate seminar at CMU West

Many thanks to:

Stuart Evans CMU Distinguished Service Professor

Jonathan ReichentalCity of Palo Alto CIO

Peter PirnejadCity of Palo Alto Dev Center Director

Diego MayJunar CEO & Co-founder

2Sunday, 28 July 13

Page 3: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Cascading, a workflow abstraction Cascalog ➟ 2.0 Palo Alto case study Open Data insights

3Sunday, 28 July 13

Page 4: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Cascading – origins

API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products.

Wensel was following the Nutch open source project – where Hadoop started.

Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology.

4Sunday, 28 July 13

Page 5: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Cascading – functional programming

Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature.

To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows:

• leverages JVM and Java-based tools without anyneed to create new languages

•allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters

5Sunday, 28 July 13

Page 6: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Cascading – functional programming

• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments

• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:

Cascalog in Clojure (2010)Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wiki

Why Adopting the Declarative Programming Practices Will Improve Your Return from TechnologyDan Woods, 2013-04-17 Forbes

forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/

6Sunday, 28 July 13

Page 7: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading – integrations

• partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera

• taps: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc.

• serialization: Avro, Thrift, Kryo, JSON, etc.

• topologies: Apache Hadoop, tuple spaces, local mode

7Sunday, 28 July 13

Page 8: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.

8Sunday, 28 July 13

Page 9: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.

workflow abstraction addresses: • staffing bottleneck; • system integration; • operational complexity; • test-driven development

9Sunday, 28 July 13

Page 10: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Workflow Abstraction – pattern language

Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Data is represented as flows of tuples. Operations in the flows bring functional programming aspects into Java

A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199

10Sunday, 28 July 13

Page 11: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Workflow Abstraction – literate programming

Cascading workflows generate their own visual documentation: flow diagrams

in formal terms, flow diagrams leverage a methodology called literate programming

provides intuitive, visual representations for apps –great for cross-team collaboration

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Literate ProgrammingDon Knuthliterateprogramming.com

11Sunday, 28 July 13

Page 12: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Workflow Abstraction – business process

following the essence of literate programming, Cascading workflows provide statements of business process

this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)

Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.)

this is especially apparent in large-scale Cascalog apps:

“Specify what you require, not how to achieve it.”

by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale

12Sunday, 28 July 13

Page 13: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Cascading, a workflow abstraction Cascalog ➟ 2.0 Palo Alto case study Open Data insights

13Sunday, 28 July 13

Page 14: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

14Sunday, 28 July 13

Page 15: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

For the process used with this Open Data app, we chose to use Cascalog

github.com/nathanmarz/cascalog/wiki

by Nathan Marz, Sam Ritchie, et al., 2010

a DSL in Clojure which implements Datalog, backed by Cascading

Some aspects of CS theory:

• Functional Relational Programming

• mitigates Accidental Complexity

• has been compared with Codd 1969

15Sunday, 28 July 13

Page 16: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Accidental Complexity:

Not O(N) complexity, but the costs of software engineering at scale over time

What happens when you build recommenders, then go work on other projects for six months? What does it cost others to maintain your apps?

“Out of the Tar Pit”, Moseley & Marks, 2006goo.gl/SKspn

Cascalog allows for leveraging the same framework, same code base, from ad-hoc queries… to modeling… to unit tests… to checkpoints in production use

This focuses on the process of structuring data:specify what you require, not how it must be achieved

Huge implications for software engineering

16Sunday, 28 July 13

Page 17: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

pros:

• most of the largest use cases for Cascading

• 10:1 reduction in code volume compared to SQL

• Leiningen build: simple, no surprises, in Clojure itself

• test-driven development (TDD) for Big Data

• fault-tolerant workflows which are simple to follow

• machine learning, map-reduce, etc., started in LISP years ago anywho...

cons:

• learning curve, limited number of Clojure developers

• aggregators are the magic, those take effort to learn

17Sunday, 28 July 13

Page 18: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Q:

Who uses Cascalog, other than Twitter?

A:

• Climate Corp

• Factual

• Nokia

• Telefonica

• Harvard School of Public Health

• YieldBot

• uSwitch

• etc.

18Sunday, 28 July 13

Page 19: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 map 1 reduce18 lines code gist.github.com/3900702

WordCount – conceptual flow diagram

cascading.org/category/impatient

19Sunday, 28 July 13

Page 20: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

WordCount – Cascading app in Java

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

20Sunday, 28 July 13

Page 21: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

WordCount – generated flow diagramDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

21Sunday, 28 July 13

Page 22: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

(ns impatient.core  (:use [cascalog.api]        [cascalog.more-taps :only (hfs-delimited)])  (:require [clojure.string :as s]            [cascalog.ops :as c])  (:gen-class))

(defmapcatop split [line]  "reads in a line of string and splits it by regex"  (s/split line #"[\[\]\\\(\),.)\s]+"))

(defn -main [in out & args]  (?<- (hfs-delimited out)       [?word ?count]       ((hfs-delimited in :skip-header? true) _ ?line)       (split ?line :> ?word)       (c/count ?count)))

; Paul Lam; github.com/Quantisan/Impatient

WordCount – Cascalog / ClojureDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

22Sunday, 28 July 13

Page 23: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Cascading, a workflow abstraction Cascalog ➟ 2.0 Palo Alto case study Open Data insights

23Sunday, 28 July 13

Page 24: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Palo Alto is quite a pleasant place

• temperate weather

• lots of parks, enormous trees

•great coffeehouses

•walkable downtown

•not particularly crowded

On a nice summer day, who wants to be stuck indoors on a phone call?

Instead, take it outside – go for a walk

24Sunday, 28 July 13

Page 25: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

1. Open Data about municipal infrastructure(GIS data: trees, roads, parks)

2. Big Data about where people like to walk(smartphone GPS logs)

3. some curated metadata(which surfaces the value)

⇒4. personalized recommendations:

“Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sipping a latte or enjoying some fro-yo.”

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

25Sunday, 28 July 13

Page 26: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates

This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good

paloalto.opendata.junar.com/dashboards/7576/geographic-information/

discovery

26Sunday, 28 July 13

Page 27: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

GIS about trees in Palo Alto:discovery

27Sunday, 28 July 13

Page 28: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Geographic_Information,,,

"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point""Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Severity: none Trench Severity: none Trench Extent: 0 Rutting Severity: none Rutting Extent: 0 Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0 -122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0 -122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0 -122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"

discovery

(unstructured data…)

28Sunday, 28 July 13

Page 29: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

(defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )  (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) ))

discovery

(specify what you require, not how to achieve it…

80:20 cost of data prep)

29Sunday, 28 July 13

Page 30: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

discovery

(ad-hoc queries get refined into composable predicates)

Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AVTree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0Point

30Sunday, 28 July 13

Page 31: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

discovery

(curate valuable metadata)

31Sunday, 28 July 13

Page 32: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

(defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^\s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) ))

discovery

?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl?tree_id! " 412?situs" " 115?tree_site" 1?species" " liquidambar styraciflua?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598?avg_height"27.5?tree_lat" 37.446001565119?tree_lng" -122.167713417554?tree_alt" 0.0?geohash" " 9q9jh0

32Sunday, 28 July 13

Page 33: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

// run analysis and visualization in Rlibrary(ggplot2)

dat_folder <- '~/src/concur/CoPA/out/tree'data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="\t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8") summary(data)

t <- head(sort(table(data$V5), decreasing=TRUE)trees <- as.data.frame.table(t, n=20))colnames(trees) <- c("species", "count") m <- ggplot(data, aes(x=V8))m <- m + ggtitle("Estimated Tree Height (meters)")m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density() par(mar = c(7, 4, 4, 2) + 0.1)plot(trees, xaxt="n", xlab="")axis(1, labels=FALSE)text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE)grid(nx=nrow(trees))

discovery

33Sunday, 28 July 13

Page 34: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

discovery

sweetgum

analysis of the tree data:

34Sunday, 28 July 13

Page 35: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

discovery

(flow diagram, gis ⇒ tree)

35Sunday, 28 July 13

Page 36: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

(defn get-roads [src trap road_meta] "subquery to parse/filter the road data" (<- [?blurb ?bike_lane ?bus_route ?truck_route ?albedo ?min_lat ?min_lng ?min_alt ?geohash ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^\s+Sequence.*Traffic Count.*" ?misc) (parse-road ?misc :> _ ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ?overlay_year ?bike_lane ?bus_route ?truck_route) (road_meta ?surface_type ?albedo_new ?albedo_worn) (estimate-albedo

?overlay_year ?albedo_new ?albedo_worn :> ?albedo) (bigram ?geo :> ?pt0 ?pt1) (midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt) ;; why filter for min? because there are geo duplicates.. (c/min ?lat :> ?min_lat) (c/min ?lng :> ?min_lng) (c/min ?alt :> ?min_alt) (geohash ?min_lat ?min_lng :> ?geohash) (:trap (hfs-textline trap)) ))

discovery

36Sunday, 28 July 13

Page 37: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

?blurb"" " " Hawthorne Avenue from Alma Street to High Street?traffic_count"3110?traffic_class"local residential?surface_type" asphalt concrete?albedo" " " 0.12?min_lat"" " 37.446140860599854"?min_lng "" " -122.1674652295435?min_alt "" " 0.0?geohash"" " 9q9jh0

(another data product)

discovery

37Sunday, 28 July 13

Page 38: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

The road data provides:

• traffic class (arterial, truck route, residential, etc.)

• traffic counts distribution

• surface type (asphalt, cement; age)

This leads to estimators for noise, reflection, etc.

discovery

38Sunday, 28 July 13

Page 39: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

GISexport

Regexparse-gis

src

FailureTraps

M

M

road

RoadMetadata

Join EstimateAlbedo Geohash

road

Regexparse-road

RoadSegments

R

(flow diagram, gis ⇒ road)

discovery

39Sunday, 28 July 13

Page 40: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

GIS data from Palo Alto provides us with geolocation about each item in the export: latitude, longitude, altitude

Geo data is great for managing municipal infrastructure as well as for mobile apps

Predictive modeling in our Open Data example focuses on leveraging geolocation

We use spatial indexing by creating a grid of geohash values, for efficientparallel processing

Cascalog queries collect items with thesame geohash values – using them as keysfor large-scale joins (Hadoop)

modeling

40Sunday, 28 July 13

Page 41: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

9q9jh0

geohash with 6-digit resolution

approximates a 5-block square

centered lat: 37.445, lng: -122.162

modeling

41Sunday, 28 July 13

Page 42: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns:

" -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0

modeling

( lat0, lng0, alt0 )

( lat1, lng1, alt1 )

( lat2, lng2, alt2 )

( lat3, lng3, alt3 )

NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt)

42Sunday, 28 July 13

Page 43: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

9q9jh0

X X

X

Filter trees which are too far away to provide shade. Calculate a sum of moments for tree height × distance, as an estimator for shade:

modeling

43Sunday, 28 July 13

Page 44: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

(defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng

?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _

?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _)

(road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric)

(trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)

(read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance

?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment

;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) ))

modeling

44Sunday, 28 July 13

Page 45: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

M

tree

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

(flow diagram, shade)

modeling

45Sunday, 28 July 13

Page 46: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

modeling

46Sunday, 28 July 13

Page 47: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

(defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs

?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance)

(read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) ))

modeling

?uuid ?geohash ?gps_count ?recent_visitcf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 197237669096932cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348

47Sunday, 28 July 13

Page 48: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Recommenders often combine multiple signals, via weighted averages, to rank personalized results:

•GPS of person ∩ road segment

• frequency and recency of visit

• traffic class and rate

• road albedo (sunlight reflection)

• tree shade estimator

Adjusting the mix allows for further personalization at the end use

(defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) ))

apps

48Sunday, 28 July 13

Page 49: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

(defproject cascading-copa "0.1.0-SNAPSHOT" :description "City of Palo Alto Open Data recommender in Cascalog" :url "https://github.com/Cascading/CoPA" :license {:name "Apache License, Version 2.0" :url "http://www.apache.org/licenses/LICENSE-2.0" :distribution :repo } :uberjar-name "copa.jar" :aot [copa.core] :main copa.core :source-paths ["src/main/clj"] :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.0"] [cascalog-more-taps "0.3.1-SNAPSHOT"] [clojure-csv/clojure-csv "1.3.2"] [org.clojars.sunng/geohash "1.0.1"] [org.clojure/clojure-contrib "1.2.0"] [date-clj "1.0.1"] ] :profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]} :provided {:dependencies [

[org.apache.hadoop/hadoop-core "0.20.2-dev"]]}}

)

apps

49Sunday, 28 July 13

Page 50: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

‣ addr: 115 HAWTHORNE AVE‣ lat/lng: 37.446, -122.168‣ geohash: 9q9jh0‣ tree: 413 site 2‣ species: Liquidambar styraciflua‣ est. height: 23 m‣ shade metric: 4.363‣ traffic: local residential, light traffic‣ recent visit: 1972376952532‣ a short walk from my train stop ✔

apps

50Sunday, 28 July 13

Page 51: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Could combine this with a variety of data APIs:

• Trulia neighborhood data, housing prices

• Factual local business (FB Places, etc.)

• CommonCrawl open source full web crawl

• Wunderground local weather data

• WalkScore neighborhood data, walkability

• Data.gov US federal open data

• Data.NASA.gov NASA open data

• DBpedia datasets derived from Wikipedia

• GeoWordNet semantic knowledge base

• Geolytics demographics, GIS, etc.

• Foursquare, Yelp, CityGrid, Localeze, YP

• various photo sharing

apps

walkscore.com/CA/Palo_Alto

51Sunday, 28 July 13

Page 52: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Cascading, a workflow abstraction Cascalog ➟ 2.0 Palo Alto case study Open Data insights

52Sunday, 28 July 13

Page 53: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Trends in Public Administration

late 1880s – late 1920s (Woodrow Wilson)as hierarchy, bureaucracy → only for the most educated, elite

late 1920s – late 1930sas a business, relying on “Scientific Method”, gov as a process

late 1930s – late 1940s (Robert Dale)relationships, behavioral-based → policy not separate from politics

late 1940s – 1980syet another form of management → less “command and control”

1980s – 1990s (David Osborne, Ted Gaebler)New Public Management → service efficiency, more private sector

1990s – present (Janet & Robert Denhardt)Digital Age → transparency, citizen-based “debugging”, bankruptcies

The Roles, Actors, and Norms Necessary to Institutionalize Sustainable Collaborative GovernancePeter PirnejadUSC Price School of Policy2013-05-02

53Sunday, 28 July 13

Page 54: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Trends in Public Administration

late 1880s – late 1920s (Woodrow Wilson)as hierarchy, bureaucracy → only for the most educated, elite

late 1920s – late 1930sas a business, relying on “Scientific Method”, gov as a process

late 1930s – late 1940s (Robert Dale)relationships, behavioral-based → policy not separate from politics

late 1940s – 1980syet another form of management → less “command and control”

1980s – 1990s (David Osborne, Ted Gaebler)New Public Management → service efficiency, more private sector

1990s – present (Janet & Robert Denhardt)Digital Age → transparency, citizen-based “debugging”, bankruptcies

The Roles, Actors, and Norms Necessary to Institutionalize Sustainable Collaborative GovernancePeter PirnejadUSC Price School of Policy2013-05-02

Drivers, circa 2013

• governments running out of money, cannot increase staff and services

• better data infra at scale (cloud, OSS, etc.)

• machine learning techniques to monetize

• viable ecosystem for data products, APIs

• mobile devices enabling use cases

54Sunday, 28 July 13

Page 55: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Open Data notes

Successful apps incorporate three components:

•Big Data (consumer interest, personalization)

•Open Data (monetizing public data)

•Curated Metadata

Most of the largest Cascading deployments leverage some Open Data components: Climate Corp, Factual, Nokia, etc.

Notes about Open Data use cases: goo.gl/cd995T

Consider buildingeye.com, aggregate building permits:

• pricing data for home owners looking to remodel

• sales data for contractors

• imagine joining data with building inspection history,for better insights about properties for sale…

55Sunday, 28 July 13

Page 56: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Open Data ecosystem

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, San Francisco, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook

Data feeds structured for public private partnerships

56Sunday, 28 July 13

Page 57: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Open Data ecosystem – caveats

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, San Francisco, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook

Required Focus

• respond to viable use cases

• not budgeting hackathons

57Sunday, 28 July 13

Page 58: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Open Data ecosystem – caveats

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, San Francisco, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook

Required Focus

• surface the metadata

• curate, allowing for joins/aggregation

• not scans as PDFs

58Sunday, 28 July 13

Page 59: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Open Data ecosystem – caveats

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, San Francisco, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook

Required Focus

• make APIs consumable by automation

• allow for probabilistic usage

• not OSS licensing for data

59Sunday, 28 July 13

Page 60: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Open Data ecosystem – caveats

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, San Francisco, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook

Required Focus

• supply actionable data

• track data provenance carefully

• provide feedback upstream, i.e., cleaned data at source

• focus on core verticals

60Sunday, 28 July 13

Page 61: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Open Data ecosystem – caveats

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, San Francisco, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook

Required Focus

• address consumer needs

• identify community benefits of the data

61Sunday, 28 July 13

Page 62: OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

newsletter for updates: http://liber118.com/pxn/

shop.oreilly.com/product/0636920028536.do

62Sunday, 28 July 13