Data Philly Meetup - Big (Geo) Data

Post on 12-Jan-2015

697 views 8 download

Tags:

description

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

Transcript of Data Philly Meetup - Big (Geo) Data

Big (Geo) Data Science

Robert Cheethamcheetham@azavea.com

@rcheetham

Web/Mobile

Geospatial

UI/UX Design

High Performance Computing

R&D

B Corporation

• Projects w/ Social Value

• Summer of Maps

• Pro Bono Program

• Donate share of profits

Research-Driven

• 10% Research Program

• Academic Collaborations

• Open Source

Spatial Temporal Forecasting

with Philadelphia Crime Data

How Phila PD uses Maps

Customized Map Products

Weekly CompStat Meetings

Web Crime Analysis

Complainant

CAD

Verizon

911

911 Operator

Radio

Dispatcher

Police Officer

District

48 Desk

INCT

Daily download

& Geocoding Routines

Incident Report

Completed by Officer District X

District Y

District Z

Maps distributed

Through Intranet,

Printing, CompStat

INCT & PARS – main database sources

over 5,000 incidents daily, over 2 million annually

PARS

The Context

1,500,000 people

7,000 police

1,000 civilian employees

2,000,000 new incidents / year

3 crime analysts

What we did

• Weekly Compstat• Lots of maps• Automation of map creation• Web-based systems

… but what if we could…

Accelerate the cycle Proactively notify Automate the process

Prototype

ArcViewVB & MapObjects

MS SQL Server

Crime Incidents

Database

Shapefiles

and

GRIDs

Process Documentation

.ini

file

… but there was a problem …

…it was crap …

… sort of.

We needed ….

1. Better Statistics

2. Notification

3. Simplicity

Crime Analysis – What has happened?– Mapping (spatial / temporal densities)

– Trending

– Intelligence Dashboard

Early Warning – What is out of the ordinary?– Statistical & Threshold-based Hunches (data

mining)

– Alerting

Risk Forecasting – What is likely to happen next?– Near Repeat Pattern

– Load Forecasting

Crime Analysis– Mapping (spatial / temporal densities)

– Trending

– Intelligence Dashboard

Early Warning– Statistical & Threshold-based Hunches (data

mining)

– Alerting

Risk Forecasting– Near Repeat Pattern

– Load Forecasting

Crime Analysis

Intelligence Dashboard

Crime Analysis

Early Warning

Early Warning

• Geographic Early Warning System– A system to alert staff of an unusual situation in a

particular location– Ingests data sets to automatically “cook on” and only

involves staff when a statistically unusual situation is found

HunchLab

Database

Operational

Database Alerting System

Geostatistical Engine

Operational

DatabaseOperational

Databases

Early Warning

What is a Hunch?

• A proposed hypothesis, saved into the system, and continually tested for validity

• Incident Attribute Requirements– Location (x, y)– Time (timestamp)– Classification

• Hunch Attributes– Location (area)– Time (recent / historic periods)– Classification

• Analyses– Statistical Hunch– Threshold Hunch

Hunch Parameters: Location

• Address & Radius• Precinct/County/Country• Custom Drawn Area• Mass Hunch

Hunch Parameters: Time

• Statistical Hunch– Recent Past– Historic Past

Hunch Parameters: Classification

• Category• Time of Day• Narrative

Hunch Helper

Email Alert

Hunch Details

Risk Forecasting

Predictive Analytics?

• Prediction vs. Forecasting

Near Repeat Pattern Analysis

Contagious Crime?

• Near repeat pattern analysis • “If one burglary occurs, how does the risk change nearby?”

What Do We Mean By Near Repeat?

• Repeat victimization– Incident at the same location at a later time (likely

related)• Near repeat victimization

– Incident at a nearby location at a later time (likely related)

• Incident A (place, time) --> Incident B (place, time)

Near Repeat Pattern Analysis

• The goal:– Quantify short term risk due to near-repeat victimization

• “If one burglary occurs, how does the risk of burglary for the neighbors change?”

• What we know:– Incident A (place, time) --> Incident B (place, time)

• Distance between A and B• Timeframe between A and B

• What we need to know:– What distances/timeframes are not simply random?

Near Repeat Pattern Analysis

• The process– Observe the pattern in historic data– Simulate the pattern in randomized historic data– Compare the observed pattern to the simulated patterns– Apply the non-random pattern to new incidents

• An example– 180 days of burglaries in Division 6 of Philadelphia

Near Repeat Pattern Analysis

Near Repeat Pattern Analysis

Near Repeat Pattern Analysis

Near Repeat Pattern Analysis

Near Repeat Pattern Analysis

• How can you test your own data?– Near Repeat Calculator

• http://www.temple.edu/cj/misc/nr/

• Papers– Near-Repeat Patterns in Philadelphia Shootings (2008)

• One city block & two weeks after one shooting– 33% increase in likelihood of a second event

Jerry Ratcliffe

Temple University

Contagious Crime?

Workload Forecasting

Improving CompStat

• Workload forecasting• “Given the time of year, day of week, time of day and

general trend, what counts of crimes should I expect?”

What Do We Mean By Load Forecasting?

• Workload forecasting• Generating aggregate crime counts for a future timeframe

using cyclical time series analysis

Measure cyclical patterns

Identify non-cyclical trend

Forecast expected count

+

bit.ly/gorrcrimeforecastingpaper

Load Forecasting

• Measure cyclical patterns• Take historic incidents (for example: last five years)• Generate multiplicative seasonal indices

– For each time cycle:» time of year» day of week» time of day

– Count incidents within each time unit (for example: Monday)– Calculate average per time unit if incidents were evenly

distributed– Divide counts within each time unit by the calculated average

to generate multiplicative indices» Index ~ 1 means at the average» Index > 1 means above average» Index < 1 means below average

Load Forecasting

Load Forecasting

Load Forecasting

Load Forecasting

Load Forecasting

• Identify non-cyclical trend• Take recent daily counts (for example: last year daily

counts)• Remove cyclical trends by dividing by indices

• Run a trending function on the new counts– Simple average

» Last X Days– Smoothing function

» Exponential smoothing» Holt’s linear exponential smoothing

Load Forecasting

• Forecast expected count• Project trend into future timeframe

– Always flat» Simple average» Exponential smoothing

– Linear trend» Holt’s linear exponential smoothing

• Multiple by seasonal indices to reseasonalize the data

Load Forecasting

Measure cyclical patterns

Identify non-cyclical trend

Forecast expected count

+

bit.ly/gorrcrimeforecastingpaper

Improving CompStat

How Do We Know It’s Accurate?

• Testing• Generated forecasting techniques(examples)

– Commonly Used» Average of last 30 days» Average of last 365 days» Last year’s count for the same time period

– Advanced Combinations» Different cyclical indices (example: day of year vs. month of year)» Different levels of geographic aggregation for indices» Different trending functions

• Scoring methodologies (examples)– Mean absolute percent error (with some enhancements)– Mean percent error– Mean squared error

• Run thousands of forecasts through testing framework• Choose the right technique in the right situation

Ongoing Research

Research Topics

• Risk Forecasting– Load forecasting enhancements

• Weather and special events

– Combining short and long term risk forecasts (Temple)• Socioeconomic changes in neighborhoods

– Risk Terrain Modeling (Rutgers)• Context of crime at the microplace

Research Topics

Research Topics

• Risk Forecasting– Offender Management

• Prioritize offenders based upon statistical models using past behaviors

• Evaluation– Automate Randomized Controlled Trials

Data Processing for Big (Geo) Data

A Story

Close to Center City

Walk to Grocery Store

Nearby Restaurants

Library

Near a Park

Biking / walking distance from our work

Biking distance to fencing

somewhat important

vital

very important

nice to have

somewhat important

very important

somewhat important

Robert’s Rules of Housing

Child Care

Local School Rankings

Farmer's Market

Car Share

Public Transit

Your factors might include…

We stand on the shoulders of giants

Not a new idea … Design with Nature

Not a new Idea … Dana Tomlin

Desktop GIS

x 5 x 2x 3x 1

+ ++

=

Weighted Overlay

Geography-driven Decisions

Iterative

Individual

Web [and Mobile]

Growing data sets

Summary

Web Challenges

Web is different from the Desktop

Lots of simultaneous users

Stateless environment

HTML+JS+CSS

Users are less skilled

Users are less patient

But wait … there’s a problem

10 – 60 second calculation time

Multiple simultaneous users …

… that are impatient

Data Challenges

Big Data – Social Media

Big Data – Science

Big Data – Citizen Science

Big Data – Cities

Early Prototype

Specific Optimization Goals New Raster File Structure

Distributed processing

Binary messaging protocol

Optimization: File Format Limit data type and range

1D arrays are fast to read/write

Tiled

Pyramids

Azavea Raster Grid (ARG)

Optimization: Distributed Processing Parallelizable - Local Ops and Focal Ops

Support multiple– Threads– Cores– CPU’s– Machines

Considered– Hadoop– Amazon Map Reduce– Beowolf

Success!!

Reduced from 10-60 seconds to

<500 milliseconds

Optimizing one process sub-optimizes others Complex to configure and maintain Limited to one operation No interpolation No mixing

– cell sizes– extents– projections

etc.

Broader set of functionality

Both raster and vector

Scala + Akka

Open source

Faster is Different

Regional/State: 84 ms

National: 84 ms

Large Country 115 ms

Continental 271 ms

Planet 1.2 – 2.0 s

Ongoing R&D

GPUs

Re-wrote a few Map Algebra operations: Local Neighborhood Zonal Viewshed etc.

15 – 120x Large grids Large kernels

GPU Results

Vector

Neighborhood/Focal

Spatial Statistics

Integration

New Spatial Operations

Urban Forest Ecosystem Modeling

Crime Analysis, Early Warning and Forecasting

GDAL

GeoServer

PostGIS

R

GeoDa

Open Source Geoprocessing

Many Thanks!© Photo used with permission from Alphafish, via Flickr.com

Big (Geo) Data Science

[We are hiring]

Robert Cheethamcheetham@azavea.com

@rcheetham