New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial...

28
October 16, 2014 | Company Confidential New Challenges in Data Science: Geospatial Analysis

Transcript of New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial...

Page 1: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

October 16, 2014 | Company Confidential

New Challenges in Data Science: Geospatial Analysis

Page 2: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

About Myself

Page 3: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Comparison Of Problems

• Netflix: Given a user’s viewing history, rank order a list of movies by probability of play.

• Lyft: Given a two-sided market of drivers and passengers how do we dispatch and price so that passengers have a nearby available driver at the lowest price possible while drivers achieve the highest possible hourly earnings.

Page 4: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Comparison Of Problems One of the things that Lyft’s problems

more challenging is the addition of geospatial data to the problem!

Page 5: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Types of Challenges @ Netflix

Page 6: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

New Types of Challenges @ Lyft

Page 7: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Estimated Time of Arrival (ETA) • ETA’s are used to:

• Dispatch a nearby driver to passenger.

• Show the passenger the ETA to set expectations of pickup time.

• Ideally, ETA’s should account for nuances such as:

• Traffic

• Bridges

• Direction of Travel

Page 8: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Lyft & ETA • Previously used Google Maps API to serve ETA’s.

• Given Lyft collects GPS data of drivers, when they are on-duty, several times a minute.

• Can we do better?

Page 9: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Big Data Approach • Example: Spanish to English Language Translation

• Method 1: Use grammatical rules and dictionary to translate.

• “Yo halbo”: Base verb is Hablar which is “to talk”

• How to decide between “I talk” and “I speak”?

• Method 2: Big Data Approach

• Look at many pairs of documents in both English and Spanish.

• Construct counts of co-occurrences of English-Spanish words

• E.g. 90% of the time “Yo Hablo” -> “I speak”

Page 10: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Big Data Approach to ETA’s • Method 1: Map Based Approach

• Snap GPS coordinates to roads on map, and then calculate average speed per road

• Method 2: Big Data Approach

• Break up the map into geo-hashes, and calculate average speed between geo-hash pairs

• ETA = Actual Distance/Speed

Page 11: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Boston

Result: 35% decrease in relative error!

Page 12: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Dispatch • Dispatch decision depends on

• Estimated ETA of driver to passenger

• Driver getting dispatched immediately is better than the driver waiting for a closer request

Page 13: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Dispatch

Page 14: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Dispatch

Page 15: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Prime Time Pricing • SF Giants playoff game just got done.

• Demand >> Supply

• Lyft dynamically increases the price to encourage more Lyft drivers to drive to the areas with highest demand.

• More drivers = More passengers with an available driver.

• We probably shouldn’t have Prime Time in Oakland at that time.

• Problem: Where and When and How Much should we have Prime Time?

Page 16: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Prime Time Pricing Communicating Prime Time to Drivers

Page 17: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Prime Time Potential Geospatial Issues

• Thought experiment:

• Break SF into sub-regions (geohashes)

• Count supply by looking at the number of available drivers per geohash.

• Count demand by looking at the number of requests per geohash.

Page 18: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

3 Available Drivers

Page 19: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

3 Available Drivers?

Page 20: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

15 Available Drivers?

Page 21: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Double Counting?

Page 22: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Fraud • Can look at traditional features

• Country of Origin

• Card Type (Credit, Debit, Prepaid)

• Behavioral attributes

• Links to known Fraudulent users

• Geospatial Data can help!

• Add geospatial data as a feature to machine learning models.

Page 23: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Fraud • Passengers first ride on Lyft

• Uses a new user coupon for $25

• 84 minute ride

• But 0 miles driven!

• If this occurs repeatedly for the same driver, then highly suspicious.

Page 24: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Targeted Advertising

• Lyft buys ads on Facebook to acquire new drivers.

• Should we target Facebook users that live closer to the city than not?

Page 25: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Targeted Advertising

Page 26: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

“The competition … is very much about who can build the algorithms that best solve the basic underlying logistical

problem of getting a ride” – Wired Magazine

• These algorithms include:

• More accurate ETA’s/Dispatch

• Determining the lowest price possible to reach target system metrics

• Preventing Fraud

• Acquiring drivers and passengers at the lowest possible CPA

Page 27: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Summary • Geospatial data is a rich feature that gives these algorithms

helpful context but adds complexity as well.

• Be smart about how you set up your problems to get the biggest gain.

• For the Data Scientist it also makes the problems much more interesting.

• With the increasing popularity of ridesharing companies, I see this as an increasing popular area of research.

Page 28: New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial Analysis About Myself Comparison Of Problems •Netflix: Given a user’s viewing history,

Questions?