New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial...
Transcript of New Challenges in Data Science: Geospatial AnalysisNew Challenges in Data Science: Geospatial...
October 16, 2014 | Company Confidential
New Challenges in Data Science: Geospatial Analysis
About Myself
Comparison Of Problems
• Netflix: Given a user’s viewing history, rank order a list of movies by probability of play.
• Lyft: Given a two-sided market of drivers and passengers how do we dispatch and price so that passengers have a nearby available driver at the lowest price possible while drivers achieve the highest possible hourly earnings.
Comparison Of Problems One of the things that Lyft’s problems
more challenging is the addition of geospatial data to the problem!
Types of Challenges @ Netflix
New Types of Challenges @ Lyft
Estimated Time of Arrival (ETA) • ETA’s are used to:
• Dispatch a nearby driver to passenger.
• Show the passenger the ETA to set expectations of pickup time.
• Ideally, ETA’s should account for nuances such as:
• Traffic
• Bridges
• Direction of Travel
Lyft & ETA • Previously used Google Maps API to serve ETA’s.
• Given Lyft collects GPS data of drivers, when they are on-duty, several times a minute.
• Can we do better?
Big Data Approach • Example: Spanish to English Language Translation
• Method 1: Use grammatical rules and dictionary to translate.
• “Yo halbo”: Base verb is Hablar which is “to talk”
• How to decide between “I talk” and “I speak”?
• Method 2: Big Data Approach
• Look at many pairs of documents in both English and Spanish.
• Construct counts of co-occurrences of English-Spanish words
• E.g. 90% of the time “Yo Hablo” -> “I speak”
Big Data Approach to ETA’s • Method 1: Map Based Approach
• Snap GPS coordinates to roads on map, and then calculate average speed per road
• Method 2: Big Data Approach
• Break up the map into geo-hashes, and calculate average speed between geo-hash pairs
• ETA = Actual Distance/Speed
Boston
Result: 35% decrease in relative error!
Dispatch • Dispatch decision depends on
• Estimated ETA of driver to passenger
• Driver getting dispatched immediately is better than the driver waiting for a closer request
Dispatch
Dispatch
Prime Time Pricing • SF Giants playoff game just got done.
• Demand >> Supply
• Lyft dynamically increases the price to encourage more Lyft drivers to drive to the areas with highest demand.
• More drivers = More passengers with an available driver.
• We probably shouldn’t have Prime Time in Oakland at that time.
• Problem: Where and When and How Much should we have Prime Time?
Prime Time Pricing Communicating Prime Time to Drivers
Prime Time Potential Geospatial Issues
• Thought experiment:
• Break SF into sub-regions (geohashes)
• Count supply by looking at the number of available drivers per geohash.
• Count demand by looking at the number of requests per geohash.
3 Available Drivers
3 Available Drivers?
15 Available Drivers?
Double Counting?
Fraud • Can look at traditional features
• Country of Origin
• Card Type (Credit, Debit, Prepaid)
• Behavioral attributes
• Links to known Fraudulent users
• Geospatial Data can help!
• Add geospatial data as a feature to machine learning models.
Fraud • Passengers first ride on Lyft
• Uses a new user coupon for $25
• 84 minute ride
• But 0 miles driven!
• If this occurs repeatedly for the same driver, then highly suspicious.
Targeted Advertising
• Lyft buys ads on Facebook to acquire new drivers.
• Should we target Facebook users that live closer to the city than not?
Targeted Advertising
“The competition … is very much about who can build the algorithms that best solve the basic underlying logistical
problem of getting a ride” – Wired Magazine
• These algorithms include:
• More accurate ETA’s/Dispatch
• Determining the lowest price possible to reach target system metrics
• Preventing Fraud
• Acquiring drivers and passengers at the lowest possible CPA
Summary • Geospatial data is a rich feature that gives these algorithms
helpful context but adds complexity as well.
• Be smart about how you set up your problems to get the biggest gain.
• For the Data Scientist it also makes the problems much more interesting.
• With the increasing popularity of ridesharing companies, I see this as an increasing popular area of research.
Questions?