Validating big data at scale
-
Upload
amplitude-mobile-analytics -
Category
Data & Analytics
-
view
349 -
download
0
description
Transcript of Validating big data at scale
Validating Data at Scale Spenser Skates
CEO at Amplitude
Doing things at scale is noisy
u Code is supposed to run the same way, but what if you run the same loop a million times on a million different machines- how confident are you it will always run the same?
Data from phones is noisier
u Running on tens of thousands of different platforms with hundreds of thousands of different software configurations on hundreds of millions of phones
u Platforms have the craziest settings
How data can get messed up
u HTTP requests get mangled in transit
u Phone might not get the acknowledgement from the server
u People’s clocks are off
u People are running weird versions of Android
u Memory/disk corruption
u Gamma ray events
You can’t trust data from the client
Problem: Data gets mangled in transit
u Parameters from post requests get dropped
u Within a parameter, a chunk of data may not actually reach the server
Solution: Checksumming
u Send a checksum that’s a function of all the fields
u If the checksum is wrong/not present, you know that you haven’t got all the data. Tell the phone the upload wasn’t successful
u The phone will attempt to reupload the data
Problem: Client sends the same data twice
u How does the phone know that the server has received the data so it doesn’t reupload the same piece of data twice? It gets an acknowledgement back
u How does the server know that the phone has received the acknowledgement? It doesn’t!
u Equivalent to the two generals problem
u Requests that are successfully received by the server fail to successfully send an acknowledgement to the phone 5% of the time
u That means all counts are inflated by about 5%!
Solution: Deduplication
u Your system must be idempotent on the event level- it must be able to receive an event it’s received before and not change its state
u Create a unique key for every event that has been sent
u When you see an event, check your list of keys if the key is already present, discard the event
Problem: Clocks are off
u Phones are often offline, so an analytics SDK needs to cache data locally before uploading, including the time the event occurred
u But people’s clocks are often off, occasionally by years!
u We can’t timestamp to the upload time, 5% of data is uploaded >24 hours after an event happened
Solution: Get an estimate of the actual time an event was logged
u Timestamp the upload from the phone
u For each event, let’s compare:
u The difference between the phone event timestamp and the server upload time
u The difference between the phone upload timestamp and the server upload time
Solution: Get an estimate of the actual time an event was logged
u For each event timestamp, subtract the difference between the phone’s upload time and the server’s upload time
Other Problems
u People are running weird versions of Android u MD5 library
u Memory/disk corruption
u Gamma ray events
Clean Data
Questions?
Always happy to talk about analytics problems!
blog.amplitude.com
twitter: @amplitudemobile
MOBILE ANALYTICS FOR DECISION MAKERS