The Big Bad Data
or how to deal with data sources
Przemysław Pastuszka, Kraków, 17.10.2014
● Quick introduction to Ocado● First approach to Big Data and why we
ended up with Bad Data● Making things better - Unified data
architecture● Live demo (if there is time left)
What do I want to talk about?
Ocado intro
Ocado is the world's largest online-only grocery retailer, reaching over 70% of British households, shipping over 150,000 orders a week or 1.1M items a day.
Customer Fulfilment Center
Shop
How did we start?
Google Big Query
OC
AD
O S
ERVI
CES
Oracle
Green plum
JMS
Google Cloud
Storage
Compute cluster
User Cluster
Cluster Manager
Transformed ORC files
Raw data
Looks good. So what’s the problem?
● Various data formats○ json, csv, uncompressed, gzip, blob, nested…○ incremental data, “snapshot” data, deltas○ lots of code to handle all corner cases
● Corrupted data○ corrupted gzips, empty files, invalid content, unexpected
schema changes …○ failures of overnight jobs
● Data exports delayed○ DB downtime, network issues, human error, …○ data available in BD platform even later
Missing data
● Real-time analytics?○ data comes in batches every night○ so forget it
● ORC is not a dream solution○ not a very friendly format○ overnight transform is a next point of failure○ data duplication (raw + ORC)
● People think you “own” the data○ “please, tell me what this data means?”
That’s not all...
● Big Data team is frustrated○ we spend lots of time on monitoring and fixing bugs○ code becomes complex to handle corner cases○ confidence in platform stability rapidly goes down
● Analytics are frustrated○ long latency before data is available for querying○ data is unreliable
People get frustrated
Let’s go back to the board!
It can’t go on like this anymore
What we need?
● Unified input data format○ JSON to the rescue
● Data goes directly from applications to BD platform○ let’s make all external services write to distributed queue
● Data validation and monitoring○ all data incoming into system should be well-described○ validation should happen as early as possible○ alerts on corrupted / missing data must be raised early
● Data must be ready for querying early○ let’s push data to BigQuery first
New architecture overviewIN
PUT
STR
EAM
Event Registry
DataStorage
EventProcessorEvent
ProcessorEventProcessor
Compute Cloud
END
POIN
TS
Cluster Manager
Loading data
validatedeventsstream
INPUT STREAMKinesis
Event Registry
BQ
event type descriptor- schema - processing instructions
Google Cloud Storage
ad-hoc / scheduled
export
raweventsstream
BQ TableauConnector
BQ Excel Connector
BQ REST API
invalid events store location
( BQ )
invalidevents
ad hocevents replay
EventProcessor
GS REST APIgsutil
Batch processing
BQ
Google Cloud Storage
BQ TableauConnector
BQ Excel Connector
BQ REST API
BQ query
export
COMPUTE CLOUD
ComputeCluster A
ComputeCluster B
ComputeCluster C
GS REST APIgsutil
Real-time processing
INPUT STREAMKinesis
Event Registry
BQ
event type descriptor- schema
- processing instructions
Google Cloud Storage
validatedeventsstream
raweventsstream
BQ TableauConnector
BQ Excel Connector
BQ REST API
EVEN
T Q
UEU
E
BQ Sink
Cluster A
Cluster B
Event Processor
GS Sink
processed data readyfor consumption by
other
GS REST APIgsutil
Questions?
Top Related