Download - The Big Bad Data

The Big Bad Data

or how to deal with data sources

Przemysław Pastuszka, Kraków, 17.10.2014

● Quick introduction to Ocado● First approach to Big Data and why we

ended up with Bad Data● Making things better - Unified data

architecture● Live demo (if there is time left)

What do I want to talk about?

Ocado intro

Ocado is the world's largest online-only grocery retailer, reaching over 70% of British households, shipping over 150,000 orders a week or 1.1M items a day.

Customer Fulfilment Center

How did we start?

Google Big Query

OC

AD

O S

ERVI

CES

Oracle

Green plum

JMS

Google Cloud

Storage

Compute cluster

User Cluster

Cluster Manager

Transformed ORC files

Raw data

Looks good. So what’s the problem?

● Various data formats○ json, csv, uncompressed, gzip, blob, nested…○ incremental data, “snapshot” data, deltas○ lots of code to handle all corner cases

● Corrupted data○ corrupted gzips, empty files, invalid content, unexpected

schema changes …○ failures of overnight jobs

● Data exports delayed○ DB downtime, network issues, human error, …○ data available in BD platform even later

Missing data

● Real-time analytics?○ data comes in batches every night○ so forget it

● ORC is not a dream solution○ not a very friendly format○ overnight transform is a next point of failure○ data duplication (raw + ORC)

● People think you “own” the data○ “please, tell me what this data means?”

That’s not all...

● Big Data team is frustrated○ we spend lots of time on monitoring and fixing bugs○ code becomes complex to handle corner cases○ confidence in platform stability rapidly goes down

● Analytics are frustrated○ long latency before data is available for querying○ data is unreliable

People get frustrated

Let’s go back to the board!

It can’t go on like this anymore

What we need?

● Unified input data format○ JSON to the rescue

● Data goes directly from applications to BD platform○ let’s make all external services write to distributed queue

● Data validation and monitoring○ all data incoming into system should be well-described○ validation should happen as early as possible○ alerts on corrupted / missing data must be raised early

● Data must be ready for querying early○ let’s push data to BigQuery first

New architecture overviewIN

PUT

STR

EAM

Event Registry

DataStorage

EventProcessorEvent

ProcessorEventProcessor

Compute Cloud

END

POIN

TS

Cluster Manager

Loading data

validatedeventsstream

INPUT STREAMKinesis

Event Registry

BQ

event type descriptor- schema - processing instructions

Google Cloud Storage

ad-hoc / scheduled

export

raweventsstream

BQ TableauConnector

BQ Excel Connector

BQ REST API

invalid events store location

( BQ )

invalidevents

ad hocevents replay

EventProcessor

GS REST APIgsutil

Batch processing

BQ


BQ TableauConnector

BQ Excel Connector

BQ REST API

BQ query

export

COMPUTE CLOUD

ComputeCluster A

ComputeCluster B

ComputeCluster C

GS REST APIgsutil

Real-time processing

INPUT STREAMKinesis

Event Registry

BQ

event type descriptor- schema

- processing instructions


validatedeventsstream

raweventsstream

BQ TableauConnector

BQ Excel Connector

BQ REST API

EVEN

T Q

UEU

E

BQ Sink

Cluster A

Cluster B

Event Processor

GS Sink

processed data readyfor consumption by

other

GS REST APIgsutil

Questions?