Data Infrastructure for a World of Music

Lars Albertsson, Data Engineer @Spotify

Focus on challenges & needs

Data infrastructure for a world of music

1. Clients generate data2. ???3. Make profit

Users create data

Why data?

Reporting to partners, from day 1Record labels, ad buyers, marketing

AnalyticsKPIs, Ads, Business insights: growth, retention, funnels

FeaturesRecommendations, search, top lists, notifications

Product developmentA/B testing

OperationsRoot cause analysis, latency, planning

Customer supportLegal

Data purpose

Different needs: speed vs quality

Reporting to partners, from day 1Record labels, ad buyers, marketing (daily + monthly)

AnalyticsKPIs, Ads, Business insights: growth, retention, funnels

FeaturesRecommendations, search, top lists, notifications

Product developmentA/B testing

OperationsRoot cause analysis, latency, planning

Customer supportLegal

Data purpose

Most user actionsPlayed songsPlaylist modificationsWeb navigationUI navigation

Service state changesUserNotifications

IncomingContentSocial integration

Data purpose

What data?

26M monthly active users6M subscribers55 markets20M songs, 20K new / day1.5B playlists4 data centres10 TB from users / day400 GB from services / day61 TB generated in Hadoop / day600 Hadoop nodes6500 MapReduce jobs / day18PB in HDFS

Data purpose

Much data?

Data purpose

Data is true

Get raw dataRefineMake it useful

Data infrastructure

2008:> for h in all_hosts

rsync ${h}:/var/log/syslog /incoming/$h/$date> echo ‘0 * * * * run_all_hourly_jobs.sh’ | crontabDump to Postgres, make graph

Still living with some of this…

Data infrastructure

It all started very basic

Data infrastructure

Collect, crunch, use/display

GatewayPlaylistservice

Kafka message bus

MapReduce

SQL

Reports

Cassandra

Recomm-endations

HDFS

service DB

Kafka@lon

logs

Data infrastructure

Fault scenarios


Kafka message bus

MapReduce

SQL

Reports

Cassandra

Recomm-endations

HDFS

service DB

Kafka@lon

logs

Most datasets are produced daily

Consumers want data after morning coffee

For each line, bottom level represents a good day

Destabilisation is the norm

Delay factors all over the infrastructure - client to display

Producers are not stakeholders

Data infrastructure

Shit happens

Get raw data from clients through GWsGWsService logsService databases

To HDFS

Data collection

Data collection

Data collection


Kafka message bus

HDFS

service DB

Kafka@lon

logs

Sources of truth

MapReduce?

Need to wait for “all” data for a time slot (hour)

What is all?Can we get all?

Most consumers want 9x% quickly.

Reruns are complex.

1. Rsync from hosts. Get list from hosts DB.- Rsync fragile, frequent network issues.- DB info often stale- Often waiting for dead host or omitting host

2. Push logs over Kafka. Wait for hosts according to hosts DB.+ Kafka better. Application level cross-site routing.- Kafka unreliable by design. Implement end-to-end acking.

3. Use Kafka as in #2. Determine active hosts by snooping metrics.+ Reliable? host metric.- End-to-end stability and host enumeration not scalable.

Data collection

Log collection evolution

Single solution cannot fit all needs. Choose reliability or low latency.

Reliable path with store and forwardService hosts must not store state.Synchronous handoff to HA Kafka with large replay buffer

Best effort path similarNo acks, asynchronous handoff

Message producers know appropriate semanticsFor critical data: handoff failure -> stop serving users

Measuring loss is essential

Data collection

Log collection future

~1% loss is ok, assuming that it is measuredFew % time slippage is ok, if unbiasedBiased slippage is not okTimestamp to use for bucketing: client, GW, HDFS?

Some components are HA (Cassandra, ZooKeeper). Most are unreliable. Client devices are very unreliable.

Buffers in “stateless” components cause loss.

Crunching delay is inconvenient. Crunching wrong data is expensive.

Data crunching

Data is false?

Core databases dumped daily (user x 2, playlist, metadata)Determinism required - delays inevitableSlave replication issues commonNo good solution:

Sqoop live - non-deterministicPostgres commit log replay - not scalableCassandra full dumps - resource heavy

Solution - convert to event processing?Experimenting with Netflix Aegisthus for Cassandra -> HDFSFacebook has MySQL commit log -> event conversion

Data collection

Database dumping

We have raw data, sorted by host and hour

We want e.g. active users by country and product over the last month

Data crunching

Data crunching

End goal example - business insights

1. Split by message type, per hour2. Combine multiple sources for similar data, per day - a core dataset.3. Join activity datasets, e.g. tracks played or user activity, with ornament dataset, e.g. track metadata, user demographics.4a. Make reports for partners, e.g. labels, advertisers.4b. Aggregate into SQL or add metadata for Hive exploration.4c. Build indexes (search, top lists), denormalise, and put in Cassandra.4d. Run machine learning (recommendations) and put in Cassandra.4e. Make notification decisions and send out....

Data crunching

Typical data crunching

MR

C*

Data crunching

Core dataset example: users

Generate - organicTransfer - KafkaProcess - Python MapReduce. Bad idea.

Big data ecosystem is 99% JVM -> moving to CrunchTest - in production.

Not acceptable. Working on it. No available tools.Deploy - CI + Debian packages.

Low isolation. Looking at containers (Docker).Monitor - organic

Cycle time for code-test-debug: 21 days

Data crunching

Data processing platform

Online storage: Cassandra, PostgresOffline storage: HDFSTransfer: Kafka, SqoopProcessing engine: Hadoop MapReduce in YarnProcessing languages: Luigi Python MapReduce, Crunch, PigMining: Hive, Postgres, QlikviewReal-time processing: Storm (mostly experimental)

Trying out:Spark - better for iterative algorithms (ML), future of MapReduce?Giraph and other graph toolsMore stable infrastructure: Docker, Azkaban

Data crunching

Technology stack

def mapper(self, items):

for item in items:

if item.type == ‘EndSong’

yield (item.track_id, 1, item)

else: # Track metadata

yield (item.track_id, 0, item)

def reducer(self, key, values):

for item in values:

if item.type != ‘EndSong’:

meta = item

else:

yield add_meta(meta, item)

Data crunching

Crunching tools - four joins

select * from tracks inner join metadata on tracks.track_id = metadata.track_id;

join tracks by track_id, metadata by track_id;

PTable <String, Pair<EndSong, TrackMeta>> = Join.innerJoin(endSongTable, metaTable);

Vanilla MapReduce - fragile SQL / Hive - exploration & display

Pig - deprecated

Crunch - future for processing pipelines

Lots of opportunities in PBs of data. Opportunities to get lost.

Organising data

Mostly organic - frequent discrepanciesAgile feature dev -> easy schema change

Currently requires client lib releaseAvro meta format in backend

Good Hadoop integrationNot best option in client

Some clients are hard to upgrade, e.g. old phones, hifi, cars.

Utopic (aka Google): client schema change -> automatic Hive/SQL/dashboard/report change

Data crunching

Schemas

Today:if date < datetime(2012, 10, 17): # Use old formatelse: …

Not scalableFew tools available. HCatalog?

Solution(?): Encapsulate each dataset in library. Owners decide compatibility vs reformat strategy. Version the interface. (Twitter)

Data crunching

Data evolution

Many redundant calculationsData discovery

Home-grown tool

Retention policySave the raw data (S3)Be brutal and delete

Data crunching

What is out there?

Technology is easy to change, humans hard

Our most difficult challenges are cultural

Organising yourself

Failing jobs, dead jobsDead dataData growthRerunsIsolation

Configuration, memory, disk, Hadoop resourcesTechnical debt

Testing, deployment, monitoring, remediationsCost

Be stringent with software engineering practices or suffer. Most data organisations suffer.

Data crunching

Staying in control

History:Data service departmentCore data + platform departmentData platform department

Self-service spurs data usageData producers and consumers have domain knowledge

Data infrastructure engineers do notData producers prioritise online services over offlineProducing and consuming is closely tied, yet often organisationally

separated

Data crunching

Who owns what?

Dos:

Solve domain-specific or unsolved thingsUse stuff from leaders (Kafka)Monitor aggressivelyHave 50+% backend engineersFocus on the data feature developer needsSeparate raw and generated dataHadoop was good bet, Spark even better?

Data crunching

Things learnt in the fire

Don’ts:

Choose your own path (Python)Use ad-hoc formatsBuild stuff with < 3 years horizonAccumulate debtUse SQL in data pipelinesHave SPOFs - no excuse anymoreRely on host configurationsCollect data with pullVanilla MapReduce“Data is special” - no SW practices

Innovation originates at Google (~10^7 data dedicated machines)MapReduce, GFS, Dapper, Pregel, Flume

Open source variants by the big dozen (10^5 - 10^6)Yahoo, Netflix, Twitter, LinkedIn, Amazon, Facebook. US onlyHadoop, HDFS, ZooKeeper, Giraph, Crunch. Cassandra

Improved by serious players (10^3 - 10^4)Spotify, AirBnB, FourSquare, Prezi, King. Mostly US

Used by beginners (10^1 - 10^2)

Big Data innovation

Innovation in Big Data - four tiers

Not much in infrastructure:Supercomputing legacy

MPI still in useBerkeley: Spark, Mesos

Cooperation with Yahoo and TwitterContainers

Xen, VMware

Data processing theory:Bloom filters, stream processing (e.g. Count-Min Sketch)

Machine learning

Big Data innovation

Innovation from academia

Fluid architectures / private cloudsLarge pools of machinesServices and jobs are independent of hostsMesos, Curator are scratching at the problemGoogle Borg = Utopia

LAMP stack for Big DataEnd to end developer testing

Client modification to insights SQL changeRunning on developer machine, in IDE

Scale is not an issue - efficiency & productivity is

Big Data innovation

Innovation is needed, examples

Data Infrastructure for a World of Music

Software

Transcript of Data Infrastructure for a World of Music