Data Infrastructure for a World of Music
-
Upload
lars-albertsson -
Category
Software
-
view
106 -
download
1
description
Transcript of Data Infrastructure for a World of Music
Lars Albertsson, Data Engineer @Spotify
Focus on challenges & needs
Data infrastructure for a world of music
1. Clients generate data2. ???3. Make profit
Users create data
Why data?
Reporting to partners, from day 1Record labels, ad buyers, marketing
AnalyticsKPIs, Ads, Business insights: growth, retention, funnels
FeaturesRecommendations, search, top lists, notifications
Product developmentA/B testing
OperationsRoot cause analysis, latency, planning
Customer supportLegal
Data purpose
Different needs: speed vs quality
Reporting to partners, from day 1Record labels, ad buyers, marketing (daily + monthly)
AnalyticsKPIs, Ads, Business insights: growth, retention, funnels
FeaturesRecommendations, search, top lists, notifications
Product developmentA/B testing
OperationsRoot cause analysis, latency, planning
Customer supportLegal
Data purpose
Most user actionsPlayed songsPlaylist modificationsWeb navigationUI navigation
Service state changesUserNotifications
IncomingContentSocial integration
Data purpose
What data?
26M monthly active users6M subscribers55 markets20M songs, 20K new / day1.5B playlists4 data centres10 TB from users / day400 GB from services / day61 TB generated in Hadoop / day600 Hadoop nodes6500 MapReduce jobs / day18PB in HDFS
Data purpose
Much data?
Data purpose
Data is true
Data purpose
Data is true
Get raw dataRefineMake it useful
Data infrastructure
2008:> for h in all_hosts
rsync ${h}:/var/log/syslog /incoming/$h/$date> echo ‘0 * * * * run_all_hourly_jobs.sh’ | crontabDump to Postgres, make graph
Still living with some of this…
Data infrastructure
It all started very basic
Data infrastructure
Collect, crunch, use/display
GatewayPlaylistservice
Kafka message bus
MapReduce
SQL
Reports
Cassandra
Recomm-endations
HDFS
service DB
Kafka@lon
logs
Data infrastructure
Fault scenarios
GatewayPlaylistservice
Kafka message bus
MapReduce
SQL
Reports
Cassandra
Recomm-endations
HDFS
service DB
Kafka@lon
logs
Most datasets are produced daily
Consumers want data after morning coffee
For each line, bottom level represents a good day
Destabilisation is the norm
Delay factors all over the infrastructure - client to display
Producers are not stakeholders
Data infrastructure
Shit happens
Get raw data from clients through GWsGWsService logsService databases
To HDFS
Data collection
Data collection
Data collection
GatewayPlaylistservice
Kafka message bus
HDFS
service DB
Kafka@lon
logs
Sources of truth
MapReduce?
Need to wait for “all” data for a time slot (hour)
What is all?Can we get all?
Most consumers want 9x% quickly.
Reruns are complex.
1. Rsync from hosts. Get list from hosts DB.- Rsync fragile, frequent network issues.- DB info often stale- Often waiting for dead host or omitting host
2. Push logs over Kafka. Wait for hosts according to hosts DB.+ Kafka better. Application level cross-site routing.- Kafka unreliable by design. Implement end-to-end acking.
3. Use Kafka as in #2. Determine active hosts by snooping metrics.+ Reliable? host metric.- End-to-end stability and host enumeration not scalable.
Data collection
Log collection evolution
Single solution cannot fit all needs. Choose reliability or low latency.
Reliable path with store and forwardService hosts must not store state.Synchronous handoff to HA Kafka with large replay buffer
Best effort path similarNo acks, asynchronous handoff
Message producers know appropriate semanticsFor critical data: handoff failure -> stop serving users
Measuring loss is essential
Data collection
Log collection future
~1% loss is ok, assuming that it is measuredFew % time slippage is ok, if unbiasedBiased slippage is not okTimestamp to use for bucketing: client, GW, HDFS?
Some components are HA (Cassandra, ZooKeeper). Most are unreliable. Client devices are very unreliable.
Buffers in “stateless” components cause loss.
Crunching delay is inconvenient. Crunching wrong data is expensive.
Data crunching
Data is false?
Core databases dumped daily (user x 2, playlist, metadata)Determinism required - delays inevitableSlave replication issues commonNo good solution:
Sqoop live - non-deterministicPostgres commit log replay - not scalableCassandra full dumps - resource heavy
Solution - convert to event processing?Experimenting with Netflix Aegisthus for Cassandra -> HDFSFacebook has MySQL commit log -> event conversion
Data collection
Database dumping
We have raw data, sorted by host and hour
We want e.g. active users by country and product over the last month
Data crunching
Data crunching
End goal example - business insights
1. Split by message type, per hour2. Combine multiple sources for similar data, per day - a core dataset.3. Join activity datasets, e.g. tracks played or user activity, with ornament dataset, e.g. track metadata, user demographics.4a. Make reports for partners, e.g. labels, advertisers.4b. Aggregate into SQL or add metadata for Hive exploration.4c. Build indexes (search, top lists), denormalise, and put in Cassandra.4d. Run machine learning (recommendations) and put in Cassandra.4e. Make notification decisions and send out....
Data crunching
Typical data crunching
MR
C*
Data crunching
Core dataset example: users
Generate - organicTransfer - KafkaProcess - Python MapReduce. Bad idea.
Big data ecosystem is 99% JVM -> moving to CrunchTest - in production.
Not acceptable. Working on it. No available tools.Deploy - CI + Debian packages.
Low isolation. Looking at containers (Docker).Monitor - organic
Cycle time for code-test-debug: 21 days
Data crunching
Data processing platform
Online storage: Cassandra, PostgresOffline storage: HDFSTransfer: Kafka, SqoopProcessing engine: Hadoop MapReduce in YarnProcessing languages: Luigi Python MapReduce, Crunch, PigMining: Hive, Postgres, QlikviewReal-time processing: Storm (mostly experimental)
Trying out:Spark - better for iterative algorithms (ML), future of MapReduce?Giraph and other graph toolsMore stable infrastructure: Docker, Azkaban
Data crunching
Technology stack
def mapper(self, items):
for item in items:
if item.type == ‘EndSong’
yield (item.track_id, 1, item)
else: # Track metadata
yield (item.track_id, 0, item)
def reducer(self, key, values):
for item in values:
if item.type != ‘EndSong’:
meta = item
else:
yield add_meta(meta, item)
Data crunching
Crunching tools - four joins
select * from tracks inner join metadata on tracks.track_id = metadata.track_id;
join tracks by track_id, metadata by track_id;
PTable <String, Pair<EndSong, TrackMeta>> = Join.innerJoin(endSongTable, metaTable);
Vanilla MapReduce - fragile SQL / Hive - exploration & display
Pig - deprecated
Crunch - future for processing pipelines
Lots of opportunities in PBs of data. Opportunities to get lost.
Organising data
Mostly organic - frequent discrepanciesAgile feature dev -> easy schema change
Currently requires client lib releaseAvro meta format in backend
Good Hadoop integrationNot best option in client
Some clients are hard to upgrade, e.g. old phones, hifi, cars.
Utopic (aka Google): client schema change -> automatic Hive/SQL/dashboard/report change
Data crunching
Schemas
Today:if date < datetime(2012, 10, 17): # Use old formatelse: …
Not scalableFew tools available. HCatalog?
Solution(?): Encapsulate each dataset in library. Owners decide compatibility vs reformat strategy. Version the interface. (Twitter)
Data crunching
Data evolution
Many redundant calculationsData discovery
Home-grown tool
Retention policySave the raw data (S3)Be brutal and delete
Data crunching
What is out there?
Technology is easy to change, humans hard
Our most difficult challenges are cultural
Organising yourself
Failing jobs, dead jobsDead dataData growthRerunsIsolation
Configuration, memory, disk, Hadoop resourcesTechnical debt
Testing, deployment, monitoring, remediationsCost
Be stringent with software engineering practices or suffer. Most data organisations suffer.
Data crunching
Staying in control
History:Data service departmentCore data + platform departmentData platform department
Self-service spurs data usageData producers and consumers have domain knowledge
Data infrastructure engineers do notData producers prioritise online services over offlineProducing and consuming is closely tied, yet often organisationally
separated
Data crunching
Who owns what?
Dos:
Solve domain-specific or unsolved thingsUse stuff from leaders (Kafka)Monitor aggressivelyHave 50+% backend engineersFocus on the data feature developer needsSeparate raw and generated dataHadoop was good bet, Spark even better?
Data crunching
Things learnt in the fire
Don’ts:
Choose your own path (Python)Use ad-hoc formatsBuild stuff with < 3 years horizonAccumulate debtUse SQL in data pipelinesHave SPOFs - no excuse anymoreRely on host configurationsCollect data with pullVanilla MapReduce“Data is special” - no SW practices
Innovation originates at Google (~10^7 data dedicated machines)MapReduce, GFS, Dapper, Pregel, Flume
Open source variants by the big dozen (10^5 - 10^6)Yahoo, Netflix, Twitter, LinkedIn, Amazon, Facebook. US onlyHadoop, HDFS, ZooKeeper, Giraph, Crunch. Cassandra
Improved by serious players (10^3 - 10^4)Spotify, AirBnB, FourSquare, Prezi, King. Mostly US
Used by beginners (10^1 - 10^2)
Big Data innovation
Innovation in Big Data - four tiers
Not much in infrastructure:Supercomputing legacy
MPI still in useBerkeley: Spark, Mesos
Cooperation with Yahoo and TwitterContainers
Xen, VMware
Data processing theory:Bloom filters, stream processing (e.g. Count-Min Sketch)
Machine learning
Big Data innovation
Innovation from academia
Fluid architectures / private cloudsLarge pools of machinesServices and jobs are independent of hostsMesos, Curator are scratching at the problemGoogle Borg = Utopia
LAMP stack for Big DataEnd to end developer testing
Client modification to insights SQL changeRunning on developer machine, in IDE
Scale is not an issue - efficiency & productivity is
Big Data innovation
Innovation is needed, examples