Music streams

Music StreamsRunning a social network on an event based

architecture

Who are we?

Stefano Galarraga:

Lead Developer at Crowdmix

~20 years in sw engineering mostly working in middleware, messaging and most recently Big Data

Open source contributor: Scalding, AKKA

https://github.com/galarragas

Michal Dziemianko:

Big Data Engineer and Data Scientist at Crowdmix

Background in AI and Distributed Computing

~10 years in software engineering and research

Tiago Palma:

Big Data Engineer and Data Warehouse Developer at Crowdmix

Data Warehousing and ETL experience working with traditional MPP systems as Teradata and with Big Data technologies

Crowdmix: Who are we?

• A social network focused on music• The model is based on crowds• People can share different type of content in the crowds they joined

• Music obviously is the most interesting content

• We are aiming for large scale (millions of user) => our system is designed for scalability

• We don’t own any music content but we allow people to share and listen to tracks across different streaming services

Main App Features (1/2)

• Classical social network interaction:• Users can build their social graph

• Started with a follower/followee model now moving to a friends model • no content shared to followers• friendship enables direct communication • Both concepts leveraged for recommendations and content prioritisation

• Users are joining crowds• P2P communication• Unlimited size for a crowd• Limited number of crowds to join (just enforced by the backend for performance

reasons)

Main App Features (2/2)• Content:

• Music (obviously) • No music is streamed by CM• Tracks from different streaming providers (more about this later)

• Videos • From YouTube• Small videos from camera

• Images and Pictures

• System Generated Content:• Recommendations

• Various forms, from crowd/people suggestion to surfacing content in the home page• Charts• Notifications

CM Architecture diagram

Tech Stack

• AWS hosted infrastructure • Docker for services deployment (not for Kafka and Cassandra)• Mesos for resources management, Marathon for Containers Orchestration• Kafka 0.8.2 (thinking about 0.9)• Cassandra to store Materialized Views, • Elasticsearch to index the searchable content• Spark running on EMR for batch processing • Stream processing done using:

• Confluent’s Kafka consumer/producer for the “legacy” Java based microservices• AKKA Streams for all new microservices

Event Based System

• Kafka content retention is two weeks (started with 6 months). Topics are stored in S3 using Secor

• QoS: accepting to lose content for most of the use cases. Focus on response times vs accuracy

• CQRS

• Batch Processing• Recommendation jobs

• Data Warehouse/Analytics

• Track Metadata acquisition and matching (more of it later)

• Event replay for data migration and evolution and recovery

• Backup/Restore• Using the event replay batch

Use Cases

• Charts

• Stream processing

• Music Matching, Home Page

• Lambda processing

• Analytics

• Traditional batch processing

Charts

• Clients generate a TrackListened event whenever a track is listened for more than 5s

• TrackListened event contains information about the context of the in-app listens, such as:

• The comment where the track was embedded• The stack where the track was embedded or the comment was referring to• A preview from music search• Listening from the Chart itself• The crowd where the event occurred if available

• Tracks are resolved by Music Matching Service (more below) • Duplicate everything for to videos

Charts

• Chart service listens to the listens topic • Fan out based on track sources

• Increment counters for each known source

• Fan in to avoid duplicates• Triggers chart re-computation based on

the context (near Real Time)• Adds extra information to the chart

• Some of the factor that contributed to the calculation are surfaced (cheers, listens, shares)

• Some of the “top” stacks where the track was shared are listed

Music Matching (1/2)

• Tracks from different streaming providers: Implications:• Playback

• A user is sharing a stack with her favourite track. She is a Spotify Premium user and lives in US

• The stack is going to be seen by people:

• Without a Spotify Premium account but with an iTunes subscription

• Without any streaming service subscription

• Preferring to see the Video associated to the track if available

“Across different streaming services” ? (2/2)

• Charts • “Hello” from Adele is the favourite track in the “Romantic Dudes” crowd

• We need to count sharing/listens/likes of the track even if shared from different sources

• User behaviour analysis (e.g. recommendations …) • You listened to “Hello” from Adele 15 times yesterday

• .. but you’re not in the “Romantic Dudes” crowd yet:

• the system needs to understand your tastes even if you liked/listened to that track in different streaming services

• You always used your iTunes version but another user had the same pattern but on Spotify

• You should be connected even if not sharing the same streaming service

Ok, it is important. But why is it difficult?

• Identify the track• Tracks are identified differently across different services• There is a “standard” track ID called ISRC but it is not always available

• Other shared IDs such as the MusicBrainz ID are also just partially supported• There are sources (Youtube) not providing content by ISRC• Track metadata are not super-consistent too:

• The same track title might contain the name of the featured artist in one service while the same info is stored in the artist-name field of another service

• Retrieve the ID for the track in the right country in a scalable way• Need to search for the track across different sources in a scalable way

• Handle time constraints• Handle missing results or connection problems

Is that a common problem?

There are other companies/project doing the same work:

Project Rosetta Stone by EchoNest (http://blog.echonest.com/post/66963888889/the-echo-nests-rosetta-stone-unlocking-social) now owned Spotify and not openly available anymore

Spotify API can provide IDs for other servicesBoP http://bop.fm/ (now shut down) was offering a web-based music matching

service

http://blog.echonest.com/post/66963888889/the-echo-nests-rosetta-stone-unlocking-social

http://blog.echonest.com/post/66963888889/the-echo-nests-rosetta-stone-unlocking-social

http://bop.fm/

And now … a diagram

Data Warehouse / Analytics

Motivations- Know what the users are doing in the app- Validate new features (A/B testing)- Measure user retention (Sticky Factor)- Calculate our revenue stream

- Some “vanity” metrics:- Total number of users- Number of new users (Day / Week)

Why not use Mixpanel

No simple way to query the raw data

Security Requirements

Ability to correlate the data sent to mixpanel directly from the Mobile and backend data (Kafka)

Grow the number of active users with lower costs

Our Business Analysts were familiar with SQL

Building Data Warehouse from Kafka- Secor consumes data from Kafka and

dumps into S3 every 30 minutes

- Spark Job reads the data, replays the events and applies business logic transformations.

- After the data transformations, data is loaded into Redshift

- Mode Analytics, our BI tool of choice to query the data in Redshift (SQL or Python) and generate nice D3 reports.

Reliable Data Warehouse

- Kafka is the source-of-truth

- Building the Warehouse by replaying the events from the source-of-truth means that the data is highly reliable for reporting, and...

… to find/recover nasty bugs in the app that were not detected by the QA process.

What we did right?

• Schema based events

• Using AVRO as serialization format

• Common Event Model• Enforcing common fields in events (timed uuid based eventId, ts, correlationId)

• Event Schema Registry (using one)

• Replicated/independent data views built from events

• Secor (plus some contribution for compaction)

What we did wrong?

• Schema Registry (implementing one)• we didn’t use Schema Registry from Confluent from start, we built our own• No enforcement on write of schema compatibility• Some contortions to support schema download• The model is compatible with the Schema Registry adoption (some copied parts too) and we

want to move there

• Event Sourcing • Not the idea per se, but OUR implementation was wrong• Confusing event based system with event sourcing

• events not replayable directly but needed batch based processes to build the system view

• Would migrate to a proper Event Sourcing framework like AKKA Persistence • Still keeping the event based infrastructure in Kafka

System Performance

• The system has been designed with the target of supporting one million active users from the beginning and a steep growth in the following months

• Marketing strategy has now changed

• system has been opened to public using an invitation model

• operating with few thousands of users

Performance results mentioned here are from the performance test done where we still wanted to open to the millions of users

SystemPerformance

SystemPerformanceTest Cases

Questions?

Speaker Profiles

Stefano Galarraga is currently working as Lead Developer at Crowdmix. Started his professional career in 1997 and has been working mostly in middleware, and message based systems, most recently moving to Big Data. He is contributor of Twitter’s Scalding and Typesafe’s AKKA projects plus some of his owns you can find at https://github.com/galarragas

Michal Dziemianko is currently working as Big Data Engineer and Data Scientist at Crowdmix. He has a background in AI and Distributed Computing. He has PhD in Machine Learning from the University of Edinburgh and have been working for around 10 years in software engineering and research.

Tiago Palma is currently working as Big Data Engineer and Data Warehouse Developer at Crowdmix. He has several years of experience in Data Warehousing and ETL, working with traditional MPP systems as Teradata and with Big Data technologies. He has also experience as a DevOps.

https://github.com/galarragas

Music streams

Software

Transcript of Music streams