Scalable Media Processing in the Cloud (MED302) | AWS re:Invent 2013
-
Upload
amazon-web-services -
Category
Technology
-
view
3.441 -
download
3
Transcript of Scalable Media Processing in the Cloud (MED302) | AWS re:Invent 2013
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Scalable Media Processing
Phil Cluff, British Broadcasting Corporation
David Sayed, Amazon Web Services
November 13, 2013
Agenda
• Media workflows
• Where AWS fits
• Cloud media processing approaches
• BBC iPlayer in the cloud
Media Workflows
Featurettes
Interviews
2D Movie
3D Movie
Archive
Materials
Stills
Networks
Theatrical
DVD/BD
Online
Mobile Apps
Archive
MSOs
Media Workflow
Media Workflow
Media Workflow
Where AWS Fits Into Media Processing
Amazon Web Services
Ingest
Ind
ex
Pro
cess
Package
Pro
tect
QC
Auth
.
Tra
ck
Pla
yback
Media Asset Management
Analytics and Monetization
Lift and Shift
Media Processing
Operation
OS Storage
Media Processing
Operation
OS Storage
EC2
Media Processing
Operation
OS Storage
EC2
The Problem with Lift and Shift
Media Processing
Operation
OS Storage
Monolithic Media Processing Operation
Ingest Operation
Post-
processing Export
Workflow Parameters
EC2
Cloud Media Processing Approaches:
Phase 2
Phase 1: Lift processing from the premises and shift to the cloud
Phase 2: Refactor and optimize to leverage cloud resources
Refactor and Optimization Opportunities
“Deconstruct monolithic media processing
operations” – Ingest
– Atomic media processing operation
– Post-processing
– Export
– Workflow
– Parameters
Refactoring and Optimization Example
AP
I C
alls
EC2 EBS
EC2 EBS
EC2 EBS
Source S3
Bucket
SWF
Output S3
Bucket
Cloud Media Processing Approaches
Phase 1: Lift processing from the premises and shift to the cloud
Phase 2: Refactor and optimize to leverage cloud resources
Phase 3: Decomposed, modular cloud-native architecture
Decomposition and Modularization Ideas
for Media Processing
• Decouple *everything* that is not part of atomic media processing operation
• Use managed services where possible for workflow, queues, databases, etc.
• Manage – Capacity
– Redundancy
– Latency
– Security
in the Cloud
AKA “Video Factory”
Phil Cluff
Principal Software Engineer & Team Lead
BBC Media Services
• The UK’s biggest video & audio on-demand service – And it’s free!
• Over 7 million requests every day – ~2% of overall consumption of BBC output
• Over 500 unique hours of content every week – Available immediately after broadcast, for at least 7 days
• Available on over 1000 devices including – PC, iOS, Android, Windows Phone, Smart TVs, Cable Boxes…
• Both streaming and download (iOS, Android, PC)
• 20 million app downloads to date
Sources:
BBC iPlayer Performance Pack August 2013
http://www.bbc.co.uk/blogs/internet/posts/Video-Factory
What Is Video Factory?
• Complete in-house rebuild of ingest, transcode,
and delivery workflows for BBC iPlayer
• Scalable, message-driven cloud-based
architecture
• The result of 1 year of development by ~18
engineers
Why Did We Build Video Factory?
• Old system – Monolithic
– Slow
– Couldn’t cope with spikes
– Mixed ownership with third party
• Video Factory – Highly scalable, reliable
– Completely elastic transcode resource
– Complete ownership
Why Use the Cloud?
• Background of 6 channels, spikes up to 24 channels, 6 days a week
• A perfect pattern for an elastic architecture
Off-Air Transcode Requests for 1 week
Video Factory – Architecture
• Entirely message driven – Amazon Simple Queuing Service (SQS)
• Some Amazon Simple Notification Service (SNS)
– We use lots of classic message patterns
• ~20 small components – Singular responsibility – “Do one thing, and do it well”
• Share libraries if components do things that are alike
• Control bloat
– Components have contracts of behavior
• Easy to test
Video Factory – Workflow
SDI Broadcast
Video Feed x 24
Playout
Data Feed
Broadcast
Encoder
Live Ingest
Logic
Amazon Elastic
Transcoder
Elemental
Cloud
DRM
QC
Editorial
Clipping
MAM
Amazon S3
Mezzanine
Time Addressable
Media Store
Amazon S3
Distribution
Renditions
RTP
Chunker
Transcode
Abstraction
Layer
Mezzanine
Playout Video
Transcoded Video
Metadata
SMPTE
Timecode
Mezzanine Video Capture
Mezzanine Capture
SDI Broadcast
Video Feed
x 24
Broadcast Grade Encoder
Amazon S3
Mezzanine
Chunks
RTP
Chunker
Chunk
Uploader
MPEG2 Transport Stream (H.264) on RTP Multicast 30 MB HD/10 MB SD
MPEG2 Transport Stream (H.264) Chunks
3 GB HD/1 GB SD
Chunk
Concatenator
Amazon S3
Mezzanine Control
Messages
SMPTE
Timecode
Concatenating Chunks
• Build file using Amazon S3 multipart requests – 10 GB Mezzanine file constructed in under 10 seconds
• Amazon S3 multipart APIs are very helpful – Component only makes REST API calls
• Small instances; still gives very high performance
• Be careful – Amazon S3 isn’t immediately consistent when dealing with multipart built files – Mitigated with rollback logic in message-based applications
By Numbers – Mezzanine Capture
• 24 channels – 6 HD, 18 SD
– 16 TB of Mezzanine data every day per capture
• 200,000 chunks every day – And Amazon S3 has never lost one
– That’s ~2 (UK) billion RTP packets every day… per capture
• Broadcast grade resiliency – Several data centers / 2 copies each
Transcode Abstraction
• Abstract away from single supplier – Avoid vendor lock in
– Choose suppliers based on performance and quality and broadcaster-friendly feature sets
– BBC: Elemental Cloud (GPU), Amazon Elastic Transcoder, in-house for subtitles
• Smart routing & smart bundling – Save money on non–time critical transcode
– Save time & money by bundling together “like” outputs
• Hybrid cloud friendly – Route a baseline of transcode to local encoders, and spike to cloud
• Who has the next game changer?
Transcode Abstraction
Transcode
Request
Transcode
Router
Amazon Elastic
Transcoder
Elemental
Cloud
Amazon Elastic
Transcoder
Backend
Elemental
Backend
REST SQS
Amazon S3
Mezzanine
Amazon S3
Distribution
Renditions
SQS
Subtitle
Extraction
Backend
Transcode Abstraction - Future
Transcode
Request
Transcode
Router
Amazon Elastic
Transcoder
Elemental
Cloud
Amazon Elastic
Transcoder
Backend
Elemental
Backend
SQS
Amazon S3
Mezzanine
Amazon S3
Distribution
Renditions
SQS
Subtitle
Extraction
Backend
Unknown Future
Backend X
?
REST
Example – A Simple Elastic Transcoder Backend
XML
Transcode
Request
Get Message
from Queue
Unmarshal and
Validate Message Initialize
Transcode
Wait for SNS
Callback over HTTP
XML
Transcode
Status
Message
Amazon Elastic
Transcoder
POST
POST
(Via SNS)
SQS Message Transaction
Example – Add Error Handling
XML
Transcode
Request
Get Message
from Queue
Unmarshal and
Validate Message Initialize
Transcode
Wait for SNS
Callback over HTTP
XML
Transcode
Status
Message
Amazon Elastic
Transcoder
POST
POST
(Via SNS)
Bad Message
Queue
Fail
Queue Dead Letter
Queue
SQS Message Transaction
Example – Add Monitoring Eventing
XML
Transcode
Request
Get Message
from Queue
Unmarshal and
Validate Message Initialize
Transcode
Wait for SNS
Callback over HTTP
XML
Transcode
Status
Message
Amazon Elastic
Transcoder
POST
POST
(Via SNS)
Bad Message
Queue
Fail
Queue Dead Letter
Queue
Monitoring
Events Monitoring
Events
Monitoring
Events
Monitoring
Events
SQS Message Transaction
BBC eventing framework
• Key-value pairs pushed into Splunk – Business-level events, e.g.:
• Message consumed
• Transcode started
– System-level events, e.g.:
• HTTP call returned status 404
• Application’s heap size
• Unhandled exception
• Fixed model for “context” data – Identifiable workflows, grouping of events; transactions
– Saves us a LOT of time diagnosing failures
Component Development – General Development &
Architecture
• Java applications
– Run inside Apache Tomcat on m1.small EC2 instances
– Run at least 3 of everything
– Autoscale on queue depth
• Built on top of the Apache Camel framework
– A platform for build message-driven applications
– Reliable, well-tested SQS backend
– Camel route builders Java DSL
• Full of messaging patterns
• Developed with Behavior-Driven Development (BDD) & Test-Driven Development (TDD)
– Cucumber
• Deployed continuously
– Many times a day, 5 days a week
Error Handling Messaging Patterns
• We use several message patterns – Bad message queue
– Dead letter queue
– Fail queue
• Key concept – Never lose a message
– Message is either in-flight, done, or in an error queue somewhere
• All require human intervention for the workflow to continue – Not necessarily a bad thing
Message Patterns – Bad Message Queue
• Wrapped in a message wrapper which contains context
• Never retried
• Very rare in production systems
• Implemented as an exception handler on the route builder
The message doesn’t unmarshal to the object it should
OR
We could unmarshal the object, but it doesn’t meet our
validation rules
Message Patterns – Dead Letter Queue
• Message is an exact copy of the input message
• Retried several times before being put on the DLQ
• Can be common, even in production systems
• Implemented as a bean in the route builder for SQS
We tried processing the message a number of times, and
something we weren’t expecting went wrong each time
Message Patterns – Fail Queue
• Wrapped in a message wrapper that contains context
• Requires some level of knowledge of the system to be retried
• Often evolve from understanding the causes of DLQ’d messages
• Implemented as an exception handler on the route builder
Something I knew could go wrong went wrong