Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
-
Upload
chris-shenton -
Category
Internet
-
view
487 -
download
0
Transcript of Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s WebApp:The Cloud-Native Architecture of
images.nasa.govChris Shenton
CTO at V! Studios
NASA WESTPrime
Presentation Overview● Evolution of webapps: simple to cloud
○ Problems with typical webapps: fault-intolerant, unscalable
○ Plan for failure, then plan to scale
○ Scalability of SQL vs NoSQL databases like DynamoDB
○ Cloud-native application design patterns
● images.nasa.gov architecture
○ Front-end decoupled from API
○ Dataflow of asset from upload to publishing
○ Fault-tolerant cloud network architecture
● DevOps
○ Infrastructure as Code
○ CI/CD
NIEP’s Problem: Users Can’t Find Content● In surveys, the public says “great images” when they think of NASA
● 60 different collections across Agency
● Uneven content quality and user interface
● No API for reuse and integration of content across apps
● Must be mobile friendly
● Shutterstock.com functionality -- too ambitious?
● Video and audio too
● We believed this functionality was possible for NASA
○ Cloud services for compute, storage, search
○ Modern, responsive web front-end
○ API for front-end and reuse by other applications
Your Father’s WebApps: they’ve got problems [1b]
server
app
DB
#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.
Your Father’s WebApps: they’ve got problems [2a]
server
app
DB
server
app
server
DB
#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.
Your Father’s WebApps: they’ve got problems [2b]
server
app
DB
server
app
server
DB
#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.
#2: Better performance (maybe), but now two SPoFs.
Your Father’s WebApps: they’ve got problems [3a]
server
app
DB
server
app
server
DB
serverapp
server
DB
serverapp
server
DB
load balancer
#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.
#2: Better performance (maybe), but now two SPoFs.
Your Father’s WebApps: they’ve got problems [3b]
server
app
DB
server
app
server
DB
serverapp
server
DB
serverapp
server
DB
load balancer
#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.
#2: Better performance (maybe), but now two SPoFs.
#3: Good, we’ve eliminated the SPoFs, but database synchronization and failover is difficult. It’s still not scalable.
Cloud Architecture: Plan for Outage, then Plan to Scale [1]● Use Elastic Load Balancers
○ redundant
○ fault-tolerant
○ globally distributed
● Use Auto-Scaling Servers (EC2 instances)
○ scale out under load
○ scale in when quiescent to save money
○ pay only for what you eat
● Use Managed Relational Database Service (RDS)
○ automatically performs synchronization
○ automatically performs failover
○ PostgreSQL, MySQL, MariaDB, MS SQL, Oracle, AWS Aurora
Cloud Architecture: Plan for Outage, then Plan to Scale [2]
EC2app
RDS
DB DB
Elastic Load Balancer
#1: Minimal cost with single EC2 instance. Fault-tolerant, automatically syncing database with fail-over.
Cloud Architecture: Plan for Outage, then Plan to Scale [3]
EC2app
RDS
DB DB
Elastic Load Balancer
#2: Auto-scale based on load triggers to handle increased load. Scales down when load subsides to contain cost. You may have load balancers in your datacenter, but you probably can’t add hundreds of servers effortlessly.
EC2app
EC2app
EC2app
Cost for 8 hours on 1 server is same as for 1 hour on 8 servers; for compute-intensive tasks, this gets you home sooner. And you still pay only for what you eat.
SQL vs NoSQL Cloud-Scale Databases, e.g., DynamoDB● SQL databases require forklift upgrade
○ when storage capacity reached
○ when I/O capacity reached
● NoSQL databases designed for “web-scale”
○ schemaless
○ expect faults, work around them
○ replicate, add partitions/shards as needed
○ do not support SQL features like JOIN
○ require good app design to leverage effectively
● AWS DynamoDB is a Cloud Scale NoSQL DB
○ < 10 millisecond latency at any scale
○ unlimited storage
○ partitions grow as data grows
○ hash key and optional sort key
○ other attributes are schemaless, store anything
○ throughput limited by a knob or API call
DynamoDB and Partitions
Hash Keyid
Sort Keyyear
Other Data(schemaless)
chris 1994 job=GSFC,state=MD
chris 2001 job=Koansys,title=Founder
chris 2012 job=VStudios,title=CTO,beer=SierraNevada
charles 2013 job=VStudios,title=FullStackEng
moe 1999 job=VStudios,title=Founder
victor 2012 job=VStudios,title=StreamEng
tim 2013 job=VStudios,title=COO,state=VA
earl 2012 job=VStudios,title=CloudEng
Cloud-Native Application Architectures● Servers are like cattle, not pets
● Servers are ephemeral and stateless
● Scale out processes
● Apps persist to DB or object store, not server
● Use queues and workers to process requests
● See Twelve-Factor App (https://12factor.net)
Job Queue
newjob
Auto-scaling workers
worker1
workerN
worker2
job99
...
job2
job1
DBObject
Storage
Decouple work from workers with queues to prevent overload or loss of jobs
images.nasa.gov: Built on Cloud-Native ServicesWe use AWS-provided services whenever possible instead of building our own. This
allows us to leverage tested, supported, backed-up, scalable services so we can
concentrate on building our own application.
● EC2, ELB: autoscaling compute for API, Image Resizer, Pipeline processes
● S3: object storage for incoming media, metadata, published assets
● ElasticTranscoder: video/audio transcoding for smaller versions including mobile
● CloudSearch: manage search services allow search by free text or fields
● DynamoDB: NoSQL database tracks incoming jobs, published assets, users
● SQS: message queues decouple incoming jobs from pipeline processes
● SNS: notification service indicates when new content uploaded, triggers pipeline
images.nasa.gov Front-End Architecture [1]● Old school webapps send HTTP requests to servers and get back HTML
● images.nasa.gov separates front-end webapp from back-end API
● Front-end is written in AngularJS
○ a webapp running in the browser
○ not just a web “page”
● Back-end written in Python (Pyramid), returns JSON data to the front-end
● Front-end then renders it per its app
● More interactive, the evolution of “AJAX”
images.nasa.gov Front-End Architecture [2]1. Browser gets FE app from S3
a. AngularJS code, HTML, CSS
b. renders home page with search box
2. Queries API (ASG)
a. gets results as JSON and renders as page
b. gets images from Assets S3 and renders
3. Connects to API to get details
a. gets details as JSON and renders as page
b. gets image from Assets S3 and renders
S3images.nasa.govwebapp code<html>
<body data-ng-app="availFeApp">
...</html>
GET /
{“collection”: {items”: [{“href”:
“https://images-assets.nasa.gov/image/…”}
,
...]}}
GET /search?q=cloud
S3images-assets.nasa.govmedia, metadata
APIimages-api.nasa.govmin=2
APIimages-api.nasa.govmin=2
GET /image/cloud-free-iceland.jpg
GET /image/…, GET /image/…, ...
GET /metadata/cloud-free-iceland/...
GET /asset/cloud-free-iceland/...
query
detail
results
Ingest Media Data Flow [1]: Browser Experience● User selects media (image, audio, video with caption)
● Browser sends media to the private S3 bucket
● Dashboard shows progress, including when more searchable metadata required
AWS SQS Queues
Ingest Media Data Flow [2]: Upload
AWS CloudSearch
API ASG
API
Uploaded
ErrorProcesses write failures to queue for cleanup
Transcoded
Published
Pipeline ASG
Uploaded
Transcoded
Published
privateS3
images-assetsS3
JobStateDBDynamoDB
All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs
Image Resizer ASG
Image Resizer
ErrorTrash Index, S3, mark bad in DB
AWS ElasticTranscodervideo, audio
User uploads asset media and optional metadata
AssetDBDynamoDB
When the asset is published and indexed, an entry is recorded here.
● POST to API with optional metadata, captions
● API stores metadata, captions to Private S3
● API returns signed upload URL to browser
● Browser PUTs media directly to Private S3
● S3 sends SNS to SQS Uploaded queue,
triggering the start of the pipeline
AWS SQS Queues
Ingest Media Data Flow [3]: Transcode/Resize
AWS CloudSearch
API ASG
API
Uploaded
ErrorProcesses write failures to queue for cleanup
Transcoded
Published
Pipeline ASG
Uploaded
Transcoded
Published
privateS3
images-assetsS3
JobStateDBDynamoDB
All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs
Image Resizer ASG
Image Resizer
ErrorTrash Index, S3, mark bad in DB
AWS ElasticTranscodervideo, audio
AssetDBDynamoDB
When the asset is published and indexed, an entry is recorded here.
● Uploaded Worker gets event from queue
● Transcode/Resize
○ image invokes ImageResizer ASG: JPG
○ audio/video invokes ElasticTranscoder: MP3, MP4
○ multiple smaller formats for download, mobile,
preview, thumbnails
○ artifacts stored in Private S3 with original
● Waits for completion
● Creates event in Transcoded queue
AWS SQS Queues
Ingest Media Data Flow [4]: Publish
AWS CloudSearch
API ASG
API
Uploaded
ErrorProcesses write failures to queue for cleanup
Transcoded
Published
Pipeline ASG
Uploaded
Transcoded
Published
privateS3
images-assetsS3
JobStateDBDynamoDB
All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs
Image Resizer ASG
Image Resizer
ErrorTrash Index, S3, mark bad in DB
AWS ElasticTranscodervideo, audio
AssetDBDynamoDB
When the asset is published and indexed, an entry is recorded here.
● Transcoded Worker gets event from queue
● If we have valid metadata (and captions)
○ move media and artifacts to images-assets S3
○ at this point, it’s publicly accessible but not
yet findable by search
○ create event in Published queue
● Else, mark job as Incomplete
AWS SQS Queues
Ingest Media Data Flow [5]: Index
AWS CloudSearch
API ASG
API
Uploaded
ErrorProcesses write failures to queue for cleanup
Transcoded
Published
Pipeline ASG
Uploaded
Transcoded
Published
privateS3
images-assetsS3
JobStateDBDynamoDB
All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs
Image Resizer ASG
Image Resizer
ErrorTrash Index, S3, mark bad in DB
AWS ElasticTranscodervideo, audio
AssetDBDynamoDB
When the asset is published and indexed, an entry is recorded here.
● Published Worker gets event from queue
● Sends metadata to CloudSearch for indexing
○ once indexed, it’s findable by search
● Marks job done in JobDB
● Creates an entry in the AssetDB
AWS SQS Queues
Ingest Media Data Flow [6]: Errors
AWS CloudSearch
API ASG
API
Uploaded
ErrorProcesses write failures to queue for cleanup
Transcoded
Published
Pipeline ASG
Uploaded
Transcoded
Published
privateS3
images-assetsS3
JobStateDBDynamoDB
All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs
Image Resizer ASG
Image Resizer
ErrorTrash Index, S3, mark bad in DB
AWS ElasticTranscodervideo, audio
AssetDBDynamoDB
When the asset is published and indexed, an entry is recorded here.
● Any jobs that cause errors create events in
the Error queue
● Error Worker pulls events from queue
○ removes index from CloudSearch
○ removes media, artifacts from images-assets
and private S3
○ marks job as errored in JobStateDB
Regions and Availability Zones and Subnets, oh my!● AWS has 16 “Regions”
● There are 42 “Availability Zones” across them
● Each AZ is a physically separate datacenter
● AZs in a region have high-speed connections
to the others in the same region
● Multi-AZ deployment guards against
catastrophic AZ outage
● images.nasa.gov is in us-east-1 across 2 AZs
● Virtual Private Clouds provide isolation
● VPCs can be subnetted
AWS Regions and the number of AZs in each
Services Deployed Across AZs, VPC, SubnetsAWS Region: us-east-1
AWS Globally Managed Services
AZ: us-east-1b AZ: us-east-1c
web1subnet
web2subnet
app1subnet
no routing from public
app2subnet
no routing from public
API ELB
Pipeline ELB
ImageResizer ELB
ImageResizerImageResizer
ImageResizerImageResizer
PipelinePipeline
PipelinePipeline
APIAPI
APIAPI
images-assetsS3
privateS3
images (FE)S3
JobDBDynamoDB
AssetDBDynamoDB
UserDBDynamoDB
UploadedSQS
API ASG
ImageResizer ASG
Pipeline ASG
TranscodedSQS
PublishedSQS
ErrorSQS
CloudSearch
Public VPC
Infrastructure as Code● We do not build or deploy networks, servers or services by hand
● All infrastructure is defined in code
○ resident with our application software
○ WESTPrime Stash code repository
● Troposphere: Python abstraction for AWS CloudFormation
● Generates 3500 lines of JSON CloudFormation
● EC2 machines use hardened AMIs provided by WESTPrime
● Nearly identical Dev, Stage, Prod environments
● Fast updates to existing infrastructure
● Reliable, repeatable, robust
Automation: Continuous Integration/Continuous Delivery● We do not deploy code by hand
● WESTPrime’s Bamboo CI/CD system:
○ watches commits to Stash code repo
○ builds code
○ runs unit and integration tests
○ creates deployment artifacts: tarballs sent to S3
○ restarts EC2 instances to run new code versions
● Immutable deployment:
○ Ansible provisions software, configs
○ instance lifetime: hours to days, not months
● Benefits:
○ faster dev cycle times
○ lower cost
○ repeatable
○ reliable
Benefits of Cloud-Native Architecture, Design, Practices● Managed cloud services
● Robust multi-AZ infrastructure
● Autoscale minimizes cost, handles surge
● Separate web front-end from API backend
● Infrastructure as Code
● CI/CD builds, test, deploy automation
Commercial Cloud Service Benefits for NASA Developers● Build things that are impossible in a datacenter or hosting provider.
● Accommodate failure in components without breaking your application.
● Define infrastructure in code to make changes reliable and easy.
● Don’t need to over-provision for worst case.
● Use managed services to save time.
● Experiment with little investment.