School of Computing National University of Singaporetbma/teaching/cs4226y16_past/07-CDN.pdf · but...

59
Richard T. B. Ma School of Computing National University of Singapore Content Delivery Networks CS 4226: Internet Architecture

Transcript of School of Computing National University of Singaporetbma/teaching/cs4226y16_past/07-CDN.pdf · but...

Richard T. B. Ma School of Computing

National University of Singapore

Content Delivery Networks

CS 4226: Internet Architecture

Motivation

Serving web content from one location scalability -- “flash crowd” problem reliability performance

Key ideas: cache content & serve requests from multiple severs at network edge reduce demand on site’s infrastructure provide faster services to users

Web cache and caching proxy

Replication and load balancing

The middle mile problem

The last mile problem is solved by high levels of global broadband penetration but imposes a new question of scale by demand

The first mile is easy in terms of performance and reliability

Get stuck in the middle

Inside the Internet

Tier 1 ISP Tier 1 ISP

Large Content Distributor Large Content

Distributor

IXP IXP

Tier 1 ISP

Tier 2 ISP

Tier 1 ISP Tier 1 ISP

Large Content Distributor Large Content

Distributor

IXP IXP

Tier 1 ISP

Tier 2 ISP

Tier 2 ISP

Tier 2 ISP

Tier 2 ISP Tier 2

ISP Tier 2 ISP

Tier 2 ISP

Tier 2 ISP

Inside the Internet

The middle mile problem

The last mile problem is solved by high levels of global broadband penetration but imposes a new question of scale by demand

The first mile is easy in terms of performance and reliability

Stuck in the middle, potential solutions: “big data center” CDNs highly distributed CDNs how about P2P?

The challenge

The fat file paradox though bits are transmitted at a speed of light,

distance between user and server is critical latency and throughput are coupled due to TCP

Distance (Server to User)

Network RTT

Packet Loss

Throughput 4GB DVD Download Time

Local: <100 mi. 1.6 ms 0.6% 44 Mbps (high quality HDTV)

12 min.

Regional: 500–1,000 mi. 16 ms 0.7% 4 Mbps (basic HDTV)

2.2 hrs.

Cross-continent: ~3K mi. 48 ms 1.0% 1 Mbps (TV) 8.2 hrs.

Multi-continent: ~6K mi. 96 ms 1.4% 0.4 Mbps (poor)

20 hrs.

Major CDNs (by ‘15 revenue)

Limelight $174M $120M of CDN

Level 3 $8B

$235M of CDN tier-1 transit provider

Akamai $1.03B $700M of CDN

Amazon $6B

$1.8B of CDN, but big % on storage

cloud provider

EdgeCast $180M $125M of CDN

Fastly $60M

$9M of CDN

Rest of smaller regional CDNs (MaxCDN,

CDN77 etc.) $100M combined.

Major CDNs (by ‘15 revenue)

Highwinds $135M $95M of CDN

ChinaCache $270M

$81M of CDN also a cloud provider

Reference

Cheng Huang, Angela Wang, Jin Li and Keith W. Ross, “Measuring and Evaluating Large-Scale CDNs”, Internet Measurement Conference 2008.

Erik Nygren, Ramesh K. Sitaraman and Jennifer Sun, ” The Akamai Network: A Platform for High-erformance Internet Applications ,” ACM SIGOPS Operating Systems Review 44(3), July 2010.

How can we understand a CDN?

We don’t know their internal structures but could “infer” via a measurement approach

We know that CDNs use a DNS trick for example, end-user types www.youtube.com resolve IP address via local DNS (LDNS) server LDNS queries YouTube’s authoritative DNS YouTube uses CDN if returns a CNAME like

• a1105.b.akamai.net or move.vo.llnwd.net • LDNS then queries CNAME’s authoritative DNS

server and get the IP address of the content server

DNS records DNS: distributed db storing resource records (RR)

Type=NS (Name Server) name is domain (e.g.,

foo.com) value is hostname of

authoritative name server for this domain

RR format: (name, value, type, ttl)

Type=A (Address) name is hostname value is IP address

Type=CNAME (Canonical NAME) name is alias name for some

“canonical” (the real) name www.ibm.com is really servereast.backup2.ibm.com value is canonical name

Type=MX (Mail eXchange)

value is name of mailserver associated with name

Content Server Assignment

The returned content server will be close to the issuing local DNS (LDNS) server

Measurement Framework

Assumptions: CDN chooses nearby content server based on

the location of LDNS that originates the query the same LDNS might get different content

servers for the same query at different times

1. Determine all the CNAMEs of a CDN

2. Query a large number of LDNSs all over the world, at different times of the day, for all of the CNAMEs found in step 1

Finding CNAMEs and LDNSs

Find all the CNAMEs of a CDN use over 16 million web hostnames a DNS query tells if it resolves to a CNAME whether the CNAME belongs to the target CDN thousands of CNAMEs for Akamai and Limelight

Locate a large # of distributed LDNSs need open recursive DNS servers use over 7 million unique client IP addresses and

over 16 million web hostnames reverse DNS lookup and test trial DNS queries

Open recursive DNS servers

many different DNS servers map into same IP addresses

obtain 282,700 unique open recursive DNS servers

Measurement Platform

300 PlanetLab nodes, 3 DNS queries per second more than 1 day for the measurement

The Akamai Network

Type (a): returns 2 IP addresses, different for different locations hundreds of IPs behind a CNAME, ~11,500 content servers

Type (c): returns only 1 IP address; 20-100 IPs for each CNAME guesses virtualization used for isolated environments

Type # of CNAMES

# of IPs Usage

(a) *.akamai.net 1964 ~11,500 conventional content distribution

(b) *.akadns.net 757 A few per CNAME

load balancing for customers who has their own networks

(c) *.akamaiedge.net 539 ~36,000 dynamic content distribution/secure service

The Akamai Network

~27K content servers, ~6K also run DNS 60% in the US, 90% in top 10 countries flat distribution in ISPs: 15% in top 7

The Limelight Network Easier as it is an Autonomous System (AS)

obtain the IP addresses of the AS only ~4K servers

Measuring performance

Two metrics

availability: how reliable are the CDN servers? delay: how fast content can be retrieved?

Performance results are controversial do the metrics sufficiently match overall

system performance goals? how does performance metric map to specific

customer performance perception? both Akamai and Limelight issued statements to

“correct” the research results

Availability

Monitor all servers for 2 months, ping once every hour

If a server does not respond in 2 consecutive hours, considered “down”

But does “down” server necessarily affect availability?

Delay

Different reasons: number of content servers? optimality (for delay) of routing?

More detailed delay comparison

Akamai’s statement

Availability cannot be reflected based on server uptime alone

Akamai’s CDN has more servers but not necessarily harder to maintain

The use of open-resolvers miss many Akamai servers, hence over-estimating delay in Akamai case

Akamaiedge is not a “virtualized network”

Limelight’s statement

Overall performance can’t be represented by just two dimensions (availability & delay)

Server downtime does not necessarily affect availability; suggested some way to measure and claim in the 99.9% range

RTT of a packet can’t represent delay for objects; suggest use different object sizes

More authoritative performance study should be based on customer trial

Akamai vs. Limelight Akamai Limelight

# of servers ~27K ~4K # of clusters 1158 18

95 percentile delay ~100ms ~200ms average delay ~30ms ~80ms

penetration in ISPs high low cost high low

complexity high low approach highly

distributed “big data center”

Facts about Akamai (2014-2015)

CDN company evolved from MIT research invent better ways to deliver Internet content tackle the "flash crowd" problem

Earns over US$1B revenue in 2015, 25% of the whole CDN market

Runs on 150,000 servers in 1,200 networks across 92 countries

Internet delivery challenge

5% traffic for the largest network

Over 650 networks to reach 90%

“Long tail” distribution of traffic

% of access traffic from top networks

Other challenges

Peering point congestion little economic incentive for middle mile

Inefficient routing protocols how does BGP work?

Unreliable networks de-peering between ISPs

Inefficient communication protocols Scalability App limitations and slow rate of adoption

Delivery network as a virtual network

Works as an overlay compatible transparent to users adaptive to changes

The untaken clean-slate approach adoption problem development cost

The Akamai Network at ~2010

A large distributed system, consists of ~ 60000 servers ~ 1000 networks ~ 70 countries

Can also be regarded as multiple delivery networks for different types of content static web streaming media dynamic applications

Anatomy of Delivery Network

edge servers global deployment thousands of cites

mapping system assigns requests to edge servers use historic data system conditions

Anatomy of Delivery Network

transport system move content from

origin to edge may cache data

communication and control system disseminate status

and control message configuration update

Anatomy of Delivery Network

data collection and analysis collect and process

data, e.g., logs used for monitoring,

analytics, billing …

management portal customer visibility &

fine-grained control update edge servers

System Design Principles

Goals: scalable and fast data collection & management safe, quick & consistent configuration updates enterprise visibility & fine-grained control

Assumption: a significant number of failures is expected to be occurring at all times machine, rack, cluster, connectivity or network philosophy: failures are normal and the delivery

network must operate seamlessly despite them

System Design Principles

Design for reliability ~100% end-to-end availability full redundancy and fault tolerance protocols

Design for scalability handle large volumes of traffic, data, control …

Limit the necessity for human management automatic, needed to scale, respond to faults

Design for performance improve bottleneck, response time, cache hit

rate, resource utilization and energy efficiency

Streaming and content delivery

Architectural considerations for cacheable web content and streaming media

Principle: minimize long-haul communication through the

middle-mile bottleneck of the Internet feasible by pervasive, distributed architectures

where servers sit as “close” to users as possible

Key question: how distributed it needs to be?

How distributed it needs to be?

Akamai’s approach: deploy server clusters not only in in Tier 1 and Tier 2 data centers also in network edges, thousands of locations more complexity and costs

Reasons: highly fragmented Internet traffic, e.g., top 45

network only account for half of access traffic distance between server and users is the

bottleneck for video throughput due to TCP P2P is not good for management and control

Video-grade scalability

Content providers’ problem YouTube receives 2 billion views per day high rates for video, e.g., 2-40 Mbps for HDTV need to scale with user requests high capital and operational costs to over-

provision so as to absorb spikes on-demand

Akamai’s throughput 3.45 Tbps in April 2010 ~ 50-100 Tbps throughput now needed

Akamai’s challenges need consider throughput along entire path bottlenecks everywhere

original data centers, peering points, network’s backhaul capacity, ISP’s upstream connectivity

a data center’s egress capacity has little impact on real throughput to end users

even 50 well-provisioned, connected data centers cannot achieve ~100 Tbps

IP-layer multicast does not work in practice, needs its own transport system

Transport system for content

Tiered content distribution target for “cold” or infrequently-accessed

efficiency cache strategy with high hit rates

well-provisioned and highly connected “parent” clusters are utilized

original servers are offloaded in the high 90’s

helpful in flash crowds for large objects

Tiered distribution

Transport system for streaming

An overlay network for live streaming once a stream is captured & encoded, it’s sent

to a cluster of servers called the entrypoint

automatic failover among multiple entrypoints

within an entrypoint cluster, distributed leader election is used to tolerate machine failure

publish-subscribe (pub-sub) model: • entrypoint publishes available streams, and each edge

server subscribes to streams that it requires

Transport system for streaming

An overlay network for live streaming reflectors act as intermediaries between the

entrypoints and the edge clusters

scaling: enables rapidly replicating a stream to a large number of edge clusters to serve popular events

quality: provides alternate paths between each entrypoint and edge cluster, enhancing end-to-end quality via path optimization

can use multiple link-disjoint paths need efficient algorithms for path selection

Application delivery network

Target for dynamic web application and non-cacheable content

Two complementary approaches speed up long-haul communications by using the

Akamai platform as a high-performance overlay network, i.e., the transport system

pushes application logic from the origin server out to the edge of the Internet

Transport system for app acceleration Path optimization

overcome BGP, collect topology & performance data from mapping system

dynamically select potential intermediate nodes for a particular path, or multiple paths

~30-50% performance improvement by overlay used also for packet loss reduction

Middle East cable cut in 2008

Transport system for app acceleration

Transport protocol optimizations proprietary transport-layer protocol use pools of persistent connections to eliminate

connection setup and teardown overhead optimal TCP window sizing with global knowledge intelligent retransmission after packet loss

Application optimizations parse HTML and prefetch embedded content content compression reduces # of roundtrips implement app logic at edge, e.g., authentication

Distributing applications to the edge

EdgeComputing Services of Akamai

E.g., deploy and execute request-driven Java J2EE apps on Akamai’s edge servers

Not all apps can be run entirely on the edge

Some use cases content aggregation/transformation static databases data collection complex applications

Platform components

Other platform components

Edge server platform

Mapping system

Communications and control system

Data collection and analysis system

Additional systems and services

Edge server platform

Functionalities controlled by metadata

origin server location and response to failures

cache control and indexing

access control

header alteration (HTTP)

EdgeComputing

performance optimization

Mapping system

Global traffic director uses historic and real-time data about the

health of the Akamai network and the Internet

objective: create maps that are used to direct traffic on the Akamai network in a reliable, efficient, and high performance manner

a fault-tolerant distributed platform: run in multiple independent sites and leader-elect based on the current health status of each site

two parts: scoring system + real-time mapping

Mapping system Scoring system creates the current Internet topology collects/processes data: ping, BGP, traceroute monitors latency, loss, connectivity frequently

Real-time mapping creates the actual maps used to direct end

users’ requests to the best edge servers selects intermediates for tiered distribution

and the overlay network first step: map to cluster

• Based on scoring system info, updated every minute second step: map to server

• Based on content locality, load changes, and etc.

Communications and control system

Real-time distribution of status and control information small real-time message throughout the net solution: pub-sub model

Point-to-point RPC and web services Dynamic configuration updates

quorum-based replication … another whole paper Key management infrastructure Software/machine config. management

Data collection and analysis system

Log collection over 10 million HTTP/sec 100TB/day compression, aggregation, pipeline and filter … reporting and billing

Real-time data collection and monitoring a distributed real-time relational database that

supports SQL query … another whole paper Analytics and Reporting

enable customers to view traffic & performance uses log and Query system, & e.g., MapReduce