ppt

43
Characteristics of Streaming Media Stored on the Web Mingzhe Li, Mark Claypool, Robert Kinicki and James Nichols ACM Transactions on Internet Technology (TOIT) Vol. 5, No. 5, November 2005

Transcript of ppt

Page 1: ppt

Characteristics of Streaming Media Stored on the Web

Mingzhe Li, Mark Claypool, Robert Kinicki and James Nichols

ACM Transactions on Internet Technology (TOIT)

Vol. 5, No. 5, November 2005

Page 2: ppt

Introduction (1 of 2)

• Improvements to Internet enable users to stream from Web browsers– Across national and cultural boundaries

• Web users expect “point and click” to stream

• 2001, RealNetworks says 350,000 hours [1]

• 2002, CAIDA says streaming is significant fraction of traffic– Going to increase with cellular networks

• Concern drives new protocols, routers, etc. to deal with traffic better

Page 3: ppt

Introduction (2 of 2)

• Much work that characterizes streaming applications to better understand

• Unfortunately, little shows what current streams stored on Web look like

• Previous study in 1997 [19]– Looked at every video on the Web– Found Internet could not support streaming– RealPlayer and Media Player not created

• In 1985, papers by Ousterhout et al [21] studied characteristics of files– Fundamental in designing new file system

Need study of streaming media stored on the Web to help research today

Page 4: ppt

Investigation (1 of 2)

• What are the most popular streaming media products?– Previous studies [12] show very different– Earlier, prevalence of MPEG, AVI, QuickTime

made it difficult for new comers

• What is the ratio of streaming audio versus streaming video?– Audio has lower bitrate cap (voice, music) than

video – Can give current bitrate expectations

• Are media durations long-tailed?– Long-tailed can contribute to self-similarity– Self-similar traffic difficult to manage

Page 5: ppt

Investigation (2 of 2)

• What are typical streaming media target bitrates?– Direct impact on network traffic

– Provides insight into frame resolution, frame rates, color depth

• What fraction of streaming codecs being used?– Codecs determine compression efficiency

– Knowledge of codec prevalence suggests how fast improvements incorporated

Page 6: ppt

Focus

• Focus on commercial– Big 3: Media Player, RealPlayer, QuickTime

• Other studies looked at server side or one client– This study broader

• Have been p2p studies, but p2p not streamed (mostly)– Instead downloaded, as is file transfer

• Build specialized crawler, crawl over 17 million URLs from different starting points, and analyze about 30 thousand clips

Page 7: ppt

Teasers

• Volume and relative amount increased since 1997

• Proprietary most prevalent– RealPlayer 1st, Media Player 2nd

• Most clips short, with long-tailed duration

• Encoded at low-resolution, less than current monitors can handle

• Work useful for:– Selecting clip workloads

– Generating streaming models

Page 8: ppt

Outline

• Introduction (done)

• Methodology

• Analysis

• Sampling Issues

• Conclusions

Page 9: ppt

Methodology(Mini-Outline)

• Media Crawler

• Starting Pages

• Measurement

Page 10: ppt

Media Crawler

• Modify Larbin Web crawler

• Recursively traverses URLs– Avoid loops by caching previous

• Identify streaming media based on protocol type– Ex: mms://,

rtsp://

• Also examine

HTTP extensions

Page 11: ppt

Starting Pages

• Wanted international and popular

• International – chose 10 most wired countries

– Allow for cross cultural analysis

– If Nielsen gave no additional info, chose domestic newspaper as starting point

• USA – chose 7 popular themes

– Allow for cross-content analysis

• Feb 13, 2003, crawl 1 million from each

– Took 4 to 24 hours, based on RTT

Page 12: ppt

Measurement of Content Characteristics

• Use specialized tools to access each Media URL– Collect: encoding, bitrate, duration, size, …

– Tools built from SDK, use player core

• RealNetworks:– RealAnalyzer, TestPlay (could not do levels)

• Microsoft Media:– Media Analyzer, Wmprop (could do levels)

• MPlayer– Open source (could not do bitrate)

Page 13: ppt

Outline

• Introduction (done)

• Methodology (done)

• Analysis– Aggregate analysis

– Commercial productsVideoAudio

– Codec

• Sampling Issues

• Conclusions

Page 14: ppt

Aggregate Analysis (1 of 3)

• Remove unique, giving about 11 million URLs– About 54,000 were streaming

• In 1997, about 25 million URLs– About 22,000 were streaming

• Extrapolating Today, about 15 million total Increase from 0.09% to 0.47%

Page 15: ppt

Aggregate Analysis (2 of 3)

Some “heavy hitters”, more so than typicalWeb servers

Page 16: ppt

Aggregate Analysis (3 of 3)

- Real almost ½ of all streaming content - In 1997, MPEG, AVI, QuickTime were all, butnow only 10% combined- MP3 is most popular non-proprietary format

Page 17: ppt

Outline

• Introduction (done)

• Methodology (done)

• Analysis– Aggregate analysis

– Commercial productsVideoAudio

– Codec

• Sampling Issues

• Conclusions

Page 18: ppt

Commercial Product Analysis

• Run custom tools on commercial

• Of original 39,000 only about 29,000 valid– 50% “cannot find specified file”

– 25% “cannot connect to server”

– 10% “authorization failure”

• Can be from playlist– But 97% only 1 clip

Page 19: ppt

Live versus Pre-Recorded

- Most pre-recorded- 98% is pre-recorded, 2% live

Page 20: ppt

Percentage of Audio and Video

- More RealAudio than MP3 Audio- Proportionally less WSM is audio- Almost no QuickTime is audio

Page 21: ppt

Duration

- 1997, 90% only 45 seconds or less- Still, today much shorter than T.V. show or movie

Page 22: ppt

Self-Similar Analysis (1 of 2)

Definitive test:Is tail flat?

Looks flat, but that is not good enough [31]

Page 23: ppt

Self-Similar Analysis (2 of 2)

• Measure curve of tail (1/16th of distro, others same)– Curve defined as 3 point estimate, take derivative

• Estimate Pareto (long-tailed) slope – Used aest tool

• Generate 1000 samples from Pareto with – Each sample has same number of points as n

– Calculate curvature of sample tail, mean

• Calculate difference (d) between and original

• Count number out of 1000 differ by d– 495 (video) and 498 (audio), about ½

• Cannot reject null-hypothesis May be long-tailed

Page 24: ppt

Outline

• Introduction (done)

• Methodology (done)

• Analysis– Aggregate analysis

– Commercial productsVideoAudio

– Codec

• Sampling Issues

• Conclusions

Page 25: ppt

Video Encoded Bitrate

In 1997, 1% stream for modem, 50% for broadband, 20% for T1+- Said, modem could not support streamingNote, today, broadband still not targeted

Page 26: ppt

Streams Encoded Per Clip

Media Scaling will be difficult!Note, earlier study [15] found real at 65%

Audio is onestream

Page 27: ppt

Aspect Ratios

Very uniform, but a few odd-balls30% above or belowTake product for size (next)

Page 28: ppt

Video Resolution

- Most much smaller than typical monitors(1024 x 768 would be 786,432)- Room to grow!

Page 29: ppt

Outline

• Introduction (done)

• Methodology (done)

• Analysis– Aggregate analysis

– Commercial productsVideoAudio

– Codec

• Sampling Issues

• Conclusions

Page 30: ppt

Audio Encoded Bitrates

- Most for modems, but 10% for broadband- In 1999, 100% found for modems- Will likely increase (MP3 128 kbps), but cap

Page 31: ppt

Video Codecs

v8 buffers differently than v9

- Newest versions, v9, still not deployed much- Useful as snapshot in time

Page 32: ppt

Outline

• Introduction (done)

• Methodology (done)

• Analysis (done)

• Sampling Issues

• Conclusions

Page 33: ppt

Sampling Issues

• In 1997, could analyze all on Web

• Today, impractical– Would take 16 years to crawl and analyze clips

• Is 17 million large “enough” sample?– Is is possible to obtain same results with fewer

starting points?

– Is it possible to obtain same results with fewer than 1 million URLs per starting point?

– How does sampling affect distributions?

– How does choice of starting point affect distribution?

Page 34: ppt

Percentage of Media versus URLs

Took 200k from each, build setOverall, above 400k from each is stable ½ million

Page 35: ppt

Duration of Video for Number of URLs

Can get away with far fewer and have same distribution of durations

Page 36: ppt

Media Type versus Starting Points

9 Starting points sufficient

Page 37: ppt

Duration for Number of Starting Points

Page 38: ppt

Media Type in USA versus International

- International similar- May be because cross-cultural Web

Page 39: ppt

Duration for USA and Non-USA

Page 40: ppt

Summary

• Many researchers worry about volume increase of Video

• Video characteristics made based on old data

• Current data on media stored on Web

• Crawled 17 million URLs, analyzed 30k clips

Page 41: ppt

Conclusions

• Streaming media increased 600% in past 5 years

• Real Media 1st, Microsoft Media 2nd

• Audio and video about equal

• Vast majority pre-recorded (not live)

• Most targets still for modem

• Potential to be large since monitor resolutions much larger than video

Page 42: ppt

Future Work?

Page 43: ppt

Future Work

• Correlate to actual data streamed

• Congestion responsiveness

• P2P

• Future study (now ~5 years old!)