Testing Real-Time IP Video Communication at Skype Nordic ... Real-Time IP Video... · CI / NFR CI...
Transcript of Testing Real-Time IP Video Communication at Skype Nordic ... Real-Time IP Video... · CI / NFR CI...
Testing Real-Time IP Video Communication at Skype
2012 © Skype.
Nordic Testing Days 2012
Today
Video on Skype
Video Quality
Objective Quality. End-to-End
Objective Quality. Local Loopback
Measuring Subjective Quality
Driving Product Development
Conclusions
2012 © Skype.
Video on Skype
Antonio Campos Domínguez
2012 © Skype.
Slide 4
Video for the masses Video, from Latin “I see”
247 M connected users per month
40 M concurrent users
Video on many devices
More than 50% of the calls have video
2012 © Skype Internal Data.
Skype-to-Skype
video
Group video
calling Screen
sharing
Video library inside Skype software
Now an independent component
2012 © Skype Internal Data. Slide 5
Skype video team
20 developers and 10 testers (and growing) organized in scrum teams (3-6 people)
QE together with development in scrums with a common goal
• Tallinn: Streaming, some video Platforms
• Stockholm: Processing, other Video Platforms
Video releases every 2 months (4 sprints) to all consumers
Critical bugfixes backported to released branches
• Development continues in trunk
2012 © Skype Internal Data. Slide 6
Quality?
Skype is a piece of software, and it is tested as such
• You name it: unit testing, code coverage, etc. and all those tools
However, video is a very unique thing to make right
• That is what we cover today
Starting from one big question…
2012 © Skype Internal Data. Slide 7
How good does our video look?
2012 © Skype
Being more specific…
2012 © Skype Internal Data. Slide 9
Does it work
at all? Backwards
compatible?
Interoperability
works?
Is video
worse in the
latest build?
Are audio
and video
in sync? What is the
compromise between
video, audio and
battery life?
It’s complicated Video quality measurement is a complex task since it implies:
1. understanding how human perception (eyes + brain) works
2. translating that knowledge into algorithms and experiments
2012 © Skype Internal Data. Slide 10
Subjective vs. objective
Objective: machine-measurable video characteristics
Subjective: what ultimately matters to our end-user experience
Between the two: difficult to automate or arguable
2012 © Skype Internal Data. Slide 11
Objective Subjective
Automatable Humans needed
Detailed by feature Aggregated full experience
Indirect factors of quality Real use quality measure
Objective measures
Resolution: 160x120, 720p, full HD…
Framerate: e.g. 30 fps (cinema: 24 fps)
• Easiest metrics to understand but not everything
• Possible: ugly-looking high resolution / framerate
Metrics: frame fidelity, blurriness, blockiness, sharpness
• Harder to specify and measure
2012 © Skype Internal Data. Slide 12
Example: frame quality depends on codec
2012 © Skype Internal Data. Slide 13
VC-1 H264 VP6
In the end, a compromise
CPU, memory, network: limited resources
One cannot offer simultaneously:
• largest resolution
• fastest framerate
• best quality
2011 © Skype. Commercially confidential. Slide 14
Metric spectrum
2012 © Skype Internal Data. Slide 15
Fully
objective
Rather
objective
Rather
subjective
Fully
subjective
Framerate Color Sharpness Mean opinion
scores
Resolution Synchronicity Smoothness Overall
experience
Aspect ratio Freezes Flickering
Delay Artifacts
Our tools – bottom / top approach
2012 © Skype Internal Data. Slide 16
ACS
SkyTest
CI / NFR
CI
• End-2-end testing
• Real client with real network
• Integration testing
• Cross team tool in Python
• Lua scripts with a local loopback
• Unit testing in C
Our tools – don’t reinvent the wheel
Testing
Frameworks
QuickBuild (build
system)
TMT (reporting system)
2012 © Skype Internal Data. Slide 17
CI system / NFR system / SkyTest / ACS /
DTS / Nuclear
All those frameworks are built around
backend tools provided by an dedicated CI
team.
That includes : databases, logs storage,
servers health status monitoring, systems
updates, hardware maintenance, . . .
Objective Testing End-to-End Testing
Marios Mpasoukos
2012 © Skype
e2e video testing
UI level testing of video library + integration
Using a Distributed System for Automating the Calls
and Distribution of Testables
The System requires:
• Agent controlling UI or browser
• Client logging
• Optional: custom parsers, reporting
Skype-to-Skype
video
Screen sharing
Group video
calling 2012 © Skype Internal Data.
ACS flow
2012 © Skype Internal Data.
to examples
2012 © Skype Internal Data.
1. Release testing
Thousands of calls run to verify stability
• Testing new Video Library release
• Testing new features: codec, network configuration keys, etc.
Build performance and crash information:
• Log dump destacking
• Bugsense, Watson, HockeyApp
Slide 22 2012 © Skype Internal Data.
2. Daily regression testing
Slide 23
Reporting System leverages Distributed System + custom parsers
Sanity Check Features Quality
Video Start/Stop HD Video Resolution, Framerate
Platform Interop Screen Sharing Response to Network
Constraints
Device Switching Multiparty Video Response to CPU
Constraints
2012 © Skype Internal Data.
2012 © Skype Internal Data. Slide 24
Objective testing Local loopback (CI / NFR)
Pierre Gronlier
2012 © Skype
CI system
Continuous Integration means that :
• every 10 mins, a script checks for new commits on video trunk/ or the branches/.
• it triggers a list of short tests. Every test lasts around 30 seconds.
• a report is generated in a database and the results are aggregated on a web page
At night, a list of longer tests is executed.
2012 © Skype Internal Data. Slide 26
Build
Deploy
Run Analyze
Report
CI system
2012 © Skype Internal Data. Slide 27
Exemples of test:
- Start call, add
participant, stop call
- Start call, add
participant, add
packet loss, detect
recovery
- Start call, disconnect
cam
CI system
2012 © Skype Internal Data. Slide 28
Make it Visible!
Code coverage
2012 © Skype Internal Data. Slide 29
• Run with the CI tests
• Specific tools report code coverage numbers for:
• Statement or line coverage
• Decision or branch coverage
• Condition coverage.
• Function coverage
Memory debugger
2012 © Skype Internal Data. Slide 30
• Run with the CI tests
• Specific tools report:
• Memory leaks
• Array boundary access errors
• Uninitialized memory read access
• Unallocated memory write access
CI long test
2012 © Skype Internal Data. Slide 31
Check:
• Bitrate adaptation
• Framerate
• Jitter
Excecute
- Deploy from Quickbuild
- Change network conditions
Analyze
- Logs
- Stats
Display
- Plots
- Alerts
Real time log and stats analysis of a video call in a local loopback.
Tool that allows developers to verify how well a new algorithm performs.
CI long test
2012 © Skype Internal Data. Slide 32
Real time log and stats analysis of a video call in a local loopback.
CI long test
2012 © Skype Internal Data. Slide 33
8 hours call !
Non functional requirements (NFR)
Functional vs. Non-Functional
the video works = we see something
vs.
the video has a good quality = we enjoy our video call
2012 © Skype Internal Data. Slide 34
NFR – Key performance indicators
List of kpis:
• resolution and frame rate
• bitrate
• dropped frames and freeze durations
• frame-quality
List of usecases:
• for every codec
• for every media protocol version
• 1-to-1 call and Group Video Calling
• software encoding vs. hardware encoding
• for different network conditions
2012 © Skype Internal Data. Slide 35
NFR – Pass / fail vs. score
2012 © Skype Internal Data. Slide 36
KPI Functional
pass / failed
Non-functional
0% → 100%
resolution ≠ 0x0 max = VGA
framerate ≠ 0 max = 15fps
bitrate [20..5000] kbps range 350kpbs ± 10%
frame-quality frame exists PSNR or SSIM values
Everything is automated using stats and feedback values from the Video Library
NFR – How to evaluate the best available quality?
2012 © Skype Internal Data. Slide 37
The best quality of a call is given by :
optimal settings = gcf (𝑠𝑒𝑛𝑑𝑒𝑟, 𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑟)
with
𝑠𝑒𝑛𝑑𝑒𝑟 = gcf (max 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑝𝑜𝑤𝑒𝑟 , max 𝑁𝑒𝑡𝑤𝑜𝑟𝑘 , max 𝐶𝑎𝑚𝑒𝑟𝑎 )
𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑟 = gcf (max 𝐷𝑒𝑐𝑜𝑑𝑖𝑛𝑔 𝑝𝑜𝑤𝑒𝑟 , max 𝑁𝑒𝑡𝑤𝑜𝑟𝑘 , max 𝑆𝑐𝑟𝑒𝑒𝑛 )
gcf = 𝑔𝑟𝑒𝑎𝑠𝑡𝑒𝑠𝑡 𝑐𝑜𝑚𝑚𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
NFR – How to evaluate the best available quality?
2012 © Skype Internal Data. Slide 38
where, with some simplifications,
𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑝𝑜𝑤𝑒𝑟 = 𝑓1 𝐶𝑃𝑈 𝑝𝑜𝑤𝑒𝑟, 𝑃𝑜𝑤𝑒𝑟 𝑠𝑢𝑝𝑝𝑙𝑦 𝑚𝑜𝑑𝑒, 𝐶𝑜𝑑𝑒𝑐 𝑝𝑟𝑒𝑓.
𝑁𝑒𝑡𝑤𝑜𝑟𝑘 = 𝑓2 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ, 𝑅𝑇𝑇, 𝑅𝑒𝑙𝑎𝑦/𝑃2𝑃
𝐶𝑎𝑚𝑒𝑟𝑎 = 𝑓3 𝑅𝑒𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛, 𝐹𝑟𝑎𝑚𝑒𝑟𝑎𝑡𝑒
𝐷𝑒𝑐𝑜𝑑𝑖𝑛𝑔 𝑝𝑜𝑤𝑒𝑟 = 𝑓4 𝐶𝑃𝑈 𝑝𝑜𝑤𝑒𝑟, 𝑃𝑜𝑤𝑒𝑟 𝑠𝑢𝑝𝑝𝑙𝑦 𝑚𝑜𝑑𝑒, 𝐶𝑜𝑑𝑒𝑐 𝑝𝑒𝑟𝑓.
𝑆𝑐𝑟𝑒𝑒𝑛 = 𝑓5 𝑅𝑒𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛
NFR – Compare across revision / platform
2012 © Skype Internal Data. Slide 39
NFR – Scores
2012 © Skype Internal Data. Slide 40
Summary
2012 © Skype Internal Data. Slide 41
Unit testing
Code coverage
&&
Memory leaks
CI
CI Long Test
NFR
Oksana Dementsova
Subjective Quality
2012 © Skype
Compared subjective test strategies
1. Video lab experiments
2. Real world feedback
2012 © Skype Internal Data. Slide 43
Video lab experiments
For feature development
E.g. is it worth applying error concealing. And how much?
For feature tuning
E.g. what is a good trade-off between audio quality, temporal and spatial resolution
on mobile devices?
2012 © Skype Internal Data. Slide 44
SAMVIQ methodology
SAMVIQ (subjective assessment methodology for video quality ITU-R BT.700).
• Controlled setup and lighting
• Each scene has 5 videos: Explicit reference, 3 test signals, Hidden reference
• Randomized test signals
• Observer gives scores: 0 to 100 (from bad to excellent)
2012 © Skype Internal Data. Slide 45
2011 © Skype. Commercially confidential. Slide 46
Differentiation for results analysis
Content:
• Scenes. E. g. static background, head and shoulders, dynamic scene.
• Source. E. g. Static camera indoors, mobile camera outdoors, etc.
Experience of the subject:
• Non-experts, Audio experts, Video experts, experts in both Audio and Video
2012 © Skype Internal Data. Slide 47
Lab testing results
2012 © Skype Internal Data. Slide 48
What do users think about subjective quality?
• Which codec performs better?
• Does the new feature improve call quality?
2012 © Skype Internal Data. Slide 49
Real world feedback
2012 © Skype Internal Data. Slide 50
• Mean Opinion Score (MOS):
value of the average user rating
(scale from 1 to 5)
• Statistics collected from the call,
e.g. version, network type, call stats
Live testing results
Audio – Wideband audio (SILK) yields longer calls and better call scores
2012 © Skype Internal Data. Slide 51
Comparison of subjective tests
Lab tests Live feedback
Controlled environment Uncontrolled environment
Non-released features Only public clients
Fast turnaround Slower rollout of new configurations
Isolated features Whole call is graded
Abstracted environment Real world measure
Limited amount of tests Millions of observers
2012 © Skype Internal Data. Slide 52
Driving
Product Development
Marios Mpasoukos
2012 © Skype
Assessing video KPIs
Functional Requirements
Bitrate, frame rate, resolution
Non – functional Requirements
“Non Functional Requirements define the quality and performance to be
expected from the platform references.”
Right proportions and orientation, no video artifacts, frame delay, freezes - given
network conditions. Bitrate, Framerate, Resolution reaching expected values,
considering also after stretching the CPU or network.
Slide 54 2012 © Skype Internal Data.
Receiving feedback
• From video product managers
• Customer support
• Following the Marketing team’s activities and customer surveys
• Collaborating with UX and UI teams
• Working closely with Data Analytics Team, collecting and evaluating
MOS Scores and User Feedback (from the pop-up window)
Slide 55 2012 © Skype Internal Data.
Drive the decision making (1)
Not only about answering the question: “Are we releasing the feature or not?”
But it’s also about working closely with…
1. Developers
During development stage and release stage
1. Bug fixing – verification
2. Delivering quality features
2. Product Owners
1. Supporting new HW delivered by OEMs and partners
2. Supporting HD (720p) video calling on Mac, 1080p video calling on Win
Slide 56 2012 © Skype Internal Data.
Drive the decision making (2)
and…
3. Marketing
1. Mac desktop – HW requirements for HD calling (720p)
2. Win desktop – HW requirements for HD calling (1080p)
3. iOS stabilization – check the promo ad
Slide 57 2012 © Skype Internal Data.
Conclusion
Antonio Campos Domínguez
2012 © Skype
Measuring video quality
Video software’s metrics are complex and diverse
• Balance of parameters is hard to achieve
• Test happens at many levels
Objective measures: full automation supports development
• e2e integration level: end user cases, but noisy
• Lower integration levels: more abstract, but cheap and clear
2011 © Skype. Commercially confidential. Slide 59
Applying video quality
Subjective measures: what really matters to users
• Lab experiments: feature tuning, controlled
• Users feedback: real experience, large scale
Video QE are responsible for
• implementing this, interpreting results and combining these with other teams’ info
• … to help deliver the video calling experience of tomorrow
2011 © Skype. Commercially confidential. Slide 60
Q&A
2012 © Skype
References
SILK and call duration: http://blogs.skype.com/en/2010/09/the_power_of_silk.html
Slide 62 2012 © Skype Internal Data.