Automated Testing of Massively Multi-Player Games Lessons Learned from The Sims Online Larry Mellon...

Post on 23-Dec-2015

215 views 0 download

Transcript of Automated Testing of Massively Multi-Player Games Lessons Learned from The Sims Online Larry Mellon...

Automated Testing of Massively Multi-Player Games

Lessons Learned fromThe Sims Online

Larry Mellon

Spring 2003

Context: What Is Automated Testing?

Classes Of Testing

SystemStress

Load

RandomInput

Feature Regression

Developer

QA

Automation Components

Collection&

Analysis

System Under Test

Repeatable, Sync’edTest Inputs

System Under Test

Startup&

Control

System Under Test

What Was Not Automated?

Startup & Control

Repeatable, Synchronized Inputs

Results Analysis

Visual Effects

Lessons Learned: Automated Testing

Time(60 Minutes)

1/3

Wrap-up & QuestionsWhat worked best, what didn’tTabula Rasa: MMP / SPG

Fielding: Analysis & Adaptations

Design & Initial ImplementationArchitecture, Scripting Tests, Test ClientInitial Results

1/3

1/3

Design Constraints

Load

Regression

Churn Rate

Automation(Repeatable, Synchronized Input)

(Data Management)

Strong Abstraction

Test Client

Single, Data Driven Test Client

Regression Load

SingleAPI

ReusableScripts & Data

Test Client

Data Driven Test Client

Regression Load

SingleAPI

ReusableScripts & Data

SingleAPI

ConfigurableLogs & Metrics

Key Game StatesPass/Fail

Responsiveness

“Testing feature correctness” “Testing system performance”

Problem: Testing Accuracy

• Load & Regression: inputs must be– Accurate

– Repeatable

• Churn rate: logic/data in constant motion– How to keep testing client accurate?

• Solution: game client becomes test client– Exact mimicry

– Lower maintenance costs

Test Client == Game Client

Test Control

State

Game GUI

Client-Side Game Logic

Commands

State

Presentation Layer

Test Client Game Client

Game Client: How Much To Keep?

Game Client

View

Logic

Presentation Layer

What Level To Test At?Game Client

MouseClicks

Presentation Layer

Regression: Too Brittle (pixel shift)Load: Too Bulky

Regression: Too Brittle (pixel shift)Load: Too Bulky

View

Logic

What Level To Test At?

Game Client

InternalEvents

Presentation Layer

Regression: Too Brittle (Churn Rate vs Logic & Data)

Regression: Too Brittle (Churn Rate vs Logic & Data)

View

Logic

Gameplay: Semantic Abstractions

NullView ClientView

LogicPresentation LayerBuy Lot Enter Lot

Use Object

…Buy Object

~ ¾

~ ¼

Basic gameplay changes less frequently than UI or protocol implementations.Basic gameplay changes less frequently than UI or protocol implementations.

Scriptable User Play Sessions

• SimScript– Collection: Presentation Layer “primitives”– Synchronization: wait_until, remote_command

– State probes: arbitrary game state• Avatar’s body skill, lamp on/off, …

• Test Scripts: Specific / ordered inputs– Single user play session– Multiple user play session

Scriptable User Play Sessions

• Scriptable play sessions: big win– Load: tunable based on actual play– Regression: constantly repeat hundreds

of play sessions, validating correctness

• Gameplay semantics: very stable– UI / protocols shifted constantly– Game play remained (about) the same

SimScript: Abstract User Actions

include_script setup_for_test.txtenter_lot $alpha_chimpwait_until game_state inlot

chat I’m an Alpha Chimp, in a Lot.

log_message Testing object purchase.

log_objects buy_objectchair 10 10log_objects 

SimScript: Control & Sync

# Have a remote client use the chair remote_cmd $monkey_bot

use_object chair sit set_data avatar reading_skill 80set_data book unlock use_object book readwait_until avatar reading_skill 100set_recording on

Client Implementation

Composable Client

- Scripts- Cheat Console- GUI

Event GeneratorsEvent GeneratorsEvent Generators

Game Logic

Presentation Layer

Composable Client

- Scripts- Console- GUI

- Console- Lurker- GUI

Any / all components may be loaded per instanceAny / all components may be loaded per instance

Event GeneratorsEvent GeneratorsEvent GeneratorsViewing SystemsViewing SystemsViewing Systems

Game Logic

Presentation Layer

Lesson: View & Logic Entangled

Game Client

View

Logic

Few Clean Separation Points

Game Client

View

LogicPresentation Layer

Solution: Refactored for Isolation

Game Client

View

LogicPresentation Layer

LogicPresentation Layer

Lesson: NullView Debugging

?Without (legacy) view system attached, tracing was “difficult”.

Without (legacy) view system attached, tracing was “difficult”.

LogicPresentation Layer

Solution: Embedded Diagnostics

DiagnosticsDiagnosticsDiagnosticsTimeout Handlers…

Talk Outline: Automated Testing

Time(60 Minutes)

Wrap-up & Questions

Lessons Learned: Fielding

Design & Initial Implementation

Architecture & Design

Test Client

Initial Results

1/3

1/3

1/3

Mean Time Between Failure

• Random Event, Log & Execute • Record client lifetime / RAM• Worked: just not relevant in early

stages of development–Most failures / leaks found were

not high-priority at that time, when weighed against server crashes

Monkey Tests

• Constant repetition of simple, isolated actions against servers

• Very useful: –Direct observation of servers while

under constant, simple input–Server processes “aged” all day

• Examples:–Login / Logout–Enter House / Leave House

QA Test Suite Regression

• High false positive rate & high maintenance–New bugs / old bugs–Shifting game design –“Unknown” failures

Not helping in day to day work.Not helping in day to day work.

Talk Outline: Automated Testing

Time(60 Minutes)

¼

½

¼ Wrap-up & Questions

Fielding: Analysis&AdaptationsNon-Determinism

Maintenance OverheadSolutions & Results

Monkey / Sniff / Load / Harness

Design & Initial Implementation

Analysis: Testing Isolated Features

Analysis: Critical Path

Failures on the Critical Path block access to much of the game.

Failures on the Critical Path block access to much of the game.

enter_house ()

Test Case: Can an Avatar Sit in a Chair?

use_object ()

buy_object ()

buy_house ()

create_avatar ()

login ()

Solution: Monkey Tests

• Primitives placed in Monkey Tests– Isolate as much possible, repeat 400x– Report only aggregate results

• Create Avatar: 93% pass (375 of 400)

• “Poor Man’s” Unit Test– Feature based, not class based– Limited isolation– Easy failure analysis / reporting

Talk Outline: Automated Testing

Time(60 Minutes)

Wrap-up & Questions

Lessons Learned: FieldingNon-Determinism

Maintenance CostsSolution Approaches

Monkey / Sniff / Load / Harness

Design & Initial Implementation1/3

1/3

1/3

Analysis: Maintenance Cost

• High defect rate in game code–Code Coupling: “side effects”–Churn Rate: frequent changes

• Critical Path: fatal dependencies• High debugging cost–Non-deterministic, distributed logic

Turnaround Time

daysBug

Introduced

Development

Checkin

Smoke

Regression

Build

Time to Fix

Cost of Detection

Tests were too far removed from introduction of defects.

Tests were too far removed from introduction of defects.

Critical Path Defects Were Very Costly

Impact on

Others daysBug

Introduced

Development

Checkin

Smoke

Regression

Build

Time to Fix

Cost of Detection

Solution: Sniff Test

Pre-Checkin Regression: don’t let broken code into Mainline.

Pre-Checkin Regression: don’t let broken code into Mainline.

Working Code

Candidate Code

Pass / Fail,Diagnostics

Development

Checkin

Smoke

Sniff

Regression

Solution: Hourly Diagnostics

• SniffTest Stability Checker–Emulates a developer–Every hour, sync / build / test

• Critical Path monkeys ran non-stop–Constant “baseline”

• Traffic Generation–Keep the pipes full & servers aging–Keep the DB growing

Analysis: CONSTANT SHOUTING IS REALLY IRRITATING

• Bugs spawned many, many, emails• Solution: Report Managers

– Aggregates / correlates across tests– Filters known defects– Translates common failure reports to their root

causes• Solution: Data Managers

– Information Overload: Automated workflow tools mandatory

ToolKit Usability

• Workflow automation• Information management• Developer / Tester “push button” ease of use• XP flavour: increasingly easy to run tests

– Must be easier to run than avoid to running

– Must solve problems “on the ground now”

Sample Testing Harness Views

Load Testing: Goals

• Expose issues that only occur at scale• Establish hardware requirements• Establish response is playable @ scale• Emulate user behaviour– Use server-side metrics to tune test scripts

against observed Beta behaviour

• Run full scale load tests daily

Load Testing: Data Flow

ClientMetrics

Game Traffic

ResourceMetrics

Debugging Data

Test Driver CPU

Load Control Rig

Server Cluster

Load Testing Team

System Monitors

Internal Probes

Test ClientTest

ClientTest

Client

Test Driver CPU

Test ClientTest

ClientTest

Client

Test Driver CPU

Test ClientTest

ClientTest

Client

Load Testing: Lessons Learned

• Very successful–“Scale&Break”: up to 4,000 clients

• Some conflicting requirements w/Regression –Continue on fail–Transaction tracking–Nullview client a little “chunky”

Current Work

• QA test suite automation• Workflow tools• Integrating testing into the new

features design/development process

• Planned work–Extend Esper Toolkit for general use–Port to other Maxis projects

Talk Outline: Automated Testing

Time(60 Minutes)

Wrap-up & Questions

Lessons Learned: Fielding

Design & Initial Implementation

Biggest Wins / LossesReuseTabula Rasa: MMP & SSP

1/3

1/3

1/3

Biggest Wins

• Presentation Layer Abstraction– NullView client– Scripted playsessions: powerful for

regression & load• Pre-Checkin Snifftest• Load Testing• Continual Usability Enhancements • Team

– Upper Management Commitment– Focused Group, Senior Developers

Biggest Issues

• Order Of Testing– MTBF / QA Test Suites should have come last– Not relevant when early & game too unstable – Find / Fix Lag: too distant from Development

• Changing TSO’s Development Process– Tool adoption was slow, unless mandated

• Noise– Constant Flood Of Test Results– Number of Game Defects, Testing Defects– Non-Determinism / False Positives

Tabula Rasa

How Would I Start The Next Project?How Would I Start The Next Project?

Tabula Rasa

PreCheckin Sniff Test

There’s just no reason to let code break.There’s just no reason to let code break.

Tabula Rasa

PreCheckin SniffTest

Hourly Monkey Tests

Keep Mainline workingKeep Mainline working

Useful baseline & keeps servers aging.Useful baseline & keeps servers aging.

Tabula Rasa

Dedicated Tools Group

PreCheckin SniffTest Keep Mainline workingKeep Mainline working

Hourly Stability Checkers Baseline for DevelopersBaseline for Developers

Continual usability enhancements adapted toolsTo meet “on the ground” conditions.Continual usability enhancements adapted toolsTo meet “on the ground” conditions.

Tabula Rasa

PreCheckin SniffTest Keep Mainline workingKeep Mainline working

Hourly Stability Checkers Baseline for DevelopersBaseline for Developers

Dedicated Tools Group Easy to Use == UsedEasy to Use == Used

Executive Level Support

Mandates required to shift how entire teams operated.Mandates required to shift how entire teams operated.

Tabula Rasa

PreCheckin SniffTest Keep Mainline workingKeep Mainline working

Hourly Stability Checkers Baseline for DevelopersBaseline for Developers

Easy to Use == UsedEasy to Use == Used

Load Test: Early & Often

Executive Support Radical Shifts in ProcessRadical Shifts in Process

Dedicated Tools Group

Tabula Rasa

PreCheckin SniffTest Keep Mainline workingKeep Mainline working

Hourly Stability Checkers Baseline for DevelopersBaseline for Developers

Easy to Use == UsedEasy to Use == Used

Distribute Test Development & Ownership Across Full TeamDistribute Test Development & Ownership Across Full Team

Load Test: Early & Often Break it before LiveBreak it before Live

Executive Support Radical shifts in ProcessRadical shifts in Process

Dedicated Tools Group

Next Project: Basic Infrastructure

Control HarnessFor Clients & Components

Reference Client Self Test

Reference Feature

RegressionEngine

Living Doc

Building Features: NullView First

Control Harness

Reference Client

NullView Client

Self Test

Reference Feature

RegressionEngine

Living Doc

Build The Tests With The Code

NullView Client

Login

Self Test

Monkey Test

Nothing Gets Checked In Without A Working Monkey Test.Nothing Gets Checked In Without A Working Monkey Test.

Control Harness

Reference Client

Reference Feature

RegressionEngine

Conclusion

• Estimated Impact on MMP: High– Sniff Test: kept developers working– Load Test: ID’d critical failures pre-launch– Presentation Layer: scriptable play sessions

• Cost To Implement: Medium– Much Lower for SSP Games

Repeatable, coordinated inputs @ scale and pre-checkin regression were very significant schedule accelerators.

Repeatable, coordinated inputs @ scale and pre-checkin regression were very significant schedule accelerators.

Conclusion

Go For It…Go For It…

Talk Outline: Automated Testing

Time(60 Minutes)

Wrap-up

Questions

Lessons Learned: Fielding

Design & Initial Implementation 1/3

1/3

1/3