Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out...

36
VOLUME 6 • ISSUE 3 • MARCH 2009 • $8.95 • www.stpcollaborative.com BEST PRACTICES: Java Testing Organically Grown High-Speed Apps page 10 S So ow w, , G Gr ro ow w a an nd d H Ha ar r v v e es s t t L L i i v v e e L L o oa ad d - - T T e es st t D Da at ta a Automate Web Service Performance Testing

Transcript of Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out...

Page 1: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

VOLUME 6 • ISSUE 3 • MARCH 2009 • $8.95 • www.stpcollaborative.com

BESTPRACTICES:

Java Testing

Organically GrownHigh-Speed Apps page 10

SSooww,, GGrrooww aanndd HHaarrvveessttLLiivvee LLooaadd--TTeesstt DDaattaa

Automate Web ServicePerformance Testing

Page 2: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing
Page 3: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

ContentsVOLUME 6 • ISSUE 3 • MARCH 2009

1100COVER STORYCultivate Your Applications For Ultra Fast Performance

To grow the best-performing Web applications, you must nurture themfrom the start and throughout the SDLC. By Mark Lustig and Aaron Cook

4 • EditorialIf your organization isn’t thinking aboutor already employing a center of excel-lence to reduce defects and improvequality, you’re not keeping up with theIT Joneses.

6 • ContributorsGet to know this month’s experts andthe best practices they preach.

7 • Out of the BoxNews and products for testers.

9 • ST&PediaIndustry lingo that gets you up to speed.

33 • Best PracticesWhen the Java Virtual Machine comesinto play, garbage time isn’t just for bas-ketball players.

By Joel Shore34 • Future TestThe future of testing is in challenges,opportunities and the Internet.

By Murtada Elfahal

2277 Automate Web Service Testing;

Be Ready to Strike Automation techniques from the realworld will pin your competitors to thefloor while your team bowls them overwith perfect performance.

By Sergei Baranov

1188 Sow and Grow Live Test Data

Step-by-step guidance for entering themaze, picking the safest path and find-ing the most effective tests your datamakes possible. By Ross Collard

A Publication

Software Test & Performance (ISSN- #1548-3460) is published monthly by Redwood Collaborative Media, 105 Maxess Avenue, Suite 207, Melville, NY, 11747. Periodicals postage paid at Huntington, NY and additional mailing offices. Software Test & Performance isa registered trademark of Redwood Collaborative Media. All contents copyrighted © 2009 Redwood Collaborative Media. All rights reserved. The price of a one year subscription is US $49.95, $69.95 in Canada, $99.95 elsewhere. POSTMASTER: Send changes of addressto Software Test & Performance, 105 Maxess Road, Suite 207, Melville, NY 11747. Software Test & Performance Subscribers Services may be reached at [email protected] or by calling 1-847-763-1958.

MARCH 2009 www.stpcollaborative.com • 3

Departments

Page 4: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

How many defects should peo-ple willing to put up with beforethey say “To heck with this Website”? I suppose the answerwould depend on how impor-tant or unique the Web site was,or how critical its function tothe person using it.

The point isn’t the numberof errors someone gets beforethey say adios. The point is thatyour applications should con-tain zero defects, should produce zero errorsand should have zero untested use cases atdeployment time.

Can you imagine that? Youmight if your company were toimplement a Center of Excel-lence.

A survey of large and smallcompanies instituting suchcenters revealed that a stagger-ing 87 percent reported “im-proved quality levels that sur-passed their initial expecta-tions.” That study, called theMarket Snapshot Report: Perfor-mance Center of Excellence (CoE),was released last month by ana-lyst firm voke.

The study defines a Perfor-mance Center of Excellence as“the consolidation of orientedresources, which typically in-cludes the disciplines of test-ing, engineering, manage-ment and modeling. The CoEhelps to centralize scarce and highly specialized re-sources with in the perform-ance organization as a whole.”It questioned performanceexperts from companies ac-ross the U.S., two-thirds ofwhich were listed in the For-tune 500.

Among the key findings wasthat companies reported “substantial ROI asmeasured by their ability to recoup mainte-

nance costs in a 12-month peri-od, due largely to the elimina-tion of production defects.”

Establishing a CoE alsomakes sense for smaller compa-nies. “Just as large enterpriserealized massive efficiencies ofscale by consolidating opera-tional roles into shared serviceorganizations in the ‘90s, for-ward-thinking IT organizationstoday are achieving similar bene-

fits by implementing Performance Centersof Excellence,” says Theresa Lanowitz,

founder of voke and author ofthe study. She added that suchorganizations also realizedbenefits “that scaled acrosstheir entire organization.”

Let’s face it, we all knowthat testers get a bad rap. Testdepartments have to constant-ly defend their existence, pro-tect their budget and makedue with less time and lessrespect than developmentteams are generally afforded.But it seems to me that pro-posing (and implementing) aCoE has only upside. Youincrease productivity, efficien-cy, communication and institu-tional knowledge through cen-tralization of a departmentdedicated to application per-formance, you reduce costsand increase quality.

“The cost to an organiza-tion to build and maintain aproduction-like environmentfor performance testing isoften prohibitive,” the studypoints out. “However, the con-sequences of failing to have anaccurate performance testingenvironment may be cata-strophic.”

Think of it as your company’s very ownstimulus package. ý

Become a CenterOf Excellence

VOLUME 6 • ISSUE 3 • MARCH 2009

Ed Notes

Edward J. Correia

•The cost to

maintain a

production-like

environment for

performance

testing is

prohibitive.

However, the

consequences of

failing to may

be catastrophic.

•4 • Software Test & Performance MARCH 2009

President Andrew Muns

Chairman Ron Muns

105 Maxess Road, Suite 207Melville, NY 11747+1-631-393-6051

fax +1-631-393-6057www.stpcollaborative.com

Cover Illustration byThe Design Diva, NY

Editor

Edward J. Correia

[email protected]

Contributing Editors

Joel Shore

Matt Heusser

Chris McMahon

Art Director

LuAnn T. Palazzo

[email protected]

Publisher

Andrew Muns

[email protected]

Associate Publisher

David Karp

[email protected]

Director of Events

Donna Esposito

[email protected]

Director of Marketing and Operations

Kristin Muns

[email protected]

Reprints

Lisa Abelson

[email protected]

(516) 379-7097

Subscriptions/Customer Service

[email protected]

847-763-1958

Circulation and List Services

Lisa Fiske

[email protected]

Page 5: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing
Page 6: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

AARON COOK and MARK LUSTIG onceagain provide our lead feature. Beginningon page 10, the test-automation dynamicduo describe how to ensure performanceacross an entire development cycle, begin-ning with the definition of service levelobjectives of a dynamic Web application.They also address real world aspects of per-

formance engineering, what factors that constitute real performance measurementand all aspects of the cloud.

Aaron is the quality assurance practice leader at Collaborative Consulting andhas been with the company for nearly five years. Mark is the director of per-formance engineering and quality assurance at Collaborative.

Contributors

TO CONTACT AN AUTHOR, please send e-mail to [email protected].

6 • Software Test & Performance MARCH 2009

In part two of his multipart series on live-data load testing, ROSSCOLLARD tackles the issues involved with selecting and cap-turing the data, then explores how to apply the data in your test-ing to increase the reliability of predictions. Ross’s personal writ-ing style comes alive beginning on page 18, as he taps into hisextensive consulting experiences and situations.

A self-proclaimed software quality guru, Ross Collard says hefunctions best as a trusted senior advisor in information tech-nology. The founder in 1980 of Collard & Company, Ross has been called a JediMaster of testing and quality by the president of the Association of Software Testing.He has consulted with top-level management from a diverse variety of companiesfrom Anheuser-Busch to Verizon.

Advertiser URL Page

Hewlett-Packard www.hp.com/go/alm 36

Lionbridge www.lionbridge.com/spe 25

Seapine www.seapine.com/testcase 5

Software Test & Performance www.stpcollaborative.com 6

Software Test & Performance Conference www.stpcon.com 2

Test & QA Newsletter www.stpmag.com/tqa 26

Wildbit www.beanstalkapp.com 35

Index to Advertisers

Subscribe Online!www.stpcollaborative.com

Get the cure for those badsoftware blues. Don’t fret

about design defects, out-of-tune device drivers,

off-key databases or flat response time.

Software Test & Performanceis your ticket to runtime

rock and roll.

This month we’re fortunate to have the tutelage SERGEIBARANOV on the subject of Web service performance-testautomation. On page 27 you’ll find his methodology for cre-ating test scenarios that reflect the tendencies of real-worldenvironments. To help you apply these strategies, Sergei intro-duces best practices for organizing and executing automat-ed load tests and suggests how these practices fit into a Webservices application’s development life cycle.

Sergei Baranov is a principle software engineer at test-tool maker ParasoftCorp. He began his software career in Moscow, where as an electrical engineerfrom 1995 to 1996 he designed assembly-language debuggers for data-acquisi-tion equipment and PCs. He’s been with Parasoft since 2001.

Page 7: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

Out of the Box

If you’re a user of Hewlett-Packard’sQuality Center test management plat-form and been bamboozled by its clunkyor nonexistent integration with JUnit,NUnit and other unit testing frameworks,you might consider an alternativeannounced this month by SmarteSoft.The test automation tools maker onMarch 1 unveiled Smarte QualityManager, which the company claimsoffers the same capabilities as HP’s ubiq-uitous suite for about a tenth the cost.Shipping since January, the US$990 per-seat/per-year platform is currently at ver-sion 2.1.

SmarteQM is a browser-based plat-form that uses Ajax to combine manage-ment of requirements, releases, testcases, coverage, defects, issues and taskswith general project management capa-bilities in a consistent user interface.According to SmarteSoft CEO GordonMacgregor, price and interface areamong its main competitive strengths.“With the Rational suite, for example,RequisitePro, Doors, ClearQuest,ClearCase, all have to be separatelylearned and managed.”

Another standout feature, Macgregorsaid, is its customizable user dashboard.“We’re not aware of that in [HP’s]TestDirector/Quality Center.” The plat-form also is built around an open API,

enabling companies to integrate existingthird-party, open source or proprietarysoftware. “That’s also a big differencefrom the competition. Open API allowfor connecting to all manner of testautomation frameworks.” Out of the box,SmarteQM integrates with JUnit, NUnit,PyUnit and TestNG automated unit-test-ing frameworks. It also works withQuickTestPro and Selenium; integrationwith LoadRunner is planned. SmarteQMalso can export bugs to JIRA, Bugzilla andMicrosoft TFS.

SmarteLoad Open to ProtocolsSmarteSoft also on March 1 released anupdate to SmarteLoad, its automatedload testing tool. New in version 4.5 isthe ability to plug-in your communica-tion protocol of choice. “Now you cantake any Java implementation of a proto-col engine and plug it into SmarteLoad

and start doing load testing with that pro-tocol. That’s unique in the industry,”Macgregor claimed.

The plug-in capability also workswith proprietary protocols. “You can’tjust buy a load testing tool off the shelfthat supports your custom protocol.This is especially relevant to firms thathave proprietary protocols, such asdefense and gaming. Let’s say youhave some protocols for high-perform-ance gaming. You could plug them inwith very little effort. We were able toprovide [Microsoft] Winsock supportin 24 hours, and we don’t charge extrafor that.”

SmarteLoad pricing varies by thenumber of simulated users, starting at$18,600 for 100 users for the first year,including maintenance and support.SmarteQM 2.1 and SmarteLoad 4.5 areavailable now.

The ‘Smarte’Way To Do QualityManagement

Microsoft in February unveiled a pair ofnew search products, central elements ofan updated roadmap for its overall enter-prise search strategy.

Set for beta in the second half of thisyear is FAST Search for SharePoint, a newserver that extends the capabilities ofMicrosoft’s FAST ESP product, and addsits capabilities to Microsoft’s Office

SharePoint Portal Server. Interested par-ties can license some of the capabilitiesnow through ESP for SharePoint, a spe-cial product created for this purpose thatincludes license migration to the newproduct, when it’s released.

Also and extension of FAST ESP andgoing to beta in the second half will beFAST Search for Internet Business, with

“new capabilities for content integrationand interaction management, helpingenable more complete and interactivesearch experiences,” according to aMicrosoft news release issued on Feb. 10,from the company’s FAST Forward 09Conference in Las Vegas. Pricing forFAST for SharePoint will reportedly startat around US$25,000 per server.

MS Search Strategy: FAST, FAST, FAST

In SmartQM's Test Management module, test cases are mapped to one or more requirements that thetest is effectively validating, providing the test coverage for the requirement(s). Each test case includesall the steps and individual actions necessary to complete the test, according to the company.

MARCH 2009 www.stpcollaborative.com • 7

Page 8: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

Ajax Goes DownSmooth WithLiquidTestA new UI-testing framework released inFebruary is claimed to have been builtwith Ajax testing in mind. It’s calledLiquidTest, and according toJadeLiquid Software, it helps developersand testers “find defects as they occur.”An Eclipse RCP app, LiquidTest recordsFireFox and IE browser actions and out-puts the results as test cases for Java andC#, JUnit, NUnit and TestNG, as well asRuby and Groovy, the company says. Itsupports headless operation through aserver component and is also availableas an Eclipse plug-in.

JadeLiquid’s flagship is WebRen-derer, a pioneering standards-based Javarendering engine for Web browsers.

According to a post on theserver-side.com by JadeLiquid’s AnthonyScotney, many automated testing prod-ucts fall flat when it comes to Ajax.“LiquidTest, however, was architected tosupport Ajax from day one. We devel-oped LiquidTest around an ‘Expec-tation’ model, so sleeps are notrequired,” he wrote, referring a com-mand sometimes used when developingasynchronous code. The following is atest case he had recorded againstfinance.google.com that uses the Ajax-based textfield:

public void testMethod(){browser.load("finance.google.com");browser.click("searchbox", 0);browser.type("B");browser.expectingModificationsTo("id('ac-list')").type("H");browser.expectingLoad().click("id('ac-list')/DIV[2]/DIV/SPAN[2]");assertEquals("BHP Billiton Limited (ADR)",browser.getValue("BHP Billiton Limited (ADR)"));}

“As you can see LiquidTest spots themodifications that are happening to theDOM as we type "BH," Scotney wrote ofthe code.

LiquidTest is available in three edi-tions. The Developer Edition is intend-ed to help “integrate functional tests(as unit tests) into a software develop-ment process”, he wrote. Headless test-case execution also permits regressiontests at every step of the build process.

If you’re also using the Server Edition,you can link with your continuous inte-gration system and automate test exe-cution for functional and acceptance-test coverage.

A Tester Edition is for test and QAteams that might have less technicalknowledge than developers. It outputsconcise scripts in LiquidTest Script, aGroovy derivative that “is powerful butnot syntactically complicated,” Scotneywrote, adding that LiquidTests record-ed with the Tester Edition can bereplayed with the Developer Editionand vice-versa, enabling close collabo-ration between development andtest/QA teams.

With A RedesignedQtronic, ConformiqComes to the U.S.Add one to the number of companiesestablished in Finland that came to theU.S. seeking their fortunes. Conformiq,which designs test-design automationsolutions, last month opened an office inSaratoga, Calif., and named A.K. Kalekospresident and CEO; he will run theNorth American operations. Also part ofthe team as CTO will be Antti Huima,formerly the company’s managing direc-tor and chief architect. Huima was thebrains behind Qtronic, the company’sflagship automatic model-to-test casegenerator.

Qtronic automates the design offunctional tests for software and sys-tems. According to the company,Qtronic also generates browsable docu-mentation and executable test scripts inPython, Tcl and other standard formats.The tool also allows testers to designtheir own output format, for situationswhen proprietary test execution ormanagement platforms exist.

Conformiq in January releasedQtronic 2.0, a major rewrite of theQtronic architecture. The way the com-pany describes it, the system went “fromsingle monolithic software to client-serv-er architecture.”

“The back-end of the test process, theexecution of tests, has already been auto-mated in many companies. But the testscripts needed for automated test execu-

tion are still designed by hand,” saidKalekos. “By using Qtronic to automatethe test design phase, our customers dra-matically reduce the effort and timerequired to generate test cases and testscripts.”

Among the major changes, publishedon the company’s Web site, is the separa-tion of a single user workspace into acomputational server (for generatingtests) and an Eclipse-based client forLinux, Solaris and Windows.

The platform also now supports mul-tiple test-design configurations, eachwith its own coverage criteria and selec-tion of script back-ends. “While genera-tion of test cases is possible without hav-ing a script back-end (abstract testcase), a user can now configure morethan one scripting back-end in a testdesign configuration for executable testscripts,” said the company. Test cases formultiple test design configurations aregenerated in parallel, making test gen-eration faster by sharing test generationresults between multiple test designconfigurations.

Also new is incremental test-casegeneration with local test case naming.Generated test cases are stored in a per-sistent storage, and previously generat-ed and stored test cases can be used asinput to subsequent test generationruns. It’s also now possible to name andrename generated test cases. Version2.0 improves handling of coverage cri-teria, with fine grained control of cov-erage criteria; structural features canbe individually selected; coverage crite-ria can be blocked, marked a target oras "do not care;" and coverage criteriastatus is updated in real time and alwaysvisible to testers.

Testers can now browse and analyzegenerated test cases (and model defects)in the user interface, including graphicalI/O and execution trace. A simplifiedplug-in API is now Java compatible, andeases the task of developing new plug-ins.

In February Conformiq receivedUS$4.2 million in venture funding frominvestors in Europe and the U.S., led byNexit Ventures and Finnish IndustryInvestment.

Send product announcements to [email protected]

8 • Software Test & Performance MARCH 2009

Page 9: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

We often hear that beforeany coding begins, the proj-ect owners should specifythe number of users foreach feature, the number oftransactions on the system,and the required systemresponse times. We wouldlike to see this happen. Wewould also like a pony.

Predicting user behavior is not reallypossible no matter how much testerswish that it were. But once we acknowl-edge that, we find that a tester can oftenadd a great deal of value to a situationthat has a vague and ambiguous prob-lem like "is the software fast enough?"

Instead of giving you easy answers, wewant to make your job valuable withoutburning you out in the process. So weintroduce you to patterns of perform-ance testing.

Here are this month’s terms:BOTTLENECKIt’s typical to find that one or more smallparts of an application are slowing downperformance of the entire application.Identifying bottlenecks is a big part ofperformance testing.

USER FLOW ANALYSISA general map of the usage patterns ofan application. An example user flowmight show that 100 percent of users goto the Login screen, 50 percent go tothe Search screen, 10 percent use theCheckout screen.

PROFILEA map of how various parts of the appli-cation handle load. Profiling is oftenuseful for identifying bottlenecks.

LOGAlmost all applications have some sortof logging capability, usually a text fileor a database row that keeps track ofwhat happened when. Adding "... andhow long" to the log is a standard devel-opment task. Timing informationparsed by a tool or spreadsheet canidentify particularly slow transactions,

thus pointing to a bottle-neck. It’s not uncommonto encounter situations inwhich we’re brought in todo performance testingonly to find that the slowparts of the system arealready identified quitenicely in the system logs.

BETA/STAGING SYSTEMMany companies exercise their softwarethemselves for profiling purposes. Onevideo game company we know of per-forms its profiling every Tuesday at 10:00am. Everyone in the company droppedwhat they were doing, picked up a gamecontroller, and played the company'svideo game product while the networkadmin simulated network load and thetest manager compiled profile informa-tion and interviewed players.

SIMULATIONWe usually recommend analyzing datafrom actual use of the system. When thatis not possible, there are tools that will sim-ulate various kinds situations such as net-work load, HTTP traffic, and low memoryconditions. Excellent commercial andopen-source tools exist for simulation.

BACK OF NAPKIN MATHRefers to the use of logic and mathe-matics to take known performancebehaviors such as the amount of timebetween page loads for a typical user orthe ratio of reads to updates. and calcu-late the amount of load to generate andsimulate a certain number of users.

PERFORMANCE AND SCALEPerformance generally refers to howthe application behaves under a singleuser; Scale implies how the softwarebehaves when accessed by many users atthe same time.

SLASHDOT EFFECTWhen software is suddenly over-whelmed with a huge and unforeseennumber of users. It’s origin is from sites

linked to by the popular news siteSlashdot. A system might perform per-fectly well and meet all specificationsunder normal conditions, but fail whenit meets with unexpected success.

A few relevant techniques for Webperformance management:WHEN PEOPLE CALL A TESTERTesters generally get called in to do"testing" when a performance or scal-ing problem already exists. In suchcases, you might not need more meas-ures of performance, but simplyto fixthe problem and retest.

USER FLOW ANALYSISSimulating performance involves predict-ing what the actual customer will do, run-ning with those predictions and evaluat-ing the results. A useful approach is to usereal customer data in the beta or produc-tion like environment. To quote EdwardKeyes paraphrasing Arthur C. Clarke:"Sufficiently advanced performance mon-itoring is indistinguishable from testing."

QUICK WINSIf you have a log, import the data into aspreadsheet, sort it by time-to-executecommands, and fix the slowest commandfirst. Better yet, examine how often thecommands are called and fix operationsthat are slow and performed often.

SERVICE LEVEL CAPABILITIESWe have had little success actually pullingout expected user requirements (some-times called Service Level Agreements orSLAs). We find more success in evaluat-ing the software under test and express-ing a service level capability. By under-standing what the software is capable of,senior management can determinewhich markets to sell to and whetherinvesting in more scale is required.ý

How Fast is Fast Enough?

ST&PediaTranslating the jargon of testing into plain English

Matt Heusser and Chris McMahon are career soft-ware developers, testers and bloggers.They’re col-leagues at Socialtext, where they perform testingand quality assurance for the company’s Web-based collaboration software.

MARCH 2009 www.stpcollaborative.com • 9

Matt Heusser andChris McMahon

Page 10: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

CultivateYour Crop For

High-PerformanceFrom The Ground Up

Page 11: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

desire to be more proactive in address-ing and resolving issues, but often takea reactive approach. Conventionalbehavior in IT is to manage discretesilos (e.g., the middleware layer, thedatabase layer, the UNIX server layer,the mainframe layer). To becomemore proactive and meet businessneeds across multiple infrastructurelayers, the goal must become proactive-ly managing to business goals.

Performance engineering (PE) isnot merely the process of ensuring adelivered system meets reasonable per-formance objectives. Rather, PEemphasizes the “total effectiveness” ofthe system, and is a discipline thatspans the entire software developmentlifecycle. By incorporating PE practicesthroughout an application’s life, scala-bility, capacity and the ability to inte-grate are determined early, when theyare still relatively easy and inexpensiveto control.

This article provides a detaileddescription of the activities across thecomplete software lifecycle, startingwith the definition and adherence toservice level objectives. This articlealso addresses the real world aspects ofperformance engineering, notably:

• What is realistic real-world per-formance for today’s dynamic webapplications?

• What is the real measure of per-formance?

• What aspects of the cloud need tobe considered (first mile, middlemile, last mile)?

The Software Development LifeCycle includes five key areas, begin-ning with business justification andrequirements definition. This is fol-lowed by the areas of system design,system development/implementation,testing, and deployment/support. Asportrayed in Figure 1 (next page),requirements definition must includeservice level definition; this includesnon-functional requirements ofresponse time, throughput, and key

measures of business process perform-ance (e.g., response and executiontime thresholds of transaction execu-tion time).

Across the lifecycle, the focus areas ofmultiple stakeholders are clearlydefined. The engineering group con-centrates on design and development/implementation. The QA and PEgroup focuses on testing activities (func-tional, integration, user acceptance,performance), while Operations focuseson system deployment and support.

Performance engineering activitiesoccur at each stage in the lifecycle,beginning with platform/environmentvalidation. This continues with per-formance benchmarking, perform-ance regression, and performanceintegration. Once the system is run-ning in production, proactive produc-tion performance monitoring enablesvisibility into system performance andoverall system health.

Service Level ObjectivesAvoid the culture of “It’s not a problem untilusers complain.”

Business requirements are the pri-mary emphasis of the analysis phase ofany system development initiative.However, many initiatives do not tracknon-functional requirements such asresponse time, throughput, and scalabil-ity. Key performance objectives andinternal incentives should ideally defineand report against service level compli-ance. As the primary goal of IT is toservice the business, well-defined serv-ice level agreements (SLAs) provide aclear set of objectives identifying activi-ties that are most appropriate to moni-tor, report, and build incentives around.

A key first step toward defining andimplementing SLAs is the identifica-tion of the key business transactions,key performance indicators (KPIs) andvolumetrics. Development and PEteams should begin the discussion ofservice level agreements and deliver adraft at the end of the discovery phase.For example, these may include thetransaction response times, batch pro-cessing requirements, and databasebackup. This also helps determine if a

www.stpcollaborative.com • 11

By Aaron Cook and Mark Lustig

A primary goal for IT organizations is to create an efficient,flexible infrastructure. Organizations struggle with the

Aaron Cook and Mark Lustig work forCollborative Consulting, a business and tech-nology consultancy.

Page 12: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

12 • Software Test & Performance MARCH 2009

FAST FARM

performance test or proof-of-concepttest is required in order to validate ifspecific service levels are achievable.

Many organizations rarely, if ever,define service level objectives, andtherefore cannot enforce them. Servicelevel agreements should be designedwith organization costs and benefits in

mind. Setting the agreements too lownegatively affects business value.Setting them too high can unnecessari-ly increase costs. Establishing andagreeing on the appropriate service lev-els requires IT and the business groupsto work together to set realistic, achiev-able SLAs.

Platform/Environment ValidationOnce the service levels are under-stood, platform/environment valida-tion can occur. This will aid in deter-mining whether a particular technicalarchitecture will support an organiza-tion’s business plan. It works byemploying workload characterizationand executing stress, load, andendurance tests against proof of con-cept architecture alternatives.

For example, a highly flexible dis-tributed architecture may include aweb server, application server, enter-prise service bus, middleware broker,database tier, and mainframe/legacysystems tier. As transactions flowthrough this architecture, numerousintegration points can impact per-formance. Ensuring successful execu-tion and response time becomes thefocus of platform validation. Whilethese efforts may require initial invest-ment and can impact the developmenttimeline, they pale in comparison tothe costs associated withretrofitting/reworking a system afterdevelopment is complete.

In addition, by performing proac-tive ‘pre-deployment’ capacity plan-ning activities (i.e., modeling), costscan be empirically considered along

with anticipated impacts to the infra-structure environment. These impactsinclude utilization, response time,bandwidth requirements, and storagerequirements, to name a few.

The primary goal of a platform val-idation is to provide an informed esti-mate of expected performance,

enabling a change/refinement inarchitecture direction, based on theavailable factors. Platform validationmust consider workload characteriza-tions such as:

• Types of business transactions• Current and projected business

transaction volumes• Observed/measured performance

(e.g., response time, processor andmemory utilization, etc.) Assump-

tions must be made for values ofthese factors to support themodel’s workload characterization.

Performance BenchmarkingPerformance benchmarking is used asa testing technique to identify the cur-rent system behavior under definedload profiles as configured for yourproduction or targeted environment.This technique can define a knownperformance starting point for yoursystem under test (SUT) before mak-ing modifications or changes to thetest environment, including applica-

tion, network or database configura-tions, or user profiles.

To identify and measure the specif-ic benchmarks, the performance testteam needs to develop a workloadcharacterization model of the SUT’sreal-world performance expectations.This provides a place to initiate thetesting process. The team can modifyand tune it as successive test runs pro-vide additional information. After theperformance test team defines theworkload characterization model, theteam needs to define a set of user pro-files that determine the applicationpathways that typical classes of userswill follow. These profiles are lever-aged and combined with estimatesfrom business and technical groupsthroughout the organization to definethe targeted SUT performance behav-ior criteria. Profiles may also be usedin conjunction with predefined per-formance SLAs as defined by the vari-ous constituent business organiza-tions.

Once the profiles are developedand the SLAs determined, the per-formance test team needs to developthe typical test scenarios that will bemodeled and executed in a tool suchas LoadRunner or OpenSTA. The

main requirement of the tool is that itallows the team to assemble the run-time test execution scenarios that itwill use to validate the initial bench-marking assumptions.

The next critical piece of perform-ance benchmarking is to identify thequantity and quality of test datarequired for the performance testruns. This can be determined byanswering a few basic questions:

• Are the test scenarios destructive to thetest-bed data?

• Can the database be populated in amanner to capture a snapshot of the

Engineering Group QA/PerformanceEngineering Group

Operations

Platform/Environment

Validation

PerformanceBenchmarking

PerformanceRegression

PerformanceIntegration

ProductionPerformanceMonitoring

Configuration/Customization

DesignImplementation Application Testing Product

Support

DEVELOPMENT PROCESS

Defineservice levelobjectives

FIG. 1: PLOTTING REQUIREMENTS

Review infrastructure& architecture

Define businessactivity profiles &

service levelsDesign & build tests

Iterate testing & tuning

• Identify risk areas• review configuration settings, topology & sizing• Define points of measurement

• Types & numbers of users• Business activities & frequencies

• Test data generation• Create test scripts• User & transaction profiles• Infrastructure configuration

FIG. 2: CULTIVATING PERFORMANCE

Page 13: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

FAST FARM

database before any test run andrestored between test runs?

• Can the test scenarios create the datathat they require as part of a set-upscript, or does the complexity of the datarequire that it be created in advanceand cleaned up as part of the test sce-narios?

One major risk to the test dataeffort is the risk that any of the testscripts fail during the course of test-ing. If using actual test scripts, the testruns and the data might have to berecreated anyway using external toolsor utilities.

As soon as these test artifacts havebeen identified, modeled, and devel-oped, the performance test bench-mark can begin with an initial test run,modeling a small subset of the poten-tial user population. This is used toshake out any issues with the testscripts or test data used by the testscripts. This is validates the targetedtest execution environment includingthe performance test tool(s), test envi-ronment, SUT configuration, and ini-tial test profile configuration parame-ters. In effect, this is a smoke-test ofthe performance test run-time envi-ronment.

Once the PE smoke test executessuccessfully, it is time to reset the envi-ronment and data and run the first ofa series of benchmark test scenarios.This first scenario will provide signifi-cant information and test results thatcan be leveraged by the performancetest team defining the performancebenchmark test suites.

The performance test benchmarkis considered complete when the testteam has captured results for all of thetest scenarios making up the test suite.The results must correspond to arepeatable set of system configurationparameters as well as a test bed of data.Together, these artifacts make up theperformance benchmark.

Figure 2 outlines our overall ap-proach used for assessing the perform-ance and scalability of a given system.These activities represent a best prac-tices model for conducting perform-ance and scalability assessments.

Each test iteration attempts to iden-tify a system impediment or prove aparticular hypothesis. The testing phi-losophy is to vary one element andobserve and analyze the results. Forexample, if results of a test are unsatis-factory, the team may chose to tune a

particular configuration parameter,and then re-run the test.

Performance RegressionPerformance regression testing is atechnique used to validate that SUTchanges have not impacted the exist-ing SLAs established during the per-formance benchmarking test phase.Depending on the nature of your SUT,this can be an important measure ofcontinued quality as the system under-goes functional maintenance, defectspecific enhancements, or perform-ance related updates to specific mod-ules or areas of the application.

Performance regression testingrequires the test team to have per-formed, at a minimum, a series ofbenchmark tests designed to establishthe current system performancebehavior. These automated test scriptsand scenarios, along with their associ-ated results, will need to be archivedfor use and comparison to the resultsgenerated for the next version of theapplication or the next version of thehardware environment. One powerfuluse of performance regression testingis when an application’s data center isupgraded to add capacity or moved toa new server. By executing a series oftests using the same data and testparameters, the results can be com-pared to ensure that nothing duringthe upgrade/migration was glossedover, missed, or adversely impactedthe modified application run-timeenvironment.

The goal for performance regres-sion testing is repeatability. Thisrequires establishing the same data-base sizing (number of records) dur-ing the test run, using the same testscenarios to generate the results, lever-aging as much of the same applicationfootprint during the test run, and

FIG. 3: EXPECTED YIELD

Source: www.keynote.com

MARCH 2009 www.stpcollaborative.com • 13

Page 14: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

14 • Software Test & Performance MARCH 2009

using as similar a hardware configura-tion during the test run. The chal-lenge arises when these are the specif-ic items being changed. Typically, thisoccurs most often when introducing adefect-fix or new version of the appli-cation.

In such cases, the number of itemsthat are different between test runs iseasily managed. The real challenge formeasuring and validating results arises

when the underlying applicationarchitecture or development platformchanges. During those test cycles, theperformance engineers need to workclosely with the application developersto ensure that the new tests being exe-cuted match closely the preexistingbenchmarked test results so that com-parisons and contrasts can be identi-fied easily.

The mechanism for executing theperformance regressions follows thesame model as the initial performancebenchmark. The one significant differ-ence is that the work required to iden-tify the test scenarios and create thetest data has been performed as partof the performance benchmark exer-cise. Once the test environment andsystem are ready for testing, the rec-ommended approach is to run thesame smoke test that was used duringthe initial performance benchmarktest. Once the smoke test runs success-fully, you can execute the initialbenchmark test scenarios and capturethe results. Ensure that the SUT is con-figured the same way, or as similarly aspossible, and capture the test runresults.

Compare the regression test resultsto the initial performance test bench-mark results. If the results differ sig-nificantly, the performance test teamshould investigate the possible rea-sons, rerun any tests required, andcompare the results again. The goalfor the regression tests is to validatethat nothing from a performance per-spective has changed significantlyunless planned. Sometimes, theregression test results differ signifi-cantly from the initial benchmark bydesign. In that case, the regressionresults have validated a configuration

change or a functional system changethat the business or end-user commu-nity has requested. This is considereda success for this phase of perform-ance testing.

Performance IntegrationPerformance integration testing is atechnique used to validate SLAs forapplication components across a suiteof SUT modules. To successfully inte-grate and compare the performancecharacteristics of multiple applicationmodules, the performance test teammust first decompose the SUT into itsconstituent components and perform-ance-benchmark each one in isolation.This might seem futile for applicationsusing legacy technologies, but the thisapproach can be used to develop apredictive performance characteriza-tion model across an entire suite ofmodules.

For example, in a simplistic transac-tion, there may be a number of com-ponents called via reference that com-bine into one logical business transac-tion. For the purpose of illustration,let’s call this business transaction“Login.” Login may take the form of aUI component used to gather user cre-dentials including user ID, password,and dynamic token (via an RSA-typekey-fob). These are sent to the appli-cation server via an encrypted HTTPrequest. The application server callsan RSA Web service to validate thetoken, and an LDAP service to validatethe user ID and password combina-tion. Each of these services returns asuccess value to the application server.The app server then passes on a suc-cess token to the calling Web page,authenticating or denying the useraccess to the application landing page.

While the business considers Loginas a single transaction, the underlyingapplication breaks it down into a min-imum of three discrete request/response pairs which result in sixexchanges of information. If the enduser community expects a Login trans-action to take less than five seconds,for example, and the application whenmodeled and tested responds within10 seconds 90 percent of the time, aperformance issue has been identifiedand needs to be solved.

The performance test team willhave mocked up each of the request/response pairs and validated eachone individually in order to identify

FIG. 4: BUMPER CROP

Source: www.gomez.com

FAST FARM

Page 15: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

the root cause of the potential per-formance bottleneck. Without per-forming this level of testing, the appli-cation developer may have limited vis-ibility into component response timeswhen integrated with other compo-nents. It is up to the performance testteam to help identify and vali-date with a combination ofperformance integration andperformance regression test-ing techniques.

Production PerformanceMonitoringTo be proactive, companiesneed to implement controlsand measures that enableawareness of potential prob-lems or target the problemsthemselves. Production per-formance monitoring ensuresthat a system can support serv-ice levels such as response time,scalability, and performance,but more importantly, enablesthe business to know in advance when aproblem will arise. When difficultiesoccur, PE, coupled with systems man-agement, can isolate bottlenecks anddramatically reduce time to resolution.Performance monitoring allows proac-tive troubleshooting of problems whenthey occur, and developing repairs or“workarounds” to minimize business dis-ruption.

Unfortunately, the nature of distrib-uted systems has made it challenging tobuild in the monitors and controls need-ed to isolate bottlenecks and to report onmetrics at each step in distributed trans-action processing. This problem has beenthe bane of traditional systems manage-ment. However, emerging tools and tech-niques are beginning to provide end-to-end transactional visibility, measurement,and monitoring.

Tools such as dashboards, perform-ance monitoring databases and rootcause analysis relationships allow tracingand correlation of transactions across thedistributed system. Dashboard views pro-vide extensive business and systemprocess information, and allow execu-tives to monitor, measure and prepareagainst forecasted and actual metrics.

‘Good’ Performance And A Web ApplicationIn an ideal world, response time would beimmediate, throughput would be limit-less, and execution time would be

instantaneous. During service level defi-nition, it is common for the goals setforth by the business to be more in linewith the ideal world, as opposed to thereal world. The business must definerealistic service levels, and the engineer-ing and operations group must validate

them. In a Web-based system, discreteservice levels must be understood bytransaction and by page type.Homepages, for example, are optimizedfor the fastest and most reliableresponse time. These typically containstatic content and highly optimized andstrategically located caching services

(e.g., Akamai). One company that meas-ures and response times is KeynoteSystems (www.keynote.com). Averageresponse time in a recent KeynoteBusiness 40 report (Figure 3, page 13)was 1.82 seconds.

Dynamic transactions traverse multi-ple architectural tiers, which typ-ically might include a Web serv-er, application server, databaseserver and backend /mainframeserver(s). Execution of a dynam-ic transaction is non-trivial.While more layers and integra-tion points allow for a more flex-ible system implementation,each integration point addsresponse and execution time.This overhead may include mar-shalling/un-marshalling of data,compression /un-compression,and queuing/dequeuing. Inde-pendently these activities mighttake only milliseconds, but col-lectively can add up to seconds.

Common complex dynamictransactions include account details andsearch. Figure 4 (previous page) showsthe best response times from a recentcredit card account detail report generat-ed by Gomez (www.gomez.com). Re-sponses range between 8 and 17 seconds,with an average response time of 14 sec-onds. Users have become accustomed to

FIG. 5: SITE SCOUTING

Source: www.gomez.com

•While more layers and integration

points allow for a more flexible

system implementation, each adds

response and execution time.

FAST FARM

MARCH 2009 www.stpcollaborative.com • 15

Page 16: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

16 • Software Test & Performance MARCH 2009

this length of execution time and expec-tations are effectively managed by meansof progress bars, messages animated .giffiles or other such methods.

For media outlets, which typicallyemploy content management enginesand with multiple databases, Gomeztracks search response times (Figure 5,previous page). These range from fourseconds to more than 15 seconds, with aaverage of around 11 seconds.

Reports such as these provide realperformance data that you can use tocompare with your UIs. In our consult-ing engagements, we ideally strive for aresponse time of 1-2 seconds—realisticfor static web content. However, fortoday’s complex dynamic transactions, amore realistic response time across stat-ic and dynamic content should bebetween three and eight seconds.Managing the user experience throughthe use of techniques including contentcaching, asynchronous loading tech-niques and progress bars all aid in effec-tively managing user expectations andoverall user satisfaction.

The Real Measure of PerformanceWhat are we actually measuring when wetalk about performance of an applica-tion? How do we determine what mattersand what doesn’t? Does it matter thatyour end user population can execute500 transactions per second if only 10 canlog on concurrently and the estimates forthe user distribution call for 10,000 simul-taneous logins? Conversely, does it matterif your application can successfully sup-

port 10,000 simultaneous logins if theend users can’t execute the most com-mon application functions as defined byyour business groups?

Most testers have heard the complaintthat “the application is slow.” The firstquestion often heard after that is, “What isslow, exactly?” If the useranswers with somethinglike “logging into the appli-cation,” you now havesomething to go on. Thebusiness user has justdefined what matters tothem, and that is the key tosuccessfully designing aseries of performance tests.Of course, this exampleimplies a client/server sys-tem with a UI component.While the example doesnot speak specifically to abatch or import-type sys-tem, the same methodolo-gy applies.

When trying to definethe real measure of per-formance, the next step isto define a transaction.There are a number ofschools of thought. Thefirst school states that atransaction is a singleempirical interaction withthe SUT. This definition may be helpfulwhen designing your performance inte-gration test suites. The second schoolstates that a transaction is defined as abusiness process. This can be extremely

helpful when describing to the businesscommunity what the observed perform-ance characteristics are for the SUT. Thechallenge is that the business may nothave insight into the underlying techni-cal implementation of a “transaction.”

What we find in the real world is thata transaction needs to be defined foreach performance test project and thenadhered to for the duration of the proj-ect testing cycle. This means that a dis-crete transaction may be defined for theperformance integration test phase andthen used in concert with additional dis-crete transactions to create a businessprocess transaction. This techniquerequires that the performance test teamcombine results and perform a bit ofmathematical computation. The tech-nique has worked successfully in a num-ber of performance engagements.

First Mile, Middle Mile, Last Mile When considering the performance ofWeb-based systems, there are variablesbeyond what is controlled, and by whosecontrol it is under. Aspects of theInternet cloud, often referred to as thefirst mile, middle mile, and last mile,become a primary consideration (Figure

6). Root causes of ‘cloudbottlenecks’ often includehigh latency, inefficientprotocols, and sometimes,network packet loss.

As the majority of appli-cations are dynamic (andhence not able to becached on proxy servers),the cloud becomes a bot-tleneck that is difficult tocontrol. The average dy-namic Web page contains20 to 30 components, eachrequiring an HTTP con-nection. The maximumround trip time (RTT) canbe as much as 60-times theaverage RTT based on inef-ficient routing in the U.S.

Optimizing applicationbehavior is typicallyfocused on the distributedinfrastructure within ourcontrol, including the Webserver and application anddatabase servers. The com-

plete user experience encompasses theclient user’s connection to the data cen-ter. For internal users, this is within thecontrol of the development team. Forexternal users, the optimization model

ISP

Last-mile

Middle-mile(RTT)

Carrier/NSP

Carrier/NSP

Global internet

Data Center

First-mile

IX

ISP

ISP

ISP

Remote End Users

Peerin

FIG. 6:THE FIELD

•The average

dynamic Web

page contains

20 to 30

components, each

requiring an

HTTP

connection.

FAST FARM

Page 17: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

is much more complex. To address thischallenge, proxy services companiessuch as Akamai have emerged.

Companies and users buy the lastmile from their local Internet ServiceProvider. Companies like Akamai andYahoo buy the first mile of access frommajor corporate ISPs. The middle mileis unpredictable, and is based on dynam-ic routing logic and rules that are opti-mized for the entire Internet, asopposed to optimized access for yourusers to your application.

The challenges for the middle mileare related to the network; delays at thepeering points between routers andservers within the middle mile. No oneentity is accountable or responsible forthis middle-mile challenge.

The latency associated with thecloud’s unpredictability can beaddressed, in part, with proxy services,which emphasize reduction in Internetlatency. By adding more servers set atthe ‘edge’, Tier 1 ISPs and Local ISPs, allstatic content is delivered quickly, andoftentimes, pre-cached dynamic contentcan also be delivered. This greatlyreduces the number of round trips,enhancing performance significantly. Inaddition, proxy services strive to opti-mize routing as a whole, with the goal ofreducing overall response time.

The typical breakdown of responsetime is based on the number of roundtrips in the middle mile. The moredynamic a Web page, the more roundtrips required. Optimizing cloud vari-ables will optimize overall response timeand user experience.

Performance engineering is a proac-tive discipline. While an investment inPE might be new to your organization, itscost is more than justified by the effi-ciency gains it will produce. It is clearlymore practical and affordable to investin systems currently in production,enhancing their stability, scalability andcustomer experience. This almost alwayscosts less than building a new systemfrom scratch, though doing so is clearlythe best way to ensure peak performanceacross the SDLC. Companies need assur-ances that their systems can support cur-rent and future demands, and perform-ance engineering is an affordable way toprovide those assurances. By gatheringobjective, empirical evidence of a sys-tem's performance characteristics andbehavior, and taking a proactive recom-mendations for its maintenance, the PEinvestment will surely pay for itself.ý

Performance engineering has matured beyond load testing, tuning and performance opti-

mization.Today, PE must enable business success beyond application delivery into the oper-

ational life cycle, providing the entire enterprise—both business and information technolo-

gy—with proactive achievement of company objectives.

Performance engineering is a proactive discipline. When integrated throughout an initia-

tive—from start to finish—PE provides a level of assurance for an organization, allowing it

to plan systems effectively and ensure business performance and customer satisfaction.

With budgets shrinking, proactive initiatives can be difficult to justify as their immediate

return on investment is not readily visible. Emphasis on the business value and ROI of PE

must become the priority. Advantages of PE are well understood, including:

• Cost reduction by maximizing infrastructure based on business need.

• Management of business transactions across a multi-tiered infrastructure.

• The quality and service level of mission-critical systems can be defined and measured.

• Implementation of SLAs to ensure that requirements are stated clearly and fulfilled.

• Forecasting and trending are enabled.

But where is the ROI of PE as a discipline? Yes it’s part of maximizing the infrastructure,

and yes it’s part of systems stability and customer satisfaction, but these can be difficult to

quantify. By understanding the costs of an outage, we can objectively validate the ROI of per-

formance engineering, as operational costs ‘hide’ the true costs of system development.

Costs of downtime in production include recovery, lost productivity, incident handling, unin-

tended negative publicity and lost revenue. In an extreme example, a 15 second timeout in

an enterprise application might result in calls to an outsourced customer support center,

which, over the course of time, could result in unanticipated support costs in the millions of

dollars.

An additional illustration of hidden costs that can be objectively measured to support ROI

calculations are the costs of designing and developing a system once, versus the cost of mak-

ing performance modifications to a system after it is in production and has failed to meet

service level expectations.

Non-functional business requirements are not always captured thoroughly. Some examples

include:

• A multi-tiered application that can scale to meet the expected load with the proper load-

balancing scheme and that can fail over properly to meet the service levels for availability

and reliability.

• A technical architecture that was engineered to meet the service levels of today and tomor-

row (as volume and utilization increase).

As IT organizations struggle to drive down maintenance costs and fund new projects, an

average IT organization can easily spend 75 percent of its budget on ongoing operations and

maintenance. IT shops are caught in ‘firefighting’ mode and inevitably dedicate a larger por-

tion of their budgets to maintenance, diverting resources from efforts to deliver new value to

the business. Taking a proactive stance will serve to enable reduced operating costs, higher

systems availability, and better performing systems today and in the future. Performance

engineering is that proactive capability.

FAST FARM

MAKING THE CASE FOR PE

MARCH 2009 www.stpcollaborative.com • 17

Page 18: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

tion of test projects and the types andsources of live data.

Once you’ve decided that live data fitsyour testing efforts, you’ll soon be pre-sented with three new questions toanswer:

1. What live data should we use? 2. How do we capture and manipulate

the data?3. How do we use it in our testing?Each of these questions presents its

own series of variables, the importance ofwhich will depend on your own situation.Tables 1, 2 and 3 present the commonlyencountered issues for each of the threequestions.

As you review each issue listed, trymaking an initial determination of itsimportance on a scale of critical, impor-tant, minor or irrelevant. If you do notknow its importance yet, place a “?” bythe issue. If you do not understand thebrief explanation of the issue, place a“??”.In a later article, you will be able tocompare your choices to a group ofexperts’ opinions.

Issue 1To assess the value of live data, weneed to know the alternatives.Everything is relative. Selecting the besttype and source of data for a perform-ance test requires awareness of the avail-

able alternatives and trade-offs, and thedefinition of “best” can be highly context-dependent.

Main AlternativesOne alternative to using copies of livedata is to devise test scenarios and thenscript or program automated tests to sup-port these scenarios. Another alternativeis to fabricate test data with a data gener-ation tool. A third alternative is to fore-cast future data by curve fitting andtrending. This can be done with live orfabricated data. Other alternatives arehybrids, e.g., a extract of an operationaldatabase can be accessed by fabricatedtransactions coordinated to match thedatabase.

In theory, we do not need live data ifwe define performance testing as check-ing system characteristics critical to per-formance (ability to support a specifiednumber of threads, database connec-tions, scaling by adding extra servers, noresource leaks, etc). We need only a loadwhich will show performance is in linewith specifications.

In practice, I favor a mix of datasources. While the judgment of experi-enced testers is invaluable, we all haveunrecognized assumptions and biases.Even fabricated data that matches theexpected usage patterns tends not touncover problems we assume will neverhappen.

The most appropriate framework forcomparison may not be live data vs. alter-native data sources. Live data in black boxtesting is “black data”: we are usingunknown data to test unknown software

18 • Software Test & Performance MARCH 2009

By Ross Collard

T his is the second article in a series that began last monthwith an introduction to live data use in testing, categoriza-

Ross Collard is founder of Collard &Company, a Manhattan-based consulting firmthat specializes in software quality.

Page 19: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

behavior. The data source alternatives arenot the full story. The system vulnerabili-ties and comparability of test and produc-tion environments also are significant toassessing value.

Allocating ResourcesWhat mix of data from differentsources, live and otherwise, is mostappropriate? If we do not consider allpotential sources of data, our perspec-tive and thus the way we test may belimited.

Testers benefit by allocating theirefforts appropriately among differenttest approaches and sources, andunderstanding the alternatives helpsimprove these decisions. Though allo-cations often change as a test projectprogresses, having a realistic sense ofthe alternatives at the project initia-tion helps us plan.

Scripted TestsCompared to live data extracts, script-ed test cases tend to be more effectivebecause each is focused, aimed at con-firming a particular behavior or uncov-ering a particular problem. But theywork only if we know what we are look-ing for.

Compared to a high-volume ap-proach using an undifferentiated del-uge of tests, the total coverage by acompact suite of scripted test cases islikely to be low. However, the coverageof important conditions is high becauseof the focusing. The cost of craftingand maintaining individual test casesusually is high for each test case.

Data GeneratorsGIGO (garbage-in, garbage-out) is thepredominant way that data generatorsare utilized. The tool output — calledfabricated or synthetic test data —often is focused for the wrong reasons.Over-simplifying the problem, unfa-miliarity in using these tools, toolquirks, knowing the test context onlysuperficially, and lack of imaginationare not unusual. All can lead to hidden, unwanted patterns in the fabricated data that might give falsereadings.

Fabricated data often lacks the rich-ness of reality. Fabricating data is moredifficult when data items have referen-tial integrity or data represents recog-nizable entities — no random charac-ter string can fully replace a stock tick-er symbol or customer name. Anotherpitfall with fabricated data is con-sciously or unconsciously massagingthe data so tests always pass. Our job isto try to break it.

Determining ValueHow do we assess and compare value?“Value to whom?” is a key question.The framework in the first article inthis series identifies characteristicswhich influence value. The value of adata source can be measured by theproblems it finds, reducing the riskand impact of releases to stakeholders.

Value depends on what is in thedata – what it represents — and what isbeing tested. If the data has a low usercount, it will not stress connectiontables. The same repeated data

Step-by-Step

Guidance For

Entering The Maze,

Picking The Safest

Path And Emerging

With The Most

Effective Tests

Possible

MARCH 2009 www.stpcollaborative.com • 19

Page 20: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

20 • Software Test & Performance MARCH 2009

stream, even if real, likely won’t havemuch effect on testing connectionsper second or connection scavengingin a stateful device.

Value is related to usefulness andthus is relative to the intended use.Live data is effective some areas, suchas realistic background noise, but lessfor others, such as functional testing.On the other hand, fabricated datacan be designed to match characteris-tics desirable for a given situation.The value of the crafted data is highfor that purpose, usually higher thanlive data.

BaselinesA baseline is the “before” snapshot ofsystem performance for before-and-after comparisons, and is built fromlive data. Baseline test suites can beeffective in catching major architec-tural problems early in prototypes,when redesign is still practical.Fabricated data also has proved usefulin uncovering basic problems early,though considerable time can bespent solving problems that wouldnever occur in realistic operating con-ditions.

Trade-OffsThe goal of a load test 90% of the timeis to “simulate a real-world transactionload, gauge the scalability and capacityof the system, surface and resolvemajor bottlenecks”. While thoughtfulaugmentation of live data can surfaceperformance anomalies, the largestdegradation tends to occur from appli-cation issues and system software con-figuration issues. Unless you have nolive data at all and must create testdata from scratch, the time spent cre-ating your own data to test data bound-

ary conditions, for example, is usuallynot worth the extra effort.

Unique Benefits of Live DataThe great benefit of live data is undeni-able: it is reality-based, often with a seem-ingly haphazard, messy richness. The

data variety, vagaries and juxtapositionsare difficult to replicate, even by the mostcanny tester.

In areas like building usage profilesand demand forecasting, there is no sub-stitute for live data. Capacity planning,for example, depends on demand fore-casting, which in turn depends on trends.Live data snapshots are captured over a(relatively glacial) duration, and com-pared to see the rates of change. Thetrends are then extrapolated into thefuture, to help determine the triggerpoints for adding resources.

Common BlundersTwo common blunders in performancetesting are;

1. Creating and then solving problems

that do not exist in the real world.These problems originate from badassumptions, biases or artifacts ofthe simulation.

2. Failing to determine actual peakdemands in terms of volume andmix of the load experienced bythe live system, and to correlatethis to known periods of inade-quate performance.

In either case, mining data from thelive system is critical in guiding thescope and implementation of perform-ance testing.

Live Data LimitationsLive data does not always have the varia-tion to adequately test the paths in ourapplications. We may not always catch theresults of an outlier if we use only unen-hanced live data. In a new organizationthere may not be enough live data —-the quantity of data available is notenough to effectively test the growth

potential, (or for the pessimistic, is notenough pressure to find where the appli-cation breaks).

In health care or financial organiza-tions, among others, using live datacould expose your company to lawsuits. To remain Sarbanes Oxley com-pliant, it may be worthwhile to scram-ble live data in the test environmentwhile retaining data integrity, and stilltest with enough data points.

Issue 2The live data chosen for testing doesnot reveal important behaviors wecould encounter in actual operation.This is a risk, not a certainty. Black boxlive data may or may not reveal importantissues. Often it makes sense to supple-

LIVE DATA II

TABLE 1: SELECTING THE LIVE DATA TO USE IN TESTING

1.

2.

3.

4.

5.

6.

Issue

To assess the value of live data, we need to know the alter-natives.

The live data chosen for testing does not reveal importantbehaviors we could encounter in actual operation.

Unenhanced, live data has a low probability of uncovering aperformance problem.

Test data enhancement is a one-time activity, not ongoing,agile and exploratory.

The data we want is not available, or not easy to derive fromwhat is available.

Background noise is not adequately represented in live data.

Importance to You

TABLE 2: OBTAINING THE LIVE DATA

7.

8.

9.

10.

11.

12.

13.

Issue

Live data usually can be monitored and collected only fromlimited points, or under limited conditions.

Tools and methods influence the collection process, in waysnot well understood by the testers.

The capture and extract processes change the data, regard-less of which tools we employ. The data no longer is repre-sentative, with subtle differences that go unrecognized.

The live data is circumscribed.

The live data has aged.

The data sample selected for testing is not statistically valid.

Important patterns in the data, and their implications, gounrecognized.

Importance to You

Page 21: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

ment black box , end-to-end tests withones focusing on a particular compo-nent, subsystem, tier or function. Thereare probably enough other aspects of thesystem that aren’t quite “real” that therepresentativeness of the data is just oneof many issues.

Testers’ Limited KnowledgeMost testers do not know the live datacontent and its meaning, unless theyalready are closely familiar with the situa-tion, or have extensive time and motiva-tion to learn about the live data.

An extensive learning effort often isrequired because of the live data’s rich-ness, volume and complexity.

Running a volume test with no under-standing of the data does not prove any-thing — whether it passes or fails.Developers will not be thrilled with theextra work if the test team is not capableof determining whether the data createda valid failure scenario.

Confluent events may trigger telltalesymptoms of problems in the test lab. Butif the confluence and its symptoms areunknown, testers do not know what datavalues and pattern or sequence to lookfor. The testers thus cannot check for thepattern’s presence in the live test data.

Over-Tuning Test DataThe representativeness of the test data isjust one of many compromises. Otheraspects of the system often are not suffi-ciently realistic. For example, we mightbetter add monitoring capabilities to theproduct than fiddle with the test data.

Beware of becoming too sophisticated andlosing track of your data manipulations. Forexample, while live data can provide cov-erage for random outlying cases in somesituations, it can be a trap to incorporatechanges in usage over time.

Problems That Data Refinement Can't FixUsing live data reveals one importantpiece of information, which is whetherthe application will perform problem-free with that specific data and in a spe-cific test lab. As for live data revealing anyother behaviors, that depends on howyou use your live data. For example, ifyou read live data from the beginning ofa file for each test run, you expect pre-dictably repeatable response time graphs.If you randomly seed your live data (startat a random point in the file each time),you may not experience repeatablebehavior.

Corner cases must be tested. No mat-ter what live data is chosen, it will not nec-essarily be representative of real world sit-uations — the proper mix of applica-tions, user actions and data. For example,no network is simple, and no simulatedcombination of traffic will ever exercise itfully. Production is where the rubbermeets the road. Smart testers may chooseto stress as much of the device or infra-structure as possible, to see how eachdevice operates and how it affects otherdevices in the network.

Although no amount of testing isguaranteed to reveal all important behav-iors, captured live data can reveal manythat occur in live operation. This is truewhether or not we investigate and under-stand these behaviors. Unrecognizedbehaviors are not detected, but nonethe-less are present and possibly will be dis-covered later. Live data can only triggeran incomplete sample of operationalbehaviors. Other behaviors in operationare not included, and some are likely tobe important. “Important” is in the eye ofthe beholder.

In summary, live data can revealimportant behaviors — and also misleadby throwing false positives and negatives.Much depends on the details, for exam-ple, of how we answer questions likethese. What period of time is chosen, andwhy? What are the resonances in a newsystem being tested? Does the live datareflect any serialization from the loggingmechanism? What errors resulted in notlogging certain test sessions, or causedissues from thread safety challenges ormemory leaking?

Scaling up live data can be difficult ifwe are interested in finding a volume-related failure point, and the data isclosely correlated to the volume.

For brevity, the remainder of theissues are summarized and may not bespecifically called out.

Issue 3Unenhanced, live data has a lowprobability of uncovering a per-formance problem.Most live data has been captured under“normal” working conditions. The dataneeds to be seeded with opportunities tofail, based on a risk assessment and a fail-ure model. Running a copy of live data asa test load seems practical but may notproduce reliable performance data.Confluent events may trigger telltalesymptoms in the test lab, such as a cusp(sudden increase in gradient) in aresponse time curve. If the confluenceand its symptoms are unknown, testersdo not know what behaviors to trigger orpatterns to look for.

Live data will reveal important behav-iors, and will also throw false positivesand negatives. “Important” is in the eyeof the beholder. You could argue that youmight need some better testers whounderstand the data. Running a volumetest with no understanding of the dataisn’t going to prove anything - whether itpasses or fails. And, a development teamwill not be thrilled to have to do all theanalysis because the test team is not capa-ble of determining if the data has createda valid failure or success scenario.

Issues 4, 5 and 6Test data enhancements, data unavail-ability and background noise.More a craft than a science, test loaddesign (TLD) prepares work loads foruse in performance and load testing. Aload is a mix of demands, and can bedenominated in a variety of units of

LIVE DATA II

MARCH 2009 www.stpcollaborative.com • 21

TABLE 3: USING THE LIVE DATA IN TESTING

14.

15.

16.

17.

18.

19.

20.

Issue

Running data in a test lab requires replay tools which mayinterfere with the testing.

Capture and replay contexts are not comparable.

Even if the same test data is re-run under apparently thesame conditions, the system performance differs.

The test results from using the live data are not actionable.

The test data is not comparably scalable.

Live data selected for functional testing may not be as effec-tive if used for performance testing, and vice versa.

Coincidental resonances and unintended mismatches occurbetween the live data, the system and its environment.

Importance to You

Page 22: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

22 • Software Test & Performance MARCH 2009

measure: clicks, transactions, or whateverunits fit the situation. TLD is one of themore important responsibilities of per-formance testers, and many see it as acritical competency.

TLD is situational and heuristic, withfour main approaches:

• Develop test scenarios and script testcases. (Typically based on docu-mented or assumed performancerequirements.)

• Generate volumes of fabricated data.(Typically using a homebrew orcommercial test data generationtool.)

• Copy, massage, enhance live data.(This depends on the availability oflive data in a usable form.)

• A combination of the first three.

Performance RequirementsRequirements are more about user satis-faction than metrics targets. Though theymay need to be quantified to be measura-ble, it is more important that the require-ments reflect the aspirations of users andstakeholders. If the requirements are notadequate, a common situation, we mayhave to expand the test project scope toinclude specifying them. We can captureaspirations by conducting a usability testand a user satisfaction survey.

Using Equivalence and Partitioning ToConfirm Bellwethers We use equivalence to group similar testsituations together, and pick a represen-tative member from each group to use asa test case. We want the best representa-tive test case, as within an equivalenceclass (EC) some are more equal than oth-ers. Despite our careful attempts atdemarcation, uncertainties mean manyECs are fuzzy sets (i.e., with fuzzy bound-aries – membership of the set is not blackand white). The costs to develop testcases vary. The best representative testcase is the one which reliably covers thelargest population of possible test cases atthe lowest cost.

For example, let’s assume that we cre-ate a test case to print a check, to pay aperson named John Smith $85.33. Thesystem prints the check correctly within15 seconds, which is our response timegoal. Since the system worked correctlywith this test transaction, do we need toinvestigate its behavior if we request acheck for John Smith in the amount of$85.32 or $85.34? Probably not.

If everything else like backgroundnoise remains the same, most of us are

willing to assume that the behavior of thesystem is essentially the same (“equiva-lent”) under these three different condi-tions. Similarly, if the first transaction failsto print a check within 15 seconds (toJohn Smith for $85.33), can we assumethat the other two test transactions willfail too, and therefore not bother toprocess them? Most testers would say yes.

Of course, these are only assumptions,not known facts. It could be, though wedon’t know this, that John Smith hasexactly $85.33 in his bank account. Thefirst transaction for $85.33 works, but arequest to print a second check for JohnSmith for $85.34 or more will not be hon-ored. The response time becomes theduration to deposit sufficient funds inJohn Smith’s account or infinite if thenew deposit does not happen.

What if the system prints one checkcorrectly, but because of a misinterpretedrequirement is designed to not print asecond check for the same person on thesame day? We would not find this per-formance bug if we assume equivalenceand use only one test transaction.

Most of us instinctively use equiva-lence while we are testing. If one test caseresults in a certain behavior, whetheracceptable or not, we simply assumeother equivalent test cases would behavea similar way without running them.

Modeling Performance Failures Test design is based consciously or other-wise on a theory of error, also called a fail-ure model or fault model.

In functional testing, an example ofan error is returning an incorrectvalue from a computation. In per-formance testing, an example isreturning the correct value but toolate to be useful. Another: not beingable to handle more than 1,000 userswhen the specs say up to 10,000 mustbe supported concurrently.

When we craft an individual testcase, we assign an objective to it, eitherto confirm an expected behavior thatmay be desirable or undesirable, or totry to find a bug. In the latter case, weeffectively reverse-engineer the testdesign, starting with a failure, workingbackwards to the faults, i.e., possiblecauses of failure, and then to the condi-tions and stimuli that trigger the failure.Test cases are then designed to exercisethe system by trying to exploit the spe-cific vulnerability of interest.

Our test design is driven by our theo-ry of error. Do not worry initially about

confusing failures and faults – their rela-tionships can distract, but unless you area Sufi philosopher the causes and effectsfall into place. (Sufis do not believe incause and effect.) Remember that onefailure can be caused by many differentfaults, and one fault can trigger none toan indefinitely large number of failures.There is no one-to-one relationship.

TLD with Live DataWith live data, TLD has a different fla-vor than traditional test case design.Instead of building a new test case ormodifying an existing one, we seed thesituation, i.e., embed it into the testdata from live operations.

We can survey the pristine data toidentify opportunities to exploit the sus-pected vulnerability. If not, we modify oradd opportunities to the original data. Ifthe surveying effort is a hassle, we canskip it and enrich the original data.

Agile Feedback in TestingWe may not be aware of test data prob-lems unless we organize a feedbackloop. Testing often has a one-way pro-gression, with little feedback abouthow well the test data worked.Feedback loops can be informal to thepoint of neglect, not timely andaction-oriented, or encumbered withpaperwork. Only when live data leadsto obviously embarrassing results dowe question the accepted approach.

To skeptics, this acceptance of thestatus pro is sensible: if it is not bro-ken, don’t fix it. Perhaps the real prob-lem is overly anxious nit-pickers withtoo much time on their hands.Perhaps our testing is really not thatsensitive to data nuances.

Ask your skeptics: “How credibleare your current test results? How con-fident are you when you go live? Doyou risk service outages, or over-engi-neer to avoid them?” Many havethoughtful insights; more stumblewhen asked.

Obtaining useful feedback does nothave to be cumbersome. We build aprototype test load from live data, runit in trials and examine the outcomesfor complications. Here “build”includes “massage and improve”.Without feedback there is no learning,and pressure to deliver a perfect testload the first time.

I plan at least three iterations ofrefining the data. If you have 10 hoursto spend on getting the data right, do

LIVE DATA II

Page 23: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

LIVE DATA II

MARCH 2009 www.stpcollaborative.com • 23

not spend 8 hours elaborately captur-ing it before you have something toreview. Instead, plan to invest yourhours in the pattern: 1-2-3-4. Expectseveral trials, have a prototype readywithin one hour, and reserve moretime at the end for refinement thanthe beginning.

Agile Feedback in Live OperationsActionable, responsive and timely feed-back matters more in live operation thanin the relative safety of a test lab, becausethe feedback cycle durations tend to bemuch shorter. Systems are labeled asunstable (a) if they have or are suspectedwill have unpredictable behavior, and (b)when that behavior happens, our correc-tive reaction times are too slow to preventdamage.

Put another way, systems with uncer-tainty and inertia are hard to control. Wecan’t effectively predict their futurebehavior, analyze data in real-time tounderstand what events are unfolding,nor quickly correct behavior we don’tlike. By the time we find out that the sys-tem performance or robustness is poor,conventional responses like tuning,rewriting and optimizing code, andadding capacity may be inadequate.

Question: What type of live data helpsfacilitate or impede timely feedback?

Preparing Test Loads from Live DataWhile the specifics vary, typically the

process of preparing a test load from livedata includes these six steps. Some or allof the steps typically are repeated in aseries of cycles:

Step 1: Determine the characteristicsrequired in the test load, i.e., the mix ofdemands to be placed on the systemunder test (SUT). Often these are spec-ified in rather general terms. E.g.,“Copy some live data and download it tothe test lab.”

Step 2: Capture or extract data withthe desired characteristics, or re-use exist-ing extracts if compatibility and dataaging are not problems.

Step 3: Run the performance or loadtest using the extract(s). Sometimes thesame test run fulfills both functional testand performance / load test objectives.

Step 4: Review the output test resultsfor anomalies. Often the anomalies arenot pre-defined nor well understood. Acommon approach: “If there’s a glitch, Iwill know it when I see it.”

Step 5: Provide feedback.Step 6: Take corrective actions if

anomalies are detected.These steps are iterative.

The Availability of Live DataOf the six steps above, Step 2 (capturingor extracting the live data), arguably isthe critical one. This step is feasible only

with comparable prior or parallel experi-ence, and its difficulty decreases with themore experience we have.

Obtaining test data for a breakthrough ishardest.

• If a system or feature is radicallyinnovative, there is no precedent toserve as a live data source.

• Something completely new is rare,though, so at least a modicum ofcomparable history is likely.

Testing a new system tends to be easier.• If a new system replaces an exist-

ing one, it is unlikely that thedatabase structure, transactionsand data flows exactly mirror theexisting ones.

Testing a new version of an existing systemand its infrastructure usually is easiest.

• If live data does not already exist, itcan be generated using a prior ver-sion of the system being tested.

• Similar data from other parallel situ-ations can be captured and may beusable with little conversion.

• Data capture and extraction entailsless work when we regression-testminor changes to existing systems.

Typical Live Data to CollectLive data is not undifferentiated,though to the untutored it mayappear to come in anonymous sets ofbits. Not all data is equally good forour purposes. If this claim is true,then what are the desirable character-istics of live data from a performancetesting perspective? To answer, I willdrill down from the performancegoals to the characteristics to monitoror measure, then to the atomic data ofinterest to us.

Performance GoalsThe data we want to gather is based onthe testing goals and thus ultimately onthe system performance goals. If thesegoals are not explicit, we can elicitthem in requirements interviews.(Caution – eliciting requirements is amajor scope increase in testing proj-ects. Do not expand your project with-out careful assessing your options.)

Or the performance goals may beoutlined in documents like productmarketing strategies, user profiles, fea-ture comparisons and analysis of com-petitive products. Examples of perform-ance goals:

• Our users’ work productivity issuperior to the comparable pro-ductivity of competing organiza-tions or competing systems.

• System response timesunder normal working con-ditions generally are withinthe desired norms (e.g.,product catalog searchesaverage 50% faster than our5 fastest competitors; “gen-erally” implies that in someinstances the responsetimes are not superior.None of these inferiorresponse times can be forhigh $ value transactions forour premium customers;realistic level of backgroundnoise is assumed.)

• The number of concurrent-ly active users supportedand the throughput areacceptable (e.g., at least1,000 active users; at least 1task per user completed by90% or more of these usersin every on-line minute;

•Eliciting requirements

is a major scope

increase in testing

projects. Do not

expand your

project without

carefully assessing

your options.

Page 24: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

24 • Software Test & Performance MARCH 2009

“user task” needs to bedefined).

• Response times under occa-sional peak loads do notdegrade beyond an accept-able threshold (e.g., testpeak is set to the maximumexpected weekly load of2,500 users; in this mode,the average degradedresponse time is no morethan 25% slower than thenorm).

Goals need to be quantified, forobjective comparisons between actualvalues and the targets. I have not both-ered to quantify all the goals above, tohighlight how vague goals can be with-out numeric targets, and because itintroduces another layer of distractingquestions. Equally important, the con-text — i.e., the specific conditions inwhich we expect the system to meetthe goals — must be spelled out.

Performance Testing GoalsWithin the framework of the SUT (sys-tem under test) and its performancegoals, the testing goals can vary con-siderably. For example, if the testobjective is to verify that capacity fore-casting works versus let’s say predict-ing a breakpoint, different thoughrelated metrics need to be tracked. Inboth cases,the metricsare compli-cated by non-linearity.

Capacity fore-casting seeks topredict what addi-tional resourcesare needed, andwhen and wherethey need to beadded, to main-tain an acceptablelevel of service, let’s sayin compliance with an SLA(service level agreement).

Predicting a breakpoint,by contrast, involves testing withincreasing load, monitoring how met-rics like response time and throughputchange with the increasing load, andextrapolating the trends (hopefullynot with a straight line), until theresponse time approaches infinity orthe throughput approaches zero orboth.

Testing goals influence the data

needs. The relationship can be reversed– the data availability influences the testgoals, sometimes inappropriately.

Characteristics to Monitor or MeasureWe test to evaluate whether a system’sperformance goals have been met satis-factorily. Effective goals are expressedin terms of the desired values of per-formance characteristics, averages to bemet, ratios and thresholds not toexceeded, etc. Observing or calculatingthe values of the characteristics is vitalto this evaluation.

Characteristics of interest can be static(e.g., the rated bandwidth of a networklink, which does not change until theinfrastructure is reconfigured or canreact to changing demands), but aremore likely to be dynamic, The values ofmany dynamic performance characteris-tics depend on (a) the loads and (b) theresources deployed. Measuring perform-ance is pointless without knowing theload on the system and the resources uti-lized at the time of measurement.

Static characteristics include the allo-cated capacities (unless the system andinfrastructure are self-tuning): memorycapacity, for each type of storage and ateach storage location, processing capaci-ty, e.g., their rated speeds, and networkcapacity (rated bandwidths of links).

Other static characteristics includethe on / off availability of pertinent fea-tures like load balancing, firewalls, servercluster failover and failback, and topolo-gy (i.e., hub vs. spoke architecture)

Dynamic characteristics include: • Response times, point-to-

point or end-to-end delays,wait times and latencies.

• Throughput, e.g., units ofwork completed per unitof time, such as transac-tions per second.

• Availability of sys-tem features andcapabilities to users.

• Number of concur-rently active users.

• Error rates, e.g., by type of trans-action, by level of severity.

• Resource utilization and sparecapacity, queue lengths, numberand frequency of buffer overflows.

• Ability to meet service level agree-ments.

• Business-oriented metrics like $revenue per transaction, and thecost overheads allocated to users.

Atomic Data to HarvestIf a characteristic or metric is notready to gather, we may be able to cal-culate it from more fundamental data— if that data is available. A depend-ent variable is one which is derivedfrom one or more independent vari-ables. An independent one by defini-tion is atomic. We calculate perform-ance characteristics from atomic data.Whether atomic or not, data of inter-est would include user, work and eventdata and counts, and resource utiliza-tion stats.

Sometimes the atomic data is notavailable, but derivatives are. The low-est-level dependent data that we canaccess effectively becomes our basisfor calculation. Examples:

Timings• Expected cause-and-effect rela-

tionships among incidents. • Duration of an event.• Elapsed time interval between a

pair of incidents. • Synchronization of devices.Rates of change• Number of user log-ons during an

interval.• Number of log-offs in the same

interval.

Fighting the Last WarUsing live data is like driving an automo-bile by looking in the rear vision mirror.The data reflects the past. For example, ifthe growth rate at a new website exceeds50 percent a week, a two-week-old copy oflive data understates current demands bymore than 75 percent.

The growth rate is the rate of changefrom when the live data was captured towhen the test is run. Growth rates areboth positive and negative. Negativegrowth, of course, is a decline.

The volumes of some types of datamay grow while others decline, as themix rotates. If they cancel each otherout, the net growth is zero. We cannotwork with change in the aggregateunless we are confident the conse-quences are irrelevant, but must sepa-rately consider the change for eachmain type of work.

The boundary values, e.g., a growthrate of +15%, are not fixed by scientif-ic laws but are approximate.

Over time, you'll accumulate experi-ence and data from your own projects.And when you're confident in the accu-racy of your growth rates, replace theapproximate values with your own.ý

LIVE DATA II

Page 25: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

On Another Issue of TheTest & QA Report

eNewsletter!Each FREE biweekly issue includes original articles that interview top thought leaders in software testing and

quality trends, best practices and Test/QA methodologies.Get must-read articles that appear

only in this eNewsletter!

Subscribe today at www.stpmag.com/tqa

To advertise in the Test & QA ReporteNewsletter, please call +631-393-6054

[email protected]

DDoonn’’ttMMiissssOOuutt

Page 26: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing
Page 27: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

testing. The traditional performancetesting approach—where one or moreload tests are run near the end of theapplication development cycle—can-not guarantee the appropriate level of

performance in a complex, multi-lay-ered, rapidly-changing Web servicesenvironment. Because of the complex-ity of Web services applications and anincreasing variety of ways they can beused and misused, an effective Webservices performance testing solutionwill have to run a number of tests tocover multiple use case scenarios thatthe application may encounter. Thesetests need to run regularly as the appli-cation is evolving so that performanceproblems can be quickly identified

and resolved. In order to satisfy theserequirements, Web services perform-ance tests have to be automated.Applying a well-designed, consistentapproach to performance testingautomation throughout the develop-ment lifecycle is key to satisfying aWeb services application’s perform-ance requirements.

This article describes strategies forsuccessful automation of Web servicesperformance testing and provides amethodology for creating test scenar-P

hoto

grap

hs b

y Jo

e S

terb

enc

Sergei Baranov is principle software engineerfor SOA solutions at test tools makerParasoft Corp.

By Sergei Baranov

T he successful developmentof scalable Web services

requires thorough performance

MARCH 2009 www.stpcollaborative.com • 27

aauuttoommaattee ppeerrffoorrmmaannccee tteessttssBowl Over Competitor Web Sites

With Techniques From The Real World

Page 28: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

ios that reflect tendencies of the real-world environment. To help you applythese strategies, it introduces bestpractices for organizing and executingautomated load tests and suggests howthese practices fit into a Web servicesapplication’s development life cycle.

Choosing a Performance Testing ApproachPerformance testing approaches can begenerally divided into three categories:the “traditional” or “leave it ’till later”approach, the “test early test often”approach, and the “test automation”approach. The order in which they arelisted is usually the order in which theyare implemented in organizations. It isalso the order in which they emergedhistorically.

The “traditional” or “leave it ’til llater” approach. Traditionally, compre-hensive performance testing is left tothe later stages of the application devel-opment cycle, with the possible excep-tion of some spontaneous performanceevaluations by the development team.Usually, a performance testing teamworks with a development team onlyduring the testing stage when bothteams work in a “find problem – fixproblem” mode.

Such an approach to performancetesting has a major flaw: it leaves the

question of whether the applicationmeets its performance requirementsunanswered for most of the develop-ment cycle. Unaware of the applica-tion’s current performance profile,developers are at risk of making wrongdesign and architecture decisions thatcould be too significant to correct atthe later stages of application develop-ment. The more complex the applica-tion, the greater the risk of such designmistakes, and the higher the cost ofstraightening things out. Significantperformance problems discoveredclose to release time usually result inpanic of various degrees of intensity,followed by hiring application perform-ance consultants, last-minute purchaseof extra hardware (which has to beshipped overnight, of course), as well asperformance analysis software.

The resolution of a performanceproblem is often a patchwork of fixes tomake things run by the deadline. Therealization of the problems with the“leave it till later” load testing practiceled to the emergence of the “test early,test often” slogan.

The “test early, test often” ap-proach. This approach was an intuitivestep forward towards resolving signifi-cant shortcomings of the “traditional”approach. Its goal is reducing theuncertainty of application performanceduring all stages of development bycatching performance problems beforethey get rooted too deep into the fabricof the application. This approach pro-moted starting load testing as early asapplication prototyping and continu-ing it through the entire applicationlifecycle.

However, although this approachpromoted early and continuous testing,it did not specify the means of enforcingwhat it was promoting. Performancetesting still remained the process ofmanually opening a load testing appli-cation, running tests, looking at theresults and deciding whether the reporttable entries or peaks and valleys on theperformance graphs mean that the testsucceeded or failed.

This approach is too subjective to beconsistently reliable: its success largelydepends on the personal discipline torun load tests consistently as well as theknowledge and qualification to evaluateperformance test results correctly andreliably. Although the “test early, testoften” approach is a step forward, it fallsshort of reaching its logical conclusion:

the automation of application perform-ance testing.

The “performance test automation”approach. The performance test auto-mation approach provides the means toenforce regular test execution. Itrequires that performance tests shouldrun automatically as a scheduled task:most commonly as a part of the auto-mated daily application “build-test”process.

In order to take the full advantage ofautomated performance testing, howev-er, regular test execution is not enough.An automated test results evaluationmechanism should be put into action tosimplify daily report analysis and tobring consistency to load test resultsevaluation.

A properly-implemented automatedperformance test solution can bring thefollowing benefits:

• You are constantly aware of theapplication’s performance profile.

• Performance problems get detect-ed soon after they are introduceddue to regular and frequent testexecution.

• Test execution and result analysisautomation makes test manage-

SPARE PERFORMANCE

FIG. 1: STRIKE TEAM

1

4

Developers

Source Code Repository

Nightly Build System

Application

New source code,code changes

Entire applicationsource code

FunctionalRegression

Tests

PerformanceRegression

Tests

Reporting System

Test results analyzed andprocessed by the Reporting Stytem

Test Reports

regular use misuse malicious use

type of use

WSDL requests

service requests

content type

average load peak load

type of load

Web ServiceLoad TestScenario

emulation mode

stress test endurance test

virtual user

request persecond

28 • Software Test & Performance MARCH 2009

Page 29: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

ment very efficient. Because of this efficiency gain, the

number of performance tests can besignificantly increased. This allows youto:

• Run more use case scenarios toincrease tests coverage.

• Performance test sub-systems andcomponents of your application inisolation to improve the diagnosticpotential of the tests.

• Automated test report analysismakes test results more consistent.

Your performance testing solution isless vulnerable to the personnelchanges in your organization since bothperformance tests and tests success cri-teria of the existing tests are automated.

Of course, implementing perform-ance test automation has its costs. Usecommon sense in determining whichtests should be automated first, andwhich come later. In the beginning, youmay find that some tests can be tootime- or resource-consuming to run reg-ularly. Hopefully, you will return tothem as you observe the benefits of per-formance test automation in practice.

Once you’ve made a decision tocompletely or partially automate Web

services load testing in your organiza-tion, it is time to consider the principlesof how your performance test infra-structure will function.

Automating Build-Test ProcessA continuous or periodic daily/night-ly build process is common in forward-looking development organizations. Ifyou want to automate your perform-ance tests, implementing such aprocess is a prerequisite. Figure 1shows the typical organization of adevelopment environment in terms ofhow source code, tests, and test resultsflow through the automated build-testinfrastructure.

It makes sense to schedule theautomated build and performancetest process to run after hours—whenthe code base is stable and when idlingdeveloper or QA machines can be uti-lized to create high-volume distrib-uted loads. If there were failures in thenightly performance tests, analyzingthe logs of your source control reposi-tory in the morning will help you iso-late the parts of the code that werechanged and which likely caused theperformance degradation. It is possi-ble that the failure was caused by somehardware configuration changes; forthis reason, keeping a hardware main-tenance log in the source controlrepository will help to pinpoint theproblem. Periodic endurance teststhat take more than12 hours to com-plete could be scheduled to run dur-ing the weekend.

Automating Performance TestResults AnalysisIn a traditional manual performance test-ing environment, a quality assurance(QA) analyst would open a load testreport and examine the data that was col-lected during the load test run. Based onsystem requirements knowledge, he orshe would determine whether the testsucceeded or failed. For instance, if the

CPU utilization of the application serverwas greater than 90 percent on average ata certain hit per second rate, the testshould be declared as failed.

This type of decision making is notapplicable to automated performancetesting. The results of each load testrun must be analyzed automaticallyand reduced to a success or failureanswer. If this is not done, daily analysisof load test reports would become atime-consuming and tedious task.Eventually, it would either be ignoredor become an obstacle to increasingthe number of tests, improving cover-age, and detecting problems.

To start the process of automatingload test report analysis and reducingresults to a success/failure answer, it ishelpful to break down each load testreport analysis into sub-reports calledquality of service (QoS) metrics. Eachmetric analyzes the report from a specificperspective and provides a success or fail-ure answer. A load test succeeds if all itsmetrics succeed. Consequently, the suc-cess of the entire performance test batchdepends on the success of every test inthe batch:

• Performance test batch succeeds if• Each performance test scenario suc-

ceeds if• Each QoS metric of each scenario

succeedsIt is convenient to use performance

test report QoS metrics because theyhave a direct analogy in the realm of Webservices requirements and policies. QoSmetrics can be implemented via scripts or

tools of a load test application of yourpreference and can be applied to thereport upon the completion of the loadtest. Another advantage of QoS metrics isthat they can be reused. For instance, ametric that checks the load test report forSOAP Fault errors, the average CPU uti-lization of the server, or the averageresponse time of a Web service can bereused in many load tests. A section of a

SPARE PERFORMANCE

MARCH 2009 www.stpcollaborative.com • 29

FIG. 2: QUALITY OF SERVICE METRICS

Page 30: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

sample load test report that uses QoSmetrics is shown in Figure 2.

Collecting Historical DataLoad test report analysis automation cre-ates a foundation for historical analysis ofperformance reports. Historical analysiscan reveal subtle changes that might beunnoticeable in daily reports and pro-vides insight into the application’s per-formance tendencies. As new functional-ity is introduced every day, the change inperformance may be small from one dayto the next, but build up to significant dif-ferences over a long period of time. Someperformance degradations may not bebig enough to trigger a QoS metric tofail, but can be revealed in performancehistory reports. Figure 3 shows an exam-

ple of a QoS metric performance historyreport.

Once you have established an auto-mated testing infrastructure, it is time tostart creating load test scenarios that willevaluate the performance of your system.

Creating Performance TestScenarios–General GuidelinesPerformance test scenarios should becreated in step with the development ofthe application functionality to ensurethat the application’s performance pro-file is continuously evaluated as newfeatures are added. To satisfy thisrequirement, the QA team should workin close coordination with the develop-ment team over the entire applicationlife cycle. Alternatively, the develop-ment team can be made responsible forperformance test automation of its owncode. The practice of creating a unittest or other appropriate functional testfor every feature or bug fix is becomingmore and more common in forward-

looking software development organi-zations. The same practice could besuccessfully applied to performancetests as well.

The best way to build performancetests is to reuse the functional applica-tion tests in the load test scenarios.With this approach, the virtual users ofthe load testing application run com-plete functional test suites or parts offunctional test suites based on the vir-tual user profile role. When creatingload test scenarios from functionaltests, make sure that the virtual usersrunning functional tests do not shareresources that they would not share inthe real world (such as TCP Sockets,SSL connections, HTTP sessions,SAML tokens etc.).

Following either the traditional andthe test early, test often performance test-ing approaches usually results in thecreation of a small number of perform-ance tests that are designed to test asmuch as possible in as few load testruns as possible. Why? The tests are runand analyzed manually, and the fewerload tests there are, the more manage-able the testing solution is. The down-side of this approach is that load testscenarios which try to test everything ina single run usually generate resultsthat are hard to analyze.

If performance testing is automated,the situation is different: you can createa greater number of tests without therisk of making the entire performancetesting solution unmanageable. You cantake advantage of this in two ways:

• Extend high-level Web services per-formance tests with subsystem oreven component tests to help isolateperformance problems and improvethe diagnostic ability of the tests.

• Increase the number of tests toimprove performance test coverage.

Improving Diagnostic Ability Of Performance TestsAs a rule, more generic tests havegreater coverage. However, they are alsoless adept at identifying the specificplace in the system that is responsiblefor a performance problem. Meta-phorically speaking, such tests havegreater breadth, but less depth. Moreisolated tests, on the other hand, pro-vide less coverage, but are better atpointing to the exact location of a prob-lem in the system internals. In otherwords, because they concentrate on aspecific part of the system, they havegreater depth but less breadth. An effec-tive set of performance tests would con-tain both generic high-level (breadth)tests and specific low-level (depth) teststhat complement each other in improv-ing the overall diagnostic potential of aperformance test batch.

For instance, a high-level perform-ance test that invokes a Web service viaits HTTP access point might revealthat the service is responding too slow-ly. A more isolated performance teston an EJB component or an SQLquery that is being invoked as a resultof the Web service call would moreprecisely identify the part of the appli-cation stack that is slowing down theservice. With the automated perform-ance testing system in place, you caneasily increase the number of tests andaugment the high-level tests thatinvoke your Web services via theiraccess points with more isolated, low-level tests that target the performanceof the underlying tiers, components,sub-systems, internal Web services, orother resources your applicationmight depend on.

In practice, you don’t have to cre-ate low-level isolated tests for all com-ponents and all tiers to complementthe high-level tests. Depending on theavailable time and resources, you canlimit yourself to the most importantones and build up isolated perform-ance tests as problems arise. For exam-ple: while investigating a high-levelWeb service test failure, let's say that aperformance problem is discovered inan SQL query. Once the problem isresolved in the source code, securethis fix by adding an SQL query per-formance test that checks for theregression you just fixed. This way,

FIG. 3:TEAM HANDICAP

SPARE PERFORMANCE

30 • Software Test & Performance MARCH 2009

Page 31: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

your performance test library willgrow “organically” in response to thearising needs.

Increasing Performance Test CoverageThe usefulness of the performancetests is directly related to how closelythey emulate request streams that theWeb services application willencounter once it is deployed in theproduction environment. In a com-plex Web services environment, it is ofthe essence to choose a systematicapproach in order to achieve adequateperformance test coverage. Such anapproach should include a wide rangeof use case scenarios that your applica-tion may encounter.

One such approach is to developload test categories that can describevarious sides of the expected stream ofrequests. Such categories can describerequest types, sequences, and intensi-ties with varying degrees of accuracy.An example of such a category break-down is shown in Figure 4.

Let’s consider these categories inmore detail. (The load type categoryanalysis of your Web service can obvi-ously include other categories as well asextend the ones shown in the Figure 4).

Type of UseDepending on the type of deployment,your Web services can be exposed tovarious types of SOAP clients. Theseclients may produce unexpected, erro-neous, and even malicious requests.Your load test scenarios shouldinclude profiles that emulate suchusers. The more your Web service isexposed to the outside world (asopposed to being for internal con-sumption), the greater the probabilityof non-regular usage. The misuse andmalicious use categories may includeinvalid SOAP requests as well as validrequests with unusual or unexpectedvalues of request sizes.

For example, if your service uses anarray of complex types, examine yourWSDL and create load test scenariosthat emulate requests with expected,average, and maximum possible ele-ment counts, as well as element countsthat exceed the allowed maximum.

<xsd:complexType name=”IntArray”><xsd:sequence>

<xsd:element name=”arg” type=”xsd:int”maxOccurs=”100”/>

</xsd:sequence></xsd:complexType>

Measure service performance withvarious sizes of client requests andserver responses. If the expectedrequest sizes and their probabilitiesare known (for example, based on loganalysis), then create the request mixaccordingly. If such data is unavail-able, test with the best-, average-, andworst-case scenarios to cover the fullperformance spectrum.

Emulation ModeA Web service may or may not supportthe notion of a user. More generically,it may be stateful or stateless. Yourdecision to use either virtual user orrequest per second emulation modeshould be based on this criteria. Forexample, the load of a stateless searchengine exposed as a Web service is bestexpressed in terms of a number ofrequests per second because thenotion of a virtual user is not well-defined in this case. A counter exam-ple of a stateful Web service is one thatsupports customer login, such as a tick-et reservation service. In this context,it makes more sense to use virtual useremulation mode.

If your service is stateless and youhave chosen the request per secondapproach, make sure that you select atest tool, which supports this mode. Ifa load test tool can sustain only thescheduled number of users, the effec-tive request injection rate may varysubstantially based on server responsetimes. Such a tool will not be able toaccurately emulate the desired requestsequence. If the number of users isconstant, the request injection ratewill be inversely proportionate to theserver processing time. It will also belikely to fluctuate, sometimes dramati-

cally, during the test. When load testing stateful Web

services, such as services that supportthe notion of a user, make sure thatyou are applying appropriate intensityand concurrency loads. Load intensitycan be expressed in request arrivalrate; it affects system resourcesrequired to transfer and process clientrequests, such as CPU and networkresources.

Load concurrency, on the otherhand, affects system resourcesrequired to keep the data associatedwith logged-in users or other statefulentities such as session object in mem-ory, open connections, or used diskspace. A concurrent load of appropri-ate intensity could expose synchro-nization errors in your Web serviceapplication. You can control the ratiobetween load intensity and concurren-cy by changing the virtual user thinktime in your load test tool.

Content TypeWhen load testing Web services, it iseasy to overlook the fact that SOAPclients may periodically refresh theWSDL, which describes the service, toget updates of the service parameters itis about to invoke. The probability ofsuch updates may vary depending onthe circumstances. Some SOAP clientsrefresh the WSDL every time they makea call. The test team can analyze accesslogs or make reasonable predictionsbased on the nature of the service.

If the WSDL access factor (theprobability of WSDL access per serviceinvocation) is high and WSDL size iscompatible with the combined averagesize of request and response, then net-work utilization will be noticeably

1

4

Developers

Source Code Repository

Nightly Build System

Application

New source code,code changes

Entire applicationsource code

FunctionalRegression

Tests

PerformanceRegression

Tests

Reporting System

Test results analyzed andprocessed by the Reporting Stytem

Test Reports

regular use misuse malicious use

type of use

WSDL requests

service requests

content type

average load peak load

type of load

Web ServiceLoad TestScenario

emulation mode

stress test endurance test

virtual user

request persecond

FIG. 4: CATEGORY BREAKDOWN

SPARE PERFORMANCE

MARCH 2009 www.stpcollaborative.com • 31

Page 32: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

higher in this scenario, as compared tothe one without the WSDL refresh. Ifyour Web services WSDLs are generat-ed dynamically, the high WSDL accessfactor will affect server utilization aswell. On the other hand, if your WSDLsare static, you can offload your applica-tion server by moving the WSDL files toa separate Web server optimized forserving static pages. Such a move willcreate increased capacity for processingWeb service requests.

Type of LoadTo ensure that your Web services appli-cation can handle the challenges it willface once it is deployed in production,you test its performance with variousload intensities and durations.Performance requirement specifica-tions should include metrics for bothexpected average and peak loads.After you run average and peak loadscenarios, conduct a stress test. A stresstest should reveal the Web servicesapplication’s behavior under extremecircumstances, which would causeyour application to start running outof resources, such as database connec-tions or disk space. Your applicationshould not crash under this stress.

It is important to keep in mind thatsimply pushing the load generatorthrottle to the floor is not enough tothoroughly stress tests a Web servicesapplication. Be explicit in what part ofthe system you are stressing. Whilesome parts of the system may be run-ning out of resources, others may becomfortably underutilized. Ask your-self: When this application is deployedin the production environment, willthe resource utilization profile be thesame? Can you be sure that the parts

of the system which were not stressedduring the test will not experienceresource starvation in the productionenvironment?

For instance, performance tests onthe staging environment revealed thatthe application bottleneck was the CPUof the database server.However, you know thatyou have a high perform-ance database server clus-ter in production. In thiscase, it is likely that theproduction system bottle-neck will be somewhereelse and the system willrespond differently understress. In such a situation,it would make sense tochange the parts of yourdatabase access code withcode stubs that emulateaccess to the database.The system bottleneckwill shift to some otherresource, and the test willbetter emulate produc-tion system behavior.

Applying this codestubbing approach toother parts of the system(as described above) willallow you to shift bottle-necks to the parts of thesystem that you want toput under stress and thusmore thoroughly testyour application. Keeping a table of sys-tem behavior under stress, as shown inTable 1, will help you approach stresstesting in a more systematic manner.

Performance degradation—evendramatic degradation—is acceptable inthis context, but the application should

return to normal after the load hasbeen reduced to the average. If theapplication does not crash under stress,verify that the resources utilized duringthe stress have been released. A com-prehensive performance-testing planwill also include an endurance test thatverifies the application’s ability to runfor hours or days, and could reveal slowresource leaks that are not noticeableduring regular tests. Slow memory leaksare among the most common. If theyare present in a Java environment, theseleaks could lead to a java.lang.OutOfMemoryError and the crash of theapplication server instance.

Creating a Real-World Value MixTo better verify the robustness of yourWeb service, you should use your loadtest tool to generate a wide variety of val-ues inside SOAP requests. This mix canbe achieved, for example, by using mul-tiple value data sources (such as spread-sheets or databases), or by having the

values of the desired rangedynamically generated(scripted) and then passedto virtual users that simu-late SOAP clients. By usingthis approach in load testsof sufficient duration andintensity, you can test yourWeb service with anextended range and mixof argument values thatwill augment your func-tional testing.

Depending on the cir-cumstances, it may be advis-able to run the mixedrequest load test after allknown concurrency issueshave been resolved. Iferrors start occurring afterthe variable request mixhas been introduced,inspect error details andcreate functional tests usingthe values that caused yourWeb service to fail duringload testing. These newly-created functional testsshould become part of yourfunctional test suite.

By implementing automated per-formance testing process in your soft-ware development organization, youcan reduce the number and severity ofpotential performance problems inyour Web services application andimprove its overall quality. ý

TABLE 1: SCORE SHEET

Test

1.

2.

3.

Max. resource utilization understress

98%

100% - allthreads busy

100% - runningout of sockets

System behaviorunder stress

Response timeincreased to 3 sec.on average.Timeoutsin 10% of requests.

Request timeouts followed by OutOfMemoryError(s)printed in sever console.

Connection refusedin 40% of requests.

System behaviorafter stress load isremoved

Returned to normal performance- success

Up to 50% errorsafter stress load isremoved - failure

Returned to normal performance- success

Resource understress

Application ServerCPU

App. Server threadpool

App. ServerNetwork connections

•Depending on the

circumstances, it

may be advisable

to run the mixed

request load test

after all known

concurrency

issues have been

resolved.

SPARE PERFORMANCE

32 • Software Test & Performance MARCH 2009

Page 33: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

In sports, “garbage time” iswhen bench players are sentin for the last few minutes ofa blowout, long after thefinal outcome has becomeobvious. Java is different:garbage time is essential forsuccess. The problem is thatthe very design of the lan-guage often lulls developersinto a sense of false comfort,says Gwyn Fisher, chief tech-nical officer at source code analysis toolsmaker Klocwork. “You can get smart dev-elopers who, all of a sudden, stop think-ing like developers,” he says. “All of thelessons they’ve spent years learning, fromgood programming practices in C++ allthe way back to assembler, get thrown outthe window because now they’re workingin a managed environment.”

Blame much of it on GC, the Javagarbage collector, Fisher says.

The problem, as Fisher sees it, is thatJava does such a good job in many areasthat its “gotchas” tend to get glossed over.And GC is a gotcha. He says garbage col-lection is a myth that in reality is “just ter-rible in many ways” because of this falsesense of security. That means the test staffmust be extra-vigilant when it comes tounderstanding what’s really going onunder the hood. Because the GC looksafter memory, programmers tend toassume that anything associated withmemory objects being cleaned up the bythe GC is also going to be managed bythe GC. But that simply isn’t the case.

As an example, Fisher cites an objectthat encapsulates a socket – a physicalinstantiation of a network endpoint. Thatencapsulating object gets cleaned up bythe runtime when it goes out of scope,but the underlying operating systemresource, the socket itself, does not getcleaned up because the GC has no ideawhat it is. The result over time is a grow-ing array of things that are no longer

managed by anything youcan grasp in Java, becausethe GC has removed theobjects, but which are stillheld onto by the underlyingOS. It’s not a big deal for anapp that runs for 10 minutesand then has its JVM termi-nated, but for a Web serverapp designed to run unin-terrupted for months oryears, it can become a huge

resource and performance drain.Derrek Seif, a product manager at

Quest Software who focuses on perform-ance, is in the same arena, believing thatinefficient code often bungles memoryallocation and release with results thatare positively deadly. But there’s more toit, he says. Like all of us, Seif often seesapplications that undergo teardown,redesign, and rebuild as business require-ments change. They might work perfect-ly in testing, but slow to a crawl oncereleased into the real world. Customersget unhappy very quickly.

The problem is relying on a reactivemethodology to fix these issues ratherthan being more proactive in upfrontdesign and understanding performancemetrics. Easier said than done, says Seif,since that requires development processreengineering. “Testing, in terms of per-formance often gets squeezed to the end,due to additional pressures of otheraspects of the project,” he says. “There’snever enough time to make a fix beforerelease, but since it does work, it’s usually‘we’ll get it out and fix it later.’ ”

One way to mitigate the performanceproblem is through automation. By usinga profiler’s automation capabilities it’spossible to perform unit tests and estab-lish baseline performance metrics. Fromthen on, as changes are made, historicaldata from that separate build is used as acomparator. This simplifies the task ofzeroing in problem areas when perform-

ance of subsequent builds becomesdegraded. “With this process change, it isallowing this customer to deliver qualityapplications with higher performancelevels than before,” Seif says.

Another common mistake Seif sees isthat in test-driven development whereunit test are run for measuring perform-ance, the percentage of code that actual-ly gets run is unknown. “An app mayappear to run fine from a performanceor functional standpoint, but if youhaven’t exercised every line of code youcan never be completely sure.”

Rich Sharples, director of productmanagement for Red Hat’s JBoss App-lication Platforms and Developer Toolsdivision, certainly agrees with extensivetesting, but says it can be done smartly.“Running tests and doing performancetuning are big investments. To be effec-tive with your budget, you have tounderstand what level of investment isright.” At one end of the spectrum, fix-ing a problem on a satellite is a placewhere you can’t overinvest in quality,but a static Web site phone directory ordiscussion forum with an occasionalcrash and restart, though undesirable, isnot exactly critical.

Modeling the environment is key. “Youcan’t replicate the Web tier of a Fortune500 e-commerce site; you don’t have thou-sands of servers sitting around,” Sharplessays. “The only solution is modeling, butthat always involves some risk, especiallywhen things scale up.” Scaling is non-lin-ear;at some point the capacity of the net-work may become the bottleneck, but ifyou didn’t model for this you won’t know.“Making the wrong assumption will causeproblems later on.”ý

Garbage Time is Not JustFor Basketball

Best Practices

Joel Shore

Joel Shore is a 20-year industry veteran andhas authored numerous books on personalcomputing. He owns and operates ReferenceGuide, a technical product reviewing and docu-mentation consultancy in Southboro, Mass.

MARCH 2009 www.stpcollaborative.com • 33

Page 34: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing

Software stakeholders willnever stop guessing whatwill the future look like.Just as farmers monitorthe weather during therainy season, trying tobenefit from opportuni-ties and prevent disaster,software departments willcontinue to monitor thehealth of their applica-tions.

Without some kind of timemachine, the only way to see glimpsesof the future is to look at what is hap-pening in the present.

Present behavior also shapes thefuture of the software industry. Onesuch factor is a change imposed bythose observing software stakeholdersthemselves. When people pay atten-tion to the future, one might say theycan end up creating it—driving thechange or getting involved when anopportunity shows. They can improvethings or prevent transformation, sim-ilar perhaps to the fictional accountsof time travel, in which a small changea month ago affects many things today.

The future of the industry is notjust defined by big manufacturers.Amateurs also play a role. Notableexamples include the Harvard drop-out who went on to create the world’sbiggest software company, or the twoyoung men whose search algorithmrevolutionized the Internet. Thefuture is generally defined by thosewho can find the next great idea, onethat might totally change the directionof an industry. When Thomas Addisoninvented the light bulb, the world did-n’t immediately replace their gas

lamps. But it eventuallyproved to be among themost important inventionsin history. The sameapplies to the softwareindustry. The next greatidea might not be whatevery one is looking for atthe moment, but onceavailable,becomesas indis-

pensible as the tele-phone.

Also affecting thesoftware industry’sfuture are the prob-lems and challengesof the present, includ-ing those of our dailylives, which some peo-ple define as theopportunities. BillHetzel author of “TheComplete Guide toSoftware Testing,”(Wiley, 1993), wrotethat any line of codeis written to solve aproblem. Therefore,according to thishypothesis, whereveryou find software,there is a problemthat needs to besolved. So perhapsthe reverse is alsotrue: Wherever youfind a problem, therecould be software written to solve. Thegreater the challenge, the greater theopportunity.

Another factor driving the future of

software is the Internet, and the expo-nential growth of networked andmobile devices to be found there.Among the major challenges is themanagement of an ever-larger quantityof addresses, users and devices withthe finite number of IP address avail-able. With this problem an opportuni-ty exists for some clever software devel-

oper to come alongand solve.

If I could pick justa single word todescribe the future ofthe software industry,it would be “change.”While most changescan be measured onlyby comparing them tothe past, few couldhave imagined 20years ago what mighthave been possible byinterconnecting com-puters and networksthroughout the world.This young industryhas grown incrediblyfast, and just as quick-ly has invaded all areasof human life. Just asunrecognizable as theWeb is now from thatof 20 years ago, we willscarcely recognize it20 years from now.The challenges andopportunities associat-ed with these changes

will be available for those who areready to benefit from them. ý

Future Test

34 • Software Test & Performance MARCH 2009

Testing’s Future Is InChallenges,OpportunitiesAnd the Internet

Murtada Elfahal

Future Test

Murtada Elfahal is a test engineer at SAP.

•The future of

the software

industry can

be described in

one word:

change.

Page 35: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing
Page 36: Organically Grown High-Speed Appsfor connecting to all manner of test automation frameworks.” Out of the box, SmarteQM integrates with JUnit, NUnit, PyUnit and TestNG automated unit-test-ing