Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove...
-
Upload
william-mccullough -
Category
Documents
-
view
217 -
download
2
Transcript of Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove...
Avoiding the Destiny of Failure in Large Software
Systems
John Cosgrove, PE, CDP, CFC
Cosgrove Computer Systems Inc.(310) 823-9448
Los Angeles ACM
Loyola Marymount University – University Hall
December 7, 2005
Responding to Risk in Software Systems
Copyright 2001-2005 CCS Inc.
Responding to Risk in Software Systems
2
Contents
The Problem . . . . . . . . . . . . . . . . . . . . . . . . 3
Seeking a Solution. . . . . . . . . . . . . . . . . . . . 9
Lessons- Learned . . . . . . . . . . . . . . . . . . . .15
Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . .21
Bibliography
Responding to Risk in Software Systems
3
The Problem
Most Software Systems Fail Planning Revisited Integrated Risk Management New World of Regulation Future of Software Engineering
Responding to Risk in Software Systems
4
Most Software Systems Fail
Most SW projects fail – bigger fail more Failure is not inevitable
– Notable exceptions exist Poor natural visibility typical w/ SW
– Effective planning & status assessment critical Risk management integral to planning
– Risk assessment must include economics of failure
Source: Humphrey – Crosstalk 2005
Responding to Risk in Software Systems
5
Planning Revisited
Plans must involve all the responsible stakeholders– Developers, customers, end users, etc.– Win-Win or Lose-Lose
Development cycle policy must be explicit– Critical drivers must be stated – Independent variables– Schedule, cost, performance or quality
Choose one or two – others are dependent variables Dependent variables vary!
Planning is never complete– Rule - Never fail a plan because plan changes to reality 1st
Source: Boehm
Responding to Risk in Software Systems
6
Integrated Risk Management
True Risk Management is element of planning Flows from unknowns identified in planning Two broad categories
– Catastrophic or unacceptable risk Treat as requiring insurance in some form
– Conventional risk exposure Classical risk mitigation steps
Both demand $$$ quantification of failure– Cost of failure drives budgets
Responding to Risk in Software Systems
7
New World of Regulation
Sarbanes Oxley (SOX)– Enforces accountability for reporting “correctness”– Software projects are investment assets– Correctness, control mechanisms, security are
auditable– Non-compliance penalties include criminal & civil
“If we managed finances in companies the way we manage software—then somebody would go to prison.” -- Armour
Responding to Risk in Software Systems
8
Future of Software Engineering
Functional size & complexity increasing rapidly– Size increase ~ 10x every 5 years– Scale matters in all engineered systems
Humphrey’s analogy with transportation system’s speed
“Increasingly software [i.e., computer systems] .. crucial part of the products and services in almost all industries.”
“Most computer systems .. interconnected ..” “.. more internal and external threats …” “In .. past, .. assumed a friendly .. environment.” Source: Humphrey, SEI/CMU 2002
Responding to Risk in Software Systems
9
Seeking a Solution
Significant Differences – Software Why Software is Valuable Software Creation Failure Management Minimizing Failure Costs
Responding to Risk in Software Systems
10
Significant Differences - Software
Requirements are seldom complete - IKIWISI– “With software the challenge is to balance the unknowable nature of the
requirements with the business need for a firm contractual relationship.” -- Watts Humphrey
“Most engineered systems are defined by comprehensive plans and specifications prior to startup. Few software-intensive systems are.”
Most software projects are challenged or fail completely*– Over $6M – less than 10% succeed, $1M ~50%– Primary cause – no realistic planning by developers– No natural visibility of progress or completion status
* Humphrey “Why Big Software Projects Fail”
Responding to Risk in Software Systems
11
Why Software is Valuable
Value created by the abstraction of productive knowledge
– Development is Social learning process Economic value comes from impact on useful activity
– Efficient automotive ignitions Value is increased when the knowledge is readily
adaptable– McDonalds hamburger franchises also work well in China
Franchises show how preserved abstractions can be valuable
Software engineers are ethically obligated to optimize value
Source: Baetjer
Responding to Risk in Software Systems
12
Software Creation
What is a Social Learning Process?? Ignorance -> useful, reproducible knowledge Orders-of-Ignorance (OI) – five levels
– 0th – Useful knowledge, have the answer– 1st – Know the ?, but not the answer– 2nd – Unknown # of unknowns, apply process– 3rd – 2-OI but no process to begin– 4th – 3-OI Ignorance of ignorance - meta-ignoranceSource: Armour, Five Orders of Ignorance, C-ACM 10/00
Responding to Risk in Software Systems
13
Failure Management
“..as if the concepts of risk and failure are somehow disconnected.”
“.. purpose of development .. do something not done before.”
90% success means 1 in 10 failure– Is the failure tolerable?
Must make it tolerable (e.g., insurance)?– Calculate $ likelihood of failure (e.g.,10% of cost)Source: Armour: “Management of Risk, C-ACM 3/05
Responding to Risk in Software Systems
14
Minimizing Failure Costs
Failure costs are never zero– Making costs explicit improves planning
Steps to Minimize– Make all catastrophic risks tolerable
Rationale behind insurance – life, property, etc. Project example – alternate, plan-B solution
– Quantify risk exposure in terms of failure costs Rationale behind testing to avoid costly field retrofits Failure cost exposure drives budgets for mitigation
Responding to Risk in Software Systems
15
Lessons-Learned
Air Traffic Control Failure New FBI Software Unusable Unsafe Automotive Ignition Framework for Dependable Designs Dependable Ignition System Example
Responding to Risk in Software Systems
16
Air Traffic Control Failure
– LA regional system failed on 9/14/2004, 3.5 hours Backup system also failed
– Many mid-air collision near misses with 800+ A/C– Improperly blamed on “human error”
Fault lay with known “glitch” avoided by manual Ops Fault introduced with year-ago system re-host Only 1 of 21 centers have fault corrected
– Questions – testing, fault tolerance policy, etc. Backup system failed immediately???
Responding to Risk in Software Systems
17
New FBI Software Unusable
New Anti-terrorism software – Virtual Case File .. “further delays in four-year effort..”
“$half-billion Upgrade … will not work ..”– “ .. render worthless much of current $170M contract.”
“.. may have outlived its usefulness .. before .. it was .. implemented”
“..officials thought ..get it right the 1st time”.. “That never happens with anybody.”
Source: LA Times, 1/13/05
Responding to Risk in Software Systems
18
Unsafe Automotive Ignition
Engine died when accelerating into traffic– Intermittent sensor wire– Ignition control software failed with open circuit
Hazard analysis missed HW-SW interaction Incomplete SW system safety requirements
– Interface failure protection - From Hazard analysis Deterministic values for common failures -- Open, short
– Control algorithm must be protected – Detect failures and substitute “safe” values
Recent examples LA Times 5/05 – “Prius..”
Responding to Risk in Software Systems
19
Framework for Dependable Designs
Defend engineering process in court* Set bounds for system - three states
– Operating -- Envelope for normal operations– Non-Operating -- Normal not possible– Exception -- Recover to normal after anomaly
Normal may be degraded-normal
Mishaps occur during state transitions– IDs SW system dependability requirements– Suggests mishap mitigation -- HW or SW* Source: Lawson
Responding to Risk in Software Systems
20
Dependable Ignition System Example
Automotive Ignition -- Hazard identified– Sensor wiring may fail from constant movement– Ignition control failure may cause traffic emergency
Requirement - Recover safely from faulty wiring Allocation of requirement – “What if”
– HW - Terminate inputs for predictable open/short values– SW - Detect open/short values, use last or known safe value
Requirements identification before design is best– More options, usually less costly
Responding to Risk in Software Systems
21
Summary
Most large SW-intensive system developments fail Public safety and economic security forcing
government & legal systems to recognize importance
Planning and risk management practices are key to any solution
Good systems engineering practices must be adapted to software’s special characteristics
Responding to Risk in Software Systems
22
Bibliography - I
Armour, Phillip, The Five Orders of Ignorance, Communications of the ACM, October 2000
Armour, Phillip, Project Portfolios: Organizational Management of Risk, Communications of the ACM, March 2005
Armour, Phillip, Sarbanes-Oxley and Software Projects, Communications of the ACM, June 2005
Baetjer, H., Software as Capital - An Economic Perspective on Software Engineering, IEEE Computer Society Press, 1997
Boehm, Barry, Win-Win Negotiation Tool, Center for Software Engineering-USC, http://sunset.usc.edu
Cosgrove, J., Software Engineering & Law, IEEE Software, May-June 2001 Humphrey, W. S., Managing the Software Process, Addison Wesley, 1990 Humphrey, Watts, The Future of Software Engineering: V, SEI Interactive,
Software Engineering Institute, Carnegie Mellon University, Vol. 5, Num.1, 1Q 2002, http://interactive.sei.cmu.edu/news@sei/columns/watts_new/watts-new-compiled.pdf
Humphrey, Watts, Why Big Software Projects Fail – The 12 Key Questions, CrossTalk Magazine, March 2005 www.stsc.hill.af.mil
Responding to Risk in Software Systems
23
Bibliography - II
Lawson, Harold W., An Assessment Methodology for Safety Critical Systems, Lidingo, Sweden, [email protected]
Los Angeles Times, System Failure Snarls Air Traffic in the Southland, 9/15/2004 Los Angeles Times, Human Errors Silenced Airports, 9/16/2004 Los Angeles Times, New FBI Software May Be Unusable, 1/13/2005 Los Angeles Times, Prius Glitches Highlight Problems of Car Computers, 5/18/2005 Lister, T. & DeMarco, T., Both Sides Always Lose: Litigation of Software-Intensive
Contracts, CrossTalk, 2/2000, www.stsc.hill.af.mil/Crosstalk/2000/feb/demarco.asp Parnas, David L., Licensing Software Engineers in Canada, Communications of the
ACM, 11/2002 Poore, Jesse H., A Tale of Three Disciplines … and a Revolution, IEEE Computer,
1/2004 Research Triangle Institute, “The Economic Impacts of Inadequate Infrastructure for
Software Testing”, www.nist.gov, NIST Planning Report 02-3