Software Design for Reliability - Presentation
-
Upload
thiyagarajan-ganesan -
Category
Documents
-
view
221 -
download
0
Transcript of Software Design for Reliability - Presentation
-
8/8/2019 Software Design for Reliability - Presentation
1/42
Software
Design for Reliability(DfR)
George de la Fuente
(408) 828-1105
www.opsalacarte.com
Mike Silverman
(408) 472-3889
www.opsalacarte.com
-
8/8/2019 Software Design for Reliability - Presentation
2/42
Introduction
-
8/8/2019 Software Design for Reliability - Presentation
3/42
(v0.1) 3
The Current SituationThe Current Situation
Engineering teams must balance cost, schedule, performanceand reliability to achieve optimal customer satisfaction.
Cost Schedule
Reliability
Performance
CustomerSatisfaction
-
8/8/2019 Software Design for Reliability - Presentation
4/42
(v0.1) 4
The Need for Design for ReliabilityThe Need for Design for Reliability
Reliability is no longer a separate activityperformed by a distinct group within the
organization.
Product reliabil ity goals, concerns and
activities are integrated into nearly everyfunction and process of an organization.
The organization must factor reliability intoevery decision.
-
8/8/2019 Software Design for Reliability - Presentation
5/42
(v0.1) 5
Design for ReliabilityDesign for Reliability
Design for Reliability is made up of the following4 key steps:
Assess customers situation
Develop goals
Write program plan
Execute program plan
The focus is on developing reliable products andimproving customer satisfaction.
-
8/8/2019 Software Design for Reliability - Presentation
6/42
(v0.1) 6
Different Views of Reliability
Product development teams
View reliability as the domain to addressmechanical and electrical, and softwareissues.
Customers
View reliability as a system-level issue,with minimal concern placed on thedistinction into sub-domains.
Since the primary measure of reliability
is made by the customer, engineeringteams must maintain a balance of bothviews (system and sub-domain) inorder to develop a reliable product.
System
MechanicalReliability
Electrical
Reliability
+
SW
Reliability
+
-
8/8/2019 Software Design for Reliability - Presentation
7/42
(v0.1) 7
Reliability vs. Cost
Intuitively, the emphasis in reliability to achieve areduction in warranty and in-service costs results insome minimal increase in development andmanufacturing costs .
Use of the proper tools during the proper life cyclephase will help to minimize total Life Cycle Cost(LCC).
-
8/8/2019 Software Design for Reliability - Presentation
8/42
(v0.1) 8
Reliability vs. Cost, continued
To minimize total Life Cycle Costs (LCC), an organizationmust do two things:
1. Choose the best tools from all of the tools available andapply these tools at the proper phases of the product lifecycle.
2. Properly integrate these tools together to assure that the
proper information is fed forwards and backwards at theproper times.
This is Design for Reliability.
-
8/8/2019 Software Design for Reliability - Presentation
9/42
(v0.1) 9
Hardware Reliability and Cost Reduction Strategy
A product-driven focus on reliability in order to reduceproduct hardware and mechanical warranty and in-servicecosts
This emphasis results in some increase in development andmanufacturing costs
A good reliability program will strive to optimize the total LifeCycle Costs
COST
RELIABILITY
OPTIMUM
COST POINT RELIABILITYPROGRAM COSTS
TOTAL COSTCURVE
ELEC / MECH
WARRANTY &IN-SERVICE COSTS
-
8/8/2019 Software Design for Reliability - Presentation
10/42
(v0.1) 10
HW and SW Reliability and Cost Reduction Strategy
A product-driven focus on reliability in order to reduceproduct hardware and mechanical warranty and in-servicecosts
This emphasis results in some increase in development andmanufacturing costs
A good reliability program will strive to optimize the total LifeCycle Costs
COST
RELIABILITY
OPTIMUM
COST POINT RELIABILITYPROGRAM COSTS
TOTAL COSTCURVE
ELEC / MECH
WARRANTY &IN-SERVICE COSTS
The softwareimpact on overallwarranty costs isminimal at best?
Why?
-
8/8/2019 Software Design for Reliability - Presentation
11/42
(v0.1) 11
Software Reliability and Cost
Software has no associated manufacturing costs and minimalreplacement costs
Warranty costs and savings are almost entirely allocated to hardware
Software development and test teams are considered fixed product development
costs
If software reliability does not result in a perceived cost savings,why not focus solely on improving hardware reliability in order tosave money?
Sadly, some organizations take just this approach
But since they sell systems, not just hardware, sometimes customers or marketforces cause companies to reconsider this approach
The primary root causes of embedded system failures is usually software, nothardware, by ratios as high as 10:1
The benefits of improved software reliability are not in direct costsavings, but rather in:
Increased availability of software staff for additional development due to reducedoperational maintenance demands
Improved customer satisfaction resulting from increased system reliability
-
8/8/2019 Software Design for Reliability - Presentation
12/42
Software Design
for Reliability
-
8/8/2019 Software Design for Reliability - Presentation
13/42
(v0.1) 13
Software Quality vs. Software Reliability
SoftwareQuality
*ISO9126QualityModel
FACTORS
Maintainab
ility
Functionality
Usability
Reliability
Efficiency
Portability
CRITERIA
suitabilityaccuracy
interoperabilitysecurity
maturityfault tolerancerecoverability
time behaviorresourceutilization
analysabilitychangeabilitystability
testability
adaptabilityinstallabilityco-existencereplaceability
understandabilitylearnabilityoperability
attractiveness Software Quality
The level to whichthe software
characteristicsconform to all the
specifications.
Software Quality
The level to whichthe software
characteristicsconform to all the
specifications.
-
8/8/2019 Software Design for Reliability - Presentation
14/42
(v0.1) 14
Software Reliability Definitions
Examine the key points
Practical rewording of the definition
The customer perception of
the softwares ability to deliver the expected functionality
in the target environmentwithout failing.
The customer perception of
the softwares ability to deliver the expected functionality
in the target environmentwithout failing.
Software reliability is a measure ofthe software failures that are visible to a customer and
prevent a system from delivering essential functionality.
Software reliability is a measure of
the software failures that are visible to a customer and
prevent a system from delivering essential functionality.
-
8/8/2019 Software Design for Reliability - Presentation
15/42
(v0.1) 15
Terminology - Defect
A flaw in software requirements, design or source code thatproduces unintended or incomplete run-time behavior
Defects of commission
Incorrect requirements are specified
Requirements are incorrectly translated into a design model
The design is incorrectly translated into source code
The source code logic is flawed
Defects of omission
Not all requirements were used in creating a design model The source code did not implement all the design
The source code has missing or incomplete logic
Defects are static and can be detected and removed without
executing the source code
Defects that cannot trigger software failures are not tracked ormeasured for reliability purposes
These are quality defects that affect other aspects of software quality such assoft maintenance defects and defects in test cases or documentation
Defect
There are amongst the most difficult class of defects to detect
-
8/8/2019 Software Design for Reliability - Presentation
16/42
(v0.1) 16
Terminology - Fault
The result of triggering a software defect byexecuting the associated source code
Faults are NOT customer-visible
Example: memory leak or a packet corruption that requiresretransmission by the higher layer stack
A fault may be the transitional state that results in a failure
Trivially simple defects (e.g., display spelling errors) do not
have intermediate fault states
Defect
Fault
-
8/8/2019 Software Design for Reliability - Presentation
17/42
(v0.1) 17
Terminology - Failure
A customer (or operational system) observation ordetection that is perceived as an unacceptable departureof operation from the designed software behavior
Failures are the visible, run-time symptoms of faults
Failures MUST be observable by the customer or anotheroperational system
Not all failures result in system outages
NOTE: for the remainder of this presentation, the term failure willrefer only to failure of essential functionality, unless otherwise stated.
Defect
Fault
Failure
-
8/8/2019 Software Design for Reliability - Presentation
18/42
(v0.1) 18
Summary of Defects and Failures
There are 3 types of run-time defects
1. Defects that are never executed (so they dont triggerfaults)
2.Defects that are executed and trigger faults that doNOT result in failures
3. Defects that are executed and trigger faults that resultin failures
Practical Software Reliability focuses solely ondefects that have the potential to causefailures by:
1. Detecting and removing defects that result in failuresduring development
2. Implementing fault tolerance techniques to
prevent faults from producing failures or
mitigating the effects of the resulting failures
Defect
Fault
Defect
Failure
Defect
Fault
Failure
Defect
Fault
-
8/8/2019 Software Design for Reliability - Presentation
19/42
(v0.1) 19
Software Availability
System outages caused by software can be attributed to:
1. Recoverable software failures
2. Software upgrades
3. Unrecoverable software failures
NOTE: Recoverable software failures are the most frequent software cause ofsystem outages
For outages due to recoverable software failures, availability isdefined as:
___MTTF___A(T) = (MTTF + MTTR)
where,
MTTF is Mean Time To [next] Failure
MTTR (Mean Time To [operational] Restoration) is still the duration of theoutage, but without the notion of a repair time. Instead, it is the time untilthe same system is restored to an operational state via a system reboot orsome level of software restart.
-
8/8/2019 Software Design for Reliability - Presentation
20/42
(v0.1) 20
Software Availability (continued)
A(T) can be increased by either:
Increasing MTTF (i.e., increasing reliability) using software reliability practices
Reducing MTTR (i.e., reducing downtime) using software availability practices
MTTR can be reduced by:
Implementing hardware redundancy (sparingly) to mask most likely failures
Increasing the speed of failure detection (the key step)
Software and system recovery speeds can be increased by implementing Fast
Fail and software restart designs Modular design practices allow software restarts to occur at the smallest
possible scope, e.g., thread or process vs. system or subsystem
Drastic reductions in MTTR are only possible when availability is part of theinitial system/software design (like redundancy)
Customers generally perceive enhanced software availability as asoftware reliability improvement
Even if the failure rate remains unchanged
-
8/8/2019 Software Design for Reliability - Presentation
21/42
(v0.1) 21
System Availability Timeframes
Timeframe vs. Mission Downtime
(Unavailability Range)
AvailabilityAvailability Class
99.99999% (7 nines) to
99.9999999% (9 nines)
99.9999% (6 nines)
99.999% (5 nines)
99.99% (4 nines)
99.9% (3 nines)
99% (2 nines)
90% (1 nine)
(7) Ultra-Availability
(6) Very-High-Availability
(5) High-Availability
(High-reliabilityproducts)
(4) Fault Tolerant(better commercialsystems)
(3) Well-managed
(2) Managed
(good web servers)
(1) Unmanaged
13.14 minutes52.6 mins/year
1.31minutes5.3 mins/year
7.88 seconds31.5 secs/year(2.6 mins/5 years)
2.19 hours8.8 hours/year
(525.6 mins/year)
0.79 seconds3.2 secs/year
to
31.5 millisecs/year
(15.8 secs/5 years or less)
21.9 hours
9.13 days
Timeframe = 3 months
3.65 days/year
(5,256 mins/year)
36.5 days/year
(52,560 mins/year)
Timeframe = 1 year
-
8/8/2019 Software Design for Reliability - Presentation
22/42
(v0.1) 22
Software Availability
a)System restart
b)100 seconds (average)
c)11 mins & 40 secs
d)1 failure/mission
a)System reboot
b)4 minutes (average)
c)14 minutes
d)1 failure/mission
1)Manual detection (No S/Wmechanism)
2)Reboot/restart via user intf
3)(avg detection time) 10 min.
a)System restart
b)100 seconds (average)
c)130 secondsd)7 failures/mission
a)System reboot
b)4 minutes (average)
c)4 mins & 30 secsd)3 failure/mission
1)Slow, automatic det. Mech.
2)Health message protocols
w/retries3)30 seconds
a)Board restart
b)30 secondsc)40 seconds
d)24 failures/mission
a)Board reboot
b)90 secondsc) 100 seconds
d)9 failures/mission
1)Fast, automatic detection
mechanism2)Watchdog timer reset
3)5-10 seconds
a)Thread/Driver restartb)(average) 50 mils
c)1.05 seconds
d)914 failures/mission
a) Board rebootb) 90 seconds
c) 91 seconds
d) 10 failures/mission
1)Very fast, automatic detectionmechanism
2)Thread or Driver monitors
3)Up to 1 second
a) Fast Recovery Action
b) Recovery Speed
c) MTTR
d) Mission failure rate
a) Slow Recovery Action
b) Recovery Speed
c) MTTR
d) Mission failure rate
1)Software Failure DetectionMechanism
2) Implementation Example
3)Failure Detection Speed
-
8/8/2019 Software Design for Reliability - Presentation
23/42
(v0.1) 23
Reliable Software Characteristics Summary
Operates within the reliability specification that satisfies customerexpectations
Measured in terms of failure rate and availability level
The goal is rarely defect free or ultra-high reliability
Gracefully handles erroneous inputs from users, other systems,and transient hardware faults
Attempts to prevent state or output data corruption from erroneous inputs
Quickly detects, reports and recovers from software and transienthardware faults
Software provides system behavior as continuously monitoring, self-diagnosingand self-healing
Prevents as many run-time faults as possible from becoming system-levelfailures
-
8/8/2019 Software Design for Reliability - Presentation
24/42
(v0.1) 24
Software Design for Reliability (DfR)
What are the common options for software DfR?
1. Use formal methods
2. Follow an approach consistent with Hardware Reliability Programs
3. Improve reliability through software process control
4. Improve reliability by focusing on traditional software development bestpractices
Formal Methods
Methodologies that utilize mathematical modeling of a systems requirementsand/or design in order to prove correctness
Primarily used for safety-critical systems that require very high degrees of
Confidence in expected system performance
Quality audit information
Targets of low or zero failure rates
Formal methods are not applicable to every software project
Cannot be used for all aspects of system design (e.g., user interface design)
Do not scale to handle large and complex system development
-
8/8/2019 Software Design for Reliability - Presentation
25/42
(v0.1) 25
Hardware Reliability Practices
Software and hardware development practices are stillfundamentally different
Architecture and design level modeling tools are not prevalent
Provide limited simulation verification
Especially if a real-time operating system is required
2-way code generation is a common feature
Software engineers find it difficult to complete a design with coding
All software faults are from design, not manufacturing or wear-out
Software is not built as an assembly of preexisting components
Off-the-shelf software components do not provide reliability characteristics
Most reused software components are modified and not re-certified beforereuse
Developing designs with very small defect-densities is the Achilles' heal
No true mechanisms available for general accelerated software testing
Increasing processor speeds is usually implemented to find timing failures
-
8/8/2019 Software Design for Reliability - Presentation
26/42
(v0.1) 26
Hardware Reliability Practices
(continued)
Extending software designs after product deployment is commonplace
Software updates are the preferred avenue for product extensions andcustomizations
Software updates provide fast development turnaround and have little or nomanufacturing or distribution costs
Hardware failure analysis techniques (FMEAs and FTAs) are rarely used withsoftware
Software engineers find it difficult to adapt these techniques below thesystem level
-
8/8/2019 Software Design for Reliability - Presentation
27/42
(v0.1) 27
Software Process Control Methodologies
Software process control assumes a correlation betweendevelopment process maturity and latent defect density in thefinal software [Jones-1]
If the current process level does not yield the desired software
reliability, process audits and stricter process controls areimplemented
Reliability improvement under this type of controlled methodology may increasevery slowly
5.8
3.4
2.1
1.09
0.39
DeliveredDefects/ KSLOC
124.8
62.4
31.2
15.6
7.8
InherentDefects/ KSLOC
5
1
2
3
4
CMM Level
-
8/8/2019 Software Design for Reliability - Presentation
28/42
(v0.1) 28
Defect Removal Potential of Best Practices
25% to 40%Integration test
20% to 40%Performance test
25% to 55%System testing
25% to 35%Acceptance test (1 customer)
15% to 45%Unit testing
45% to 60%Design inspections
15% to 30%Regression test
45% to 60%Code inspections
Efficiency RangeDefect Removal Technique
[Jones-2]
-
8/8/2019 Software Design for Reliability - Presentation
29/42
(v0.1) 29
Impact of Integrating Multiple Best Practices
MedianDefectEfficiency
FormalTesting
Formal SQAProcesses
CodeInspections/ Reviews
DesignInspections
/ Reviews
40
%
n
n
n
n
45
%
n
Y
n
n
53
%
Y
n
n
n
57
%
n
n
Y
n
60
%
n
n
n
Y
65
%
Y
Y
n
n
85
%
n
n
Y
Y
99
%
Y
Y
Y
Y
Most development organizations will see greater improvements in software reliability
with investments in design and code reviews than further investments in testing.
[Jones-2]
-
8/8/2019 Software Design for Reliability - Presentation
30/42
(v0.1) 30
Best in Class Results by Application
[Jones-2]
92%Outsourced Software
72%Web Software
73%IT Software
96%Military Software
90%Commercial Software
95%Embedded Software
94%System Software
Best in Class
Defect Removal EfficiencyApplication Type
-
8/8/2019 Software Design for Reliability - Presentation
31/42
(v0.1) 31
Phase Containment of Defects and Failures
Typical Behavior
TestingDesign Coding MaintenanceReqts
TestingDesign Coding MaintenanceReqts
Defect
Origin
Defect
Discovery
Goal of Phase Containment
TestingDesign Coding MaintenanceReqts
TestingDesign Coding MaintenanceReqts
Defect
Origin
Defect
Discovery
Surprise!
-
8/8/2019 Software Design for Reliability - Presentation
32/42
(v0.1) 32
Typical Defect Reduction Behavior
SysBld SysBldSysBldSysBldSysBld#1 #2 #3 #4 #5
System Test
150
50
100
200
-
8/8/2019 Software Design for Reliability - Presentation
33/42
(v0.1) 33
DfR Phase Containment Behavior
150
50
100
200
Req Design Code Unit SysBld SysBld SysBld SysBld SysBld Field
Test #1 #2 #3 #4 #5 Failures
Development System Test Deployment
Goals are to:
(1) Reduce field failure by finding most of the failures in-house
(2) Find the majority of failures during development
-
8/8/2019 Software Design for Reliability - Presentation
34/42
(v0.1) 34
Requirements Phase DfR Practices
Software reliability requirements
Identify important software functionality
Essential, critical, and non-essential
Explicitly define acceptable software failure rates
Specify any behavior that impacts software availability
Acceptable durations for software upgrades, reboots and restarts
Define any operating cycles that apply to the system and the software inorder to define opportunity for software restarts or rejuvenation
Ex: maintenance or diagnostic periods, off-line periods, shutdown
periods
-
8/8/2019 Software Design for Reliability - Presentation
35/42
(v0.1) 35
Design Phase DfR Practices
The software designs should evolve using a multi-tiered approach
1. System Architecture
Identify all essential system-level functionality that requires software
Identify the role of software in detecting and handling hardware failure modes byperforming system-level failure mode analysis
2. High-level Design (HLD) Identify modules based on their functional importance and vulnerability to failures
Essential functionality is most frequently executed
Critical functionality is infrequently executed but implements key systemoperations (e.g., boot/restart, shutdown, backup, etc.)
Vulnerability points that might flag defect clusters (e.g., synchronizationpoints, hardware and module interfaces, initialization and restart code, etc.)
Identify the visibility and access major data objects outside of each module
3. Low-level Design (LLD)
Define availability behavior of the modules (e.g., restarts, retries, reboots,redundancy, etc.)
Identify vulnerable sections of functionality in detail
Functionality is targeted for fault tolerance techniques
Focus on simple implementations and recovery actions
-
8/8/2019 Software Design for Reliability - Presentation
36/42
(v0.1) 36
Design Phase DfR Practices
For software engineers, the highest ROI for defect and failuredetection and removal is low level design (LLD)
LLD defines sufficient module logic and flow control details to allow analysis basedon common failure categories and vulnerable portions of the design
Failure handling behavior can be examined in sufficient detail
LLD bridges gap between traditional design specs and source code
Most design defects that were previously caught during code reviews willnow be caught in the LLD review
More likely to correctly fix design defects since the defect is caught in thedesign phase
Most design defects found after the design phase are not properly fixedbecause the scheduling costs are too high
Design defects require returning to the design phase to correct andreview the design and then correcting, re-reviewing and unit testing the
code!
LLD can also be reviewed for testability
The goal of defining and validating all system test cases as part of an LLDreview is achievable
-
8/8/2019 Software Design for Reliability - Presentation
37/42
(v0.1) 37
Code Phase DfR Practices
Code reviews should be carried out in stages to remove the mostdefects
Properly focused design reviews coupled with techniques to detect simple codingdefects will result in shorter code reviews
Code reviews should focus on implementation, and not design, issues
Language defects can be detected with static and dynamic analysis tools
Maintenance defects caught with coding standards pre-reviews
Requiring authors to review their own code significantly reduces simple codedefects and possible areas of code complexity
The inspection portion of a review tries to identify missing exception handlingpoints
Software failure analysis will focus on the robustness of the exception handlingbehavior
Software failure analysis should be performed as a separate code inspectiononce the code has undergone initial testing
-
8/8/2019 Software Design for Reliability - Presentation
38/42
(v0.1) 38
Test Phase DfR Practices
Unit testing can be effectively driven using code coveragetechniques
Allows software engineers to define and execute unit testing adequacyrequirements in a manner that is meaningful and easily measured
Coverage requirements can vary based on the critical nature of a module
System-level testing should measure reliability and validate asmany customer operational profiles as possible
Requires that most of the failure detection be performed prior to system testing
System integration becomes the general functional validation phase
-
8/8/2019 Software Design for Reliability - Presentation
39/42
(v0.1) 39
Reliability Measurements and Metrics
Definitions
Measurements data collected for tracking or to calculate meta-data (metrics)
Ex: defect counts by phase, defect insertion phase, defect detection phase
Metrics information derived from measurements (meta-data)
Ex: failure rate, defect removal efficiency, defect density
Reliability measurements and metrics accomplish several goals
Provide estimates of software reliability prior to customer deployment
Track reliability growth throughout the life cycle of a release
Identify defect clusters based on code sections with frequent fixes
Determine where to focus improvements based on analysis of failure data
SCM and defect tracking tools should be updated to facilitate the
automatic tracking of this information Allow for data entry in all phases, including development
Distinguish code base updates for critical defect repair vs. any other changes,(e.g., enhancements, minor defect repairs, coding standards updates, etc.)
-
8/8/2019 Software Design for Reliability - Presentation
40/42
Conclusion
-
8/8/2019 Software Design for Reliability - Presentation
41/42
(v0.1) 41
Conclusions
Software developers can produce commercial software withhigh reliability by optimizing their best practices for defectremoval
Most software organizations are not aware that this type ofsoftware DfR is a viable option, which is why much of ourinitial contact is made via the hardware organizations
Final Point: You dont need more tools, more testing ormore processes to develop highly reliable software. Theanswer lies within the reach of your teams best practices!
R f
-
8/8/2019 Software Design for Reliability - Presentation
42/42
(v0.1) 42
References
[Jones-1], Jones, C., Conflict and Litigation Between Software Clients andDevelopers, March 4, 1996
[Jones-2], Jones, C., Software Quality in 2002: A Survey of the State of the
Art, July 23, 2002