A Case Study In Reliability Analysis Lewis Sykalski.

28
A Case Study In Reliability Analysis Lewis Sykalski

Transcript of A Case Study In Reliability Analysis Lewis Sykalski.

Page 1: A Case Study In Reliability Analysis Lewis Sykalski.

A Case Study In Reliability Analysis

Lewis Sykalski

Page 2: A Case Study In Reliability Analysis Lewis Sykalski.

Background (cont.)Background (cont.)

• Net Centric Warfare Data Collector

Approximately 180KLOC

Written in Java and heavily uses JDBC and RMI from J2EE package

CMMI Level 1

Utilizes Oracle 9.2 EE OTS DBMS

• Reliability Required: Moderate

Page 3: A Case Study In Reliability Analysis Lewis Sykalski.

GLOBAL VISION NETWORK (GVN)

Integrated WarfareDevelopment Center

Fort Worth, TX

Light HouseSuffolk, VA

LM – Mission SysColorado Springs, CO

DC

FUSIONCAOC

WCS

JSAF

JIMM

JTAC

JABE

DC

LM – Sim & TrainingOrlando, FL

OtherSimulators

ThreatSims

VBMS

VBMS

BackgroundBackground

Page 4: A Case Study In Reliability Analysis Lewis Sykalski.

Design Diversity (Part I)Design Diversity (Part I)

• Part I: Oracle DBMS Design Diversity– Acquire 20 bug reports each from Oracle 9.2 &

Oracle 10.0– Bugs had to be Date Independent, Easy To

Reproduce, & Type Independent– Results would then be classified by self-evidence &

divergence

Page 5: A Case Study In Reliability Analysis Lewis Sykalski.

Design Diversity: Results 9.2 BugsDesign Diversity: Results 9.2 BugsBug # Type 9.2 S.E 10.0 Fails? 10.0 S.E. Divergent

2357784 Internal Error X NO N/A X

2299898 Performance/Hang X NO N/A X

2202561 Incorrect Results NO N/A

2221401 Incorrect Results NO N/A

2739068 Incorrect Results NO N/A

2683540 Incorrect Results NO N/A

2991842 Incorrect Results NO N/A

2200057 Internal Error X NO N/A

2405258 Internal Error X NO N/A

2716265 Internal Error X NO N/A

2054241 Performance/Hang X NO N/A

2485871 Internal Error X NO N/A

2670497 Internal Error X NO N/A

2659126 Internal Error X NO N/A X

2064478 Internal Error X NO N/A

2624737 Internal Error X NO N/A X

1918751 Internal Error X NO N/A

2286290 Incorrect Results NO N/A X

2700474 Incorrect Results NO N/A

2576353 Internal Error X NO N/A

Page 6: A Case Study In Reliability Analysis Lewis Sykalski.

Design Diversity: Results 10.0 BugsDesign Diversity: Results 10.0 BugsBug # Type 10.0 SE 9.2 Fails? 9.2 SE Divergent

5731063 Internal Error X NO N/A

3664284 Incorrect Results NO N/A

4582808 Incorrect Results NO N/A

3895678 Internal Error X YES X

3893571 Internal Error X YES X

3903063 Incorrect Results YES

3912423 Internal Error X NO N/A

4029857 Engine Crash X YES X

4156695 Incorrect Results YES

2929556 Internal Error X YES X X

3255350 Performance / Hang X NO N/A

3887704 Internal Error X NO N/A

3405237 Engine Crash X YES X

3952322 Feature Unusable X YES X

4033889 Incorrect Results NO N/A

4060997 Internal Error X YES X

4134776 Internal Error X NO N/A

4149779 Incorrect Results NO N/A

2964132 Internal Error X YES X

3361118 Internal Error X YES X

Page 7: A Case Study In Reliability Analysis Lewis Sykalski.

Design Diversity: More AnalysisDesign Diversity: More Analysis

Oracle 9.2 Oracle 10.0 Oracle 10.0 Oracle 9.2

Total Bug Scripts 20 - 20 -

Failure Observed 20 - 20 11

Performance/Hang

S.E 2 0 1 0

Internal Error S.E 11 0 10 6

Engine Crash S.E 0 0 2 2

IncorrectResult

S.E 0 0 0 0

N.S.E 7 0 6 2

S.E 0 0 1 1

N.S.E 0 0 0 0

Page 8: A Case Study In Reliability Analysis Lewis Sykalski.

TotalBug

Scripts

Failures 1 out of 2 Bug Scripts Failing

Both DBMS Products Failing

S.E N.S.E Non-Divergent Divergent

S.E N.S.E S.E. N.S.E

40 40 18 11 8 2 1 0

Bottom Line:•Not a Statistical Sample (Not Enough Time)•2/40 = 10% of Failures not detected across both products•Out of the 20 failures for Oracle 10.0, 6 were N.S.E & 4 out of 6 of these failures would be resolved by utilizing a past release in tangent with future release

Design Diversity: Even More AnalysisDesign Diversity: Even More Analysis

Page 9: A Case Study In Reliability Analysis Lewis Sykalski.

• Part II: CASRE Reliability Analysis of NCW Data Collector

1. Extract the following from Failure Logs using JavaScript: Time of Program Start, Time of Program Termination, Time of Thread Terminations, and Exception or Failure Messages

2. Parse failures manually into CASRE input format3. Categorize by severity utilizing chart on next slide4. Compare 2 consecutive events (CALOE08 &

MAGTF08) as well as 2 consecutives lifecycles within same event (Integration & Execution)

Reliability Analysis (Part II)Reliability Analysis (Part II)

Page 10: A Case Study In Reliability Analysis Lewis Sykalski.

SeverityCode

Failure Description

9 Failure Causes Machine to be Rebooted Causing Catastrophic Loss

8 Failure Causes Program Abort

7 Failure Causes Program Thread Abort

5 Failure Causes Record Not to be Written, Thread Continues

3 Failure Causes Incorrect Data to be Written, Thread Continues

1 Failure is Caught, Handled and Recovers Correctly

SeveritySeverity

Page 11: A Case Study In Reliability Analysis Lewis Sykalski.

Using CASRE Using CASRE

Page 12: A Case Study In Reliability Analysis Lewis Sykalski.

Using CASRE (cont.)Using CASRE (cont.)

Page 13: A Case Study In Reliability Analysis Lewis Sykalski.

Interval Number of Interval Error Number Errors Length Severity(int) (float) (float) (int)

Example:Hours

1 5.0 40.0 11 3.0 40.0 21 2.0 40.0 32 4.0 40.0 12 3.0 40.0 33 7.0 40.0 14 5.0 40.0 15 4.0 40.0 1

FAILURE COUNT FORMAT

TIME BETWEEN FAILURES FORMAT: N/A

CASRE Input FormatCASRE Input Format

Page 14: A Case Study In Reliability Analysis Lewis Sykalski.

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Failure CountsCASRE Failure Counts

Page 15: A Case Study In Reliability Analysis Lewis Sykalski.

CASRE Time Between FailuresCASRE Time Between Failures

CALOE+MAGTF Execution MAGTF Integration + Execution

Page 16: A Case Study In Reliability Analysis Lewis Sykalski.

CASRE Failure IntensityCASRE Failure Intensity

CALOE+MAGTF Execution MAGTF Integration + Execution

Page 17: A Case Study In Reliability Analysis Lewis Sykalski.

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Cummulative FailuresCASRE Cummulative Failures

Page 18: A Case Study In Reliability Analysis Lewis Sykalski.

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Test Interval LengthCASRE Test Interval Length

Page 19: A Case Study In Reliability Analysis Lewis Sykalski.

• Running Average:– Not as Useful for Failure Count Data (unless test intervals are equal

length) – Computes the running average of the time between successive failures

for time between failures data, or the running average of number of failures per interval for failure count data.

– If the running average decreases with time (fewer failures per test interval), reliability growth is indicated.

• Laplace Test: – Not as Useful for Failure Count Data (unless test intervals are equal

length) – Occurrences of failures = homogeneous Poisson process– If the test statistic decreases with increasing failure#, then the null

hypothesis can be rejected in favor of reliability growth at an appropriate significance level. Opposite for increases with increasing failure#

Detecting Reliability TrendsDetecting Reliability Trends

Page 20: A Case Study In Reliability Analysis Lewis Sykalski.

CALOE+MAGTF Execution MAGTF Integration + Execution

Running AverageRunning Average

Page 21: A Case Study In Reliability Analysis Lewis Sykalski.

Laplace TestLaplace Test

CALOE+MAGTF Execution MAGTF Integration + Execution

Page 22: A Case Study In Reliability Analysis Lewis Sykalski.

CASRE Cum Failure PredictionsCASRE Cum Failure Predictions

CALOE+MAGTF Execution MAGTF Integration + Execution

Page 23: A Case Study In Reliability Analysis Lewis Sykalski.

CASRE Prediction SetupCASRE Prediction Setup

CALOE+MAGTF Execution MAGTF Integration + Execution

Page 24: A Case Study In Reliability Analysis Lewis Sykalski.

CASRE Reliability PredictionCASRE Reliability Prediction

CALOE+MAGTF Execution MAGTF Integration + Execution

Page 25: A Case Study In Reliability Analysis Lewis Sykalski.

CASRE Prequential LikelihoodCASRE Prequential Likelihood

CALOE+MAGTF Execution MAGTF Integration + Execution

Page 26: A Case Study In Reliability Analysis Lewis Sykalski.

CASRE Model-RankingCASRE Model-Ranking

CALOE+MAGTF Execution MAGTF Integration + Execution

Page 27: A Case Study In Reliability Analysis Lewis Sykalski.

• Haven’t been able to get these to run yet.

• Instruction manual says many of the built-in models only work with Time Between Failures Data.

• Doubt there would be much utility with Failure Count Data

Reliability ModelsReliability Models

Page 28: A Case Study In Reliability Analysis Lewis Sykalski.

• It actually would be QUITE easy to integrate Failure Count or Time Between Failures Output Auto-Generation into my environment

• This would facilitate quick trend-analysis

• Reliability trends and not the actual numbers is what is important

Conclusion/Follow-UpConclusion/Follow-Up