# SQAT(software quality assurance and testing)

date post

01-Dec-2015Category

## Documents

view

33download

1

Embed Size (px)

description

### Transcript of SQAT(software quality assurance and testing)

1

Unit-III

5.1. Establishing software quality requirements:

The software quality metrics methodology is a systematic approach to

establishing quality requirements and identifying, implementing, analyzing, and

validating the process and product of software quality metrics for a software system. It

spans the entire software life cycle and comprises five steps.

These steps shall be applied iteratively because insights gained from applying a

step may show the need for further evaluation of the results of prior steps.

(a).Establish software quality requirements: .A list of quality factors is selected,

prioritized, and quanti- fied at the outset of system development or system change. These

requirements shall be used to guide and control the development of the system, to access

the whether the system met the quality requirements specified in the contract.

(b). identib software quality requirements: The software quality metrics framework is

applied in the selection of relevant metrics.

implementthesoftwarequalityrequiremnsts: Tools are either procured or developed, data

is collected, and metrics are applied at each phase of the software life cycle.

(d) Analyze software quality metrics results: The metrics results are analyzed and

reported to help control the development and assess the final product.

(e) Validatethequalitymetrics. Predictive metrics results are compared to the direct

metrics results to determine whether the predictive metrics accurately measure their

associated factors.

The documentation or output produced as a result of these steps is shown in table I

Table I-Outputs of metrics methodology steps

Metric Methodology Step output

Establish software quality requirements -Quality requirements

Identify software quality metrics Approved quality metrics framework

- Metrics set

2

-Cost-benefit analysis

Implement the software quality metrics - Description of data items

- Metrieddata item

-Traceability matrix

-Training plan and schedule

5.1 Establish software quality requirements :

Quality requirements shall be represented in either of the following forms:

Direct metric value: A numerical target for a factor to be met in the final product.

For example, mean time to failure (MTTF) is a direct metric of final system reliability.

This is an intermediate requirement that is an early indicator of final system

performance. For exam- ple, design or code errors may be early predictors of final system

reliability.

5.1 .I Identify a list of possible quality requirements:

Identify quality requirements that may be applicable to the software system. Use

organizational experience, required standards, regulations, or laws to create this list.

Annex A contains sample lists of factors and sub- factors. In addition, list other system

requirements that may affect the feasibility of the quality requirements. Consider

acquisition concerns, such as cost or schedule constraints, warranties, and organizational

self- interest. Do not rule out mutually exclusive requirements at this point. Focus on

factoddirect metric combi- nations instead of predictive metrics.

All parties involved in the creation and use of the system shall participate in the

quality requirements identi- fication process.

5.1.2 Determine the actual list of quality requirements :

3

Rate each of the listed quality requirements by importance. Importance is a

function of the system characteristics and the viewpoints of the people involved.

To determine the actual list of the possible quality require-ments, follow the two

steps below:

Create the actual quality requirements; Resolve the results of the survey into a

single list of quality requirements. This shall involve a technical feasibility analysis of the

quality requirements. The proposed factors for this list may have cooperative or

conflicting relationships. Conflicts between requirements shall be resolved at this point.

In addition, if the choice of quality requirements is in conflict with cost, schedule, or

system functionality, one or the other shall be altered. Care shall be exercised in choosing

the desired list to ensure that the requirements are technically feasible reasonable,

complementary, achievable, and verifiable. All involved parties shall agree to this final

List.

5.1.3 Quantify each factor:

For each factor, assign one or more direct metrics to represent the factor, and

assign direct metric values to serve as quantitative requirements for that factor. For

example, if high efficiency was one of the quality requirements from the previous item,

the direct metric actual resource utilizatiodallocated resource utilization with a value of

90% could represent that factor. This direct metric value is used to verify the achieve-

Ment of the quality requirement. Without it, there is no way to tell whether or not the

delivered system meets its quality requirements.

The-quantified list of quality requirements and their definitions again shall be

approved by all involved parties.

5.2 Identify software quality metrics:

4

When identifying software quality metrics, apply the software quality metrics

framework, and perform a cost-benefit analysis. Then gain commitment to the metrics

from all involved parties.

5.2.1 Apply the software quality metrics framework:

Create a chart of the quality requirements based on the hierarchical tree structure

found in figure 1. At this point, only the factor level must be complete. Next decompose

each factor into subfactors as indicated in clause 4. The decomposition into subfactors

must continue for as many levels as needed until the subfactor level is complete.

SOFTWARE QUALITY METRICS METHODOLOGY

Using the software quality metrics framework, decompose the subfactors into

measurable metrics. For each validated metric on the metric level, assign a target value

and a critical value and range that should be achieved during development. The target

values constitute additional quality requirements for the system.

The framework and the target values for the metrics shall be reviewed and

approved by all involved parties.

To help ensure that metrics are used appropriately, only validated metrics (that is,

either direct metrics or metrics validated with respect to direct metrics) shall be used to

assess current and future product and process quality. Nonvalidated metrics may be

included for future analysis, but shall not be included as part of the system requirements.

Furthermore, the metrics that are used shall be those that are associated with the quality

requirements of the software project. However, given that the above conditions are

satisfied.

The selection of specific metrics as candidates for validation and the selection of

specific vali dated metrics for application is at the discretion of the user of this standard.

Examples of metrics and experi ences with their use are given in annex B.

5

Document each metric using the format shown in table 2.

Table 2-Metrics set

Item Description

Name Name given to the metric.

Impact Indication of whether a metric may be

Used to alter.

Target value Numerical value of the metric that is to

be achieved in order to meet quality.

Factors Factors that are related to this metric.

5.2.2 Perform a cost-benefit analysis:

Perform a cost-benefit analysis by identifying the costs of implementing the

metrics, identifying the benefits of applying the metrics, and then applying the metrics

set.

5.2.2.1 Identify the costs of implementing the metrics:

Identify document(see 2.2) all the costs associated with the metrics in the metrics set.

For each met ric, estimate and document the following impacts and costs.

Metric utilization cost: The costs of collecting data; automating the metric value

calculation (when possible); and analyzing, interpreting, and reporting the results.

Software development cost change: The set of metrics may imply a change in the

organizational structure used to produce the software system.

Special equipment: The hardware or software tools may have to be located, purchased,

adapted, or developed to implement the metrics.

Training: The quality assurancekontrol organization or the entire development team may

6

need train ing in the use of the metrics and data collection procedures.

If the introduction of metrics has caused changes in the development process, the

development team may need to be educated about the changes .

5.2.2.2 Identify the benefits of applying the metrics :

Identify and document the benefits that are associated with each metric in the

metrics set .

Some benefits to consider include the following:

-Identify quality goals and increase awareness of the goals in the software organization.

- Provide timely feedback useful in developing higher quality software

- Increase customer satisfaction by quantifying the quality of the software before it is

delivered to the customer.

-provide a a quantitative basis for making decisions about software quality

5.2.2.3 Adjust the metrics set :

Weigh the benefits, tangible and intangible, against the costs of each metric. If the

costs exceed the benefits of a given metric, alter or delete it from the metrics set. On the

other hand, for metrics that remain, make plans for any necessary changes to the software

development process, organizational structure, tools, and training. In most cases it will

not be feasible to quantify benefits. In these cases judgment shall be exercised in

weighing qualitative benefits against quantitative costs.

5.2.3 Gain commitment to the metrics set :

All involved parties shall review the revised metrics set to which the cost-benefit

7

analysis has been added. The metrics set shall be formally adopted and supported by this

group.

5.3 Implement the software quality metrics :

To implement the software quality metrics, define the data collection procedures,

use selected software to prototype the measurement process, then collect the data and

compute the metrics values.

5.3.1 Define the data collection procedures:

For each metric in the metric set, determine the data that will be collected and

determine the assumptions that will be made about the data (for example, random sample,

subjective or objective measure). The flow of data shall be shown from point of

collection to evaluation of metrics. Describe or reference when and how tools are to be

used and data storage procedures. Also, identify candidate tools. Select tools for use with

the prototyping process. A traceability matrix shall be established between metrics and

data items.

Identify the organizational entities that will directly participate in data collection

including those responsible for monitoring data collection. Describe the training and

experience required for data collection and the training process for personnel involved.

Describe each data item thoroughly, using the format shown in table 3

Table 3-Description of a data item

Item I Description

Name Name given to the data item

Metries Metrics that are associated with the data item.

Definition Straightforward description of the data item.

Source Location of where the data originates.

Collector Entity responsible for collecting the data.

Timing Time(s) in life cycle at which the data is to be

collected. (Some dataitems , are collected more than

once.)

Procedures Methodology (for example. automated or manual)

8

used to collect the data.

Representation Manner in which the data is represented, for

example. its precision format.

Sample Method used to select the data to be collected and

the percentage of the available datathat is to be

collected.

Alternatives methods that may be used to collected the data other

than method.

5.3.2 Prototype the measurement process :

Test the data collection and metric computation procedures on selected software

that will act a a prototype. If possible, the samples selected should be similar to the

project(s) on which the metrics will later be used. analysis shall be made to determine if

the data is collected uniformly and if instructions have always been interpreted in the

same manner. In particular, data requiring subjective judgments shall be checked to

determine if the descriptions and instructions are clear enough to ensure uniform results.

In addition, the cost of the measurement process for the prototype shall be

examined to verify or improve the cost analysis.

5.3.3 Collect the data and compute the metrics values:

Using the formats in table 2 and table 3, collect and store data at the appropriate

time in the life cycle. The data shall be checked for accuracy and proper unit of measure.

Data collection shall be monitored. If a sample of data is used, requirements such

as randomness, minimum size of sample, and homogeneous sampling shall be verified.

If more than one person is collecting the data, it shall be checked for uniformity.

Compute the metrics values from the collected data.

5.4 Analyze the software metrics results:

9

Analyzing the results of the software metrics includes interpreting the results,

identifying software quality, making software quality predictions, and ensuring

compliance with requirements.

5.4.1 Interpret the results:

The results shall be interpreted and recorded against the broad context of the

project as well as for a particu lar product or process of the project. The differences

between the collected metric data and the target values for the metrics shall be analyzed

against the quality requirements. Substantive differences shall be investi gated.

5.4.2 Identify software quality:

Quality metric values for software components shall be determined and reviewed.

Quality metric values that are outside the anticipated tolerance intervals (low or

unexpected high quality) shall be identified for further study. Unacceptable quality may

be manifested as excessive complexity, inadequate documentation, lack of traceability, or

other undesirable attributes.

The existence of such conditions is an indication that the soft ware may not satisfy

quality requirements when it becomes operational. Since many of the direct metrics that

are usually of interest cannot be measured during software development (for example,

reliability met rics), validated metrics shall be used when direct metrics are not available.

Direct or validated metrics shall be measured for software components and process steps.

The measurements shall be compared with critical values of the metrics.

Software components whose measurements deviate from the critical values shall

be analyzed in detail.

Unexpected high quality metric values shall cause a review of the software

development process as the expected tolerance levels may need to be modified or the

process for identifying quality metric values may need to be improved.

5.4.3 Make software quality predictions:

10

During development validated metrics shall be used to make predictions of direct

metric values. Predicted values of direct metrics shall be compared with target values to

determine whether to flag software compo nents for detailed analysis. Predictions shall be

made for software components and process steps. Software components and process steps

whose predicted direct metric values deviate from the target values shall be analyzed in

detail.

Potentially, prediction is very valuable because it estimates the metric of ultimate

interest-the direct metric. However, prediction is difficult because it involves using

validated metrics from an earlier phase of the life cycle (for example, development) to

make a prediction about a different but related metric (direct metric) in a much later

phase (for example, operations).

5.4.4 Ensure compliance with requirements:

Direct rnetrics shall be used to ensure compliance of software products with

quality requirements during system and acceptance testing. Direct metrics shall be

measured for software components and process steps. These values shall be compared

with target values of the direct metrics that represent quality require ments. Software

components and process steps whose measurements deviate from the target values are

non compliant

5.5 Validate the software quality metrics:

For infuriation on the statistical techniques used. See [B 1141. [B 115], (B 11 71,'

or similar references.

5.5.1 Purpose of metrics validation:

The purpose of metrics validation is to identify both product and process metrjcs

that can predict specified quality factor values, which are quantitative representations of

quality requirements. If metrics are to be us ful, they shall indicate accurately whether

11

quality requirements have been achieved or are likely to be achieved in the future. When

it is possible to measure factor values at the desired point in the life cycle, these direct

metrics are used to evaluate software quality. At some points in the life cycle, certain

quality factor values (for example, reliability) are not available. They are obtained after

delivery or late in the project. In these cases, other metrics are used early on in a project

to predict quality factor values.

Although quality subfactors are useful when identifying and establishing factors

and metrics, they need not be used in metrics validation. because the focus in validation

is on determining whether a statistically signi icant relationship exists between predictive

metric values and factor values.

Quality factors may be affected by multiple variables. A single metric. therefore,

may not sufficiently repre sent any one factor if it ignores these other variables.

5.5.2 Validity criteria:

To be considered valid, a predictive metric shall demonstrate a high degree of

association with the quality factors it represents. This is equivalent to accurately

portraying the quality condition(s) of a product or process. A metric may be valid with

respect to certain validity criteria and invalid with respect to other criteria.

The person in the organization who understands the consequence of the values selected

shall designate threshold values for the following:

V - square of the linear correlation coefficient

B-rank correlations coefficient

A-prediction error

P-success rate

A short numerical example follows the definition of each validity criterion.

Detailed examples of the appli cation of metrics validation are contained in annex.

Correlations: The variation in the quality factor values explained by the variation in the

metric val ues, which is given by the square of the linear correlation coefficient between

the metric and the corresponding factor, shall exceed where R2>v

This criterion assesses whether there is a sufficiently strong linear association between a

12

factor and a metric to warrant using the metric as a substitute for the factor, when it is

infeasible to use the latter. For example, the correlation coefficient between a complexity

metric and the factor reliability may be 0.8. The square of this is 0.64. Only 64% of the

variation in the factor is explained by the varia tion in the metric. If V has been

established as 0.7, the conclusion would be drawn that the metric is invalid.

Trucking: If a metric M is directly related to a quality factor F,for a given productor

process, then a change in aquality factor value from FT1to FT2 at times T1 and T2, shall

be accompanied by a change in metric value from MT1 to MT2, which is the same

direction .

For example, if a complexity metric is claimed to be a measure of reliability, then

it is reasonable to expect a change in the reliability of a software component to be

accompanied by an appropriate change in metric value (for instance, if the product

increases in reliability, the metric value should also change in a direction that indicates

the product has improved).

That is, if Mean Time to Failure(MTTF) is used to measure reliability and is

equal to 1000 hours during testing (TI) and 1500 hours during operation (T2), a

complexity metric whose value is 8 in TI and 6 in T2, where 6 less complex than 8, is

said to track reliability for this software component. If this relationship is demonstrated

over a representative sample of software components, the conclusion could be drawn that

the metric can track reliability (that is, indicate changes in product reliability) over the

software life cycle.

Consistency:if factor values f1,f2,.fn corresponding to products or processes

relationship FI > F2> . . ,, Fn, then the corresponding metric values shall have the

relationship MI M2 > . . . , Mn. To perform this test, compute the coefficient of rank

correlation(r) between paired values (from the same software components) of the factor

and the metric;

Id shall exceed This criterion assesses whether there is consistency between the

ranks of the factor values of a set of software components and the ranks of the metric

values for the same set of software components

Predictability: a metric is used at time T1 to predict a quality factor for a given product

or processit shall predict a related quality factorFP~2with accurancy of where

13

FAT2is the actual value Fat atimeT2 This criterion assesses whether a metric is

capable of predicting a factor value with the required accuracy. For example, if a

complexity metric is used during development to predict the reliability of a soft ware

component during operation (T2) to be 1200 hours MTTF (FpT2) and the actual MTTF

that is measured during operation is 1000 hours FuT2), then the error in prediction is 200

hours, or 20%.

If the acceptable prediction error (A) is 25%. prediction accuracy is acceptable. If

the ability to predict is demonstrated over a representative sample of software

components, the conclusion could be drawn that the metric can be used as a predictor of

reliability.

Discriminative power: A metric shall be able to discriminate between high quality

software compo nents (for example, high MTTF) and low quality software components

(for example, low MTTF). For instance, the set of metric values associated with the

former should be significantly higher (or lower) than those associated with the latter.

The Mann Whitney Test and the Chi-square test for differences in probabilities

(Contingency Tables) can be used for this validation test.

Reliability: A metric shall demonstrate the above correlation. tracking, consistency,

predictability, and discriminative power properties forP percent of the applications of the

metric. This criterion is used to ensure that a metric has passed a validity test over a

sufficient number or percentage of applications so that there will be confidence that the

metric can perform its intended function consistently.

5.5.3 Validation procedure:

Metrics validation shall include identifying the quality factors sample, identifying

the metrics sample, per forming a statistical analysis, and documenting the results.

5.5.3.1 Identify the quality factors sample:

These factor values (for example, measurements of reliability), which represent

the quality requirements of a project, were previously identified (see 5.1.3) and collected

and stored (see 5.3.3). For validation purposes, draw a sample from the metrics database.

14

5.5.3.2 Identify the metrics sample:

These metrics (for example, design structure) are used to predict or to represent

quality factor values, when the factor values cannot be measured. The metrics were

previously identified (see 5.2.1) and their data collected and stored, and values computed

from the collected data (see 5.3.3). For validation purposes, draw a sample from the same

domain (for example, same software components) of the metrics database as used in

5.5.3.1.

5.5.3.3 Perform a statistical analysis :

The tests described in 5.5.2 shall be performed.

Before a metric is used to evaluate the quality of a product or process, it shall be

validated against the criteria described in 5.5.2. If a metric does not pass all of the

validity tests, it shall only be used according to the criteria prescribed by those tests (for

example, if it only passes the tracking validity test, it shall be used only for tracking the

quality of a product or process).

5.5.3.4 Document the results:

Documenting results shall include the direct metric, predictive metric, validation

criteria, and numerical results, as a minimum.

5.5.4 Additional requirements:

Additional requirements, those being the need for revalidation, confidence in

analysis results, and the stability of environment, are described in the following

subclauses.

5.5.4.1 The need for revalidation:

It is important to revalidate a predictive metric before it is used for another

environment or application. AS the software engineering process changes, the validity of

metrics changes. Cumulative metric validation val ues may be misleading because a

15

metric that has been valid for several uses may become invalid. It is wise to compare the

one-time validation of a metric with its validation history to avoid being misled.

The following statements of caution should be noted:

*A validated metric may not necessarily be valid in other environments or future

applications.

*A metric that has been invalidated may be valid in other environments or future

applications.

5.5.4.2 Confidence in analysis results:

Metrics validation is a continuous process. Confidence in metrics is increased

over time as metrics are vali dated on a variety of projects and as the metrics database

and sample size increases. Confidence is not a static, one-time property. If a metric is

valid, confidence will increase with increased use (that is, the correlation coefficient will

be significant at decreasing values of the significance level). Greatest confidence occurs

when metrics have been tentatively validated based on data collected from previous

projects. Even when this is the case, the validation analysis will continue into future

prqjects as the metrics database and sample size grow.

5.5.4.3 Stability of environment:

Practicable metrics validation shall be undertaken in a stable development

environment (that is, where the design language, implementation language, or program

development tools do not change over the life of the project in which validation is

performed). In addition, there shall be at least one project in which metrics data have

been collected and validated prior to application of the predictive metrics.

This project shall be similar to the one in which the metrics are applied with

respect to software engineering skills, application, size, and software engineering

environment.

SOFTWARE QUALITY INDICATOR:

16

A software quality indicator (SQI) is a variable whose value can be determined

through direct analysis of product or process characteristics and whose evidential

relationship to one ormoresoftware engineering attributes is

undeniable.

A Software Quality Indicator can be calculated to provide an

indication of the quality of the system by assessing system

characteristics.

Method

Assemble a quality indicator from factors that can be determined

automatically with commercial or custom code scanners, such as

the following:

cyclomatic complexity of code (e.g., McCabe's

Complexity Measure),

unused/unreferenced code segments (these should

be eliminated over time),

average number of application calls per module

(complexity is directly proportional to the number of calls),

size of compilation units (reasonably sized units

have approximately 20 functions (or paragraphs), or about

2000 lines of code; these guidelines will vary greatly by

environment),

use of structured programming constructs (e.g.,

elimination of GOTOs, and circular procedure calls).

These measures apply to traditional 3GL environments and are

more difficult to determine in environments which are using

object-oriented languages, 4GLs, or code generators.

Tips and Hints

With existing software, the Software Quality Indicator could also

include a measure of the reliability of the code. This can be

17

determined by keeping a record of how many times each module

has to be fixed in a given time period.

There are other factors which contribute to the quality of a system

such as:

procedure re-use,

clarity of code and documentation,

consistency in the application of naming

conventions,

adherence to standards,

consistency between documentation and code,

the presence of current unit test plans.

These factors are harder to determine automatically. However,

with the introduction of CASE tools and reverse-engineering tools,

and as more of the design and documentation of a system is

maintained in structured repositories, these measures of quality

will be easier to determine, and could be added to the indicator.

Fundamentals in Measurement Theory:

To discusses the fundamentals of measurement theory. We outline the

relationshipsamong theoretical concepts, definitions, and measurement, and describe

some basic measures that are used frequently. It is important to distinguish the levels of

the conceptualization proces s, from abstract concepts, to definitions that are used

operationally, to actual measurements. Depending on the concept and the operational

definition derived from it, different levels of measurement may be applied: nominal scale,

ordinal scale, interval scale, and ratio scale.

It is also beneficial to spell out the explicit differences among some basic

measures such as ratio, proportion, percentage, and rate. Significant amounts of wasted

effort and resources can be avoided if these fundamental measurements are well

understood.

We then focus on measurement quality. We discuss the most important issues in

18

measurement quality, namely, reliability and validity, and their relationships with

measurement errors. We then discuss the role of correlation in observational studies and

the criteria for causality.

3.1 Definition, Operational Definition, and Measurement:

It is undisputed that measurement is crucial to the progress of all sciences. Scientific

progress is made through observations and generalizations based on data and

measurements, the derivation of theories as a result, and in turn the confirmation or

refutation of theories via hypothesis testing based on further empirical data. As an

example, consider the proposition "the more rigorously the front end of the software

development process is executed, the better the quality at the back end."

To confirm or refute this proposition, we first need to define the key concepts. For

example, we define "the software development process" and distinguish the process steps

and activities of the front end from those of the back end. Assume that after the

requirements-gathering process, our development process consists of the following

phases:

Design

Design reviews and inspections

Code

Code inspection

Debug and development tests

Integration of components and modules to form the product

Formal machine testing

Early customer program

Integration is the development phase during which various parts and components are

integrated to form one complete software product. Usually after integration the product is

under formal change control. Specifically, after integration every change of the software

must have a specific reason(e.g., to fix a bug uncovered during testing) and must be

documented and tracked. Therefore, we may want to use integration as the cutoff point:

The design, coding, debugging, and integration phases are classified as the front end of

19

the development process and the formal machine testing and early customer trials

constitute the back end.

We need to specify the indicator(s) of the definition and to make it (them)

operational. For example, suppose the process documentation says all designs and code

should be inspected. One operational definition of rigorous implementation may be

inspection coverage expressed in terms of the percentage of the estimated lines of code

(LOC) or of the function points (FP) that are actually inspected. Another indicator of

good reviews and inspections could be the scoring of each inspection by the inspectors at

the end of the inspection, based on a set of criteria.

We may want to operationally use a five-point Likert scale to denote the degree of

effectiveness (e.g., 5 = very effective, 4 = effective, 3 = somewhat effective, 2 = not

effective, 1 = poor inspection). There may also be other indicators.

In addition to design, design reviews, code implementation, and code inspections,

development testing is part of our definition of the front end of the development process.

We also need to we need to operationally define "quality at the back end" and decide

which measurement indicators to use.

For the sake of simplicity let us use defects found per KLOC (or defects per function

point) during formal machine testing as the indicator of back-end quality. From these

metrics, we can formulate several testable hypotheses such as the following:

For software projects, the higher the percentage of the designs and code that

areinspected, the lower the defect rate at the later phase of formal machine testing.

The more effective the design reviews and the code inspections as scored by the

inspection team, the lower the defect rate at the later phase of formal machine

testing.

The more thorough the development testing (in terms of test coverage) before

integration,the lower the defect rate at the formal machine testing phase.

With the hypotheses formulated, we can set out to gather data and test the hypotheses.

We also need to determine the unit of analysis for our measurement and data. In this case,

it could be at the project level or at the component level of a large project. If we are able

to collect a number of data points that form a reasonable sample size (e.g., 35 projects or

20

components), we can perform statistical analysis to test the hypotheses

The building blocks of theory are concepts and definitions. In a theoretical

definition a concept is defined in terms of other concepts that are already well

understood. In the deductive logic system, certain concepts would be taken as undefined;

they are the primitives. All other concepts would be defined in terms of the primitive

concepts. For example, the concepts of point and line may be used as undefined and the

concepts of triangle or rectangle can then be defined based on these primitives.

3.2 Level of Measurement:

We have seen that from theory to empirical hypothesis and from theoretically defined

concepts to operational definitions.

The process is by no means direct. As the example illustrates, when we

operationalize a definition and derive measurement indicators, we must consider the scale

of measurement. For instance, to measure the quality of software inspection we may use a

five-point scale to score the inspection effectiveness or we may use percentage to indicate

the inspection coverage. For some cases, more than one measurement scale is applicable;

for others, the nature of the concept and the resultant operational definition can be

21

measured only with a certain scale. In this section, we briefly discuss the four levels of

measurement: nominal scale, ordinal scale, interval scale, and ratio scale.

Nominal Scale

The most simple operation in science and the lowest level of measurement is

classification. In classifying we attempt to sort elements into categories with respect to a

certain attribute. For example, if the attribute of interest is religion, we may classify the

subjects of the study into Catholics, Protestants, Jews, Buddhists, and so on. If we

classify software products by the development process models through which the

products were developed, then we may have categories such as waterfall development

process, spiral development process, iterative development process, object-oriented

programming process, and others.

In a nominal scale, the names of the categories and their sequence bear no

assumptions about relationships among categories. For instance, we place the waterfall

development process in front of spiral development process, but we do not imply that one

is "better than" or "greater than" the other.

Ordinal Scale

Ordinal scale refers to the measurement operations through which the subjects can

be compared in order. For example, we may classify families according to socio-

economic status: upper class, middle class, and lower class. We may classify software

development projects according to the SEI maturity levels or according to a process rigor

scale.

The ordinal measurement scale is at a higher level than the nominal scale in the

measurement hierarchy. Through it we are able not only to group subjects into categories,

but also to order the categories.

An ordinal scale is asymmetric in the sense that if A > B is true then B > A is

false. It has the transitivity property in that if A > B and B > C, then A > C.

Therefore, when we translate order relations into mathematical operations, we

cannot use operations such as addition, subtraction, multiplication, and division. We can

use "greater than" and "less than." However, in real-world application for some specific

types of ordinal scales (such as the Likert five-point, seven-point, or ten-point scales), the

22

assumption of equal distance is often made and operations such as averaging are applied

to these scales. In such cases, we should be aware that the measurement assumption is

deviated, and then use extreme caution when interpreting the results of data analysis.

Interval and Ratio Scales:

An interval scale indicates the exact differences between measurement points.

The mathematical operations of addition and subtraction can be applied to interval scale

data. For instance, assuming products A, B, and C are developed in the same language, if

the defect rate of software product A is5 defects per KLOC and product B's rate is 3.5

defects per KLOC, then we can say product A's defect level is 1.5 defects per KLOC

higher than product B's defect level. An interval scale of measurement requires a well-

defined unit of measurement that can be agreed on as a common standard and that is

repeatable.

Given a unit of measurement, it is possible to say that the difference between two

scores is 15 units or that one difference is the same as a second. Assuming product C's

defect rate is 2 defects per KLOC, we can thus say the difference in defect rate between

products A and B is the same as that between B and C.

For interval and ratio scales, the measurement can be expressed in both integer and

noninteger data. Integer data are usually given in terms of frequency counts (e.g., the

number of defects customers will encounter for a software product over a specified time

length).

3.3 Some Basic Measures

Regardless of the measurement scale, when the data are gathered we need to

analyze them toextract meaningful information. Various measures and statistics are

available for summarizing theraw data and for making comparisons across groups. In this

section we discuss some basic measures such as ratio, proportion, percentage, and rate,

which are frequently used in our dailylives as well as in various activities associated with

software development and software quality. These basic measures, while seemingly easy,

are often misused. There are also numerous

Sophisticated statistical techniques and methodologies that can be employed in data

analysis.

23

However, such topics are not within the scope of this discussion.

Ratio

A ratio results from dividing one quantity by another. The numerator and

denominator are from twodistinct populations and are mutually exclusive. For example,

in demography, sex ratio is defined asIf the ratio is less than 100, there are more females

than males; otherwise there are more malesthan females.

Ratios are also used in software metrics. The most often used, perhaps, is the ratio

of number ofpeople in an independent test organization to the number of those in the

development group. Thetest/development head-count ratio could range from 1:1 to 1:10

depending on the managementapproach to the software development process. For the

large-ratio (e.g., 1:10) organizations, thedevelopment group usually is responsible for the

complete development (including extensivedevelopment tests) of the product, and the test

group conducts system-level testing in terms ofcustomer environment verifications. For

the small-ratio organizations, the independent group takesthe major responsibility for

testing (after debugging and code integration) and quality assurance.

Proportion

Proportion is different from ratio in that the numerator in a proportion is a part of the

denominator:

Proportion also differs from ratio in that ratio is best used for two groups,

whereas proportion isused for multiple categories (or populations) of one group. In other

words, the denominator in thepreceding formula can be more than just a + b. If

then we have

When the numerator and the denominator are integers and represent counts of

certain events,then p is also referred to as a relative frequency. For example, the

following gives the proportion ofsatisfied customers of the total customer set:

24

The numerator and the denominator in a proportion need not be integers. They can be

frequencycounts as well as measurement units on a continuous scale (e.g., height in

inches, weight in pounds). When the measurement unit is not integer, proportions

arecalled fractions.

Percentage

A proportion or a fraction becomes a percentage when it is expressed in terms of

per hundred units(the denominator is normalized to 100). The word percent means per

hundred. A proportion p istherefore equal to 100p percent (100p%).Percentages are

frequently used to report results, and as such are frequently misused. First,because

percentages represent relative frequencies, it isimportant that enough

contextualinformation be given, especially the total number of cases, so that the readers

can interpret theinformation correctly. Jones (1992) observes that many reports and

presentations in the softwareindustry are careless in using percentages and ratios. He cites

the example:Requirements bugs were 15% of the total, design bugs were 25% of the

total,coding bugs were50% of the total, and other bugs made up 10% of the total.Had the

results been stated as follows, it would have been much more informative:The project

consists of 8 thousand linesof code (KLOC). During its development a total of 200defects

were detected and removed, giving adefect removal rate of 25 defects per KLOC. Of

the200 defects, requirements bugs constituted 15%, design bugs 25%, coding bugs 50%,

and otherbugs made up 10%.A second important rule of thumb is that the total number of

cases must be sufficiently largeenough to use percentages. Percentages computed from a

small total are not stable;

they alsoconvey an impression that a large number of cases are involved. Some

writers recommend that theminimum number of cases for which percentages should be

calculated is 50. We recommend that,depending on the number of categories, the

minimum number be 30, the smallest sample sizerequired for parametric statistics. If the

number of cases is too small, then absolute numbers,instead of percentages, should be

used. For instance,Of the total 20 defects for the entire project of 2 KLOC, there were 3

requirements bugs, 5 designbugs, 10 coding bugs, and 2 others.When results in

25

percentages appear in table format, usually both the percentages and actualnumbers are

shown when there is only one variable. When there are more than two groups, suchas the

example in Table 3.1, it is better just to show the percentages and the total number of

cases (N) for each group. With percentages and N known, one

canalwaysreconstructthefrequency distributions. The total of 100.0% should always be

shown so that it is clear how thepercentages are computed. In a two-way table, the

direction in which the percentages arecomputed depends on the purpose of the

comparison. For instance, the percentages inTable3.1arecomputedvertically (the total of

each column is 100.0%), and the purpose is to compare thedefect-type profile across

projects (e.g., project B proportionally has more requirements defectsthan project A).In

Table 3.2, the percentages are computed horizontally. The purpose here is to compare

thedistribution of defects across projects for each type of defect. The inter-pretations of

the two tablesdiffer.

Therefore, it is important to carefully examine percentage tables to determine exactly

how

the percentages are calculated.

Table 3.1. Percentage Distributions

26

Ratios, proportions, and percentages are static summary measures. They provide a

cross-sectionalview of the phenomena of interest at a specific time. The concept of rate is

associated with thedynamics (change) of the phenomena of interest; generally it can be

defined as a measure ofchange in one quantity (y) per unit of another quantity (x) on

which the former (y) depends. Usuallythe x variable is time. It is important that the time

unit always be specified when describing a rateassociated with time. For instance, in

demography the crude birth rate (CBR) is defined as:where B is the number of live births

in a given calendar year, P is the mid-year population, and K isa constant, usually 1,000.

The concept of exposure to risk is also central to the definition of rate, which

distinguishes rate fromproportion. Simply stated, all elements or subjects in the

denominator have to be at risk ofbecoming or producing the elements or subjects in the

numerator. If we take a second look at thecrude birth rate formula, we will note that the

denominator is mid-year population and we knowthat not the entire population is subject

to the risk of giving birth. Therefore, the operational

Page

where B is the number of live births in a given calendar year, P is the mid-year

population, and K isa constant, usually 1,000.The concept of exposure to risk is also

central to the definition of rate, which distinguishes rate fromproportion. Simply stated,

all elements or subjects in the denominator have to be at risk ofbecoming or producing

the elements or subjects in the numerator.

27

If we take a second look at thecrude birth rate formula, we will note that the denominator

is mid-year population and we knowthat not the entire population is subject to the risk of

giving birth. Therefore, the operational

Six Sigma

The term six sigma represents a stringent level of quality. It is a specific defect

rate: 3.4 defectiveparts per million (ppm). It was made known in the industry by

Motorola, Inc., in the late 1980swhen Motorola won the first Malcolm Baldrige National

Quality Award (MBNQA). Six sigma hasbecome an industry standard as an ultimate

quality goal.

Sigma ( Figure 3.2 indicates, the areas

under thecurve of normal distribution defined by standard deviations are constants in

terms of percentages,regardless of the distribution parameters. The area under the curve

as defined by plus and minusone standard deviation (sigma) from the mean is 68.26%.

The area defined by plus/minus twostandard deviations is 95.44%, and so forth. The area

defined by plus/minus six sigma is99.9999998%. The area outside the six sigma area is

thus 100% -99.9999998% = 0.0000002%.

Figure 3.3. Specification Limits, Centered Six Sigma, and Shifted (1.5 Sigma) Six

Sigma

Page

If we take the area within the six sigma limit as the percentage of defect-free parts

and the areaoutside the limit as the percentage of defective parts, we find that six sigma is

equal to 2 defectivesper billion parts or 0.002 defective parts per million. The

interpretation of defect rate as it relates tothe normal distribution will be clearer if we

include the specification limits in the discussion, asshown in the top panel of Figure 3.3.

Given the specification limits (which were derived fromcustomers' requirements), our

28

purpose is to produce parts or products within the limits. Parts orproducts outside the

specification limits do not conform to requirements.

If we can reduce thevariations in the production process so that the six sigma

(standard deviations) variation of theproduction process is within the specification limits,

then we will havesixsigma quality level.

The six sigma value of 0.002 ppm is from the statistical normal distribution. It

assumes that eachexecution of the production process will produce the exact distribution

of parts or productscentered with regard to the specification limits. In reality, however,

process shifts and drifts alwaysresult from variations in process execution. The maximum

process shifts as indicated by research(Harry, 1989) is 1.5 sigma. If we account for this

1.5-sigma shift in the production process, we willget the value of 3.4 ppm. Such shifting

is illustrated in the two lower panels of Figure 3.3. Givenfixed specification limits, the

distribution of the production process may shift to the left or to theright. When the shift is

1.5 sigma, the area outside the specification limit on one end is 3.4 ppm,and on the other

it is nearly zero.

The six sigma definition accounting for the 1.5-sigma shift (3.4 ppm) proposed

and used byMotorola (Harry, 1989) has become the industry standard in terms of six

29

sigmalevel quality (versusthe normal distribution's six sigma of 0.002 ppm).

Furthermore, when the production distributionshifts 1.5 sigma, the intersection points of

the normal curve and the specification limits become 4.5sigma at one end and 7.5 sigma

at the other. Since for all practical purposes, the area outside 7.5

3.4 Reliability and Validity

Recall that concepts and definitions have to be operationally defined before

measurements can betaken. Assuming operational definitions are derived and

measurements are taken, the logicalquestion to ask is, how good are the operational

metrics and the measurement data? Do theyreally accomplish their taskmeasuring the

concept that we want to measure and doing so withgood quality? Of the many criteria of

measurement quality, reliability and validity are the two mostimportant.

Reliability:

refers to the consistency of a number of measurements taken using the

sameMeasurement method on the same subject. If repeated measurements are highly

consistent oreven identical, then the measurement method or the operational definition

has a high degree of

Reliability. If the variations among repeated measurements are large, then reliability is

low. Forexample, if an operational definition of a body height measurement of children

(e.g., between ages3 and 12) includes specifications of the time of the day to take

measurements, the specific scale touse, who takes the measurements (e.g., trained

pediatric nurses), whether the measurementsshould be taken barefooted, and so on, it is

likely that reliable data will be obtained. If theoperational definition is very vague in

terms of these considerations, the data reliability may be low.Measurements taken in the

early morning may be greater than those taken in the late afternoonbecause children's

bodies tend to be more stretched after a good night's sleep and becomesomewhat

compacted after a tiring day. Other factors that can contribute to the variations of

themeasurement data include different scales, trained or untrained personnel, with or

without shoeson, and so on.

30

Figure 3.4. An Analogy to Validity and Reliability

From Practice of Social Research (Non-Info Trac Version), 9th edition, by E. Babbie.

2001

Thomson Learning. Reprinted with permission of Brooks/Cole, an imprint of the

Wadsworth

Group, a division of Thomson Learning (fax: 800-730-2215).

Note

Note that there is some tension between validity and reliability. For the data to be

reliable, themeasurement must be specifically defined. In such an endeavor, the risk of

being unable torepresent the theoretical concept validly may be high. On the other hand,

for the definition to havegood validity, it may be quite difficult to define the

measurements precisely. For example, themeasurement of church attendance may be

quite reliable because it is specific and observable.However, it may not be valid to

represent the concept of religiousness. On the other hand, toderive valid measurements of

religiousness is quite difficult. In the real world of measurements andmetrics, it is not

uncommon for a certain tradeoff or balance to be made between validity and

reliability.

Validity and reliability issues come to the fore when we try to use metrics and

measurements to

represent abstract theoretical constructs. In traditional quality engi-neering where

measurements

are frequently physical and usually do not involve abstract concepts, the counterparts of

validity

and reliability are termed accuracy and precision (Juran and Gryna, 1970). Much

confusion surroundsthese two terms despite their having distinctly different meanings. If

31

we want a much higherdegree of precision in measurement (e.g., accuracy up to three

digits after the decimal point whenmeasuring height), then our chance of getting all

measurements accurate may be reduced. Incontrast, if accuracy is required only at the

level of integerinch (less precise), then it is a lot easier

to meet the accuracy requirement.

3.5 Measurement Errors

In this section we discuss validity and reliability in the context of measurement

error. There are twotypes of measurement error: systematic and random.

Systematic measurement error is associatedwith validity; random error is

associated with reliability. Let us revisit our example about thebathroom weight scale

with an offset of 10 lb. Each time a person uses the scale, he will get ameasurement that

is 10 lb. more than his actual body weight, in addition to the slight variationsamong

measurements. Therefore, the expected value of the measurements from the scale does

notequal the true value because of the systematic deviation of 10 lb. In simple formula:

In a general case:

where M is the observed/measured score, T is the true score, s is systematic error, and e is

random

error.

The presence of s (systematic error) makes the measurement invalid. Now let us

assume themeasurement is valid and the s term is not in the equation. We have the

following:The equation still states that any observed score is not equal to the true score

because of randomdisturbancethe random error e. These disturbances mean that on one

measurement, a person'sscore may be higher than his true score and on another occasion

the measurement may be lowerthan the true score. However, since the disturbances are

random, it means that the positive errorsare just as likely to occur as the negative errors

and these errors are expected to cancel eachother. In other words, the average of these

32

errors in the long run, or the expected value of e, iszero: E(e) = 0. Furthermore, from

statistical theory about random error, we can also assume the

following:

The correlation between the true score and the error term is zero.

There is no serial correlation between the true score and the error term.

The correlation between errors on distinct measurements is zero.

From these assumptions, we find that the expected value of the observed scores is equal

to the

true score:

Therefore, the reliability of a metric varies between 0 and 1. In general, the larger the

errorvariance relative to the variance of the observed score, the poorer the reliability. If

all variance ofthe observed scores is a result of random errors, then the reliability is zero

[1 (1/1) = 0].

3.5.1 Assessing Reliability

Thus far we have discussed the concept and meaning of validity and reliability

and their

interpretation in the context of measurement errors. Validity is associated with systematic

error andthe only way to eliminate systematic error is through better understanding of the

concept we try tomeasure, and through deductive logic and reasoning to derive better

definitions. Reliability isassociated with random error. To reduce random error, we need

good operational definitions, andbased on them, good execution of measurement

operations and data collection. In this section, wediscuss how to assess the reliability of

empirical measurements.There are several ways to assess the reliability of empirical

measurements including the test/retest

method, the alternative-form method, the split-halves method, and the internal

consistency method(Carmines and Zeller, 1979). Because our purpose is to illustrate how

to use our understanding ofreliability to interpret software metrics rather than

indepthstatisticalexamination of the subject,we take the easiest method, the retest method.

The retest method is simply taking a secondmeasurement of the subjects some time after

33

the first measurement is taken and then computingthe correlation between the first and

the second measurements. For instance, to evaluate thereliability of a blood pressure

machine, we would measure the blood pressures of a group of peopleand, after everyone

has been measured, we would take another set of measurements. The secondmeasurement

could be taken one day later at the same time of day, or we could simply take

twomeasurements at one time. Either way, each person will have two scores. For the sake

of simplicity,let us confine ourselves to just one measurement, either the systolic or the

diastolic score. We thencalculate the correlation between the first and second score and

the correlation coefficient is thereliability of the blood pressure machine. A schematic

representation of the test/retest method forestimating reliability is shown in Figure 3.5.

Figure 3.5. Test/Retest Method for Estimating Reliability

Page

The equations for the two tests can be represented as follows:

From the assumptions about the error terms, as we briefly stated before, it can be shown

that

m is the reliability measure.

As an example in software metrics, let us assess the reliability of the reported number of

defects

found at design inspection. Assume that the inspection is formal; that is, an inspection

meeting washeld and the participants include the design owner, the inspection moderator,

and the inspectors.At the meeting, each defect is acknowledged by the whole group and

34

the record keeping is done bythe moderator. The test/retest method may involve two

record keepers and, at the end of theinspection, each turns in his recorded number of

defects. If this method is applied to a series ofinspections in a development organization,

we will have two reports for each inspection over asample of inspections. We then

calculate the correlation between the two series of reportednumbers and we can estimate

the reliability of the reported inspection defects.

3.5.2 Correction for Attenuation

One of the important uses of reliability assessment is to adjust or correct

correlations for

unreliability that result from random errors in measurements. Correlation is perhaps one

of the

most important methods in software engineering and other disciplines for analyzing

relationships

between metrics. For us to substantiate or refute a hypothesis,

we have to gather data for boththe independent and the dependent variables and

examine the correlation of the data. Let usrevisit our hypothesis testing example at the

beginning of this chapter: The more effective thedesign reviews and the code inspections

as scored by the inspection team, the lower the defectrate encountered at the later phase

of formal machine testing.As mentioned, we first need to operationally define the

independent variable (inspection

effectiveness) and the dependent variable (defect rate during formal machine testing).

Then we

gather data on a sample of components or projects and calculate the correlation between

theindependent variable and dependent variable. However, because of random errors in

the data, theresultant correlation often is lower than the true correlation. With knowledge

about the estimate ofthe reliability of the variables of interest, we can adjust the observed

correlation to get a moreaccurate picture of the relationship under consideration. In

software development, we observedthat a key reason for some theoretically sound

hypotheses not being supported by actual projectdata is that the operational definitions of

the metrics are poor and there are too many noises in thedata.Given the observed

35

correlation and the reliability estimates of the two variables, the formula forcorrection for

attenuation (Carmines and Zeller, 1979) is as follows:

where

xt yt) is the correlation corrected for attenuation, in other words, the estimated true

correlation

xi yi) is the observed correlation, calculated from the observed data

xx' is the estimated reliability of the X variable

yy' is the estimated reliability of the Y variable |

For example, if the observed correlation between two variables was 0.2 and the reliability

estimates were 0.5 and 0.7, respectively, for X and Y, then the correlation corrected for

attenuation

would be

This

This means that the correlation between X and Y would be 0.34 if both were measured

perfectly

without error.

3.6 Be Careful with Correlation

Correlation is probably the most widely used statistical method to assess

relationships amongobservational data (versus experimental data). However, caution

must be exercised when usingcorrelation; otherwise, the true relationship under

investigation may be disguised ormisrepresented. There are several points about

correlation that one has to know before using it.

First, although there are special types of nonlinear correlation analysis available in

statistical

literature, most of the time when one mentions correlation, it means linear correlation.

Indeed, themost well-known Pearson correlation coefficient assumes a linear relationship.

Therefore, if a

36

correlation coefficient between two variables is weak, it simply means there is no linear

relationshipbetween the two variables. It doesn't mean there is no relationship of any

kind.

Let us look at the five types of relationship shown in Figure 3.6. Panel A

represents a positive linearrelationship and panel B a negative linear relationship. Panel C

shows a curvilinear convexrelationship, and panel D a concave relationship. In panel E, a

cyclical relationship (such as theFourier series representing frequency waves) is shown.

Because correlation assumes linear

relationships, when the correlation coefficients (Pearson) for the five relationships are

calculated,

the results accurately show that panels A and B have significant correlation. However,

thecorrelation coefficients for the other three relationships will be very weak or will show

norelationship at all. For this reason, it is highly recommended that when we use

correlation wealways look at the scattergrams. If the scattergram shows a particular type

of nonlinearrelationship, then we need to pursue analyses or coefficients other than linear

correlation.

Figure 3.6. Five Types of Relationship Between Two Variables

Page

Second, if the data contain noise (due to unreliability in measurement) or if the range of

the data

points is large, the correlation coefficient (Pearson) will probably show no relationship.

In such a

situation, we recommend using the rank-order correlation method, such as Spearman's

rank-order

37

correlation. The Pearson correlation (the correlation we usually refer to) requires interval

scaledata, whereas rank-order correlation requires only ordinal data. If there is too much

noise in theinterval data, the Pearson correlation coefficient thus calculated will be

greatly attenuated. Asdiscussed in the last section, if we know the reliability of the

variables involved, we can adjust theresultant correlation. However, if we have no

knowledge about the reliability of the variables,

rank-order correlation will be more likely to detect the underlying relationship.

Specifically, if thenoises of the data did not affect the original ordering of the data points,

then rank-order correlationwill be more successful in representing the true relationship.

Since both Pearson's correlation andSpearman's rank-order correlation are covered in

basic statistics textbooks and are available inmost statistical software packages, we need

not get into the calculation details here.Third, the method of linear correlation (least-

squares method) is very vulnerable to extreme values.If there are a few extreme outliers

in the sample, the correlation coefficient may be seriouslyaffected. For example, Figure

3.7 shows a moderately negative relationship between X and Y.However, because there

are three extreme outliers at the northeast coordinates, the correlationcoefficient will

become positive. This outlier susceptibility reinforces the point that when correlationis

used, one should also look at the scatter diagram of the data.

3.7 Criteria for Causality

The isolation of cause and effect in controlled experiments is relatively easy. For

example, aheadache medicine was administered to a sample of subjects who were having

headaches. Aplacebo was administered to another group with headaches (who were

statistically not differentfrom the first group). If after a certain time of taking the

headache medicine and the placebo, theheadaches of the first group were reduced or

disappeared, while headaches persisted among thesecond group, then the curing effect of

the headache medicine is clear.

For analysis with observational data, the task is much more difficult. Researchers (e.g.,

Babbie,

1986) have identified three criteria:

1. The first requirement in a causal relationship between two variables is that the cause

precede the effect in time or as shown clearly in logic.

38

2. The second requirement in a causal relationship is that the two variables be empirically

correlated with one another.

3. The third requirement for a causal relationship is that the observed empirical

correlation

between two variables be not the result of a spurious relationship.

The first and second requirements simply state that in addition to empirical

correlation, therelationship has to be examined in terms of sequence of occurrence or

deductive logic. Correlationis a statistical tool and could be misused without the guidance

of a logic system. For instance, it ispossible to correlate the outcome of a Super Bowl

(National Football League versus AmericanFootball League) to some interesting artifacts

such as fashion (length of skirt, popular color, and soforth) and weather. However, logic

tells us that coincidence or spurious association cannot\substantiate causation.The third

requirement is a difficult one. There are several types of spurious relationships,

as Figure3.8 shows, and sometimes it may be a formidable task to show that the

observed correlation is notdue to a spurious relationship. For this reason, it is much more

difficult to prove causality inobservational data than in experimental data. Nonetheless,

examining for spurious relationships isnecessary for scientific reasoning; as a result,

findings from the data will be of higher quality.

Figure 3.8. Spurious Relationships

In Figure 3.8, case A is the typical spurious relationship between X and Y in which X and

Y have a

common cause Z. Case B is a case of the intervening variable, in which the real cause of

Y is an

intervening variable Z instead of X. In the strict sense, X is not a direct cause of Y.

However, since X

39

causes Z and Z in turn causes Y, one could claim causality if the sequence is not too

indirect. Case C

is similar to case A. However, instead of X and Y having a common cause as in case B,

both X and Yare indicators (operational definitions) of the same concept C. It is logical

that there is a correlation

between them, but causality should not be inferred.

An example of the spurious causal relationship due to two indicators measuring the same

concept

is Halstead's (1977) software science formula for program length:

where

N = estimated program length

n1 = number of unique operators

n2 = number of unique operands

Researchers have reported high correlations between actual program length (actual lines

of code

count) and the predicted length based on the formula, sometimes as high as 0.95

(Fitzsimmons andLove, 1978). However, as Card and Agresti (1987) show, both the

formula and actual programlength are functions of n1 and n2, so correlation exists by

definition. In other words, both theformula and the actual lines of code counts are

operational measurements of the concept ofprogram length. One has to conduct an actual

n1 and n2 count for the formula to work.However, n1and n2 counts are not available until

the program is complete or almost complete. Therefore, therelationship is not a cause-

and-effect relationship and the usefulness of the formula's predictabilityis limited.