SQAT(software quality assurance and testing)

of 39/39
1 Unit-III 5.1. Establishing software quality requirements : The software quality metrics methodology is a systematic approach to establishing quality requirements and identifying, implementing, analyzing, and validating the process and product of software quality metrics for a software system. It spans the entire software life cycle and comprises five steps. These steps shall be applied iteratively because insights gained from applying a step may show the need for further evaluation of the results of prior steps. (a).Establish software quality requirements: .A list of quality factors is selected, prioritized, and quanti- fied at the outset of system development or system change. These requirements shall be used to guide and control the development of the system, to access the whether the system met the quality requirements specified in the contract. (b). identib software quality requirements: The software quality metrics framework is applied in the selection of relevant metrics. ©implementthesoftwarequalityrequiremnsts: Tools are either procured or developed, data is collected, and metrics are applied at each phase of the software life cycle. (d) Analyze software quality metrics results: The metrics results are analyzed and reported to help control the development and assess the final product. (e) Validatethequalitymetrics. Predictive metrics results are compared to the direct metrics results to determine whether the predictive metrics accurately measure their associated factors. The documentation or output produced as a result of these steps is shown in table I Table I-Outputs of metrics methodology steps Metric Methodology Step output Establish software quality requirements -Quality requirements Identify software quality metrics Approved quality metrics framework - Metrics set
  • date post

  • Category


  • view

  • download


Embed Size (px)


Unit 3

Transcript of SQAT(software quality assurance and testing)

  • 1


    5.1. Establishing software quality requirements:

    The software quality metrics methodology is a systematic approach to

    establishing quality requirements and identifying, implementing, analyzing, and

    validating the process and product of software quality metrics for a software system. It

    spans the entire software life cycle and comprises five steps.

    These steps shall be applied iteratively because insights gained from applying a

    step may show the need for further evaluation of the results of prior steps.

    (a).Establish software quality requirements: .A list of quality factors is selected,

    prioritized, and quanti- fied at the outset of system development or system change. These

    requirements shall be used to guide and control the development of the system, to access

    the whether the system met the quality requirements specified in the contract.

    (b). identib software quality requirements: The software quality metrics framework is

    applied in the selection of relevant metrics.

    implementthesoftwarequalityrequiremnsts: Tools are either procured or developed, data

    is collected, and metrics are applied at each phase of the software life cycle.

    (d) Analyze software quality metrics results: The metrics results are analyzed and

    reported to help control the development and assess the final product.

    (e) Validatethequalitymetrics. Predictive metrics results are compared to the direct

    metrics results to determine whether the predictive metrics accurately measure their

    associated factors.

    The documentation or output produced as a result of these steps is shown in table I

    Table I-Outputs of metrics methodology steps

    Metric Methodology Step output

    Establish software quality requirements -Quality requirements

    Identify software quality metrics Approved quality metrics framework

    - Metrics set

  • 2

    -Cost-benefit analysis

    Implement the software quality metrics - Description of data items

    - Metrieddata item

    -Traceability matrix

    -Training plan and schedule

    5.1 Establish software quality requirements :

    Quality requirements shall be represented in either of the following forms:

    Direct metric value: A numerical target for a factor to be met in the final product.

    For example, mean time to failure (MTTF) is a direct metric of final system reliability.

    This is an intermediate requirement that is an early indicator of final system

    performance. For exam- ple, design or code errors may be early predictors of final system


    5.1 .I Identify a list of possible quality requirements:

    Identify quality requirements that may be applicable to the software system. Use

    organizational experience, required standards, regulations, or laws to create this list.

    Annex A contains sample lists of factors and sub- factors. In addition, list other system

    requirements that may affect the feasibility of the quality requirements. Consider

    acquisition concerns, such as cost or schedule constraints, warranties, and organizational

    self- interest. Do not rule out mutually exclusive requirements at this point. Focus on

    factoddirect metric combi- nations instead of predictive metrics.

    All parties involved in the creation and use of the system shall participate in the

    quality requirements identi- fication process.

    5.1.2 Determine the actual list of quality requirements :

  • 3

    Rate each of the listed quality requirements by importance. Importance is a

    function of the system characteristics and the viewpoints of the people involved.

    To determine the actual list of the possible quality require-ments, follow the two

    steps below:

    Create the actual quality requirements; Resolve the results of the survey into a

    single list of quality requirements. This shall involve a technical feasibility analysis of the

    quality requirements. The proposed factors for this list may have cooperative or

    conflicting relationships. Conflicts between requirements shall be resolved at this point.

    In addition, if the choice of quality requirements is in conflict with cost, schedule, or

    system functionality, one or the other shall be altered. Care shall be exercised in choosing

    the desired list to ensure that the requirements are technically feasible reasonable,

    complementary, achievable, and verifiable. All involved parties shall agree to this final


    5.1.3 Quantify each factor:

    For each factor, assign one or more direct metrics to represent the factor, and

    assign direct metric values to serve as quantitative requirements for that factor. For

    example, if high efficiency was one of the quality requirements from the previous item,

    the direct metric actual resource utilizatiodallocated resource utilization with a value of

    90% could represent that factor. This direct metric value is used to verify the achieve-

    Ment of the quality requirement. Without it, there is no way to tell whether or not the

    delivered system meets its quality requirements.

    The-quantified list of quality requirements and their definitions again shall be

    approved by all involved parties.

    5.2 Identify software quality metrics:

  • 4

    When identifying software quality metrics, apply the software quality metrics

    framework, and perform a cost-benefit analysis. Then gain commitment to the metrics

    from all involved parties.

    5.2.1 Apply the software quality metrics framework:

    Create a chart of the quality requirements based on the hierarchical tree structure

    found in figure 1. At this point, only the factor level must be complete. Next decompose

    each factor into subfactors as indicated in clause 4. The decomposition into subfactors

    must continue for as many levels as needed until the subfactor level is complete.


    Using the software quality metrics framework, decompose the subfactors into

    measurable metrics. For each validated metric on the metric level, assign a target value

    and a critical value and range that should be achieved during development. The target

    values constitute additional quality requirements for the system.

    The framework and the target values for the metrics shall be reviewed and

    approved by all involved parties.

    To help ensure that metrics are used appropriately, only validated metrics (that is,

    either direct metrics or metrics validated with respect to direct metrics) shall be used to

    assess current and future product and process quality. Nonvalidated metrics may be

    included for future analysis, but shall not be included as part of the system requirements.

    Furthermore, the metrics that are used shall be those that are associated with the quality

    requirements of the software project. However, given that the above conditions are


    The selection of specific metrics as candidates for validation and the selection of

    specific vali dated metrics for application is at the discretion of the user of this standard.

    Examples of metrics and experi ences with their use are given in annex B.

  • 5

    Document each metric using the format shown in table 2.

    Table 2-Metrics set

    Item Description

    Name Name given to the metric.

    Impact Indication of whether a metric may be

    Used to alter.

    Target value Numerical value of the metric that is to

    be achieved in order to meet quality.

    Factors Factors that are related to this metric.

    5.2.2 Perform a cost-benefit analysis:

    Perform a cost-benefit analysis by identifying the costs of implementing the

    metrics, identifying the benefits of applying the metrics, and then applying the metrics

    set. Identify the costs of implementing the metrics:

    Identify document(see 2.2) all the costs associated with the metrics in the metrics set.

    For each met ric, estimate and document the following impacts and costs.

    Metric utilization cost: The costs of collecting data; automating the metric value

    calculation (when possible); and analyzing, interpreting, and reporting the results.

    Software development cost change: The set of metrics may imply a change in the

    organizational structure used to produce the software system.

    Special equipment: The hardware or software tools may have to be located, purchased,

    adapted, or developed to implement the metrics.

    Training: The quality assurancekontrol organization or the entire development team may

  • 6

    need train ing in the use of the metrics and data collection procedures.

    If the introduction of metrics has caused changes in the development process, the

    development team may need to be educated about the changes . Identify the benefits of applying the metrics :

    Identify and document the benefits that are associated with each metric in the

    metrics set .

    Some benefits to consider include the following:

    -Identify quality goals and increase awareness of the goals in the software organization.

    - Provide timely feedback useful in developing higher quality software

    - Increase customer satisfaction by quantifying the quality of the software before it is

    delivered to the customer.

    -provide a a quantitative basis for making decisions about software quality Adjust the metrics set :

    Weigh the benefits, tangible and intangible, against the costs of each metric. If the

    costs exceed the benefits of a given metric, alter or delete it from the metrics set. On the

    other hand, for metrics that remain, make plans for any necessary changes to the software

    development process, organizational structure, tools, and training. In most cases it will

    not be feasible to quantify benefits. In these cases judgment shall be exercised in

    weighing qualitative benefits against quantitative costs.

    5.2.3 Gain commitment to the metrics set :

    All involved parties shall review the revised metrics set to which the cost-benefit

  • 7

    analysis has been added. The metrics set shall be formally adopted and supported by this


    5.3 Implement the software quality metrics :

    To implement the software quality metrics, define the data collection procedures,

    use selected software to prototype the measurement process, then collect the data and

    compute the metrics values.

    5.3.1 Define the data collection procedures:

    For each metric in the metric set, determine the data that will be collected and

    determine the assumptions that will be made about the data (for example, random sample,

    subjective or objective measure). The flow of data shall be shown from point of

    collection to evaluation of metrics. Describe or reference when and how tools are to be

    used and data storage procedures. Also, identify candidate tools. Select tools for use with

    the prototyping process. A traceability matrix shall be established between metrics and

    data items.

    Identify the organizational entities that will directly participate in data collection

    including those responsible for monitoring data collection. Describe the training and

    experience required for data collection and the training process for personnel involved.

    Describe each data item thoroughly, using the format shown in table 3

    Table 3-Description of a data item

    Item I Description

    Name Name given to the data item

    Metries Metrics that are associated with the data item.

    Definition Straightforward description of the data item.

    Source Location of where the data originates.

    Collector Entity responsible for collecting the data.

    Timing Time(s) in life cycle at which the data is to be

    collected. (Some dataitems , are collected more than


    Procedures Methodology (for example. automated or manual)

  • 8

    used to collect the data.

    Representation Manner in which the data is represented, for

    example. its precision format.

    Sample Method used to select the data to be collected and

    the percentage of the available datathat is to be


    Alternatives methods that may be used to collected the data other

    than method.

    5.3.2 Prototype the measurement process :

    Test the data collection and metric computation procedures on selected software

    that will act a a prototype. If possible, the samples selected should be similar to the

    project(s) on which the metrics will later be used. analysis shall be made to determine if

    the data is collected uniformly and if instructions have always been interpreted in the

    same manner. In particular, data requiring subjective judgments shall be checked to

    determine if the descriptions and instructions are clear enough to ensure uniform results.

    In addition, the cost of the measurement process for the prototype shall be

    examined to verify or improve the cost analysis.

    5.3.3 Collect the data and compute the metrics values:

    Using the formats in table 2 and table 3, collect and store data at the appropriate

    time in the life cycle. The data shall be checked for accuracy and proper unit of measure.

    Data collection shall be monitored. If a sample of data is used, requirements such

    as randomness, minimum size of sample, and homogeneous sampling shall be verified.

    If more than one person is collecting the data, it shall be checked for uniformity.

    Compute the metrics values from the collected data.

    5.4 Analyze the software metrics results:

  • 9

    Analyzing the results of the software metrics includes interpreting the results,

    identifying software quality, making software quality predictions, and ensuring

    compliance with requirements.

    5.4.1 Interpret the results:

    The results shall be interpreted and recorded against the broad context of the

    project as well as for a particu lar product or process of the project. The differences

    between the collected metric data and the target values for the metrics shall be analyzed

    against the quality requirements. Substantive differences shall be investi gated.

    5.4.2 Identify software quality:

    Quality metric values for software components shall be determined and reviewed.

    Quality metric values that are outside the anticipated tolerance intervals (low or

    unexpected high quality) shall be identified for further study. Unacceptable quality may

    be manifested as excessive complexity, inadequate documentation, lack of traceability, or

    other undesirable attributes.

    The existence of such conditions is an indication that the soft ware may not satisfy

    quality requirements when it becomes operational. Since many of the direct metrics that

    are usually of interest cannot be measured during software development (for example,

    reliability met rics), validated metrics shall be used when direct metrics are not available.

    Direct or validated metrics shall be measured for software components and process steps.

    The measurements shall be compared with critical values of the metrics.

    Software components whose measurements deviate from the critical values shall

    be analyzed in detail.

    Unexpected high quality metric values shall cause a review of the software

    development process as the expected tolerance levels may need to be modified or the

    process for identifying quality metric values may need to be improved.

    5.4.3 Make software quality predictions:

  • 10

    During development validated metrics shall be used to make predictions of direct

    metric values. Predicted values of direct metrics shall be compared with target values to

    determine whether to flag software compo nents for detailed analysis. Predictions shall be

    made for software components and process steps. Software components and process steps

    whose predicted direct metric values deviate from the target values shall be analyzed in


    Potentially, prediction is very valuable because it estimates the metric of ultimate

    interest-the direct metric. However, prediction is difficult because it involves using

    validated metrics from an earlier phase of the life cycle (for example, development) to

    make a prediction about a different but related metric (direct metric) in a much later

    phase (for example, operations).

    5.4.4 Ensure compliance with requirements:

    Direct rnetrics shall be used to ensure compliance of software products with

    quality requirements during system and acceptance testing. Direct metrics shall be

    measured for software components and process steps. These values shall be compared

    with target values of the direct metrics that represent quality require ments. Software

    components and process steps whose measurements deviate from the target values are

    non compliant

    5.5 Validate the software quality metrics:

    For infuriation on the statistical techniques used. See [B 1141. [B 115], (B 11 71,'

    or similar references.

    5.5.1 Purpose of metrics validation:

    The purpose of metrics validation is to identify both product and process metrjcs

    that can predict specified quality factor values, which are quantitative representations of

    quality requirements. If metrics are to be us ful, they shall indicate accurately whether

  • 11

    quality requirements have been achieved or are likely to be achieved in the future. When

    it is possible to measure factor values at the desired point in the life cycle, these direct

    metrics are used to evaluate software quality. At some points in the life cycle, certain

    quality factor values (for example, reliability) are not available. They are obtained after

    delivery or late in the project. In these cases, other metrics are used early on in a project

    to predict quality factor values.

    Although quality subfactors are useful when identifying and establishing factors

    and metrics, they need not be used in metrics validation. because the focus in validation

    is on determining whether a statistically signi icant relationship exists between predictive

    metric values and factor values.

    Quality factors may be affected by multiple variables. A single metric. therefore,

    may not sufficiently repre sent any one factor if it ignores these other variables.

    5.5.2 Validity criteria:

    To be considered valid, a predictive metric shall demonstrate a high degree of

    association with the quality factors it represents. This is equivalent to accurately

    portraying the quality condition(s) of a product or process. A metric may be valid with

    respect to certain validity criteria and invalid with respect to other criteria.

    The person in the organization who understands the consequence of the values selected

    shall designate threshold values for the following:

    V - square of the linear correlation coefficient

    B-rank correlations coefficient

    A-prediction error

    P-success rate

    A short numerical example follows the definition of each validity criterion.

    Detailed examples of the appli cation of metrics validation are contained in annex.

    Correlations: The variation in the quality factor values explained by the variation in the

    metric val ues, which is given by the square of the linear correlation coefficient between

    the metric and the corresponding factor, shall exceed where R2>v

    This criterion assesses whether there is a sufficiently strong linear association between a

  • 12

    factor and a metric to warrant using the metric as a substitute for the factor, when it is

    infeasible to use the latter. For example, the correlation coefficient between a complexity

    metric and the factor reliability may be 0.8. The square of this is 0.64. Only 64% of the

    variation in the factor is explained by the varia tion in the metric. If V has been

    established as 0.7, the conclusion would be drawn that the metric is invalid.

    Trucking: If a metric M is directly related to a quality factor F,for a given productor

    process, then a change in aquality factor value from FT1to FT2 at times T1 and T2, shall

    be accompanied by a change in metric value from MT1 to MT2, which is the same

    direction .

    For example, if a complexity metric is claimed to be a measure of reliability, then

    it is reasonable to expect a change in the reliability of a software component to be

    accompanied by an appropriate change in metric value (for instance, if the product

    increases in reliability, the metric value should also change in a direction that indicates

    the product has improved).

    That is, if Mean Time to Failure(MTTF) is used to measure reliability and is

    equal to 1000 hours during testing (TI) and 1500 hours during operation (T2), a

    complexity metric whose value is 8 in TI and 6 in T2, where 6 less complex than 8, is

    said to track reliability for this software component. If this relationship is demonstrated

    over a representative sample of software components, the conclusion could be drawn that

    the metric can track reliability (that is, indicate changes in product reliability) over the

    software life cycle.

    Consistency:if factor values f1,f2,.fn corresponding to products or processes

    relationship FI > F2> . . ,, Fn, then the corresponding metric values shall have the

    relationship MI M2 > . . . , Mn. To perform this test, compute the coefficient of rank

    correlation(r) between paired values (from the same software components) of the factor

    and the metric;

    Id shall exceed This criterion assesses whether there is consistency between the

    ranks of the factor values of a set of software components and the ranks of the metric

    values for the same set of software components

    Predictability: a metric is used at time T1 to predict a quality factor for a given product

    or processit shall predict a related quality factorFP~2with accurancy of where

  • 13

    FAT2is the actual value Fat atimeT2 This criterion assesses whether a metric is

    capable of predicting a factor value with the required accuracy. For example, if a

    complexity metric is used during development to predict the reliability of a soft ware

    component during operation (T2) to be 1200 hours MTTF (FpT2) and the actual MTTF

    that is measured during operation is 1000 hours FuT2), then the error in prediction is 200

    hours, or 20%.

    If the acceptable prediction error (A) is 25%. prediction accuracy is acceptable. If

    the ability to predict is demonstrated over a representative sample of software

    components, the conclusion could be drawn that the metric can be used as a predictor of


    Discriminative power: A metric shall be able to discriminate between high quality

    software compo nents (for example, high MTTF) and low quality software components

    (for example, low MTTF). For instance, the set of metric values associated with the

    former should be significantly higher (or lower) than those associated with the latter.

    The Mann Whitney Test and the Chi-square test for differences in probabilities

    (Contingency Tables) can be used for this validation test.

    Reliability: A metric shall demonstrate the above correlation. tracking, consistency,

    predictability, and discriminative power properties forP percent of the applications of the

    metric. This criterion is used to ensure that a metric has passed a validity test over a

    sufficient number or percentage of applications so that there will be confidence that the

    metric can perform its intended function consistently.

    5.5.3 Validation procedure:

    Metrics validation shall include identifying the quality factors sample, identifying

    the metrics sample, per forming a statistical analysis, and documenting the results. Identify the quality factors sample:

    These factor values (for example, measurements of reliability), which represent

    the quality requirements of a project, were previously identified (see 5.1.3) and collected

    and stored (see 5.3.3). For validation purposes, draw a sample from the metrics database.

  • 14 Identify the metrics sample:

    These metrics (for example, design structure) are used to predict or to represent

    quality factor values, when the factor values cannot be measured. The metrics were

    previously identified (see 5.2.1) and their data collected and stored, and values computed

    from the collected data (see 5.3.3). For validation purposes, draw a sample from the same

    domain (for example, same software components) of the metrics database as used in Perform a statistical analysis :

    The tests described in 5.5.2 shall be performed.

    Before a metric is used to evaluate the quality of a product or process, it shall be

    validated against the criteria described in 5.5.2. If a metric does not pass all of the

    validity tests, it shall only be used according to the criteria prescribed by those tests (for

    example, if it only passes the tracking validity test, it shall be used only for tracking the

    quality of a product or process). Document the results:

    Documenting results shall include the direct metric, predictive metric, validation

    criteria, and numerical results, as a minimum.

    5.5.4 Additional requirements:

    Additional requirements, those being the need for revalidation, confidence in

    analysis results, and the stability of environment, are described in the following

    subclauses. The need for revalidation:

    It is important to revalidate a predictive metric before it is used for another

    environment or application. AS the software engineering process changes, the validity of

    metrics changes. Cumulative metric validation val ues may be misleading because a

  • 15

    metric that has been valid for several uses may become invalid. It is wise to compare the

    one-time validation of a metric with its validation history to avoid being misled.

    The following statements of caution should be noted:

    *A validated metric may not necessarily be valid in other environments or future


    *A metric that has been invalidated may be valid in other environments or future

    applications. Confidence in analysis results:

    Metrics validation is a continuous process. Confidence in metrics is increased

    over time as metrics are vali dated on a variety of projects and as the metrics database

    and sample size increases. Confidence is not a static, one-time property. If a metric is

    valid, confidence will increase with increased use (that is, the correlation coefficient will

    be significant at decreasing values of the significance level). Greatest confidence occurs

    when metrics have been tentatively validated based on data collected from previous

    projects. Even when this is the case, the validation analysis will continue into future

    prqjects as the metrics database and sample size grow. Stability of environment:

    Practicable metrics validation shall be undertaken in a stable development

    environment (that is, where the design language, implementation language, or program

    development tools do not change over the life of the project in which validation is

    performed). In addition, there shall be at least one project in which metrics data have

    been collected and validated prior to application of the predictive metrics.

    This project shall be similar to the one in which the metrics are applied with

    respect to software engineering skills, application, size, and software engineering



  • 16

    A software quality indicator (SQI) is a variable whose value can be determined

    through direct analysis of product or process characteristics and whose evidential

    relationship to one ormoresoftware engineering attributes is


    A Software Quality Indicator can be calculated to provide an

    indication of the quality of the system by assessing system



    Assemble a quality indicator from factors that can be determined

    automatically with commercial or custom code scanners, such as

    the following:

    cyclomatic complexity of code (e.g., McCabe's

    Complexity Measure),

    unused/unreferenced code segments (these should

    be eliminated over time),

    average number of application calls per module

    (complexity is directly proportional to the number of calls),

    size of compilation units (reasonably sized units

    have approximately 20 functions (or paragraphs), or about

    2000 lines of code; these guidelines will vary greatly by


    use of structured programming constructs (e.g.,

    elimination of GOTOs, and circular procedure calls).

    These measures apply to traditional 3GL environments and are

    more difficult to determine in environments which are using

    object-oriented languages, 4GLs, or code generators.

    Tips and Hints

    With existing software, the Software Quality Indicator could also

    include a measure of the reliability of the code. This can be

  • 17

    determined by keeping a record of how many times each module

    has to be fixed in a given time period.

    There are other factors which contribute to the quality of a system

    such as:

    procedure re-use,

    clarity of code and documentation,

    consistency in the application of naming


    adherence to standards,

    consistency between documentation and code,

    the presence of current unit test plans.

    These factors are harder to determine automatically. However,

    with the introduction of CASE tools and reverse-engineering tools,

    and as more of the design and documentation of a system is

    maintained in structured repositories, these measures of quality

    will be easier to determine, and could be added to the indicator.

    Fundamentals in Measurement Theory:

    To discusses the fundamentals of measurement theory. We outline the

    relationshipsamong theoretical concepts, definitions, and measurement, and describe

    some basic measures that are used frequently. It is important to distinguish the levels of

    the conceptualization proces s, from abstract concepts, to definitions that are used

    operationally, to actual measurements. Depending on the concept and the operational

    definition derived from it, different levels of measurement may be applied: nominal scale,

    ordinal scale, interval scale, and ratio scale.

    It is also beneficial to spell out the explicit differences among some basic

    measures such as ratio, proportion, percentage, and rate. Significant amounts of wasted

    effort and resources can be avoided if these fundamental measurements are well


    We then focus on measurement quality. We discuss the most important issues in

  • 18

    measurement quality, namely, reliability and validity, and their relationships with

    measurement errors. We then discuss the role of correlation in observational studies and

    the criteria for causality.

    3.1 Definition, Operational Definition, and Measurement:

    It is undisputed that measurement is crucial to the progress of all sciences. Scientific

    progress is made through observations and generalizations based on data and

    measurements, the derivation of theories as a result, and in turn the confirmation or

    refutation of theories via hypothesis testing based on further empirical data. As an

    example, consider the proposition "the more rigorously the front end of the software

    development process is executed, the better the quality at the back end."

    To confirm or refute this proposition, we first need to define the key concepts. For

    example, we define "the software development process" and distinguish the process steps

    and activities of the front end from those of the back end. Assume that after the

    requirements-gathering process, our development process consists of the following



    Design reviews and inspections


    Code inspection

    Debug and development tests

    Integration of components and modules to form the product

    Formal machine testing

    Early customer program

    Integration is the development phase during which various parts and components are

    integrated to form one complete software product. Usually after integration the product is

    under formal change control. Specifically, after integration every change of the software

    must have a specific reason(e.g., to fix a bug uncovered during testing) and must be

    documented and tracked. Therefore, we may want to use integration as the cutoff point:

    The design, coding, debugging, and integration phases are classified as the front end of

  • 19

    the development process and the formal machine testing and early customer trials

    constitute the back end.

    We need to specify the indicator(s) of the definition and to make it (them)

    operational. For example, suppose the process documentation says all designs and code

    should be inspected. One operational definition of rigorous implementation may be

    inspection coverage expressed in terms of the percentage of the estimated lines of code

    (LOC) or of the function points (FP) that are actually inspected. Another indicator of

    good reviews and inspections could be the scoring of each inspection by the inspectors at

    the end of the inspection, based on a set of criteria.

    We may want to operationally use a five-point Likert scale to denote the degree of

    effectiveness (e.g., 5 = very effective, 4 = effective, 3 = somewhat effective, 2 = not

    effective, 1 = poor inspection). There may also be other indicators.

    In addition to design, design reviews, code implementation, and code inspections,

    development testing is part of our definition of the front end of the development process.

    We also need to we need to operationally define "quality at the back end" and decide

    which measurement indicators to use.

    For the sake of simplicity let us use defects found per KLOC (or defects per function

    point) during formal machine testing as the indicator of back-end quality. From these

    metrics, we can formulate several testable hypotheses such as the following:

    For software projects, the higher the percentage of the designs and code that

    areinspected, the lower the defect rate at the later phase of formal machine testing.

    The more effective the design reviews and the code inspections as scored by the

    inspection team, the lower the defect rate at the later phase of formal machine


    The more thorough the development testing (in terms of test coverage) before

    integration,the lower the defect rate at the formal machine testing phase.

    With the hypotheses formulated, we can set out to gather data and test the hypotheses.

    We also need to determine the unit of analysis for our measurement and data. In this case,

    it could be at the project level or at the component level of a large project. If we are able

    to collect a number of data points that form a reasonable sample size (e.g., 35 projects or

  • 20

    components), we can perform statistical analysis to test the hypotheses

    The building blocks of theory are concepts and definitions. In a theoretical

    definition a concept is defined in terms of other concepts that are already well

    understood. In the deductive logic system, certain concepts would be taken as undefined;

    they are the primitives. All other concepts would be defined in terms of the primitive

    concepts. For example, the concepts of point and line may be used as undefined and the

    concepts of triangle or rectangle can then be defined based on these primitives.

    3.2 Level of Measurement:

    We have seen that from theory to empirical hypothesis and from theoretically defined

    concepts to operational definitions.

    The process is by no means direct. As the example illustrates, when we

    operationalize a definition and derive measurement indicators, we must consider the scale

    of measurement. For instance, to measure the quality of software inspection we may use a

    five-point scale to score the inspection effectiveness or we may use percentage to indicate

    the inspection coverage. For some cases, more than one measurement scale is applicable;

    for others, the nature of the concept and the resultant operational definition can be

  • 21

    measured only with a certain scale. In this section, we briefly discuss the four levels of

    measurement: nominal scale, ordinal scale, interval scale, and ratio scale.

    Nominal Scale

    The most simple operation in science and the lowest level of measurement is

    classification. In classifying we attempt to sort elements into categories with respect to a

    certain attribute. For example, if the attribute of interest is religion, we may classify the

    subjects of the study into Catholics, Protestants, Jews, Buddhists, and so on. If we

    classify software products by the development process models through which the

    products were developed, then we may have categories such as waterfall development

    process, spiral development process, iterative development process, object-oriented

    programming process, and others.

    In a nominal scale, the names of the categories and their sequence bear no

    assumptions about relationships among categories. For instance, we place the waterfall

    development process in front of spiral development process, but we do not imply that one

    is "better than" or "greater than" the other.

    Ordinal Scale

    Ordinal scale refers to the measurement operations through which the subjects can

    be compared in order. For example, we may classify families according to socio-

    economic status: upper class, middle class, and lower class. We may classify software

    development projects according to the SEI maturity levels or according to a process rigor


    The ordinal measurement scale is at a higher level than the nominal scale in the

    measurement hierarchy. Through it we are able not only to group subjects into categories,

    but also to order the categories.

    An ordinal scale is asymmetric in the sense that if A > B is true then B > A is

    false. It has the transitivity property in that if A > B and B > C, then A > C.

    Therefore, when we translate order relations into mathematical operations, we

    cannot use operations such as addition, subtraction, multiplication, and division. We can

    use "greater than" and "less than." However, in real-world application for some specific

    types of ordinal scales (such as the Likert five-point, seven-point, or ten-point scales), the

  • 22

    assumption of equal distance is often made and operations such as averaging are applied

    to these scales. In such cases, we should be aware that the measurement assumption is

    deviated, and then use extreme caution when interpreting the results of data analysis.

    Interval and Ratio Scales:

    An interval scale indicates the exact differences between measurement points.

    The mathematical operations of addition and subtraction can be applied to interval scale

    data. For instance, assuming products A, B, and C are developed in the same language, if

    the defect rate of software product A is5 defects per KLOC and product B's rate is 3.5

    defects per KLOC, then we can say product A's defect level is 1.5 defects per KLOC

    higher than product B's defect level. An interval scale of measurement requires a well-

    defined unit of measurement that can be agreed on as a common standard and that is


    Given a unit of measurement, it is possible to say that the difference between two

    scores is 15 units or that one difference is the same as a second. Assuming product C's

    defect rate is 2 defects per KLOC, we can thus say the difference in defect rate between

    products A and B is the same as that between B and C.

    For interval and ratio scales, the measurement can be expressed in both integer and

    noninteger data. Integer data are usually given in terms of frequency counts (e.g., the

    number of defects customers will encounter for a software product over a specified time


    3.3 Some Basic Measures

    Regardless of the measurement scale, when the data are gathered we need to

    analyze them toextract meaningful information. Various measures and statistics are

    available for summarizing theraw data and for making comparisons across groups. In this

    section we discuss some basic measures such as ratio, proportion, percentage, and rate,

    which are frequently used in our dailylives as well as in various activities associated with

    software development and software quality. These basic measures, while seemingly easy,

    are often misused. There are also numerous

    Sophisticated statistical techniques and methodologies that can be employed in data


  • 23

    However, such topics are not within the scope of this discussion.


    A ratio results from dividing one quantity by another. The numerator and

    denominator are from twodistinct populations and are mutually exclusive. For example,

    in demography, sex ratio is defined asIf the ratio is less than 100, there are more females

    than males; otherwise there are more malesthan females.

    Ratios are also used in software metrics. The most often used, perhaps, is the ratio

    of number ofpeople in an independent test organization to the number of those in the

    development group. Thetest/development head-count ratio could range from 1:1 to 1:10

    depending on the managementapproach to the software development process. For the

    large-ratio (e.g., 1:10) organizations, thedevelopment group usually is responsible for the

    complete development (including extensivedevelopment tests) of the product, and the test

    group conducts system-level testing in terms ofcustomer environment verifications. For

    the small-ratio organizations, the independent group takesthe major responsibility for

    testing (after debugging and code integration) and quality assurance.


    Proportion is different from ratio in that the numerator in a proportion is a part of the


    Proportion also differs from ratio in that ratio is best used for two groups,

    whereas proportion isused for multiple categories (or populations) of one group. In other

    words, the denominator in thepreceding formula can be more than just a + b. If

    then we have

    When the numerator and the denominator are integers and represent counts of

    certain events,then p is also referred to as a relative frequency. For example, the

    following gives the proportion ofsatisfied customers of the total customer set:

  • 24

    The numerator and the denominator in a proportion need not be integers. They can be

    frequencycounts as well as measurement units on a continuous scale (e.g., height in

    inches, weight in pounds). When the measurement unit is not integer, proportions

    arecalled fractions.


    A proportion or a fraction becomes a percentage when it is expressed in terms of

    per hundred units(the denominator is normalized to 100). The word percent means per

    hundred. A proportion p istherefore equal to 100p percent (100p%).Percentages are

    frequently used to report results, and as such are frequently misused. First,because

    percentages represent relative frequencies, it isimportant that enough

    contextualinformation be given, especially the total number of cases, so that the readers

    can interpret theinformation correctly. Jones (1992) observes that many reports and

    presentations in the softwareindustry are careless in using percentages and ratios. He cites

    the example:Requirements bugs were 15% of the total, design bugs were 25% of the

    total,coding bugs were50% of the total, and other bugs made up 10% of the total.Had the

    results been stated as follows, it would have been much more informative:The project

    consists of 8 thousand linesof code (KLOC). During its development a total of 200defects

    were detected and removed, giving adefect removal rate of 25 defects per KLOC. Of

    the200 defects, requirements bugs constituted 15%, design bugs 25%, coding bugs 50%,

    and otherbugs made up 10%.A second important rule of thumb is that the total number of

    cases must be sufficiently largeenough to use percentages. Percentages computed from a

    small total are not stable;

    they alsoconvey an impression that a large number of cases are involved. Some

    writers recommend that theminimum number of cases for which percentages should be

    calculated is 50. We recommend that,depending on the number of categories, the

    minimum number be 30, the smallest sample sizerequired for parametric statistics. If the

    number of cases is too small, then absolute numbers,instead of percentages, should be

    used. For instance,Of the total 20 defects for the entire project of 2 KLOC, there were 3

    requirements bugs, 5 designbugs, 10 coding bugs, and 2 others.When results in

  • 25

    percentages appear in table format, usually both the percentages and actualnumbers are

    shown when there is only one variable. When there are more than two groups, suchas the

    example in Table 3.1, it is better just to show the percentages and the total number of

    cases (N) for each group. With percentages and N known, one

    canalwaysreconstructthefrequency distributions. The total of 100.0% should always be

    shown so that it is clear how thepercentages are computed. In a two-way table, the

    direction in which the percentages arecomputed depends on the purpose of the

    comparison. For instance, the percentages inTable3.1arecomputedvertically (the total of

    each column is 100.0%), and the purpose is to compare thedefect-type profile across

    projects (e.g., project B proportionally has more requirements defectsthan project A).In

    Table 3.2, the percentages are computed horizontally. The purpose here is to compare

    thedistribution of defects across projects for each type of defect. The inter-pretations of

    the two tablesdiffer.

    Therefore, it is important to carefully examine percentage tables to determine exactly


    the percentages are calculated.

    Table 3.1. Percentage Distributions

  • 26

    Ratios, proportions, and percentages are static summary measures. They provide a

    cross-sectionalview of the phenomena of interest at a specific time. The concept of rate is

    associated with thedynamics (change) of the phenomena of interest; generally it can be

    defined as a measure ofchange in one quantity (y) per unit of another quantity (x) on

    which the former (y) depends. Usuallythe x variable is time. It is important that the time

    unit always be specified when describing a rateassociated with time. For instance, in

    demography the crude birth rate (CBR) is defined as:where B is the number of live births

    in a given calendar year, P is the mid-year population, and K isa constant, usually 1,000.

    The concept of exposure to risk is also central to the definition of rate, which

    distinguishes rate fromproportion. Simply stated, all elements or subjects in the

    denominator have to be at risk ofbecoming or producing the elements or subjects in the

    numerator. If we take a second look at thecrude birth rate formula, we will note that the

    denominator is mid-year population and we knowthat not the entire population is subject

    to the risk of giving birth. Therefore, the operational


    where B is the number of live births in a given calendar year, P is the mid-year

    population, and K isa constant, usually 1,000.The concept of exposure to risk is also

    central to the definition of rate, which distinguishes rate fromproportion. Simply stated,

    all elements or subjects in the denominator have to be at risk ofbecoming or producing

    the elements or subjects in the numerator.

  • 27

    If we take a second look at thecrude birth rate formula, we will note that the denominator

    is mid-year population and we knowthat not the entire population is subject to the risk of

    giving birth. Therefore, the operational

    Six Sigma

    The term six sigma represents a stringent level of quality. It is a specific defect

    rate: 3.4 defectiveparts per million (ppm). It was made known in the industry by

    Motorola, Inc., in the late 1980swhen Motorola won the first Malcolm Baldrige National

    Quality Award (MBNQA). Six sigma hasbecome an industry standard as an ultimate

    quality goal.

    Sigma ( Figure 3.2 indicates, the areas

    under thecurve of normal distribution defined by standard deviations are constants in

    terms of percentages,regardless of the distribution parameters. The area under the curve

    as defined by plus and minusone standard deviation (sigma) from the mean is 68.26%.

    The area defined by plus/minus twostandard deviations is 95.44%, and so forth. The area

    defined by plus/minus six sigma is99.9999998%. The area outside the six sigma area is

    thus 100% -99.9999998% = 0.0000002%.

    Figure 3.3. Specification Limits, Centered Six Sigma, and Shifted (1.5 Sigma) Six



    If we take the area within the six sigma limit as the percentage of defect-free parts

    and the areaoutside the limit as the percentage of defective parts, we find that six sigma is

    equal to 2 defectivesper billion parts or 0.002 defective parts per million. The

    interpretation of defect rate as it relates tothe normal distribution will be clearer if we

    include the specification limits in the discussion, asshown in the top panel of Figure 3.3.

    Given the specification limits (which were derived fromcustomers' requirements), our

  • 28

    purpose is to produce parts or products within the limits. Parts orproducts outside the

    specification limits do not conform to requirements.

    If we can reduce thevariations in the production process so that the six sigma

    (standard deviations) variation of theproduction process is within the specification limits,

    then we will havesixsigma quality level.

    The six sigma value of 0.002 ppm is from the statistical normal distribution. It

    assumes that eachexecution of the production process will produce the exact distribution

    of parts or productscentered with regard to the specification limits. In reality, however,

    process shifts and drifts alwaysresult from variations in process execution. The maximum

    process shifts as indicated by research(Harry, 1989) is 1.5 sigma. If we account for this

    1.5-sigma shift in the production process, we willget the value of 3.4 ppm. Such shifting

    is illustrated in the two lower panels of Figure 3.3. Givenfixed specification limits, the

    distribution of the production process may shift to the left or to theright. When the shift is

    1.5 sigma, the area outside the specification limit on one end is 3.4 ppm,and on the other

    it is nearly zero.

    The six sigma definition accounting for the 1.5-sigma shift (3.4 ppm) proposed

    and used byMotorola (Harry, 1989) has become the industry standard in terms of six

  • 29

    sigmalevel quality (versusthe normal distribution's six sigma of 0.002 ppm).

    Furthermore, when the production distributionshifts 1.5 sigma, the intersection points of

    the normal curve and the specification limits become 4.5sigma at one end and 7.5 sigma

    at the other. Since for all practical purposes, the area outside 7.5

    3.4 Reliability and Validity

    Recall that concepts and definitions have to be operationally defined before

    measurements can betaken. Assuming operational definitions are derived and

    measurements are taken, the logicalquestion to ask is, how good are the operational

    metrics and the measurement data? Do theyreally accomplish their taskmeasuring the

    concept that we want to measure and doing so withgood quality? Of the many criteria of

    measurement quality, reliability and validity are the two mostimportant.


    refers to the consistency of a number of measurements taken using the

    sameMeasurement method on the same subject. If repeated measurements are highly

    consistent oreven identical, then the measurement method or the operational definition

    has a high degree of

    Reliability. If the variations among repeated measurements are large, then reliability is

    low. Forexample, if an operational definition of a body height measurement of children

    (e.g., between ages3 and 12) includes specifications of the time of the day to take

    measurements, the specific scale touse, who takes the measurements (e.g., trained

    pediatric nurses), whether the measurementsshould be taken barefooted, and so on, it is

    likely that reliable data will be obtained. If theoperational definition is very vague in

    terms of these considerations, the data reliability may be low.Measurements taken in the

    early morning may be greater than those taken in the late afternoonbecause children's

    bodies tend to be more stretched after a good night's sleep and becomesomewhat

    compacted after a tiring day. Other factors that can contribute to the variations of

    themeasurement data include different scales, trained or untrained personnel, with or

    without shoeson, and so on.

  • 30

    Figure 3.4. An Analogy to Validity and Reliability

    From Practice of Social Research (Non-Info Trac Version), 9th edition, by E. Babbie.


    Thomson Learning. Reprinted with permission of Brooks/Cole, an imprint of the


    Group, a division of Thomson Learning (fax: 800-730-2215).


    Note that there is some tension between validity and reliability. For the data to be

    reliable, themeasurement must be specifically defined. In such an endeavor, the risk of

    being unable torepresent the theoretical concept validly may be high. On the other hand,

    for the definition to havegood validity, it may be quite difficult to define the

    measurements precisely. For example, themeasurement of church attendance may be

    quite reliable because it is specific and observable.However, it may not be valid to

    represent the concept of religiousness. On the other hand, toderive valid measurements of

    religiousness is quite difficult. In the real world of measurements andmetrics, it is not

    uncommon for a certain tradeoff or balance to be made between validity and


    Validity and reliability issues come to the fore when we try to use metrics and

    measurements to

    represent abstract theoretical constructs. In traditional quality engi-neering where


    are frequently physical and usually do not involve abstract concepts, the counterparts of


    and reliability are termed accuracy and precision (Juran and Gryna, 1970). Much

    confusion surroundsthese two terms despite their having distinctly different meanings. If

  • 31

    we want a much higherdegree of precision in measurement (e.g., accuracy up to three

    digits after the decimal point whenmeasuring height), then our chance of getting all

    measurements accurate may be reduced. Incontrast, if accuracy is required only at the

    level of integerinch (less precise), then it is a lot easier

    to meet the accuracy requirement.

    3.5 Measurement Errors

    In this section we discuss validity and reliability in the context of measurement

    error. There are twotypes of measurement error: systematic and random.

    Systematic measurement error is associatedwith validity; random error is

    associated with reliability. Let us revisit our example about thebathroom weight scale

    with an offset of 10 lb. Each time a person uses the scale, he will get ameasurement that

    is 10 lb. more than his actual body weight, in addition to the slight variationsamong

    measurements. Therefore, the expected value of the measurements from the scale does

    notequal the true value because of the systematic deviation of 10 lb. In simple formula:

    In a general case:

    where M is the observed/measured score, T is the true score, s is systematic error, and e is



    The presence of s (systematic error) makes the measurement invalid. Now let us

    assume themeasurement is valid and the s term is not in the equation. We have the

    following:The equation still states that any observed score is not equal to the true score

    because of randomdisturbancethe random error e. These disturbances mean that on one

    measurement, a person'sscore may be higher than his true score and on another occasion

    the measurement may be lowerthan the true score. However, since the disturbances are

    random, it means that the positive errorsare just as likely to occur as the negative errors

    and these errors are expected to cancel eachother. In other words, the average of these

  • 32

    errors in the long run, or the expected value of e, iszero: E(e) = 0. Furthermore, from

    statistical theory about random error, we can also assume the


    The correlation between the true score and the error term is zero.

    There is no serial correlation between the true score and the error term.

    The correlation between errors on distinct measurements is zero.

    From these assumptions, we find that the expected value of the observed scores is equal

    to the

    true score:

    Therefore, the reliability of a metric varies between 0 and 1. In general, the larger the

    errorvariance relative to the variance of the observed score, the poorer the reliability. If

    all variance ofthe observed scores is a result of random errors, then the reliability is zero

    [1 (1/1) = 0].

    3.5.1 Assessing Reliability

    Thus far we have discussed the concept and meaning of validity and reliability

    and their

    interpretation in the context of measurement errors. Validity is associated with systematic

    error andthe only way to eliminate systematic error is through better understanding of the

    concept we try tomeasure, and through deductive logic and reasoning to derive better

    definitions. Reliability isassociated with random error. To reduce random error, we need

    good operational definitions, andbased on them, good execution of measurement

    operations and data collection. In this section, wediscuss how to assess the reliability of

    empirical measurements.There are several ways to assess the reliability of empirical

    measurements including the test/retest

    method, the alternative-form method, the split-halves method, and the internal

    consistency method(Carmines and Zeller, 1979). Because our purpose is to illustrate how

    to use our understanding ofreliability to interpret software metrics rather than

    indepthstatisticalexamination of the subject,we take the easiest method, the retest method.

    The retest method is simply taking a secondmeasurement of the subjects some time after

  • 33

    the first measurement is taken and then computingthe correlation between the first and

    the second measurements. For instance, to evaluate thereliability of a blood pressure

    machine, we would measure the blood pressures of a group of peopleand, after everyone

    has been measured, we would take another set of measurements. The secondmeasurement

    could be taken one day later at the same time of day, or we could simply take

    twomeasurements at one time. Either way, each person will have two scores. For the sake

    of simplicity,let us confine ourselves to just one measurement, either the systolic or the

    diastolic score. We thencalculate the correlation between the first and second score and

    the correlation coefficient is thereliability of the blood pressure machine. A schematic

    representation of the test/retest method forestimating reliability is shown in Figure 3.5.

    Figure 3.5. Test/Retest Method for Estimating Reliability


    The equations for the two tests can be represented as follows:

    From the assumptions about the error terms, as we briefly stated before, it can be shown


    m is the reliability measure.

    As an example in software metrics, let us assess the reliability of the reported number of


    found at design inspection. Assume that the inspection is formal; that is, an inspection

    meeting washeld and the participants include the design owner, the inspection moderator,

    and the inspectors.At the meeting, each defect is acknowledged by the whole group and

  • 34

    the record keeping is done bythe moderator. The test/retest method may involve two

    record keepers and, at the end of theinspection, each turns in his recorded number of

    defects. If this method is applied to a series ofinspections in a development organization,

    we will have two reports for each inspection over asample of inspections. We then

    calculate the correlation between the two series of reportednumbers and we can estimate

    the reliability of the reported inspection defects.

    3.5.2 Correction for Attenuation

    One of the important uses of reliability assessment is to adjust or correct

    correlations for

    unreliability that result from random errors in measurements. Correlation is perhaps one

    of the

    most important methods in software engineering and other disciplines for analyzing


    between metrics. For us to substantiate or refute a hypothesis,

    we have to gather data for boththe independent and the dependent variables and

    examine the correlation of the data. Let usrevisit our hypothesis testing example at the

    beginning of this chapter: The more effective thedesign reviews and the code inspections

    as scored by the inspection team, the lower the defectrate encountered at the later phase

    of formal machine testing.As mentioned, we first need to operationally define the

    independent variable (inspection

    effectiveness) and the dependent variable (defect rate during formal machine testing).

    Then we

    gather data on a sample of components or projects and calculate the correlation between

    theindependent variable and dependent variable. However, because of random errors in

    the data, theresultant correlation often is lower than the true correlation. With knowledge

    about the estimate ofthe reliability of the variables of interest, we can adjust the observed

    correlation to get a moreaccurate picture of the relationship under consideration. In

    software development, we observedthat a key reason for some theoretically sound

    hypotheses not being supported by actual projectdata is that the operational definitions of

    the metrics are poor and there are too many noises in thedata.Given the observed

  • 35

    correlation and the reliability estimates of the two variables, the formula forcorrection for

    attenuation (Carmines and Zeller, 1979) is as follows:


    xt yt) is the correlation corrected for attenuation, in other words, the estimated true


    xi yi) is the observed correlation, calculated from the observed data

    xx' is the estimated reliability of the X variable

    yy' is the estimated reliability of the Y variable |

    For example, if the observed correlation between two variables was 0.2 and the reliability

    estimates were 0.5 and 0.7, respectively, for X and Y, then the correlation corrected for


    would be


    This means that the correlation between X and Y would be 0.34 if both were measured


    without error.

    3.6 Be Careful with Correlation

    Correlation is probably the most widely used statistical method to assess

    relationships amongobservational data (versus experimental data). However, caution

    must be exercised when usingcorrelation; otherwise, the true relationship under

    investigation may be disguised ormisrepresented. There are several points about

    correlation that one has to know before using it.

    First, although there are special types of nonlinear correlation analysis available in


    literature, most of the time when one mentions correlation, it means linear correlation.

    Indeed, themost well-known Pearson correlation coefficient assumes a linear relationship.

    Therefore, if a

  • 36

    correlation coefficient between two variables is weak, it simply means there is no linear

    relationshipbetween the two variables. It doesn't mean there is no relationship of any


    Let us look at the five types of relationship shown in Figure 3.6. Panel A

    represents a positive linearrelationship and panel B a negative linear relationship. Panel C

    shows a curvilinear convexrelationship, and panel D a concave relationship. In panel E, a

    cyclical relationship (such as theFourier series representing frequency waves) is shown.

    Because correlation assumes linear

    relationships, when the correlation coefficients (Pearson) for the five relationships are


    the results accurately show that panels A and B have significant correlation. However,

    thecorrelation coefficients for the other three relationships will be very weak or will show

    norelationship at all. For this reason, it is highly recommended that when we use

    correlation wealways look at the scattergrams. If the scattergram shows a particular type

    of nonlinearrelationship, then we need to pursue analyses or coefficients other than linear


    Figure 3.6. Five Types of Relationship Between Two Variables


    Second, if the data contain noise (due to unreliability in measurement) or if the range of

    the data

    points is large, the correlation coefficient (Pearson) will probably show no relationship.

    In such a

    situation, we recommend using the rank-order correlation method, such as Spearman's


  • 37

    correlation. The Pearson correlation (the correlation we usually refer to) requires interval

    scaledata, whereas rank-order correlation requires only ordinal data. If there is too much

    noise in theinterval data, the Pearson correlation coefficient thus calculated will be

    greatly attenuated. Asdiscussed in the last section, if we know the reliability of the

    variables involved, we can adjust theresultant correlation. However, if we have no

    knowledge about the reliability of the variables,

    rank-order correlation will be more likely to detect the underlying relationship.

    Specifically, if thenoises of the data did not affect the original ordering of the data points,

    then rank-order correlationwill be more successful in representing the true relationship.

    Since both Pearson's correlation andSpearman's rank-order correlation are covered in

    basic statistics textbooks and are available inmost statistical software packages, we need

    not get into the calculation details here.Third, the method of linear correlation (least-

    squares method) is very vulnerable to extreme values.If there are a few extreme outliers

    in the sample, the correlation coefficient may be seriouslyaffected. For example, Figure

    3.7 shows a moderately negative relationship between X and Y.However, because there

    are three extreme outliers at the northeast coordinates, the correlationcoefficient will

    become positive. This outlier susceptibility reinforces the point that when correlationis

    used, one should also look at the scatter diagram of the data.

    3.7 Criteria for Causality

    The isolation of cause and effect in controlled experiments is relatively easy. For

    example, aheadache medicine was administered to a sample of subjects who were having

    headaches. Aplacebo was administered to another group with headaches (who were

    statistically not differentfrom the first group). If after a certain time of taking the

    headache medicine and the placebo, theheadaches of the first group were reduced or

    disappeared, while headaches persisted among thesecond group, then the curing effect of

    the headache medicine is clear.

    For analysis with observational data, the task is much more difficult. Researchers (e.g.,


    1986) have identified three criteria:

    1. The first requirement in a causal relationship between two variables is that the cause

    precede the effect in time or as shown clearly in logic.

  • 38

    2. The second requirement in a causal relationship is that the two variables be empirically

    correlated with one another.

    3. The third requirement for a causal relationship is that the observed empirical


    between two variables be not the result of a spurious relationship.

    The first and second requirements simply state that in addition to empirical

    correlation, therelationship has to be examined in terms of sequence of occurrence or

    deductive logic. Correlationis a statistical tool and could be misused without the guidance

    of a logic system. For instance, it ispossible to correlate the outcome of a Super Bowl

    (National Football League versus AmericanFootball League) to some interesting artifacts

    such as fashion (length of skirt, popular color, and soforth) and weather. However, logic

    tells us that coincidence or spurious association cannot\substantiate causation.The third

    requirement is a difficult one. There are several types of spurious relationships,

    as Figure3.8 shows, and sometimes it may be a formidable task to show that the

    observed correlation is notdue to a spurious relationship. For this reason, it is much more

    difficult to prove causality inobservational data than in experimental data. Nonetheless,

    examining for spurious relationships isnecessary for scientific reasoning; as a result,

    findings from the data will be of higher quality.

    Figure 3.8. Spurious Relationships

    In Figure 3.8, case A is the typical spurious relationship between X and Y in which X and

    Y have a

    common cause Z. Case B is a case of the intervening variable, in which the real cause of

    Y is an

    intervening variable Z instead of X. In the strict sense, X is not a direct cause of Y.

    However, since X

  • 39

    causes Z and Z in turn causes Y, one could claim causality if the sequence is not too

    indirect. Case C

    is similar to case A. However, instead of X and Y having a common cause as in case B,

    both X and Yare indicators (operational definitions) of the same concept C. It is logical

    that there is a correlation

    between them, but causality should not be inferred.

    An example of the spurious causal relationship due to two indicators measuring the same


    is Halstead's (1977) software science formula for program length:


    N = estimated program length

    n1 = number of unique operators

    n2 = number of unique operands

    Researchers have reported high correlations between actual program length (actual lines

    of code

    count) and the predicted length based on the formula, sometimes as high as 0.95

    (Fitzsimmons andLove, 1978). However, as Card and Agresti (1987) show, both the

    formula and actual programlength are functions of n1 and n2, so correlation exists by

    definition. In other words, both theformula and the actual lines of code counts are

    operational measurements of the concept ofprogram length. One has to conduct an actual

    n1 and n2 count for the formula to work.However, n1and n2 counts are not available until

    the program is complete or almost complete. Therefore, therelationship is not a cause-

    and-effect relationship and the usefulness of the formula's predictabilityis limited.