Fundamentals of Measurement Theory. Measurement Measurement is crucial to the progress of all...

Fundamentals of Measurement Theory

Measurement

Measurement is crucial to the progress of all sciences.

Scientific progress is made through observations and generalizations based on data and measurements.

The confirmation or refutation of theories via hypothesis testing depends on empirical data and measurement.

Example Hypothesis

“The more rigorously the front end of the software development process is executed, the better the quality at the back end.”

To confirm or refute this proposition we need to: Define the key concepts, e.g., “the

software development process”. Distinguish the process steps and activities

of the front end from those of the back end.

Sample Development Process(After Requirements Gathering) Design Design review and inspections Code Code inspections Debug and development tests Integration of components and modules to form

the product Formal machine testing Early customer programs

Front-End and Back-End Steps

Assume everything through debugging and development tests is the front end.

The back-end is everything from integration onward.

Definition of Rigorous Implementation “Total adherence to the process: Whatever is

described in the process documentation that needs to be executed is executed.”

We need to specify the indicators of the definition and make them operational. E.g., if the process requires that all designs and code are inspected, an operational definition of rigorous implementation may be inspection coverage in terms of the percentage of actual lines of code actually inspected.

How Would We Operationally Define “Rigorous Testing”?

How Would We Operationally Define “Rigorous Testing”?

Possibly using measurement indicators such as: The percent coverage in terms of

instructions executed The defect rate expressed in terms of the

number of defects removed per thousand lines of source code

How Would We Operationally Define “Back-End Quality”?

How Would We Operationally Define “Back-End Quality”?

Possibly in terms of “the number of defects found per KLOC during formal machine testing”

Possible Testable Hypotheses

For software projects, the higher the percentage of the designs and code that are inspected, the lower the defect rate at the later phase of formal machine testing.

The more effective the design reviews and the code inspections as scored by the inspection team, the lower the defect rate at the later phase of formal machine testing.

Possible Testable Hypotheses (Cont’d)

The more thorough the development testing (in terms of test coverage) before integration, the lower the defect rate at the formal machine testing phase.

What Are Additional Questions We Need to Ask?

What Are Additional Questions We Need to Ask?

Are the indicators valid? Are the data reliable? Are their other variables we need to

control when we conduct the analysis for hypothesis testing?

Abstraction Hierarchy

Theory Concept

Proposition DefinitionAbstractWorld

EmpiricalWorld Hypothesis

Data Analysis Measurements in theReal World

Operational Definition

Levels of Measurement

The four levels of measurement:1. Nominal Scale

2. Ordinal Scale

3. Interval Scale

4. Ratio Scale

The scales are hierarchical. One should always try to devise metrics that

can take advantage of the highest level of measurement allowed by the nature of the concept and its definition.

Nominal Scale

Separating elements into categories with respect to a certain attribute

The categories must be jointly exhaustive

The categories must be mutually exclusive

Ordinal Scale

Refers to the measurement operations through which the subjects can be compared in order.

Not only can we group elements into categories, but we can order the categories.

The scale offers no information about the magnitude of the differences between the elements.

Interval Scale

Indicates the exact differences between the measurement points.

Requires a well-defined unit of measurement that can be agreed on as a common standard and that is repeatable.

Ratio Scale

When a absolute or nonarbitrary zero point can be located on an interval scale it becomes a ratio scale.

It is the highest level of measurement and all mathematical operations can be applied to it.

Almost all interval measurement scales are also ratio scales.

Some Basic Measures

Ratio Proportion Percentage Rate Six Sigma

Ratio

A ratio results from dividing one quantity by another.

The numerator and denominator are from two distinct populations and are mutually exclusive.

E.g., Number of males / Number of females x 100%

Proportion

In a proportion the numerator is a part of the denominator:

E.g., p = a / (a+b) While a ratio is best used for two groups,

a proportion is used for multiple categories of one group.

Percentage

A proportion becomes a percentage when it is expressed in terms of per hundred units (the denominator is normalized to 100).

Percentages can be misleading because they do not make the sample size clear.

The minimum sample size for reporting percentages is at least 30.

It’s best to show both percentages and actual numbers or sample sizes.

Example: Percentage Distributions of Defect Type by Project

Example: Percentage Distributions of Defects across Project by Defect Type

Rate

Ratios, proportions, and percentages are static summary measures.

Rate is associated with the dynamics of the phenomena of interest.

Generally it is a measure of change in one quantity (y) per unit of another quantity (x); usually x is time.

E.g., Crude birth rate = (B/P) x K where B is the number of live births in a given calendar year, P is the mid-year population, and K is a constant, usually 1.000

Rate – Exposure to Risk

All elements in the denominator have to be at risk of becoming or producing the elements in the numerator.

A better measurement would be the general fertility rate in which the denominator is the number of women of childbearing age.

Risk Exposure with Respect to Quality It is defined as opportunities for error (OFE) The numerator is the number of defects of

interest. Therefore, Defect Rate = (Number of Defects /

OFE) x K In software, defect rate is usually defined as the

number of defects per thousand source lines of code (KLOC)

This is a crude measure. WHY?

Six Sigma

Six sigma represents a stringent level of quality (3.4 defective parts per million).

It was made know by Motorola when it won the first Malcolm Baldrige National Quality Award.

It has become an industry standard as an ultimate quality goal.

Sigma (σ) is the symbol for standard deviation.

Six Sigma (Cont’d)

In a normal distribution, the area under the curve between plus and minus one standard deviation is 68.26%.

The area defined by plus/minus two standard deviations is 95.44%.

The area defined by plus/minus six standard deviations is 99.9999998%

Areas Under the Normal Curve

Shifted Six Sigma The six sigma value of 0.002 ppm is from the

statistical normal distribution. It assumes each execution of the production

process will produce the exact distribution of parts or products centered with regard to the specification limits.

However, process shifts and drifts always result from variations in process execution.

According to research the maximum process shift is 1.5 sigma.

Accounting for the shift gives a six sigma value of 3.4 ppm.

Specification Limits, Centered Six Sigma, and Shifted Six Sigma

Six Sigma and Software Development

In software, six sigma in terms of defect level is defined as 3.4 defects per million lines of code of the software product over its lifetime.

Unfortunately, the operational definitions differ across organizations.

Some do not distinguish lines of code by language type.

Reliability and Validity

Concepts and definitions have to be operationally defined before measurements can be taken.

The logical questions to ask are; How good are the operational metrics and

the measurement data? Do they really accomplish their task –

measuring the concept we want to measure and doing so with good quality?

Reliability

It is the consistency of a number of measurements taken using the same measurement method on the same subject (precision)

If repeated measurements are highly consistent, or even identical, then the measurement method or operational definition has a high degree of reliability.

Reliability (Cont’d)

Reliability can be expressed in terms of the size of the standard deviations of the repeated mesurements.

When variables are compared, the ratio of the standard deviation to the mean (the index of variation or IV) is used.

IV = Standard Deviation / Mean

Validity

Validity refers to whether the measurement or metric really measures what we intend to measure.

When the measurement does not involve a higher level of abstraction, validity simply means accuracy.

For an abstract concept it is difficult to recognize whether a certain metric is valid or invalid in measuring it.

Types of Validity

Construct validity – the validity of the operational measurement or metric representing the theoretical construct.

Criterion-related validity – predictive validity, e.g., the relationship between test scores and actual performance.

Content validity – the degree to which a measure covers the range of meanings included in the concept.

Tension Between Reliability and Validity

For data to be reliable, the measurement must be specifically defined.

This may make it more difficult to represent the theoretical concept in a valid way.

Measurement Errors

There are two types of measurement error: Systematic Random

Systematic errors are associated with validity

Random errors are associated with reliability

Systematic Errors

If the measurements do not equal the true value because of a systematic deviation (e.g., a scale being off by ten pounds), the error is a systematic error.

Measurement = True Value + Systematic Error + Random variations or M = T + s + e

The presence of a systematic error makes the measurement invalid.

Random Errors

If we eliminate systematic errors we have: M = T + e

That is, the measured value is different from the true value because of some random disturbance.

Since the disturbances are random, positive errors are jus as likely as negative errors.

Therefore, the expected value of e is zero, i.e., E(e) = 0

Random Errors (Cont’d)

From statistical theory about random error we can assume the following: The correlation between the true score and

the error term is zero. There is no serial correlation between the true

score and the error term. The correlation between errors on distinct

measurements is zero.


From these assumptions it follows:E(M) = E(T) +E(e)

= E(T) + 0

= E(T)

= T The smaller the variations in the error term,

the more reliable the measurements.


M = T + e

var(M) = var(T) + var(e)

where var is variance

Reliability = pm = var(T) / var(M)

= [var(M) –var(e)] / var(M)

= 1 – [var(e) / var (M)]

Assessing Reliability

Several ways of assessing reliability exist: Test/retest method Alternate-form method Split-halves method Internal consistency method

Using the test/retest method we would have:M1 = T + e1

M2 = T + e2

and as before

pm = pm1m2 = var(T) / var(M)

Correction for Attenuation

One of the important uses of reliability assessment is to adjust correlations.

Given the observed correlation and the reliability estimates of two variables the formula for correction for attenuation is as follows:

Correction for Attenuation (Cont’d)

_____p(xtyt) = p(xiyi) /√ pxx’ pyy’

where

p(xtyt) is the correlation corrected for attenuation, in other words, theestimated true correlation

p(xiyi) is the observed correlation, calculated from the observed data

pxx’ is the estimated reliability of the X variable

pyy’ is the estimated reliability of the X variable

Correction for Attenuation Example

If the observed correlation between two variables was 0.2 and the reliability estimates were 0.5 and 0.7 respectively, for X and Y, then the correlation corrected for attenuation would be:

________

p(xtyt) = 0.2 /√ 0.5 x 0.7 = 0.34

Correlation

Correlation is probably the most widely used statistical method to assess relationships among observational data.

Important points about correlation: It usually means linear correlation (the well

known Pearson correlation coefficient assumes a linear relationship). Other possibilities include convex, concave, or cyclical.

Correlation (Cont’d)

Important points (cont’d) If the data contain noise (due to

unreliability in measurement) or if the range of data points is large, the correlation coefficient will probably show no relationship. Using rank-order correlation might be a solution.

The method of linear correlation (least-squares method) is very vulnerable to extreme values. One should look at the scatter diagram of the data.

Correlation (Cont’d)

Important points (cont’d) Although significant correlation

demonstrates that an association exists between tow variables, it does not automatically imply a cause-and-effect relationship.

Criteria for Causality

The determination of cause-and-effect with observational data is a difficult task.

Three criteria are as follows:1. The first requirement in a causal

relationship between two variables is that the cause precede the effect in time or as shown clearly in logic.

2. The second requirement is a causal relationship is that the two variables be empirically correlated with one another.

Criteria for Causality (Cont’d)

Three criteria (cont’d):3. The third requirement in a causal

relationship is that the observed empirical correlation between two variables be not the result of a spurious relationship.

Spurious Relationships

Example of a Spurious Causal Relationship

Consider Halstead’s software science formula for program length:

N = n1 x log2 n1 + n2 x log2 n2

where

N = estimated program length

n1 = number of unique operators

n2 = number of unique operands

Example of a Spurious Causal Relationship (Cont’d)

Researchers have reported high correlations between actual program length (in terms of lines of code) and the predicted length based on the formula.

However, this is not surprising since both are really functions of n1 and n2.

Correlation exists because they are both operational definitions of the same concept.

Summary Measurement is related to the concept of entity

of interest and the operational definition of the concept.

Depending on the operational definition different levels of measurement can be applied: nominal scale, ordinal scale, interval scale, and ratio scale.

The measurement scales are hierarchical: each scale possesses all the properties of the lower level scales.

Summary (Cont’d) Basic measures such as ratio, proportion,

percentage, and rate all have specific purposes – care must be taken to avoid misuse.

The concept of six sigma represents a strict level of quality and includes the notions of process-variation reduction and product-design improvement.

In industry (shifted) six sigma is different from the statistical definition.

Summary (Cont’d) Because of differences in operational definitions,

six sigma cannot be used for comparison across companies.

Validity and reliability are the tow most important criteria for measurement quality.

Validity refers to whether the metric really measures what it is intended to.

Reliability refers to the consistency of measurements of the metric and mesurement method.

Summary (Cont’d) Validity is associated with systematic

measurement errors. Reliability is associated with random

measurement errors. Unreliability of measurements leads to an

attenuation of correlation between two variables.

When the measurement reliabilities of the variables are known, correction for attenuation can be made.

Summary (Cont’d)

Correlation is widely used with observational data.

Correlation alone cannot show causality. Causality depends on three criteria being

met:1. Cause precedes effect in time or logically

2. Significant correlation exists

3. The observed correlation is not spurious

Summary (Cont’d)

Measurement is the key to making software development a true engineering discipline.

To improve the practice of software measurement, it is important to understand the fundamentals of measurement theory.

Fundamentals of Measurement Theory. Measurement Measurement is crucial to the progress of all...

Documents

Transcript of Fundamentals of Measurement Theory. Measurement Measurement is crucial to the progress of all...