Reliability and validity

54
What is test validity and test validation? Tests themselves are not valid or invalid. Instead, we validate the use of a test score.

description

 

Transcript of Reliability and validity

Page 1: Reliability and validity

What is test validity and test validation?

Tests themselves are not valid or invalid. Instead, we validate

the use of a test score.

Page 2: Reliability and validity

• Tests are pervasive in our world. Tests can take the form of written responses to a series of questions, such as the paper-and-pencil SAT Reasoning Test™, or of judgments by experts about behavior, such as those for gymnastic trials or for a work performance appraisal. The form of test results also vary from pass/fail, to holistic judgments, to a complex series of numbers meant to convey minute differences in behavior.

Page 3: Reliability and validity

• Regardless of the form a test takes, its most important aspect is how the results are used and the way those results impact individual persons and society as a whole. Tests used for admission to schools or programs or for educational diagnosis not only affect individuals, but also assign value to the content being tested. A test that is perfectly appropriate and useful in one situation may be inappropriate or insufficient in another. For example, a test that may be sufficient for use in educational diagnosis may be completely insufficient for use in determining graduation from high school.

Page 4: Reliability and validity

• Test validity, or the validation of a test, explicitly means validating the use of a test in a specific context, such as college admission or placement into a course. Therefore, when determining the validity of a test, it is important to study the test results in the setting in which they are used. In the previous example, in order to use the same test for educational diagnosis as for high school graduation, each use would need to be validated separately, even though the same test is used for both purposes.

Page 5: Reliability and validity

Validity is a matter of degree, not all or none.

• Samuel Messick, a renowned psychometrician, defines validity as "...an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationale support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment." Messick points out that validity is a matter of degree, not absolutely valid or absolutely invalid. He advocates that, over time, validity evidence will continue to gather, either enhancing or contradicting previous findings.

Page 6: Reliability and validity

Tests sample behavior; they don't measure it directly.

• Most, but not all, tests are designed to measure skills, abilities, or traits that are and are not directly observable. For example, scores on the SAT Reasoning Test measure developed critical reading, writing and mathematical ability. The score on the SAT Reasoning Test that an examinee obtains when he or she takes the test is not a direct measure of critical reading ability, such as degrees centigrade are a direct measure of the heat of an object. The amount of an examinee's developed critical reading ability must be inferred from the examinee's SAT Reasoning Test critical reading score.

Page 7: Reliability and validity

• The process of using a test score as a sample of behavior in order to draw conclusions about a larger domain of behaviors is characteristic of most educational and psychological tests. Responsible test developers and publishers must be able to demonstrate that it is possible to use the sample of behaviors measured by a test to make valid inferences about an examinee's ability to perform tasks that represent the larger domain of interest.

Page 8: Reliability and validity

Reliability is not enough; a test must also be valid for its use.

• If test scores are to be used to make accurate inferences about an examinee's ability, they must be both reliable and valid. Reliability is a prerequisite for validity and refers to the ability of a test to measure a particular trait or skill consistently. However, tests can be highly reliable and still not be valid for a particular purpose. Crocker and Algina (1986, page 217) demonstrate the difference between reliability and validity with the following analogy.

Page 9: Reliability and validity

• Consider the analogy of a car's fuel gauge which systematically registers one-quarter higher than the actual level of fuel in the gas tank. If repeated readings are taken under the same conditions, the gauge will yield consistent (reliable) measurements, but the inference about the amount of fuel in the tank is faulty.

Page 10: Reliability and validity

This analogy makes it clear that determining the reliability of a test is an important first step, but not

the defining step, in determining the validity of a test.

Page 11: Reliability and validity

Reliability

Is the degree to which an assessment tool produces stable

and consistent results.

Page 12: Reliability and validity

Test reliability• Researchers use four methods to check

the reliability of a test: the test-retest method, alternate forms, internal consistency, and inter-scorer reliability. Not all of these methods are used for all tests. Each method provides research evidence that the responses are consistent under certain circumstances.

Page 13: Reliability and validity

Test-retest reliability

• Is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals.

• Example: A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first. The obtained correlation coefficient would indicate the stability of the scores.

Page 14: Reliability and validity

• Example: The information obtained for test-retest reliability of the WISC-IV was evaluated with information from 243 children. The WISC-IV was administered two separate times with the test-retest mean interval of 32 days. The average corrected Full Scale IQ stability coefficient was .93.

• Show some computation here using excel

Page 15: Reliability and validity

Correlation• The term correlation simply refers to the

degree to which two or more sets of data show a tendency to vary together. The strength of the positive (increase, increase) correlation coefficient can vary from .00 to 1.00.

Page 16: Reliability and validity

Determining the Correlation Coefficient

• Step 1:     List scores. Find the mean (M) and Standard Deviation (SD) for each set of scores.

• Participant #        1st Scores     2nd Scores•    1                            4                     6

   2                            6                     7   3                            2                     3   4                            2                     4   5                            4                     2   6                            4                     7   7                            9                     9   8                            3                     5   9                            3                     6   10                          3                     5                   M 1  = 4               M 2  = 5.4

                       SD 1  = 2            SD 2 = 1.96                       n 1 = 10                n 2= 10

Page 17: Reliability and validity

• Step 2: Find the z score for each test score using the formula:

(X - M)/ SD

•     Z scores are a type of standard score. The z score is useful when attempting to compare items from distributions with different means and standard deviations. The z score for a test score indicates how far and in what direction that test score is from its distribution's mean, expressed in units of its distribution's standard deviation. The z scores will have a mean of zero and a standard deviation of one.

Page 18: Reliability and validity

• Participant #    lst Test Z Score    2nd Test Z Score•     1                        0                             .31

    2                        1                             .82    3                       -1                          -1.22    4                       -1                            -.71    5                        0                          -1.73    6                        0                              .82    7                        2.5                        1.84    8                       -0.5                         -.20    9                       -0.5                          .31   10                      -0.5                        -.20

Page 19: Reliability and validity

• Step 3:     Sum the scores from each test. Apply scores to the defining formula for the Pearson r:   

r = Σ ZxZy / N

Page 20: Reliability and validity

• Participant #     ZxZy    1.                     0    2.                       .82    3.                     1.22    4.                        .71    5.                     0    6.                    0     7.                      4.60    8.                        .10    9.                       -.16    10.                      .10

∑ZxZy  =    7.39•  (∑ ZxZy) / 10 = 0.74*• *This answer is the correlation coefficient

Page 21: Reliability and validity

About Pearson r

• The Pearson r is the most commonly used measure of correlation, sometimes called the Pearson Product Moment correlation. It is simply the average of the sum of the Z score products and it measures the strength of linear relationship between two characteristics. The positive (increase, increase) correlation coefficient can range from 0.00 to 1.00. The closer to 1.00, the stronger the relationship. 

Page 22: Reliability and validity

Alternate Forms• this type of reliability makes a second form of a test

consisting of similar items, but not the same items. Researchers administer this second “parallel” form of a test after having already administered the first form. This allows researchers to determine a reliability coefficient that reflects error due to different times and items and allow to control for test form. By administering form A to one group and form B to another group, and then form B to the first group and form A to the second group for the next administration of the test, researchers are able to find a coefficient of stability and equivalence. This is the correlation between scores on two forms and takes into account error of different times and forms.

Page 23: Reliability and validity

• Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms.

Page 24: Reliability and validity

Inter-rater ability• Is a measure of reliability used to assess the degree to

which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct* or skill being assessed.

• Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which portfolios meet certain standards. Inter-rater reliability is especially useful when judgments can be considered relatively subjective. Thus. The use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems.*Construct is defined as a property that is offered to explain some aspect of human behavior, such as mechanical ability, intelligence or introversion

Page 25: Reliability and validity

Internal consistency reliability

• Is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results.

Page 26: Reliability and validity

A. Split-half reliability- is a subtype of internal consistency reliability. The process of obtaining split-half reliability is begun by splitting in half all items of a test that are intended to probe the same area of knowledge (e.g. World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score of each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total set of scores.

Page 27: Reliability and validity

• A test given and divided into halves and are scored separately, then the score of one half of test are compared to the score of the remaining half to test the reliability

Page 28: Reliability and validity

Why use Split-Half?

• Split-Half Reliability is a useful measure when impractical or undesirable to assess reliability with two tests or to have two test administrations (because of limited time or money) (Cohen & Swerdlik, 2001).

Page 29: Reliability and validity

How do I use Split-Half?

• 1st-divide test into halves. The most commonly used way to do this would be to assign odd numbered items to one half of the test and even numbered items to the other, this is called, Odd-Even reliability.

• 2nd- Find the correlation of scores between the two halves by using the Pearson r formula.

• 3rd- Adjust or reevaluate correlation using Spearman-Brown formula which increases the estimate reliability even more. The longer the test the more reliable it is so it is necessary to apply the Spearman-Brown formula to a test that has been shortened, as we do in split-half reliability (Kaplan & Saccuzzo, 2001).

Page 30: Reliability and validity

•  Spearman-Brown formula• r = 2 r

1+ r• r = estimated correlation between two

halves (Pearson r) (Kaplan & Saccuzzo, 2001).

Page 31: Reliability and validity

• This method does not require 2 administrations of the same or an alternative form test. In the split halves method, it is not enough that we simply derived its correlation because it only estimates the reliability of each half of test. It is necessary to use a statistical correction to estimate reliability of the whole test. This correction is known as the spearman-brown prophecy.

Page 32: Reliability and validity

• If the correlation between the halves is .75, the reliability for the total test is

r = 2 r 1+ r

R = 2 (.75)/ (1+.75) = 1.5 / 1.75 = .857

Page 33: Reliability and validity

B. Average inter-item correlation- is a subtype of internal consistency reliability. It is obtained by taking all of the items on a test that probe the same construct (e.g.; reading comprehension), determining the correlation coefficient for each pair of items, and finally taking the average inter-item correlation.

Page 34: Reliability and validity

• measures the degree of agreement between persons scoring a subjective test (like an essay exam) or rating an individual. In regards to the latter, this type of reliability is most often used when scorers have to observe and rate the actions of participants in a study. This research method reveals how well the scorers agreed when rating the same set of things. Other names for this type of reliability are inter-rater reliability or inter observer reliability.

Page 35: Reliability and validity

• The KAPPA coefficients indicate the extent of agreement between the raters, after removing that part of their agreement that is attributable to chance. As can be seen, the values of the KAPPA statistic are much lower than the simple percentages of agreement

Page 36: Reliability and validity

• Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The first mention of a kappa-like statistic is attributed to Galton (1892), see Smeeton (1985).

• The equation for κ is:

• where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters (other than what would be expected by chance) then κ ≤ 0.

Page 37: Reliability and validity

• Suppose that you were analyzing data related to people applying for a grant. Each grant proposal was read by two people and each reader either said "Yes" or "No" to the proposal. Suppose the data were as follows, where rows are reader A and columns are reader B:

Yes No Yes 20 5No 10 15

Page 38: Reliability and validity

• Note that there were 20 proposals that were granted by both reader A and reader B, and 15 proposals that were rejected by both readers. Thus, the observed percentage agreement is Pr(a)=(20+15)/50 = 0.70.

• To calculate Pr(e) (the probability of random agreement) we note that:

• Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus reader A said "Yes" 50% of the time.

• Reader B said "Yes" to 30 applicants and "No" to 20 applicants. Thus reader B said "Yes" 60% of the time.

Page 39: Reliability and validity

• Therefore the probability that both of them would say "Yes" randomly is 0.50*0.60=0.30 and the probability that both of them would say "No" is 0.50*0.40=0.20. Thus the overall probability of random agreement is Pr("e") = 0.3+0.2 = 0.5.

• So now applying our formula for Cohen's Kappa we get:

Page 40: Reliability and validity

Validity

Refers to how well a test measures what it is purported to

measure.

Page 41: Reliability and validity

Why is it necessary?

• While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs to be valid.

Page 42: Reliability and validity

Types of Validity

Page 43: Reliability and validity

Face Validity

• Ascertains that the measure appears to be assessing the intended construct under study. The stakeholders can easily assess face validity. Although this is not a very “scientific” type of validity, it may be an essential component in listing motivation of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the ability, they may become disengaged with task.

Page 44: Reliability and validity

• Ex: If a measure of art appreciation is created all of the items should be related to the different components and types of art. If the questions are regarding historical time periods, with no reference to any artistic movement, stakeholders may not be motivated to give their best effort or invest in this measure because they do not believe it is a true assessment of art appreciation.

Page 45: Reliability and validity

Construct Validity

• Is used to ensure that the measure is actually measure what it is intended to measure and not the other variables. Using a panel of “experts” familiar with the construct is a way in which this type of validity can be assessed. The experts can examine the items and decide what that specific item is intended to measure. Students can be involved in this process to obtain their feedback.

Page 46: Reliability and validity

• Ex: A women’s studies program may design a cumulative assessment of learning throughout the major. The questions are written with complicated wording and phrasing. This can cause the test inadvertently becoming a test of reading comprehension, rather than a test of women’s studies. It is important that the measure is actually assessing the intended construct, rather than extraneous factor.

Page 47: Reliability and validity

Criterion-Related Validity

• Is used to predict future or current performance – it correlates test results with another criterion of interest.

Page 48: Reliability and validity

• Ex: If a physics program designed a measure to assess cumulative student learning throughout the major. The new measure could be correlated with a standardized measure of ability in this discipline, such as an ETS field testr or the GRE subject test. The higher the correlation between the established measure and new measure, the more faith stakeholders can have in the new assessment tool.

Page 49: Reliability and validity

Formative Validity

• When applied to outcomes assessment it is used to assess how well a measure is able to provide information to help improve the program under study.

Page 50: Reliability and validity

• Ex: When designing a rubric for history one could assess student’s knowledge across the discipline. If the measure can provide information that students are lacking knowledge in a certain area, for instance the Civil Rights Movements, then that assessment tool is providing meaningful information that can be used to improve the course of program requirements.

Page 51: Reliability and validity

Sampling Validity

• Ensures that the measure covers the broad range of areas within the concept under study. Not everything can be covered, so items need to be sampled from all of the domains. This may need to be completed using a panel of “experts” to ensure that the content area is adequately sampled. Additionally, a panel can help limit “expert bias” (i.e. a test reflecting what an individual personally feels are the most important or relevant areas.)

Page 52: Reliability and validity

• Ex: When designing an assessment of learning in the theatre department, it would not be sufficient to only cover issues related to acting. Other areas of theatre such as lighting, sound, functions of stage managers should all be included. The assessment should reflect the content area in its entirety.

Page 53: Reliability and validity

Some ways to improve validity• Make sure your goals and objectives are clearly

defined and operationalized. Expectations of students should be written down.

• Match your assessment measure to your goals and objectives. Additionally, have the test reviewed by faculty at other schools to obtain feedback from an outside party who is less invested in the instrument.

• Get students involved; have the students look over the assessment for troublesome wording, or other difficulties.

• If possible, compare your measure with other measures, or data that may be available.

Page 54: Reliability and validity

How do we use simple test analysis on our school.