25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding...

25
25 CONSTRUCTING D ATA KADRIYE ERCIKAN University of British Columbia WOLFF-MICHAEL ROTH University of Victoria W e focus this chapter on the con- struction of data in educational research, where we understand the term “data” to mean those mathematical or tex- tual elements that researchers use in support of their claims. Articulating issues in the construc- tion of data that are valid for educational research in general is not an easy task because there are very different traditions or research par- adigms. Ordinarily, these traditions are differen- tiated by the adjectives “quantitative” and “qualitative.” We do not find these terms to be useful, however, because all phenomena are characterized by the mutual constitution of quan- titative and qualitative elements (Hegel, 1969). Accordingly, all so-called quantitative research requires qualitative processes (e.g., making judg- ments, distinguishing categories), and all so- called qualitative studies involve quantitative processes (e.g., counting and descriptive statis- tics [sums, averages, percentages]; terms such as “more,” “less,” and “increasing”). Other difficul- ties that exist in discussing data construction in general pertain to the different and sometimes incompatible discourses that exist in the alterna- tive traditions. Our first task, therefore, is to articulate a language that is subsequently used to discuss issues in two types of research, distin- guished by the adjectives “high inference” and “low inference.” BASIC FRAMEWORK In this chapter, we describe fundamental issues concerning the construction of data in high- inference and low-inference research methods. We denote as high inference those studies in which researchers are interested in generalizing findings beyond the context of the research. For example, researchers may be interested in examining gender differences in mathematics learning and performance. In this research, researchers often already have their research questions formulated at the onset of the research, such as whether there are gender dif- ferences in mathematics learning, and are inter- ested in testing hypotheses and generalizing 451 AUTHORS’NOTE: The co-authors contributed equally to the chapter, and the author names are in alphabetical order. 25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 451

Transcript of 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding...

Page 1: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

25CONSTRUCTING DATA

KADRIYE ERCIKAN

University of British Columbia

WOLFF-MICHAEL ROTH

University of Victoria

We focus this chapter on the con-struction of data in educationalresearch, where we understand the

term “data” to mean those mathematical or tex-tual elements that researchers use in support oftheir claims. Articulating issues in the construc-tion of data that are valid for educationalresearch in general is not an easy task becausethere are very different traditions or research par-adigms. Ordinarily, these traditions are differen-tiated by the adjectives “quantitative” and“qualitative.” We do not find these terms to beuseful, however, because all phenomena arecharacterized by the mutual constitution of quan-titative and qualitative elements (Hegel, 1969).Accordingly, all so-called quantitative researchrequires qualitative processes (e.g., making judg-ments, distinguishing categories), and all so-called qualitative studies involve quantitativeprocesses (e.g., counting and descriptive statis-tics [sums, averages, percentages]; terms such as“more,” “less,” and “increasing”). Other difficul-ties that exist in discussing data construction ingeneral pertain to the different and sometimes

incompatible discourses that exist in the alterna-tive traditions. Our first task, therefore, is toarticulate a language that is subsequently used todiscuss issues in two types of research, distin-guished by the adjectives “high inference” and“low inference.”

BASIC FRAMEWORK

In this chapter, we describe fundamental issuesconcerning the construction of data in high-inference and low-inference research methods.We denote as high inference those studies inwhich researchers are interested in generalizingfindings beyond the context of the research. Forexample, researchers may be interested inexamining gender differences in mathematicslearning and performance. In this research,researchers often already have their researchquestions formulated at the onset of theresearch, such as whether there are gender dif-ferences in mathematics learning, and are inter-ested in testing hypotheses and generalizing

451

AUTHORS’ NOTE: The co-authors contributed equally to the chapter, and the author names are in alphabeticalorder.

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 451

Page 2: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

findings beyond samples and contexts used intheir research. In high-inference research,researchers are interested in examining genderdifferences in mathematics in general ratherthan gender differences in a particular class-room or context.

We denote as low inference those studies inwhich results (claims) are not extended beyondthe sample or situation. Research questions areoften developed during the process of research.The degree to which the results generalize tosituations other than those represented in thedata sources must be tested empirically in otherresearch that involves these other situations. Inlow-inference research, researchers may exam-ine mathematics learning in one or two class-rooms in an effort to understand potential factorsthat lead to differential mathematics learning ormotivation for mathematics learning. As we dis-cuss and demonstrate in this chapter, quantitativeand qualitative are properties of data rather thantypes of research or types of inferences thatresearchers are interested in making. Researchactivities may differ in terms of their purposes,such as high or low inference; however, bothtypes of research may very well involve qualita-tive data as well as quantitative data.

In this chapter, constructing data is viewed asan integral component of the evidence-gatheringprocess in research. The evidence-gatheringprocess may be aimed at answering a set ofresearch questions and testing hypotheses, as inthe case of high-inference research. It may alsobe part of an exploration to refine and identifyresearch questions, as in the case of low-infer-ence research. Our emphasis is not on the criti-cal differences between the two approachesthemselves; instead, we highlight the differ-ences in purposes of the two types of researchand how these differences may influence thedata construction process. There are, however,distinctions in the two approaches regardingthe terminologies used, what constitutes datasources, how data are derived from these datasources, and who constitutes participants inresearch.

Here we understand data sources as contexts,people, methods, tools, and educational out-comes that we may want to explore so as tounderstand or make decisions in educationalresearch. In mathematics gender differences

research, data sources may include differentgroupings of students in mathematics instruc-tion/learning, computer programs, demonstra-tions of mathematical performance on differentmathematical tasks, and teacher–student andstudent–student interactions. Research is basedon representations of these contexts, people, andeducational outcomes, including responses tosurveys or tests as well as observations, video-tapes, and audiotapes. From these representa-tions, researchers construct the data properthrough an interpretation model. The interpreta-tion model could include scoring rubrics in amathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different interpretation models wouldbe used in constructing data from students’responses to test questions, depending on theresearch questions. For example, if the studentswere asked to write a report on a science exper-iment, then different interpretation modelswould be used in constructing data for assessingtheir scientific literacy, mathematical compe-tency, or English-language competency.

The distinction between data source and dataalso allows us to understand that the same datasources may be used in both low-inference andhigh-inference research and may lead to differ-ent data. For example, the same videotapes maybe used to construct data for assessing mathe-matical knowing and understanding and forrevealing patterns in student–student interactionduring peer tutoring. Interpretations may resultin data that assign scores to students’ responsesor may result in counts of some behavior ofinterest within and across videotapes. Thesecould be ordinal data (e.g., 1, 2, 3, . . . , wherehigher numbers indicate higher degrees of com-petency) or categorical data (e.g., types of prob-lem-solving strategies, types of interactions).

In low-inference research, data are used tosubstantiate claims (e.g., how female and malestudents interact with each other in a mathemat-ics learning context) and to make inferencesabout the contexts and people (e.g., beliefsabout mathematical competence, cognitivestructures, intentions for further mathematicalstudy or careers that involve mathematics);the inferences do not go beyond the contexts,people, and educational outcomes present inthe data sources. In high-inference research, the

452–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 452

Page 3: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

data are used to make inferences that go beyondthe actual contexts or people on which the dataare based. For example, group differences suchas those between fourth-grade Canadian boysand girls who participate in a study are to begeneralized to all Canadian fourth-grade boysand girls. Therefore, questions about the degreeto which data sources are representative ofthe population about which researchers want tomake some general claims is a central issue inhigh-inference research. To be representative,the sample (group of participants in the study)should include the same variations that wouldbe expected in the population (i.e., the group towhich results are generalized), including age,region, culture, language background, and cur-ricular exposure. In addition, the mathematicstests used for assessing mathematics com-petency should be consistent with the mathe-matical competency that the researchers areinterested in examining and should include allaspects of this mathematical competency. Forexample, if the researchers are interested inexamining how well students can apply mathe-matical concepts to scientific problem solving,then the test should include a sufficient numberof questions to assess this component of mathe-matical competency as well as all other com-ponents of mathematical competency that theresearchers are interested in examining.

Low-inference research and high-inferenceresearch do not distinguish themselves inthe use or nonuse of quantitative data (e.g.,numbers, counts). For example, some high-inference research, such as phenomenologicalstudies that target understanding the structure ofexperience, might not use numbers at all; how-ever, they might use terms that distinguishdegrees of something such as degrees of under-standing (Heidegger, 1977), degrees of sharp-ness in visual perception (Merleau-Ponty,1945), or degrees of temporal distance (Husserl,1991). What is the common essence of somehuman experience that might express itselfin similarities or differences? This commonessence may be derived based on studies ofsound involving only one person (i.e., N = 1)and consistently confirmed in other studies ifproperly performed (Varela, 2001). Similarly,some low-inference research counts the numberof incidences of some phenomenon or provides

frequencies for particular behaviors. Forexample, a researcher might report the frequen-cies of different categories of responses in aclass of 20 or 30 students.

In the sections that follow, we present someresearch scenarios from our own individualresearch programs that involve both high-and low-inference research, and we highlightthe similarities and differences in alternativeapproaches to research. These examples ofresearch are presented and discussed in an effortto exemplify identification and definition of datasources, construction of data, challenges inresearch that are related to data, and the neces-sity of different modes of constructing data toaddress research questions.

CONSTRUCTING DATA IN

HIGH-INFERENCE RESEARCH

Most educational research requires data aboutstudent competencies, motivation, thinkingskills, and the like. These are constructs, that is,entities that cannot be observed directly by theresearchers. The researchers need to create dataabout the students’ status regarding these con-structs using indirect methods. Therefore, mosteducational data construction involves assess-ment of nonobservable psychological constructsthat are thought to underlie the actually observ-able behaviors. Our understandings of somepsychological construct—how it develops, howit is related to other constructs, and how it mightbe related to certain behavior, actions, andperformance—are at the core of selecting ordeveloping the right measurement instrument.Measurement instruments constitute primarytools for the collection of data sources. Howwell the measurement instruments matchresearch purposes and the properties of suchinstruments are critical aspects of the validityof inferences from high-inference research. Atleast two typical research scenarios can be iden-tified pertaining to measurement instrumentsthat highlight different types of measurementissues involved in research. We refer to these asresearch scenarios on (a) the development ofmeasurement instruments and (b) the identifica-tion of measurement instruments. These scenar-ios give us the opportunity to elaborate on

Constructing Data–•–453

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 453

Page 4: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

measure development issues in general, thematch between measures and research purposes,and limitations in research due to measurementissues that are at the core of all research scenar-ios. The next two subsections describe anddiscuss these research scenarios.

Development ofMeasurement Instruments

In this first scenario, the researchers havedone a thorough review of the literature andidentified an important research question andkey constructs relevant to their research ques-tion. The main data construction issue is thedevelopment of measures that will provide ade-quate measurement of the construct in whichthe researchers are interested. For example, indeveloping tests to assess mathematical compe-tency of seventh-graders in the nation, such asthose used in the School Achievement Indi-cators Program (SAIP, Canada’s national surveyof achievement), the tasks and associatedassessed skills that the researchers choose to usein the assessment of mathematical competencyskills need to be consistent with the definition ofmathematical competency that the researchershave in mind or the way in which they aredefined by an identified national curriculum. Inaddition, to be generalizable, these tasks need tospan the range of all possible components of themathematical competency that the researchersare interested in assessing. The tasks need to bepresented in such a way that the language of thetest items does not interfere with students’ abil-ity to demonstrate their mathematical compe-tency. Many of these tasks should be available togive students numerous chances to demonstratetheir competency, to provide multiple observa-tions of this competency, and to rule out thepossibility that students accidentally succeededor failed on a task.

In developing measurement instruments, theresearchers are required to make many choices.The first set of choices is related to the develop-ment and selection of tasks. The researchers candevelop tasks that may be in a multiple-choiceformat, in a short-answer response format, orperformance tasks that involve hands-on prob-lem solving. The researchers may want to use asmany extensive performance tasks as possible;

however, responding to these types of perfor-mance tasks can take students a long time and,therefore, may result in an unreasonably longtesting time. In addition, scoring of studentresponses from such tasks is highly labor inten-sive; therefore, the use of performance tasksmay result in very high costs. On the other hand,using multiple-choice questions may limitthe types of competencies the researchers canassess, such as mathematical communication,and may have negative effects on communicat-ing to the educational community what theimportant mathematical competency compo-nents are.

In addition to the format of the tasks, theresearchers need to decide on the types of con-texts in which the tasks can be and need to bepresented. Whether these contexts, such as thedivision of pizza slices in a fractions problem,enhance the mathematics assessment, whetherthey introduce an artificial context that is notnecessary, and whether they can introduce biasagainst some ethnic or gender groups must bedetermined. How many tasks can be includedgiven the testing time constraints? What are thefunding constraints that may limit the number ofperformance tasks that can be used in the assess-ment? How many and which tasks should beselected so that all components of mathematicalcompetency are included in the assessment?Mathematical communication is an importantcomponent of mathematical competency, so itmust be ascertained to what extent the tasks thatassess mathematical communication are alsoassessing English-language competency and towhat extent the test will be biased against thosestudents whose primary language is not the lan-guage of instruction. These are only some of thequestions that the researchers need to considerwhen developing and selecting tasks for the test.

Students’ responses to the tasks then becomethe “raw data” or data sources from whichevidence (data) is developed in support of theclaims regarding students’ competencies. Theresearchers need to use an interpretation modelto convert the data sources into evidence in sup-port of the existence or lack of a certain degreeof competency. The interpretation model is a setof rules that determine what aspects of studentresponses are relevant to the target inferencesand how different characteristics of responses

454–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 454

Page 5: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

may indicate differing degrees of competency.For example, in a mathematics test, the interpre-tation model includes scoring rubrics thatdescribe what aspects of students’ responses arerelevant to successful completion of a mathe-matics problem-solving test item and howvarious responses, with different degrees ofcompletion and mathematical accuracy, mayindicate different levels of competency. Astudent’s successful completion of tasks or pro-vision of correct responses is treated as evidencesupporting a certain degree of competency thatis required to complete the task, whereas astudent’s unsuccessful completion of tasks orprovision of incorrect responses is interpreted asa lack of the required level of competency.

The interpretation model can be applied toproduce a set of scores, higher levels of whichindicate higher degrees of competencies.Psychometric models with different degrees ofcomplexity, such as classical test theory models,item response theory models, and latent classmodels, are then used to summarize this evi-dence. The result is a set of scores that may beone-dimensional, may be multidimensional, ormay indicate class memberships. Assessmentdesign and development issues that are at thecore of such data construction efforts can befound elsewhere (Ercikan, 2005; NationalResearch Council, 2001).

Identification ofMeasurement Instruments

The second scenario is similar to the first oneexcept that the researchers choose to use a mea-sure that has already been developed. Using anexisting measure, instead of developing a newmeasure tailored to the purposes of the research,can have implications for the validity ofresearch findings, and making such a choicerequires investigation of the appropriateness ofthe selected measure for the research purposes.In the measure identification scenario, theresearchers are typically constrained by avail-ability and familiarity of measures, their con-ceptualization of how the construct should bemeasured, costs associated with measurementand data collection, and availability of thedesired research participants (subjects) and con-texts relevant to the study. If the researchers are

selecting a measure from an existing set ofpossible measures, the establishment of validityof interpretations for one population, such asadults, is not sufficient to expect or assume sim-ilar levels of appropriateness of interpretationsfor other populations, such as children. Theresearchers are often aware of the degree ofinappropriateness of using with one age groupa set of measures developed and validatedfor another age group. Unfortunately, the samedegree of awareness does not exist for appropri-ateness of measures for different cultural andlanguage groups or for people from differentcountries (Hambleton, 2004).

The widespread use of translated or adaptedversions of tests in multicultural research orinternational assessments demonstrates the needfor, and interest in, using the same measure inmultiple languages and cultures. However, dif-ferences due to culture and language are oftennot taken into account when scores from thesemeasures are used. Extensive research nowdemonstrates that measures in fact cannot beassumed to measure similar constructs whenadministered in different languages (Allalouf,Hambleton, & Sireci, 1999; Ercikan, 1998,2002, 2003; Ercikan & Koh, 2005; Gierl &Khaliq, 2001; Sireci, Fitzgerald, & Xing, 1998).In this scenario, the validity of the research andthe validity of interpretations are critically tiedto the appropriateness of the measures selectedfor the purposes of research. Therefore, theresearchers are required to provide evidence ofsuch appropriateness.

PSYCHOMETRIC PROPERTIES

OF MEASUREMENT INSTRUMENTS

The impact of different measure properties andthe appropriateness of measures are at the coreof validity of interpretations of all educationalresearch. In both of the research scenarios justpresented, a set of measures is at the core ofthe data construction process, and the primaryresearch regarding data construction is thedegree to which these measures are appropriate.This type of research requires data to examineproperties of measures and their appropriatenessfor different uses. The following subsectionsdiscuss and exemplify the various types of

Constructing Data–•–455

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 455

Page 6: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

research questions that psychometric researchmay involve and the range of data constructionefforts that may be needed to explore thesequestions.

Examples of High-Inference Research

In this subsection, we articulate typical fea-tures of high-inference research and exemplifythem in a case study assessing the equivalence ofEnglish and French versions of a school achieve-ment test. The research described and discussedin this subsection explored the properties of teststhat may have significant implications for dataobtained from these tests.1 The general researchquestion was as follows: Can the English andFrench versions of tests be considered equiva-lent, or do the tests provide biased informationabout student competencies? The impetus forthis research question was the fact that in Canadamost tests are administered in both English andFrench, the two official languages of the country,with little or no evidence of the equivalence ofthe two versions of the tests. The comparabilityof constructs assessed by the two language ver-sions of the tests is critical to all decisions andresearch projects that use results from theseassessments. The purpose of this research was toexamine the degree of differences that can beexpected between the two language versions ofthese tests and to identify sources of differences.Data from the SAIP were used to evaluate thecomparability of test items and of assessmentresults from the English and French versions oftests in three content areas: language arts, math-ematics, and science. The data sources consistedof students’ responses to test questions and iden-tification of the language in which their test wasadministered.

There are many challenges in deciding whichmeasures to use and the degree to which datafrom these measures will provide relevant infor-mation to address research questions ade-quately. In this research, the first challenge wasthe degree to which the types of differences thatcan be identified in the SAIP can be generalizedto other tests. First, these tests are developed byprofessional content area experts in languagearts, mathematics, and science and are adaptedby professional bilingual translators. Given this,can the differences observed in this research be

considered the minimum level of differencesthat might be observed in other tests for whichprofessionals do not develop and translate thetests? The second challenge is whether the qual-ities of the data from the SAIP are adequate formaking meaningful and valid inferences. Forexample, do the tests provide accurate measuresof competencies in each of the three contentareas? If separate measures of competenciesin the two languages are not accurate—inother words, if they have high standard errors ofmeasurement—then statistical comparisons ofequivalence will not be accurate.

Comparisons of equivalence often use statis-tical estimates of overall competency in the sub-stantive area. If this estimate of competency isnot accurate, then investigations of equivalenceare jeopardized. Another quality of assessmentdata is the degree to which each sample is rep-resentative of the language group as a whole. Inthe SAIP research, the tests tended to be long;therefore, the scores had good measurementaccuracies. The two language samples wererepresentative samples of 13- and 16-year-oldsfrom each language group. Therefore, the twomajor challenges in data construction were metin this research.

Many other challenges were not met duringthe first phase of the study. One of the first chal-lenges in high-inference research is the degreeto which the findings can be attributed to thehypothesized sources of differences. From basicknowledge of research design, we know that wecannot infer causal relationships among vari-ables unless we account for all possible alterna-tive explanations, often through experimentaldesigns. In construct comparability research, towhat extent can we infer that the differences weidentify are due to translation errors, differencesin the two language versions of tests, or differ-ent language conceptualizations of constructs?Is it possible for the language differences to beconfounded by other factors such as cultural dif-ferences, curricular differences, and educationresource differences between the two groups?The statistical identification of differences doesnot provide answers to these questions. The nextsubsection describes additional data construc-tion efforts to examine the sources of differ-ences between the two language versions oftests and elaborates on the need for multiple

456–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 456

Page 7: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

sources and types of data to address theseresearch questions.

Multiple Modes of ConstructingData to Address Research Questions

Four different approaches (Figure 25.1) weretaken to examine the sources of differencesidentified in this psychometric research: judg-mental review, replication, think-aloud proto-cols, and experimental design. Each of theseapproaches provided different types of evidencein support of or against the idea that perfor-mance differences were due to language differ-ences between the two language versions oftests. All of the statistical comparisons wereconducted for both the 13- and 16-year-olds.

This replication approach was expected toprovide further evidence of a language differ-ences interpretation if the differences were repli-cated for both age groups. Replication studiesare a common way of verifying and validatingfindings in research. In our research, the assess-ments were administered to the two age groupsnot with replication purposes in mind but ratherfor the convenience of using the same assess-ment for two groups as well as for the examina-tion of growth in learning from one age group toanother. Because our study was based on assess-ment data that had already been collected as anational survey of achievement, we decided totake advantage of this feature of the data.

The evidence regarding language differencesbetween the two language versions of testscannot be determined without a review of thetest items by bilingual experts. The judgmentalreview approach provided such evidence. Wedid not consider any alternative approaches tothe judgmental reviews because this is the onlyway of examining linguistic comparability oftest items, with possible variations in how thejudgmental reviews are actually conducted.Even when language differences are identifiedby judgmental reviews, it is not certain whetherthese differences lead to performance differ-ences between the two groups or whether theseare the sources of differences identified by thestatistical methods.

The think-aloud protocols approach allowedus to examine students’ thinking processes asthe students responded to the test questions andto examine whether the language used in thetest items may have affected these thoughtprocesses. There are several methods for exam-ining students’ thinking processes, includingexamining student concept maps and interview-ing students after they complete test questions.These methods, among other possible methods,may reveal information about how studentsinterpret test questions and use informationfrom test questions. Given the purposes of ourresearch, the think-aloud protocols approachhad some advantages over these methods. Itallowed us to document the thinking processes

Constructing Data–•–457

Judgmental Reviews Think-aloud

Identify Sources of DIF

Replications

Experimental design

Figure 25.1 Four approaches to identifying sources of DIF are shown in the illustration.

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 457

Page 8: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

as the students took the tests, ask studentsquestions that targeted their understanding ofthe test questions, and examine whether anyaspects of the test questions helped or hinderedstudents’ ability to answer the questions. All ofthese approaches led to further insights aboutsources of differences yet lacked support forcausal interpretations.

The experimental design approach was usedto test a set of hypotheses, developed using find-ings from the other approaches, regarding whatlanguage aspects of test items may be the sourcesof differential item functioning (DIF). The exper-imental design approach is the only way of estab-lishing causal relationships between linguisticdifferences and performance. Therefore, an alter-native approach was not considered for exploringsuch a relationship. In these four approaches,four different types of data sources were assem-bled to identify DIF of the English and Frenchversions of the “same” test. The four approachesare described in the following subsections.

Judgmental Review. In the absence of experi-mental design, research is an evidence-gatheringeffort to support or contradict interpretations.Therefore, once a tentative interpretation isderived from data at hand, additional sources ofevidence to support that interpretation, or otherpossible explanations of that interpretation, arethe focus of further explorations. Following thestatistical analyses, we asked bilingual expertsto conduct a judgmental review of test items inan effort to identify possible sources of statisti-cally identified DIF and possible adaptationproblems. There are many variations of judg-mental reviews that could have been used. Wecould have varied the number of reviewers, theexpertise of the reviewers (e.g., their proficiencylevels in the two languages), whether theyreview all of the test items or only a subset,whether they review the test items indepen-dently or jointly, and/or how their judgmentsare summarized and used. We chose to use fourbilingual French–English translators, who com-pleted a blind review of all the items to identifypotential sources of statistically identified DIF.The translators were fluent in both languagesand had extensive teaching experience. Theadaptation review process required not only theidentification of differences in the two language

versions but also judgments about the extent towhich the differences were expected to lead toperformance differences between the two lan-guage groups. Therefore, experience in teachingand familiarity with student thinking processeswere also considered to be important character-istics of translators.

During this research phase, the data sourceswere the English- and French-language versionsof test items. The interpretation model involveda comparison of meaning, structure, expression,format, and level of information provided toexaminees in the two language versions of testitems. Based on these comparisons, the review-ers created data that identified the level of com-parability between the two language versions.The level of comparability ranged between 0and 3. We used a rating of 0 or 1 to indicate min-imal or no difference in meaning between thetwo versions, a rating of 2 to indicate differ-ences in meaning between the two versions butnot necessarily leading to differences in perfor-mance between the two groups, and a rating of3 to indicate differences in meaning between thetwo versions that are expected to lead to differ-ences in performance between the two groups.

This judgmental review phase of the studyhelped to identify those items that clearly hadadaptation problems and those that had differen-tial meaning, content, or format in the two lan-guages. In the SAIP study, 38% to 100% of theDIF items in the three content areas were iden-tified as having adaptation-related differences.The associations between these differences andthe statistical differences identified cannot becausal; therefore, this additional step in examin-ing comparability of constructs between the twolanguage versions of tests gets the researchersone step closer to identifying sources of differ-ences but not with certainty. Even with thisuncertainty, the judgmental review phase is nec-essary to identify sources of DIF. Without thisstep, it would not be possible to tell whetherthere were language-related differences in thetwo language versions of the test items, even ifwe did not know with certainty whether thesedifferences were the sources of the psychomet-ric differences identified.

Replication. In construct comparabilityresearch, one way of finding evidence to support

458–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 458

Page 9: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

the interpretation that DIF arises from adapta-tion differences is by examining whether similarlevels of psychometric differences were identi-fied between the two language versions of testsfor other groups of students who took the tests;in other words, replication of DIF findings issought in several samples. We implemented thereplication by comparing the findings for twoage groups (13- and 16-year-olds) receivingthe same sets of test items. In our constructcomparability research, all of the reading itemsshowing DIF were identified as having adapta-tion-related differences. These could be differ-ences in the difficulty levels of vocabulary usedin the two languages, differences in the clarityof the test questions, or change in the meaningof the test questions, among many other types ofdifferences. The results of the replication studyare summarized in Table 25.1. In the readingcomparisons, of the 4 DIF items identified ashaving adaptation-related differences for the 13-year-olds, 3 were also identified as showing DIFfor the 16-year-olds. In mathematics, of the17 DIF items identified as having adaptation-related differences for the 13-year-olds, 9 werereplicated for both age groups. In science, 28 ofthe DIF items were interpreted as having adap-tation-related differences for the 13-year-olds.For the 16-year-olds, 22 of the DIF were inter-preted as having adaptation-related differences,and 17 of these were the same as the itemsidentified as DIF for the 13-year-olds. The agereplication component of this research provided

further support for most of the DIF items inter-preted as being due to adaptation differences.

Think-Aloud Protocols. A more direct way ofdetermining whether adaptation differencescaused DIF is to actually look at how examineesinterpret test questions, how they use infor-mation from test items, and whether there aredifferences in these methods between the twolanguage groups. The next phase in this researchused the think-aloud protocols to examinestudent thinking processes during test takingand examined whether these processes weresimilar for the two language groups (Ercikanet al., 2004). The think-aloud protocols weredefined as structured interview protocols thatencourage examinees to think aloud, talk abouttheir interpretations of test questions, and artic-ulate solution strategies they use as well as dif-ficulties they are having as they respond to testquestions. In this research, participants con-sisted of two groups of 13-year-olds: a sampleof 36 English-speaking students and a sample of12 French-speaking students from schools in alarge urban center in British Columbia.

The think-aloud protocols consisted of a setof questions that test administrators posed toparticipants on completion of each mathematicsor science item. The questions were intended totap four themes: (a) participants’ understandingof the intent of each mathematics/science item,(b) the steps that participants took to answer theitem, (c) the reasons for selecting the answer

Constructing Data–•–459

Table 25.1 Replication of DIF Items Across Age Groups and Judgmental Review Ratings

Judgmental Common Across Content Area Review Rating 13-Year-Olds 16-Year-Olds 13- and 16-Year-Olds

Reading 0–1 0 0 0(22 items) 2 3 3 2

3 1 4 1Mathematics 0–1 30 25 17

(125 items) 2 11 9 53 6 6 4

Science 0–1 24 27 18(144 items) 2 17 12 10

3 11 10 7

NOTE: Judgmental review rating 0–1: no or minimal difference in meaning between the two versions; 2: clear differences inmeaning between the two versions that might not necessarily lead to differences in performance between two groups; 3: cleardifferences in meaning between the two versions that are expected to lead to differences in performance between two groups.

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 459

Page 10: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

that they chose, and (d) the aspects of the itemthat facilitated and hindered the problem-solving process. For each item, an interviewerinstructed participants in a videotaped session toread the question and then verbalize theirthought processes as they attempted to answerthe item. If any information was not evidentfrom the spontaneous thinking out loud, then theinterviewer explicitly asked participants to pro-vide further information after they had com-pleted the question. Once a full understanding ofparticipants’ thought process was obtained, par-ticipants were instructed to proceed to the nextitem and so on until the protocols were com-pleted. Even though the primary goal of thestudy was to determine whether data from thethink-aloud protocols supported the hypothe-sized source of DIF, students were not promptedto confirm or deny certain possible characteris-tics of items. For example, if difficulty of vocab-ulary was the hypothesized source of DIF,students were not asked whether they found aspecific word hard to understand or whether theywere familiar with it. Instead, students wereprompted to respond to the same set of questionson the think-aloud protocols independent of thehypothesized source of DIF. This was done tominimize any kind of bias that might be createdby the researchers administering the think-aloudprotocols. In fact, the researchers were not awareof the hypothesized sources of DIF during thedata collection (Ercikan et al., 2004).

The think-aloud protocols data provided sup-port for language differences as sources of DIFfor 7 of the 20 items, 6 of which were hypothe-sized to have language differences as a source ofDIF. For the remaining items, the think-aloudprotocols did not provide supporting evidencefor the hypotheses. This was not necessarilybecause these hypotheses were not reasonable;instead, it might have been because our think-aloud protocols did not induce the kind ofresponses from students that would supportthese hypotheses or because of the limitations inour sample of students.

The construction of the data source duringthe think-aloud phase of the study was a verytime-consuming effort. It required approxi-mately 1 hour of interviewing and videotapingfor each participant (a total of nearly 50 hours).Recordings of the interviews were transcribed

by hired individuals and were reviewed foraccuracy by members of the research team. Thetranscriptions took approximately 5 hours foreach interview for a total of more than 200hours of transcription time.

The transcriptions of interviews constitutedthe data source in this research. The resultingtext file was then used to extract data relevant toour research questions using an interpretationmodel, that is, by coding different qualities ofstudent think-aloud transcriptions. The focus ofthe coding was to determine the following infor-mation for each item in the protocols: (a)whether or not students answered the questioncorrectly, (b) students’ understanding of themeaning of the question, (c) whether or notstudents found the question to be difficult toanswer, (d) what aspects of the question wereuseful for solving the problem for students, and(e) what aspects of the question students foundto be confusing or difficult to understand.Sample responses from the interviews arepresented in Figure 25.2. Persons who spokeboth French and English coded the data.Organization of the data for analysis was facili-tated by the use of the NVIVO 2.0 qualitativedata analysis computer program (QSRInternational, 2003).

The think-aloud protocols approach hasproven to be useful in identifying sources ofDIF not as a preferred method but rather moreas a complementary method to other methodssuch as judgmental reviews and statistical meth-ods. Some of the evidence obtained from usingthis approach, in support of the hypothesizedsource of DIF, could not have been obtainedusing either judgmental reviews or statisticalanalyses.

Experimental Method. The three approachesjust described—replication, judgmental review,and think-aloud protocols—together providedinsights about what the sources of differencesmay be. Yet these insights are tentative untilthey are tested formally. The fourth approach inidentifying sources of DIF was an experimentalstudy to test a set of formal set of hypotheses.These hypotheses were developed based onthe replications, judgmental reviews, and think-aloud results and are related to whetherdifferent vocabularies, item formats, and language

460–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 460

Page 11: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

Constructing Data–•–461

Student’s Understanding of the Meaning of the Question

I47: So what was this question actually asking you to do?P47: Um, just find out how much it would cost for a 10-hour repair.

I32: What was this question asking you to do?P32: Find a pattern in the cost.

I3: What do you think this question was asking you in your own words?P3: How much the clock hand moved, or how much time is in between 3:25 and 3:45.

I10: So what were you supposed to do with this question in your own words?P10: Figure out how many cities were under 25 degrees Celsius.

I50: So what was this question actually asking you to do?P50: How much it would cost to purchase T-shirts for a variety show.

What Aspects of the Question Were Useful for Solving the Problem?

I66: And were there any hints in the question that helped you to solve it?P66: The word “marine”.

I53: Was there anything in there that helped you?P53: Well, just, um, “to get the same results” is in the question, that made it easier.

I36: And what helped you to figure out that that was the final answer?P36: Uh, the chart helped.

I37: How did it help?P37: It shows what days he worked, when he worked, and how long.

I112: What helped you to figure out the answer?P112: The picture.

What Aspects of the Question Did the Student Find Confusing or Difficult to Understand?

I28: So are there any words in there that you don’t know?P28: Well the “line-ear” equation.

I. What made it difficult?P83: Describe the oxygen cycle in nature. Use a labeled diagram if you wish. I haven’t done this yet inclasses or anything, so I’m not really sure what the cycle is.

I48: What made it difficult?P48: Um, well, ‘cause we had to calculate the add every additional 400 kilometers. And we had to figureout that you have to divide it by the 2,300 kilometers that he had left, and it took a while to figure that out.

I74: What made it hard?P74: Well, I don’t really know what cattails are, so I just looked at the diagram and I eliminated the leastrelated ones, then I guessed between A and B.

I113: What is it about this question that is difficult to understand?P113: Well, it says there are many variations among individuals of “variate”.

Figure 25.2 Sample responses from the interviews are shown in the table

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 461

Page 12: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

styles used in the test items affected studentperformance. In constructing data for thisphase, different qualities of test items that weresuspected of being the sources of differencesbetween the two comparison groups weremanipulated, and two versions of test itemswere administered to randomly equivalentgroups of students.

DATA SOURCES AND DATA

IN LOW-INFERENCE RESEARCH

In this section, the relationship between datasource and data is exemplified for low-inferenceresearch, and possibilities arising from an exist-ing data source are articulated by drawing onmaterials collected during a study in a splitsixth- and seventh-grade classroom taught byWolff-Michael Roth. Data are what researchersuse in support of the claims made about a situa-tion under investigation. Thus, the set of video-tapes shot in this classroom constitutes a sourcefrom which many different kinds of data can beconstructed.

Construction of Data Sources

The study was designed to investigate know-ing and learning when a science classroom isconceived of as a design studio and where,because of student–student interaction, onecould investigate the relationship between indi-vidual learning and collective learning. For mostof the time, students designed machines, encom-passing the entire process from envisioningmachines to completing a prototype model. Theresearch team included the teacher (Roth), twograduate research assistants (gRA1 and gRA2),and a nonstudent research assistant (RA).

All lessons of the 4-month curriculum wererecorded continuously using two cameras (oper-ated by gRA1 and RA). During whole-classactivities, the second camera served as a backupand was used to ascertain whether student utter-ances were recorded as completely as possibleeven when a student spoke quietly. In addition,two audiotape recorders were used to capture(a) students when they presented their designsin whole-class sessions, (b) the teacher’s

interactions with students during small-groupwork, and (c) interviews conducted in the set-ting by gRA2 while students worked indepen-dently on their design projects. Although it isnever possible to “capture everything,” captur-ing as much detail as possible in the course of adesign experiment (Brown, 1992) allowsresearchers to identify salient factors that medi-ate learning at the level of the individual, smallgroups, and the whole class. In subsequentinvestigations of this type, therefore, we oper-ated three cameras and made use of additionalobservers, allowing us to record half of thestudents during the learning process (e.g., Roth& Duit, 2003). In addition, we recorded theteacher continuously, implemented a massiveinterview schedule paralleling the classroomresearch by interviewing nearly half of the 26students five times for 1 hour, and administeredfive instruments assessing knowledge, views,and attitudes. Roth could have used some stan-dardized assessment of scientific understanding.Because Roth was more concerned with ecolog-ical validity, however, he had rejected themalready in his grant application, opting insteadfor the construction of test items and test for-mats that would allow him to assess in wayswhere the students themselves had the sense thatthey had exhibited all of their understandings.

In videotaping, we decided to interfere as littleas possible with ongoing activities. Students werenot asked to remain at one place or to reduce thenoise they made, although this would haveimproved video and sound quality. We felt, how-ever, that making such changes would have beencounterproductive to observing the knowing andlearning as they occur in a setting conceived of asa design studio (Roth, 1998a). Ethnographicobservations by the two gRAs were documentedin field notes and as photographs. Although ourethnographic observations are generally unstruc-tured, in the current case we had decided to con-firm and disconfirm qualitative hypotheses froman earlier investigation about how knowledgecomes to be shared in a classroom community(Roth, 1998a). We used the term “confirmatoryethnography” for this part of the work. After eachlesson, team members debriefed the teacher(Roth), and these debriefings were also docu-mented as observational field notes. Based on

462–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 462

Page 13: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

these observational field notes and on our experi-ence during transcription, we prepared theoreti-cal field notes that also became part of the datacorpus. These theoretical field notes drove furthercollection of data sources.

The entire curriculum development effort, allcurricular materials, and the artifacts used dur-ing teaching became part of the data sources. Allcurriculum planning meetings and interviewswere recorded. Further ethnographic fieldworkwas conducted in class during students’ othercourses, and informal interviews were com-pleted with the students’ teachers.

Prior to and at the end of the unit, we testedstudents in a number of ways. First, studentsprepared a semantic map of all the ideas theyassociated with simple machines. Second,students responded to questions about threeinstances that illustrated the application oflevers, pulleys, and inclined planes. Third, thepretest phase included interviewing 13 (of 26)students about their ideas on simple machines,requesting elaborations of their written answers,and observing their qualitative and quantitativeresponses to problems. The students wereselected to represent the class in terms of genderand grade level and had to be willing and able toarticulate themselves in conversations withunfamiliar people. Fourth, the posttest wasdesigned in the same way except that we invitedpairs of students to talk about their answers onthe test and about three practical problems.Fifth, all students participated in these debrief-ing conversations. For the three practical situa-tions, students had available the necessaryartifacts to model solutions to the followingquestions: “How would you use a pulley todecrease the effort?,” “How would you use twologs to get a car out of a ditch?,” and “Howwould you set up a ramp to bring a heavy loadto a higher ground?” With many student groups,this led to situations that resembled conversa-tions among students rather than interviews. Bydebriefing students in pairs, we hoped toaddress in part the problematic issue of ecolog-ical validity. During the lessons, emphasis wasplaced on material and discursive practicesembedded in a social matrix, so we attempted toincrease ecological validity by reproducing thissocial situation to some extent through paired

interviews for the posttest. Some studentsagreed that debriefing in groups better reflectedtheir learning in this class; for example, onestudent remarked, “It’s much better with a part-ner. We worked on most stuff together, andalthough you sometimes argue, it’s easier withtwo.” All written work became part of the cor-pus, as did the videotaped debriefings and theirtranscripts. We explicitly avoided using stan-dardized examinations because these generallydo not test cognition of specific domains exten-sively and in-depth. Furthermore, the outcomesof such tests are frequently used for politicalpurposes in the spirit of “accountability” (e.g.,Hodgkinson, 1995) and may have detrimentaleffects on classroom processes and learning(e.g., Darling-Hammond, 1994; Rodriguez,1997).

This particular research project followed onthe heels of two similar design experiments inthe same school. We had developed a goodsense of the kind of data sources that wereappropriate for each grade level. This was notthe case when we started. For example, in afourth-grade classroom, it turned out that thestudents in general, and the girls in particular,were rather timid and hesitated when expressingthemselves. We had to make a choice about howto get students to express their knowledgeand understanding prior to the beginning ofour intervention. Therefore, we had to discardour plans to interview students individually.Although group interviewing has its disadvan-tages given that one cannot probe every studentin-depth, we opted to try this technique withgroups of five to seven students.

In our seventh-grade study, we ended up withan extensive database involving considerablecosts, both financial (three RAs) and temporal(four times a week, approximately 2–3 hoursin the school and 2 hours to commute to theresearch site, for 4 months). The costs wereincurred in part because of concerns with mak-ing the database sufficiently extensive that anyemerging hypotheses could be tested even aftercompletion, in part for triangulation purposes(Lincoln & Guba, 1985), in part because accessto suitable sites to conduct such larger studies isoften limited, and in part because the curriculumdevelopment and teaching placed additional

Constructing Data–•–463

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 463

Page 14: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

burdens on a researcher accomplishing hisregular professorial duties.

Preparing for Data Construction

Low-inference research is conducted to findout how people (here students and teachers) makesense in and of their lifeworlds, how their every-day ways of acting are patterned (the structures oftheir practices), and why they do what they do(the grounds for their actions). In new kinds ofsituations not (or seldom) studied before, thisrequires researchers to collect materials fromwhich the sensemaking of research participantscan be inferred. What is relevant or interestingemerges from a dialectic tension between thematerials at hand and the researchers’ interest.

The entire team transcribed videotapes andaudiotapes to make conversations available, astext for analysis and as feedback to directfurther curriculum design and planning, asquickly as possible. Therefore, we chose to pre-pare them in a “quick and dirty way” rather thanspending too much time in trying to make themsuitable for publication purposes. These firsttranscriptions contained speaker names, spokentext, and some transcriber commentary (Figure25.3). As is typical for design experiments(Brown, 1992), observations and interpretationsduring the process of building the data corpusdirected subsequent design of teaching materi-als, social configurations, physical arrange-ments, temporal organization of activities, andtime allotments.

464–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

WMR: there was a question? (2)Shamir: but, when we were pulling on that other one, we were pulling, we were just pulling the banisterstringWMR: no, you were pulling here, this pulley (open end)Shamir: yea, but it was attached to the banister, if we pulled really hard then the (.)Don: i know, i //know]Shamir: if we] pulled the hardest, then the banister would flyDaniel: we were pulling where the pulley is ((WMR walks to the block and tackle))

Figure 25.3 This example of a first transcription contains rough estimates of pauses (indicated by singleparentheses, e.g., “(2)” or “(.)” [= less than 0.2 second]), overlaps (indicated by slashes andbrackets, e.g., “//know]” was overlapped by “/if we]”), and comments (enclosed by doubleparentheses “((“ and “))”). WMR is the teacher.

To identify relevant and interesting issues,formal analyses were conducted throughout theresearch process. Formal data analyses wereconducted in sessions with two to four membersof the research team, according to precepts ofinteraction analysis, on the basis of videotapeddata sources (Jordan & Henderson, 1995). Thevideotapes were played, stopping and replayingthem as often as needed and whenever a teammember felt that something remarkable hadhappened. This event was then discussed untilthe participants felt that nothing more could besaid. In this way, a brief event (1–5 minutes)may lead to a 2-hour research session. (At thesame time, we also ascertained the accuracyof the transcript as to the utterances, overlaps,emphases, etc.) When the researchers deemedthis event to be interesting, all data sources werethen searched to see whether it was similar to

other situations and, therefore, represented aclass of events. These analysis sessions weretaped and recorded in field notes, and a flipchart was used to allow a permanent record ofnotes and drawings to be made during the meet-ings. Tapes, field notes, and flip charts wereadded to the existing data sources. In our expe-rience, what different members of a team see inthe data sources and how what they see is pat-terned may be quite different initially, but as theteam works together, both what they see andhow they see it become increasingly similar (onthis point, see also Schoenfeld, 1992). This canbe used to an advantage. Initial differences leadto the identification of different patterns,whereas the increasing similarity in perceivingparticular events can then be used to individuallyanalyze data sources and thereby increase ateam’s efficiency.

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 464

Page 15: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

In such meetings, we frequently identifiedsome events as “interesting.” We then trans-formed transcripts to include information thatwas relevant to the construction of data. Forexample, the transcript from the lessonexcerpted earlier included drawings of the dia-grams that the teacher and students had made onthe chalkboard while discussing the outcomesof an earlier tug-of-war in which a pulleysystem had been used, leading to the fact thatthe teacher had won the competition against20 students (Figure 25.4).

EVOLVING INTERESTS

AND CONSTRUCTION OF DATA

What researchers need as data to support a claimdepends on the research question, which itself isa function of the current status of the field ordiscipline. In the following subsections, severalexamples are used to show how different inter-ests and questions led to the construction ofdifferent types of data from the raw and trans-formed materials. The videotapes and first tran-scripts constituted the sources for theconstruction of data.

Example 1

The researchers knew from the literatureon workplace studies that artifacts and represen-tational tools (e.g., whiteboard, chalkboard,

computer screen) used in meetings mediated theinteractions and content of conversations. Ineducation and learning sciences, there had beenno studies of how the content and form of class-room discourse were influenced by differentcombinations of artifacts (e.g., overhead trans-parencies, physical models), group size, andphysical arrangements in and of the classroom.This tentative focus led to the identification ofdifferent interactional spaces, participant roles,and levels of participation in classroom con-versations and, concomitantly, to different dis-cursive forms and content. Three hunches(hypotheses) emerged to become more salientthan others during the initial (collective) studyof the videotapes. First, the artifacts appearedto have important functions in maintaining andsequencing conversations. Second, dependingon the situation and the role of participants, theartifacts seemed to serve as resources forstudents’ sensemaking. Third, each of the differ-ent activity structures of the curriculum sup-ported different dimensions of participation inconversations. Why these three were moresalient and more interesting than the others is aquestion that we cannot answer with certainty,but greater salience of data and research fociinvolves a dialectical process (Roth, 2005). Inthe current situation, this dialectical processlikely involved the relationships among theexisting literature in workplace studies, theabsence of similar studies in education, andthe researchers’ existing predilection for group

Constructing Data–•–465

A B

A B

MR: and like this? (3.4) is that what you mean?

Jenni: Yeah

Sham: yeah

MR: how is that different from this //one?]

Figure 25.4 This example of a second-level transcription contains the length of pauses measured to0.1-second accuracy (“(3.4)”), overlaps (“//one?]”), a gloss of the action (“[ends drawing]”),and the diagrams currently on the chalkboard.

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 465

Page 16: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

processes so that the new interest in the study ofthe interaction of physical arrangements, groupsize, and nature of representational artifactsemerged.

In support of any of these hunches, devel-oped during the analyses of individual episodes,two kinds of data were needed. First, maps thatshowed how individuals were positioned withrespect to one another and the focal artifact wererequired. Using a drawing program, we gener-ated a map of the classroom, onto which we laida new transparency sheet for each lesson, andfelt markers were used to track student positionsthroughout each lesson. Second, transcripts thatshowed how the interactions unfolded in the sit-uations described by the different maps wereneeded. Given the extensive nature of the data-base described, the researchers had considerablesources for testing the hunches. For example,Figure 25.5 shows the teacher’s movement inthe course of one whole-class discussion con-cerning an artifact that three students had con-structed. The gray-shaded area shows the extentof the “stage,” and the dark rectangle shows theartifact. Closer analysis led to the hypothesis

that individuals positioned in the gray-shadedarea dominate the conversation. To substantiateor refute this hypothesis, three types of tran-scripts were selected, distinguished by differentpositions the teacher had with respect to the pre-senting students (Roth, McGinn, Woszczyna, &Boutonné, 1999). Thus, when the teacher was inthe back of the classroom (Positions 1, 4, and 8),his influence on the conversation was as negli-gible or as important as that of any other studentcalled on by the presenting students, who alsochaired the session (Figure 25.5). The teacher’sinfluence on the conversation was greater whenthe teacher was “in the wings” (Positions 3, 6,and 10). His impact on the conversation was asdominant as that of the presenting studentswhen he “entered the stage” defined by thegray-shaded area (Positions 2, 5, 7, 9, and 11).

From these analyses emerged new interestinghypotheses. If participation in the conversationchanges as a function of the teacher’s distancefrom the gray-shaded area, would the same bethe case for students other than the presenters?This new question required the researchersto search their entire database for all of those

466–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

115

92

7

3

84

1

610

2m

Figure 25.5 During a typical whole-class conversation about a student-designed artifact, the teacher hadpositioned himself in three distinct locations: directly next to the presenting students, to theside (“in the wings”), and in the back of the classroom. Depending on his location withrespect to the focal area, the teacher influenced conversations to different degrees.

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 466

Page 17: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

situations where students other than thepresenters—or, in small-group work, studentsother than those in the group—entered the stage.These episodes could then be analyzed in termsof the contributions or impacts of the respectivestudent(s) on the ongoing conversation. Hereagain, what was to become data was defined bythe ongoing analyses, leading to a searchthrough the database (i.e., the source) to identifywhat had been framed as the data relevant to thecurrent question.

Example 2

A second set of issues concerning the natureof the data evolved from questions aboutachievement. The educational community ingeneral and researchers of cognition in particu-lar are concerned with questions such as thefollowing: “What did students achieve as part ofan innovative curriculum?” and “How did iden-tifiable groups of students (in this class) achievewith respect to each other?” These are low-inference questions so long as the analyses areconducted for this classroom and are not gener-alized toward sixth- and seventh-grade studentsin general, but they are high-inference questionsif the results from this class are generalized to alarger population of students doing the samecurriculum. Although achievement might nothave been the primary interest of the researchers,who were more concerned with levels of studentparticipation and the interaction of processes atthe individual and collective levels, the collec-tive interests of their discipline were alsoaddressed in their project. Past research onnational and international (e.g., Third Inter-national Mathematics and Science Study[TIMSS]) achievement levels frequently identi-fied grade and gender differences in scienceachievement; older students generally are higherachievers than younger students, and malestudents generally outscore females, in science.To assess whether there were differences inachievement across gender and age levels intheir study, the researchers needed to searchtheir database again, but this time for a differenttype of data. They were aware that any statisti-cal calculations intended to test these generaltrends would have to be interpreted cautiouslybecause of the small sample size.

The researcher had administered written testsand set up various tasks in which students orallyresponded to a variety of practical tasks. Theraw data would consist of how each student didon the different tasks, coded as either 1 (correct)or 0 (incorrect). The relevant data sources werethe written responses on the posttest and thetranscripts of the videotaped interview sessions.Cautioning readers to keep the low power of thestatistical tests in mind, we reported the resultsof a 2 (boys or girls) × 2 (sixth grade or seventhgrade) multivariate analysis of variance(MANOVA), with students’ written and oralposttests as dependent variables. No statisticallysignificant effect for the gender (Wilks’s lambda= .998, p = .98) or grade (Wilks’s lambda =

.954, p = .62) main effects or for the interaction(Wilks’s lambda = .826, p = .15) was found(Roth et al., 1999). If the classroom was taken toconstitute a sample from a population of sixth-and seventh-grade students doing this specialdesign-centered curriculum, then these resultswould imply that the frequently occurring gen-der and age differences do not exist in the cur-rent classroom after instruction. Alternatively, ifthere are true differences in the population, thenthe statistical power in this study may have beentoo low to detect them.

The data in support of this tentative claim arethe results of having used a particular statisticalmodel; the inference is high because it makesclaims about the curriculum in general. On theother hand, one could have simply reported theposttest means by gender and grade level, inwhich case the level of inference would havebeen low, but then so would have been the gen-eralizability. That is, these statistics were usedto make inferences about populations, aboutboys and girls in general, rather than beingdescriptive, in which case one would havesimply reported the means for boys and girlsand the different grade levels (i.e., descriptivestatistics).

While assembling the raw data into a datatable, the researchers noted something else thatthey had not considered before. Five of ninestudents identified by the school as cognitivelyor socially handicapped achieved in the top30%. Although the research team did not pursuea possible hunch of the differential effect of thecurriculum on students with different prior

Constructing Data–•–467

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 467

Page 18: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

knowledge or achievement, this result, based ona simple tabulation of test scores by type ofstudent, could have provided an interesting leadfor future research. But another issue emergedfor the researchers that was even more interest-ing given their predilection for, and knowledgeof, situated cognition.

As the primary researcher repeatedly workedthrough the entire data set, he was struck by thevariations in the responses that individualstudents appeared to give to structurally identi-cal test items. This was interesting, especially inthe context of the efforts of international test-ing consortia (e.g., TIMMS, Programme forInternational Student Assessment [PISA]). Hishunch was that the results achieved in theresearch project could enlighten the educationalcommunity about the problematic nature ofassessment formats in assessing knowing andlearning not only in different language commu-nities (as in the first section of this chapter) butalso within more homogeneous groups. Havingstudents respond to questions in a variety ofcircumstances, both before and after the unit,allowed an investigation about how the testingformat influenced the students’ answers, andtherefore inferences, about their understandingand knowledge. If one assumed that knowledgeis a property of the individual, such as in thehigh-inference examples used in the earlier partof the chapter, then this research points to themultidimensionality of knowledge even withinthe same domain. Alternatively, one may choosea different unit of analysis such as person-in-setting, as proposed by the educational psychol-ogist Richard E. Snow, known for his workon aptitude–treatment interactions. In this case,multiple items or tests given to a student providea sample of the student, whereas the same testgiven to many students provides a sample of thetest (e.g., Corno et al., 2002). Supporting aclaim about how different testing formats medi-ated student responses required a different kindof data from the kind used for achievementcomparisons.

One piece of data was a table in which theresponse patterns of 13 students, interviewedduring the pretest, were mapped in two condi-tions: an equal-arm balance that had continu-ously numbered distance markers on one sideand was unmarked on the other side (Figure 25.6).

The data show unequivocally that (a) noindividual student reused a strategy when theequal-arm balance was turned around to displaynumbered equidistant markers representing dis-tances and (b) the sample changed in its entiretyfrom using one set of strategies to another set(Roth, 1998b). There was no overlap in the twosets of strategies; if there was, it would haveshown up as entries in the main diagonal. In thecontext of other studies, the researcher and hisvarious collaborators came to the conclusionthat interviews (Welzel & Roth, 1998) andbetter test instruments (McGinn & Roth, 1998)are not enough to arrive at suitable assessmentsof knowledge and understanding. These inter-views alone did not allow us to take the analysisfurther, but subsequently additional data sourcesallowed us to show that students perceive thebalances in different ways (i.e., they had whatcognitive scientists call different “domainontologies”), leading to different actions as well.

In this example, the analysis was qualitativein that it constructed categories of strategies onthe basis of all responses to questions presentedin two formats. The analysis was also quantita-tive in that it counted how many instancesexisted in each category (e.g., three studentsreferenced locations on the marked lever butused trial-and-error procedures on the unmarkedlever). The low-inferential nature of the studywas not changed by the fact that the number ofstudents using the same pair of responses wasentered into a matrix; my counting itself did notmake this a quantitative study because no infer-ences were made as to the frequencies withwhich students of this kind would respond onthese tasks. That is, this is an example of low-inference research enacting qualitative andquantitative analyses.

Example 3

Several years later, the researcher becameinterested in the question of how scientific rep-resentations were used in scientific communi-ties to mediate face-to-face interactions. Hereturned to the original videotapes to look atepisodes where students and the teacher wereengaged in conversations over and about a vari-ety of representations. Contrasting the standardassumptions in science education at the time,

468–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 468

Page 19: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

according to which conceptual knowledge isexpressed in language, the researcher developedthe following hunch: To account for the contentand process of interactions in science class-rooms—that is, the scientific conceptions thatare brought to bear on a debate—the fundamen-tal interdependence of gestures, perceptions,and speech had to be accounted for. Thisrequired data that provided readers with ges-tures and perceptually salient chalkboarddiagrams in addition to the words normallyreported by educators (Figure 25.7). This claim,suitably supported by the data, undermined acommon practice in science education, that is,to concentrate on words alone when identifyingand testing student conceptions (Roth, 1996).

This initial research on the reported varia-tions of performance across test format and therole of gestures and perceptions gave rise toa deepening interest in how the human bodymediates knowing and learning. Do gesturesand other bodily movements accompany andmediate scientific and mathematical cognition?Questions of this kind lead to claims that requiredata very different from the data presented sofar. For example, in an article published in theJournal of Pragmatics (Roth, 2000), evidence(data) was presented for three major claims.First, in the absence of scientifically appropriatediscourse, students’ gestures already pick out,describe, and explain scientific phenomena.Second, during the initial appearance of scientific

Constructing Data–•–469

1. using formula2. matching weight, distance3. crunching numbers4. referencing locations

5. measuring

6. estimating7. using trial and error8. guessing

markedlever

unmarkedlever

1

1 1 1

1

1 1

1

2

2

3

3 4 5 6 7 8

Figure 25.6 The figure shows the frequencies for strategy pairs used on marked and unmarked levers.The data support the claim that (a) structure of artifact changes the strategies that individualstudents use to solve problems and (b) an entirely different set of strategies was used.

Figure 25.7 This example of a third-level transcription constituted the data used in an article making anargument that in understanding “conceptual” talk and thinking in science classroomsorganized as linguistic communities, one must account for the fundamental interdependenceof “hands, eyes, and signs.”

Audio

1.1. Shaun: You can have the banister, if that, if that pulley there, the pulley there, if that was on our side then, ahm.

1.2. WMR: This was, this [1] was on your side, because the class was pulling here [1], and I was pulling here [2].

1.3. Shaun: No, but if that, switch it around=1.4. Sharon: =You were B.

1.5. Jon: //If you were B]

Video

[2]

[1]A

B

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 469

Page 20: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

discourse, deictic and iconic gestures precedethe associated utterances. Third, as students’familiarity with a domain increases, scientifictalk takes on greater importance and gesturesbegin to coincide with the talk. This requireddata that showed how gestures and discoursewere coordinated in time and also how wordsand actions, which must be brought about by thebody, constitute and make scientific conceptsavailable to others in real time. For example,Figure 25.8 constitutes a piece of data of thetype presented in this article and others on thetopic. It shows how the student, while uttering“You can pull on here,” moved his arm upward.Here he not only said something but also acti-vated his muscles to move the arm upward, andthe video also showed a slight backward move-ment of the upper body. That is, in this briefepisode, the relevant cognition also involved aphysical action and a bodily response to thechange in equilibrium as the stretched left armwas raised upward along the diagram. Readerscan also see how the hand moved up to the linethat supports the pulley, the place where the“you” can pull, and which words are associatedwith a particular position of the hand. Withoutactive perception, it would have been unlikely

that the hand had been exactly on the diagramwhere it was supposed to be.

This research began an extensive investiga-tion of gestures, on the one hand, and of theliterature in psycholinguistics, psychology,anthropology, and applied linguistics (in educa-tion, there was virtually no research concerningthe role of gestures in content learning), on theother. This research provided further evidenceof the fact that students were using gesturesbefore they could describe and explain phenom-ena in words and that when the words firstemerged they tended to lag behind the corre-sponding gestures (Roth, 2003).

SIMILARITIES AND DIFFERENCES IN

HIGH- AND LOW-INFERENCE RESEARCH

Research Questionand Constructing Data

In high-inference research, a set of explicitresearch questions is the starting place foridentifying data sources and constructing data.Examples might include “How do twomathematics instruction practices differ in their

470–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

You can pull on here

Figure 25.8 This example of a fourth-level transcription constituted the data for an article on linguistics,arguing that (a) as students’ familiarity with a domain increases, scientific talk takes ongreater importance and gestures begin to coincide with the talk, and that (b) gestures involvebodily actions that constitute the nonverbal aspects of scientific concepts. The arrows markthe phonemes where the images coincide with speech.

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 470

Page 21: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

effectiveness?” and “Are there gender differ-ences in the way females and males usecomputers in their mathematical learning activ-ities?” In the high-inference research describedearlier, the research started with a well-definedresearch question concerning the degree of com-parability between English and French versionsof the SAIP and the sources of differences. Inlow-inference research, on the other hand, datasources relevant to the general research questionmay be the starting place; therefore, the firststep in low-inference research is the identifica-tion of data sources. Thus, the videotapes col-lected in the sixth- and seventh-grade classroompermitted the generation of hypotheses aboutthe role of gestures in the construction of scien-tific knowledge about simple machines and theforces operating in them. Because people maketheir sense of what is going on available to oneanother during interactions (Schegloff, 1996),videotapes are an ideal data source and startingpoint for low-inference research.

Common in both types of research is theneed for different modes of constructing data. Inthe high-inference research examples discussedin this chapter, the research had four differentapproaches, each of which required differentkinds of data and different modes of data con-struction. The first was the data constructedthrough large-scale survey testing of a nation-ally representative sample of eighth-gradestudents using the SAIP. The second required aset of judgments from reviews of English andFrench versions of test items by bilingualexperts. The third required data about studentcognitive processes during test taking. The datasources were students’ responses during thethink-aloud process, and the data were theextraction of the relevant components of whatstudents said in relation to the research ques-tions. The fourth approach, similar to the firstone, involved data about student performanceon different versions of test questions. Thesedifferent phases and data construction effortswere planned at the beginning of the researchproject.

In low-inference research, research questionsdo exist before researchers begin assembling thedata sources, but they are framed more broadlyand frequently in process terms. Examplesmight include “How do students make sense of

tests?,” “How do students make collaborativedesign activity work?,” and “How does the lan-guage in whole-class interactions change whenteachers change locations within the class-room?” Researchers then seek to identify pat-terns within the set of data sources theyassemble and normally make claims that do notgo beyond the particular set, although in someinstances the set may be very large. (Forexample, a conversation analyst might look atmore than 500 telephone interactions during9–1-1 emergency calls.) The fundamentalassumption underlying low-inference researchis that each case expresses the concrete possibil-ities of acting and understanding in a particularculture; that is, any patterns identified areconcrete realizations of general possibilities.Because low-inference researchers are fre-quently interested in how participants under-stand their situation, they may change theirresearch questions while doing fieldwork so thatthe questions reflect participants’ understandings.

Similar temporal shifts exist in the determi-nation of the nature of the data and the interpre-tive model. In high-inference research, thenature of the data and the nature of the interpre-tive model are determined at the very beginningof the research. In low-inference research, thenature of the data and the nature of the interpre-tive model arise during the research process. Inethnomethodology, for example, researchers notonly describe the ways in which people makesense of and act in their everyday situationsbut also must use the same (ethno)methods astheir research participants for interpreting thedata.

Research Process

In descriptions of research, details areimportant in low-inference research becauseresearchers believe that all of these details maycontribute to or interact with the researchprocess. In high-inference research, on the otherhand, the research is expected to be con-ducted in a predetermined uniform manner thatguarantees the comparability of results acrosssituations or, in other words, guarantees the gen-eralizability of results. For example, if a set oftests is being administered, test administrationprocedures are expected to be uniform across

Constructing Data–•–471

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 471

Page 22: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

classrooms or schools. Variations across differ-ent settings are almost guaranteed due to factorssuch as interactions between test administratorsand students, students’ needs, and weather. Theresearch design in high-inference research treatssuch variations as random. Even though somestatistical models exist for accounting for suchvariation, the error due to this random variationin most cases is not accounted for in the formalstatistical models used for summarizing data orfor testing hypotheses.

In low-inference research, efforts are made tomaintain a “natural” mode of operation. In high-inference research, to determine the effects,there is an explicit intervention that is intendedto change the natural course of things. Yet detailsof the research process are just as important inhigh-inference research in making meaningfulinterpretations. For example, during the lastphase of the high-inference research describedearlier, students will be split into randomlyequivalent groups and will be administered dif-ferent versions of a test. The equivalence of thetwo random groups is critical to the validityof interpretations from this phase of the study.Therefore, the description of the randomizationprocess, details of the test administrations, andcharacteristics of the two groups of students arejust as important in high-inference research asthey would be in low-inference research.

Constructing Data

The example from low-inference researchshowed that there are interactions between par-ticipant/test takers and the instrument used tomeasure knowledge, and space limitations pro-hibit showing that there are similar inter-viewer–interviewee interactions that have beenshown to exist even under the most rigorousschedules for high-inference research (Suchman& Jordan, 1990). Acknowledgment of the inter-action between participants and test instrumentsis a general feature of most low-inference research.In typical high-inference research, the measure/test is distinct from the participant/test taker,even though interactions between the two can beconsidered and modeled. Thus, even in theanalyses of interviews that do not draw on sta-tistical inferences, the participants’ responsessometimes are taken to be independent of the

interviewer questions and interview context. Inboth types of research, when such interactionsare not taken into account, the inferences madeby researchers may be inappropriate.

Nearly all psychometric models used inhigh-inference research require independentresponses from participants. This assumption ofindependence creates constraints on what typesof data are constructed and how they are con-structed. For example, having participants workin pairs or groups generates data that violatethe independence assumption. Low-inferenceresearch, on the other hand, generally is inter-ested in how people make sense. Therefore,observing participants in their natural settingsgenerates data that reveal the kind of informa-tion they make available to one another in prob-lematic situations. In the first situation, thethink-aloud protocols are used to elicit datato make inferences about individual problem-solving capabilities. In the second situation,researchers will typically ask participants towork in pairs because participants inherentlymake available to one another any problemsexperienced at the moment. In both situations,researchers collect protocols and transcripts,but these are used with different underlyingassumptions (independence vs. ecological valid-ity). The choice is (consciously or not) mediatedby researchers’ presuppositions about the appro-priate unit of analysis. If knowledge is presup-posed to be an attribute of the person, then thefirst situation will be chosen irrespective of thelevel of inference. If, on the other hand, knowl-edge is presupposed to be an attribute of person-in-setting transactions and always to be madeavailable as needed by co-participants to oneanother, then the second situation will be chosen.

The examples from our research show thathypotheses can be found in both types ofresearch; however, the hypotheses are found indifferent stages of the research and have differentfunctions. In typical high-inference research,hypotheses are logical derivations from theresearch questions, and they determine what therelevant data sources are, what interpretationmodels should be used in constructing data, andthus the nature of the data at the beginningof research. In the examples discussed underhigh-inference research, we see that even thoughsome of the hypotheses are indeed determined at

472–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 472

Page 23: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

the beginning of research, hypotheses for differ-ent phases of the research are based on findingsfrom the previous phase. In low-inferenceresearch, hypotheses are usually found at the endof the research process, where researchers mayarticulate in which way their findings bear on sit-uations other than the one they researched. Thisform of research is conducted because theoriesabout phenomena that allow hypotheses to begenerated do not yet exist.

CONCLUSIONS

As can be seen clearly from our examples anddiscussions, there are more similarities than dif-ferences between low-inference and high-inference research. Researchers needing todecide on an approach to research should makechoices that best fit their research questions andthe objectives of their research. For example, inpolicy-oriented research targeted to makingdecisions for groups, high-inference researchwill tend to provide the needed generalizability,whereas in research targeted to informing deci-sions about individuals, low-inference researchwould be preferable because of its attention tothe particulars of participants and their situa-tions. We also highlighted in this chapter thatboth high-inference and low-inference researchmay use qualitative (i.e., categorical) data as wellas quantitative (i.e., numbers) data. Therefore,qualitative and quantitative terminologies do notprovide useful distinctions in understandingresearch processes and requirements in generaland the construction of data in particular.

High-inference research is interested in iden-tifying patterns that describe groups or classesof participants. For example, a researcher mightbe interested in the covariation of IQ scoreswith achievement test scores. In low-inferenceresearch, the deviations from the group normmight be the most important component of thedata (Holzkamp, 1991). For example, there is acorrelation between IQ scores and achievementscores (Reschly & Grimes, 1992). High-inferenceresearch identifies such correlations, whereaslow-inference research may focus on the rea-sons why an individual high-IQ student doesvery poorly on an achievement test specificallyand in school more generally.

Both types of research involve interpretationmodels in constructing data. These modelsprovide different ways of extracting data fromdata sources; therefore, subjectivity is involvedin both approaches. For low-inference research,where many different interpretations of data canbe found, it might be necessary to better specifythe interpretation models used. Similarly, inhigh-inference research, the researcher needs tobe aware of the possibility of multiple interpre-tation models and outline the reasons for choos-ing one model over another.

High-inference research is very prescriptiveabout how results are interpreted. For example,what constitutes a significant difference aboveand beyond reasonable doubt, or associationsbetween different research variables, might notbe interpreted as causal unless an experimentaldesign had been used. Such requirements forlow-inference research are not articulated veryclearly in most studies of this kind, yet to makeappropriate interpretations, similar rigor needsto be enacted. One might, for example, expectlow-inference research to be very explicit aboutthe interpretive processes and assumptions thatare used to make claims about patterns and toinstantiate an audit trail that allows others toretrace the emergence and changing nature ofthe patterns.

It is important to highlight here thatresearchers may choose methods and approachesthat are most familiar to them, methods in whichthey have had training, or types of research forwhich they have sufficient resources. This leadsmany researchers to employ only one techniqueof data construction, that is, “monomaniacsof log–linear modeling, of discourse analysis, ofparticipant observation, of open-ended orin-depth interviewing, or of ethnographicdescription” (Bourdieu, 1992, p. 226). This isunfortunate because choosing data constructionor analysis methods on the basis of a method,rather than according to the question at hand,has the potential of jeopardizing what researchcan uncover. Therefore, we have argued in thischapter that researchers should choose theresearch method that best addresses the researchquestions. The types of research questions askedwill determine the types of inferences needed.All types of research questions—such as “Whatis happening?,” “Is there a systematic effect?,”

Constructing Data–•–473

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 473

Page 24: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

and “Why or how is it happening?”—requiredifferent forms of inquiry with differinglevels of inference. As the National ResearchCouncil’s committee on research methods iden-tified, the types of research questions asked inan area depend on the developments in that area(Shavelson & Towne, 2002). The three types ofresearch questions just listed correspond to dif-ferent stages of development in a particular areaand correspond to different levels of inference,from low- to high-inference research. Therefore,the most appropriate approach is not always theone that leads to the highest level of inference;rather, it is the approach that addresses theresearch question the best.

NOTE

1. A more detailed and extensive description ofthis research can be found elsewhere (Ercikan, Gierl,McCreith, Puhan, & Koh, 2004).

REFERENCES

Allalouf, A., Hambleton, R., & Sireci, S. (1999).Identifying the causes of translation DIF on ver-bal items. Journal of Educational Measurement,36, 185–198.

Bourdieu, P. (1992). The practice of reflexive sociol-ogy (the Paris workshop). In P. Bourdieu &L. J. D. Wacquant (Eds.), An invitation to reflex-ive sociology (pp. 216–260). Chicago:University of Chicago Press.

Brown, A. L. (1992). Design experiments: Theo-retical and methodological challenges in creat-ing complex interventions in classroom settings.Journal of the Learning Sciences, 2, 141–178.

Corno, L., Cronbach, L. J., Kupermintz, H., Lohman,D. F., Mandinach, E. B., Porteus, A. W., &Talbert, J. E., for the Stanford Aptitude Seminar.(2002). Remaking the concept of aptitude:Extending the legacy of Richard E. Snow.Mahwah, NJ: Lawrence Erlbaum.

Darling-Hammond, L. (1994). Performance-basedassessment and educational equity. HarvardEducational Review, 64, 5–30.

Ercikan, K. (1998). Translation effects in inter-national assessments. International Journal ofEducational Research, 29, 543–553.

Ercikan, K. (2002). Disentangling sources of differ-ential item functioning in multi-language assess-ments. International Journal of Testing, 2,199–215.

Ercikan, K. (2003). Are the English and French ver-sions of the Third International Mathematics andScience Study administered in Canada compara-ble? Effects of adaptations. InternationalJournal of Educational Policy, Research, andPractice, 4, 55–76.

Ercikan, K. (2005). Developments in assessmentof student learning. In P. Winne & P. Alexander(Eds.), Handbook of educational psychology(2nd ed.). Mahwah, NJ: Lawrence Erlbaum.

Ercikan, K., Gierl, M. J., McCreith, T., Puhan, G., &Koh, K. (2004). Comparability of bilingual ver-sions of assessments: Sources of incomparabil-ity of English and French versions of Canada’snational achievement tests. Applied Measure-ment in Education, 17, 301–321.

Ercikan, K., & Koh, K. (2005). Construct compara-bility of the English and French versions ofTIMSS. International Journal of Testing, 5,23–35.

Ercikan, K., Law, D., Arim, R., Domene, J. F.,Lacroix, S., & Gagnon, F. (2004, April).Identifying sources of DIF using think-aloudprotocols: Comparing thought processes ofexaminees taking tests in English versus inFrench. Paper presented at the annual meeting ofthe National Council on Measurement inEducation, San Diego.

Gierl, M., & Khaliq, S. (2001). Identifying sources ofdifferential item and bundle functioning ontranslated achievement tests: A confirmatoryanalysis. Journal of Educational Measurement,38, 164–187.

Hambleton, R. K. (2004). Issues, designs, and techni-cal guidelines for adapting tests into multiplelanguages and cultures. In R. K. Hambleton,P. F. Merenda, & C. Spielberger (Eds.), Adaptingeducational and psychological tests for culturalassessment (pp. 3–39). Mahwah, NJ: LawrenceErlbaum.

Hegel, G. W. F. (1969). Science of logic (A. V. Miller,Trans.). New York: George Allen & Unwin.

Heidegger, M. (1977). Sein und Zeit. Tübingen,Germany: Max Niemeyer.

Hodgkinson, D. (1995). Accountability in educationin British Columbia. Canadian Journal ofEducation, 20, 18–26.

474–•–SECTION FIVE: CHALLENGES IN CONDUCTING INQUIRY

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 474

Page 25: 25 - us.corwin.com · tion model could include scoring rubrics in a mathematics test or coding protocols for video-tapes of student–student interaction in a class-room. Different

Holzkamp, K. (1991). Experience of self and scien-tific objectivity. In C. W. Tolman & W. Maiers(Eds.), Critical psychology: Contributions to anhistorical science of the subject (pp. 65–80).Cambridge, UK: Cambridge University Press.

Husserl, E. (1991). On the phenomenology of theconsciousness of internal time 1893–1917 (J. B.Brough, Trans.). Dordrecht, Netherlands:Kluwer.

Jordan, B., & Henderson, A. (1995). Interactionanalysis: Foundations and practice. Journal ofthe Learning Sciences, 4, 39–103.

Lincoln, Y. S., & Guba, E. (1985). Naturalisticinquiry. Beverly Hills, CA: Sage.

McGinn, M. K., & Roth, W-M. (1998). Assessingstudents’ understandings about levers: Bettertest instruments are not enough. InternationalJournal of Science Education, 20, 813–832.

Merleau-Ponty, M. (1945). Phénoménologie de laperception. Paris: Gallimard.

National Research Council. (2001). Knowing whatstudents know: The science and design of edu-cational assessment. Washington, DC: NationalAcademy Press.

QSR International. (2003). NVIVO (Version 2.0)[computer software]. Thousand Oaks, CA:Scolari–Sage Publications Software.

Reschly, D. J., & Grimes, J. P. (1992). State departmentand university cooperation: Evaluation of continu-ing education in consultation and curriculum-based assessment. School Psychology Review, 20,522–529.

Rodriguez, A. J. (1997). The dangerous discourse ofinvisibility: A critique of the National ResearchCouncil’s National Science Education Standards.Journal of Research in Science Teaching, 3, 19–37.

Roth, W-M. (1996). Thinking with hands, eyes, andsigns: Multimodal science talk in a Grade 6/7unit on simple machines. Interactive LearningEnvironments, 4, 170–187.

Roth, W-M. (1998a). Designing communities.Dordrecht, Netherlands: Kluwer Academic.

Roth, W-M. (1998b). Situated cognition and assess-ment of competence in science. Evaluation andProgram Planning, 21, 155–169.

Roth, W-M. (2000). From gesture to scientificlanguage. Journal of Pragmatics, 32, 1683–1714.

Roth, W-M. (2003). Gesture–speech phenomena, learn-ing, and development. Educational Psychologist,38, 249–263.

Roth, W-M. (2005). Doing qualitative research: Praxisof method. Rotterdam, Netherlands: SENSEPublications.

Roth, W-M., & Duit, R. (2003). Emergence, flexibil-ity, and stabilization of language in a physicsclassroom. Journal for Research in ScienceTeaching, 40, 869–897.

Roth, W-M., McGinn, M. K., Woszczyna, C., &Boutonné, S. (1999). Differential participationduring science conversations: The interaction offocal artifacts, social configuration, and physicalarrangements. Journal of the Learning Sciences,8, 293–347.

Schegloff, E. A. (1996). Confirming allusions:Toward an empirical account of action. AmericanJournal of Sociology, 102, 161–216.

Schoenfeld, A. (1992). On paradigms and methods:What do you do when the ones you know don’tdo what you want them to? Issues in the analy-sis of data in the form of videotapes. Journal ofthe Learning Sciences, 2, 179–214.

Shavelson, R. J., & Towne, L. (Eds.). (2002). Scientificresearch in education. Washington, DC: NationalAcademy Press.

Sireci, G. S., Fitzgerald, C., & Xing, D. (1998).Adapting credentialing examinations for inter-national uses (Laboratory of Psychometricand Evaluative Research, Report No. 329).Amherst: University of Massachusetts, Schoolof Education.

Suchman, L. A., & Jordan, B. (1990). Interactionaltroubles in face-to-face survey interviews.Journal of the American Statistical Association,85, 232–244.

Varela, F. J. (2001). Consciousness: The inside view.Trends in Cognitive Sciences, 5, 318–319.

Welzel, M., & Roth, W-M. (1998). Do interviewsreally assess students’ knowledge? Inter-national Journal of Science Education, 20,25–44.

Constructing Data–•–475

25-Conrad-4821.qxd 9/29/2005 9:05 PM Page 475