ICPW2007.deWaard

30
Science Beyond the Facts: A Pragmatic Structure for Research Articles Anita de Waard Elsevier Labs, Disruptive Technologies Utrecht University

Transcript of ICPW2007.deWaard

Page 1: ICPW2007.deWaard

Science Beyond the Facts: A Pragmatic Structure for Research

Articles

Anita de Waard

Elsevier Labs, Disruptive Technologies Utrecht University

Page 2: ICPW2007.deWaard

Introduction

Page 3: ICPW2007.deWaard

- There was too much scientific information

(43,848 Papers on p53)

- And it was all written in stories....

[demo Papers]

Once Upon a Time....

Page 4: ICPW2007.deWaard

Research Goal

- Find a structure for research articles,that allows computer-aided access to knowledge elements

- Start with Research Articles in Cell Biology

- Expand to other genres/domains?

- How do we extract this structure?

- How do we use this structure?

Page 5: ICPW2007.deWaard

Pragmatic

1. Colloquial: practical, vs. theoretical

2. Linguistic: ‘meaning of linguistic messages in their context of use’ (per/il/locutionary goals)

3. Pragmatic web: ‘quality of goal-oriented discourse in communities’

2

English 306A; Harris 4

Functions

Ideational function:What does “The cat is on the mat” mean as an expression in

the system of English?

How?Denotation, truth conditions, event schemata, semantic roles, …

Interpersonal function:What does “The cat is on the mat” mean to hearer X, when

said by speaker Y, in context Z?

How?Speech acts, conversational maxims, face principles, deixis, …

English 306A; Harris 5

Functions

Ideational function:What does “The cat is on the mat” mean as an expression in

the system of English?

How?Denotation, truth conditions, event schemata, semantic roles, …

Interpersonal function:What does “The cat is on the mat” mean to hearer X, when

said by speaker Y, in context Z?

How?Speech acts, conversational maxims, face principles, deixis, …

English 306A; Harris 6

Meaning

SemanticsPropositions

Truth/falsity

Context-free

Language-in-vitro

PragmaticsUtterances

Appropriateness

Context-dependent

Language-in-vivo

Page 6: ICPW2007.deWaard

Method

Page 7: ICPW2007.deWaard

Genre + Discourse Studies- Science is written in text, as a story

- Text is created by humans to persuade other humans (peers, that claims are facts)

- To tell the computer how we encode our knowledge, we need to understand:

=> How do humans tell stories?

=> How do stories make sense?

Page 8: ICPW2007.deWaard

Work on corpus

-Corpus of 14 coherent (citing, cited) articles in Cell Biology, based around (Voorhoeve, 2006)

-Hand-modeled ascii text; created XML

-Manual (by me + small user validation)

Page 9: ICPW2007.deWaard

(Preliminary)Results

Page 10: ICPW2007.deWaard

Aristotle Quintilian Cell APA Style Guide

prooimion Introduction exordiumThe introduction of a speech, where one announces the subject and

purpose of the discourse, and where one usually employs the persuasive appeal of ethos in order to establish credibility with the audience.

Introduction Introduction

prothesis Statement of Facts narratio

The second part of a classical oration, following the introduction or exordium. The speaker here provides a narrative account of what has

happened and generally explains the nature of the case. Quintilian adds that the narratio is followed by the propositio, a kind of summary of the

issues or a statement of the charge.

Introduction Introduction

  Summary propostitioComing between the narratio and the partitio of a classical oration, the

propositio provides a brief summary of what one is about to speak on, or concisely puts forth the charges or accusation.

Abstract Abstract

  Division/outline partitio

Following the statement of facts, or narratio, comes the partitio or divisio. In this section of the oration, the speaker outlines what will follow, in

accordance with what's been stated as the status, or point at issue in the case. Quintilian suggests the partitio is blended with the propositio and

also assists memory.

Table of Contents Article Outline

pistis Proof confirmatioFollowing the division / outline or partitio comes the main body of the

speech where one offers logical arguments as proof. The appeal to logos is emphasized here.

Results Methods, Results

  Refutation refutatioFollowing the the confirmatio or section on proof in a classical oration,

comes the refutation. As the name connotes, this section of a speech was devoted to answering the counterarguments of one's opponent.

Discussion Discussion

epilogos   peroratioFollowing the refutatio and concluding the classical oration, the peroratio

conventionally employed appeals through pathos, and often included a summing up (see the figures of summary, below).

Discussion Discussion

1st Attempt: Classical rhetoric

Page 11: ICPW2007.deWaard

The Story of Goldilocks and the Three Bears

Story Grammar Paper The AXH Domain of Ataxin-1 Mediates Neurodegeneration through Its Interaction with Gfi-1/Senseless Proteins

Once upon a time Time Setting Background The mechanisms mediating SCA1 pathogenesis are still not fully understood, but some general principles have emerged.

a little girl named Goldilocks Characters Objects of study

the Drosophila Atx-1 homolog (dAtx-1) which lacks a polyQ tract,

She went for a walk in the forest. Pretty soon, she came upon a house.

Location Experimental setup

studied and compared in vivo effects and interactions to those of the human protein

She knocked and, when no one answered,

Goal Theme Researchgoal

Gain insight into how Atx-1's function contributes to SCA1 pathogenesis. How these interactions might contribute to the disease process and how they might cause toxicity in only a subset of neurons in SCA1 is not fully understood.

she walked right in. Attempt Hypothesis Atx-1 may play a role in the regulation of gene expression

At the table in the kitchen, there were three bowls of porridge.

Name Episode 1 Name dAtX-1 and hAtx-1 Induce Similar Phenotypes When Overexpressed in Files

Goldilocks was hungry. Subgoal Subgoal test the function of the AXH domain

She tasted the porridge from the first bowl.

Attempt Method overexpressed dAtx-1 in flies using the GAL4/UAS system (Brand and Perrimon, 1993) and compared its effects to those of hAtx-1.

This porridge is too hot! she exclaimed.

Outcome Results Overexpression of dAtx-1 by Rhodopsin1(Rh1)-GAL4, which drives expression in the differentiated R1-R6 photoreceptor cells (Mollereau et al., 2000 and O'Tousa et al., 1985), results in neurodegeneration in the eye, as does overexpression of hAtx-1[82Q]. Although at 2 days after eclosion, overexpression of either Atx-1 does not show obvious morphological changes in the photoreceptor cells

So, she tasted the porridge from the second bowl.

  Data (data not shown),

This porridge is too cold, she said

Outcome Results both genotypes show many large holes and loss of cell integrity at 28 days

So, she tasted the last bowl of porridge.

  Data (Figures 1B-1D).

Ahhh, this porridge is just right, she said happily and

Outcome Results Overexpression of dAtx-1 using the GMR-GAL4 driver also induces eye abnormalities. The external structures of the eyes that overexpress dAtx-1 show disorganized ommatidia and loss of interommatidial bristles

she ate it all up.   Data (Figure 1F),

2nd Attempt: Story Grammar

Page 12: ICPW2007.deWaard

3rd Attempt: Discourse Segments

- “A text is made up of Discourse Segments and the relations between them” - Grosz and Sidner, Mann-Thomson, Marcu, Swales

- Discourse Segment Purpose: element that has a consistent rhetorical/pragmatic goal.

- Define for Biological Research Article

Page 13: ICPW2007.deWaard

Discourse Segments In Biology <Goal>To examine miRNA expression from the miR-Vec system, </Goal><Method> a miR-24 minigene-containing virus was transduced into human cells. Expression was determined using an RNase protection assay (RPA) with a probe designed to identify both precursor and mature miR-24 (Figure 1B). </Method><Result>Figure 1C shows that cells transduced with miR-Vec-24 clearly express high levels of mature miR-24, whereas little expression was detected in control-transduced cells. </Result>

Page 14: ICPW2007.deWaard

12

Page 15: ICPW2007.deWaard

Introduction Method Results Discussion Total

Fact 63 0 104 37 204

Problem 20 0 10 15 45

Goal 2 0 72 6 80

Method 2 all 129 6 137

Result 10 0 230 44 284

Implication 14 0 100 36 150

Hypothesis 10 0 33 26 69

Total 121 0 678 170 969

Segments vs. Sections

Page 16: ICPW2007.deWaard

Fact Problem Goal Method Result Implication Hypothesis Total

Present active 72 46% 27 60% 15 23% 7 7% 37 16% 69 51% 38 55% 265

Present passive

5 3% 2 4% 2 3% 1 1% 1 0% 11 8% 1 1% 23

Past active 18 11% 5 11% 11 17% 48 47% 122 54% 16 12% 8 12% 228

Past passive 25 16% 2 4% 1 2% 17 17% 21 9% 1 1% 5 7% 72

Future 2 1% 3 7% 0 0% 0 0% 1 0% 0 0% 0 0% 6

Imperfect: "to" 13 8% 2 4% 32 50% 2 2% 20 9% 14 10% 7 10% 90

Gerund ("ing") 22 14% 4 9% 3 5% 28 27% 23 10% 24 18% 10 14% 114

Total 157 100% 45 100% 64 100% 103 100% 225 100% 135 100% 69 100% 798

Segment Tense

Page 17: ICPW2007.deWaard

Fact Hypothesis Problem Goal Method Result Implication End Total

Start 18 3 1 8 2 2 4 0 38

Fact 83 22 13 17 9 31 12 1 188

Hypothesis 20 5 3 7 6 2 6 3 52

Problem 9 7 7 2 3 5 3 3 39

Goal 7 0 2 4 46 6 0 0 65

Method 13 2 3 10 25 54 3 0 110

Result 23 9 4 6 16 85 78 6 227

Implication 13 6 4 12 11 30 12 25 113

Total 186 54 37 61 118 215 118 38 827

Segment order

Page 18: ICPW2007.deWaard

goalto

hypothetical realm: (might, would)

realm of activity: (to test, to see)

realm of models: present

realm of experience:

past

we

method

result

resulting in

Discourse: A Fact(ory)

suggests that

implication

discussion

Own viewShared view

hypothesis

fact fact fact

incongruity or ignorance

problem

introduction

results

discussion

Page 19: ICPW2007.deWaard

Links (Under Construction)To references:

- From/to segment type makes difference: methods link, fact link, agree/disagree link

- Not clear where to link into: is claim truly in referred document? How to locate?

To figures/tables:

- Usually main proof in results (methods) segments: need to allow multi-media elements in system!

Discourse relations:

- Many taxonomies: RST, Hovy, Sanders, ClaiMaker

- Identify textual coherence/argumentation...

Page 20: ICPW2007.deWaard

Fact Problem Goal Method Results Implication Hypothesis

Fact in animals however (3x)

to, we examined (2x)

we fused, we utilised

in contrast, we found (5x), though, on average, under our conditions

our data suggest, we propose that, consistent with

suggesting that (2x)

Problem we fused in this paper

Goal we isolated we showed

Method we found (2x), while, as seen

but suggests we predicted

Results in addition, in contrast

we utilised, we used

interestingly (2x), since (3x), also (2x), while (2x), second (2x), third (2x), finally (2x), subsequent, thereafter, in our study

(strongly) suggests/suggesting that (8x), implicating (2x), consistent with (2x), demonstrating that (3x)

we propose, suggesting that

Implication to verify, to confirm

we replaced, we fused, we tried

however, first (2x), interestingly (2x), consistent with, in our analysis, strikingly, neither

also in theory

Hypothesis in animals, in support of this, indeed

to test (2x) however, our results provide evidence that

Coherence Markers

Page 21: ICPW2007.deWaard

Preliminary Hypotheses1 'To' infinitive appears as marker of Goal moves +

2 Sequential connectives appear within same segment type -

3 'though', 'however', 'therefore' - causal connectives occur at all

-> Problem and -> Hypothesis boundaries

0

4 'suggests' occurs at Results-> Implication/Hypothesis boundary +/0

5 'we found' /'we observed'/ 'we showed' -> Result boundary +/0

6 'we + other verb' occurs at -> method boundary 0

7 Contrast/correspondence in Fact <-> Result <-> Implication moves +!

Page 22: ICPW2007.deWaard

Discussion

Page 23: ICPW2007.deWaard

Research Goals fulfilled?allow computer-aided access to knowledge: yes, but: > need to identify if they do cover this genre> need to finalize a structure of relations

other genres/domains?> investigate more than cell biology

how do we extract this structure?> collaborative attempts to identify segment markers/relationships - next step

how do we use this structure? : [ DEMO ]> possible collaborations with sensemaking systems?

Page 24: ICPW2007.deWaard

Preliminary Conclusions- Science is created in text

- Goal of text is to convince peers that claims (backed by data) belong to fact canon

- Text convinces humans through rhetorical/narrative discourse structure

- Text creates meaning in the human mind

- Discourse parsing could allow access to knowledge structure

- More work needed: collaborations?

Page 26: ICPW2007.deWaard

Appendix

Page 27: ICPW2007.deWaard

Related workBio-informatics Style Guides Shum et al Harmsze Swales RST Teufel Collier et.al

Sections x x x

Moves x x x x!

Entities x x

Embedding x x

Discourse relations x x x

Argumentational relations x x

* Need complete model for multidocument collection – markup of content elements and relationships

* Unique role as a publisher: can apply/mandate at the source!

Page 28: ICPW2007.deWaard

Total Fact Problem Goal Method Result Implication Hypothesis End Total

Start 18 1 8 2 2 4 3 0 38

Fact 83 13 17 9 31 12 22 1 188

Problem 9 7 2 3 5 3 7 3 39

Goal 7 2 4 46 6 0 0 0 65

Method 13 3 10 25 54 3 2 0 110

Result 23 4 6 16 85 78 9 6 227

Implication 13 4 12 11 30 12 6 25 113

Hypothesis 20 3 7 6 2 6 5 3 52

Total 186 37 61 118 215 118 54 38 827

Selfs 221

Model: 399

% in Model:

65.84%19

Page 29: ICPW2007.deWaard

24

Nr Section Introduction Results Discussion

A1 Agami, Results 4

A2 Agami, Discussion ½ 2 ½

A3 Agami, Introduction 3

S1 Serrano, Results 2

S2 Serrano, Discussion 1 1

S3 Serrano, Introduction 2

V1 Voorhoeve, Results 2

V2 Voorhoeve, Discussion 3

V3 Voorhoeve, Introduction 1 2

Results Clause assignment test (8 tests handed in, avg. 38 clauses each):

114 Clauses 51 No Disagreement

13 Fact/Result 11 Fact/Problem

10 Method/Result 7 Result/Implication

4 Goal/Method 3 Problem/Goal 2 Goal/Result

2 Problem/Interpretation 2 Fact/Interpretation

1 Problem/Result

Comments on classification:• Incomplete sentences are unclear, hard to classify• Add ‘Hypothesis’ category, exx. clauses 8, 33, 74a, 77,

78b.• Other possible categories: Assumption, Observation,

“Given that...”

Clause Classification Test

Page 30: ICPW2007.deWaard

25

References• Austin, J.L. How to do things with words, J.O. Urmson, ed. Oxford: Clarendon Press, 1962.

• Bazerman, Charles : Shaping written knowledge : the genre and activity of the experimental article in science, Madison, Wisconsin: Univ. of Wisconsin Press, 1988.

• F.J. Bex, H. Prakken, C. Reed & D.N. Walton, Towards a formal account of reasoning about evidence: argumentation schemes and generalisations. Artificial Intelligence and Law 11 (2003), 125-165

• Buckingham Shum, Simon J. Uren, V. et. al , Modelling Naturalistic Argumentation in Research literatures: Representation and Interaction Design Issues, Tech Report kmi-04-28, December 2004

• Harmsze, Frédérique. PhD Thesis, February 9, 2000. A modular structure for scientific articles in an electronic environment (HTML & PDF).

• Hovy, E. Automated discourse generation using discourse structure relations. Art. Intelligence 63(1-2): 1993. 341-386.

• Kircz, Joost G.. Modularity: the next form of scientific information presentation? Journal of Documentation. vol.54. No. 2. March 1998. pp. 210-235.

• Kuhn, Thomas, The Structure of Scientific Revolutions (Chicago: University of Chicago Press, 1962)

• Latour, B., Science in Action, How to Follow Scientists and Engineers through Society, (Cambridge, Ma.: Harvard University Press, 1987)

• Latour, Bruno, Steve Woolgar, Jonas Salk, Laboratory Life: The Construction of Scientific Facts, Princeton University Press, 1986