How Scientists Read, How Computers Read,
and What We Should Do
Anita de WaardDisruptive Technologies Director
Elsevier Labs
(= not what it says in the abstract!)
Outline
1. How do scientists read?2. How do computers read?3. What should we do?
Outline
1. How do scientists read?2. How do computers read?3. What should we do?
How we read• Letter < syllable < word < clause < sentence < discourse:
This is how linguistics is structured. But it is not how we understand text!
• Letter < syllable < word < clause < sentence < discourse:
This is how linguistics is structured. But it is not how we understand text!
How we read
• Letter < syllable < word < clause < sentence < discourse:
This is how linguistics is structured. But it is not how we understand text!
How we read
• Letter < syllable < word < clause < sentence < discourse:
This is how linguistics is structured. But it is not how we understand text!
How we read
• Letter < syllable < word < clause < sentence < discourse:
This is how linguistics is structured. But it is not how we understand text!
How we read
• Letter < syllable < word < clause < sentence < discourse:
This is how linguistics is structured. But it is not how we understand text!
How we read
Scientists read:
• Why do scientists read? – They want to ingest knowledge: – read, integrate with their current knowledge
• What do scientists read?– Things that are ‘interesting’ :– Pertinent (within their ‘shell of interest’)– Possibly or probably true– Novel, but in agreement with what we know
human breast cancer
noninvasive MCF7-Ras
antisense oligonucleotides
high-grade malignancy
cell viability retroviral vector
miR-31
cloned
transiently expressed miRNA sponges
Is it pertinent? -> Possibly…Is it true? -> ?Is it new, but in agreement with what I know? -> -?
What is this paper about? NOUN PHRASES
miR-31 PREVENT acquisition of aggressive traits
miR-31 INHIBIT noninvasive MCF7-Ras cells
miR-31 ENHANCE invasion
cell viability AFFECT inhibitor
miR-31 expression DEPRIVE metastatic cells
Is it pertinent? -> Possibly…Is it true? -> ?Is it new, but in agreement with what I know? ->?
What is this paper about? TRIPLES
The preceding observations demonstrated that X expression deprives Y cells of attributes associated with Z. We next asked whether X also prevents the acquisition of A traits by B cells.To do so, we transiently inhibited X in C cells with either D or E. Both approaches inhibited X function by > 4.5-fold (Figure S7A).Suppression of X enhanced invasion by 20-fold and motility by 5-fold, but F was unaffected by either inhibitor (Figure 3A; Figure S7B). The E sponge reduced X function by 2.5-fold, but did not affect the activity of other known Js (Figures S8A and S8B). Collectively, these data indicated that sustained X activity is necessary to prevent the acquisition of Z traits by both K and untransformed B cells.
Is it pertinent? -> Need contentIs it true? -> Sounds likely! I know this stuff!Is it new, but in agreement with what I know? -> Need content
What is this paper about? METADISCOURSE
Claim: • sustained miR-31 activity is necessary to prevent the acquisition of aggressive
traits by both tumor cells and untransformed breast epithelialEvidence: Method: • We transiently inhibited miR-31 in noninvasive MCF7-Ras cells with either
antisense oligonucleotides or miRNA sponges.Evidence: Result: • Both approaches inhibited miR-31 function by >4.5-fold (Figure S7A). • Suppression of miR-31 enhanced invasion by 20-fold and motility by 5-fold,
but cell viability was unaffected by either inhibitor (Figure 3A; Figure S7B). • The miR-31 sponge reduced miR-31 function by 2.5-fold, but did not affect
the activity of other known antimetastatic miRNAs (Figures S8A and S8B).
What is this paper about? CLAIMS AND EVIDENCE
Is it pertinent? -> ProbablyIs it true? -> Sounds likely! Is it new, but in agreement with what I know? -> Check/know
What is this paper about? DATA
Is it pertinent? -> Need contentIs it true? -> Need methodsIs it new, but in agreement with what I know? -> Check/know
Is it pertinent? -> Possibly Is it true? Is it new, but in agreement with what I know? -> Need background
-> Probably!
What is this paper about? METADATA
How scientists read:
Representation Pertinence Truth Fit with knowledge
Noun phrases xTriples xMetadiscourse xClaims and evidence x x xData x x xMetadata x
Text mining
Data-centric science
Publishing
Outline
1. How do scientists read?2. How do computers read?3. What should we do?
Noun Phrases: some issues• Problem 1: disambiguating terms (© GoPubMed):
– Hnrpa1 = Tis = Fli-2 = nuclear ribonucleoprotein A1 = helix destabilizing protein = single-strand binding protein = hnRNP core protein A1 = HDP-1 = topoisomerase-inhibitor suppressed.
– Cellulose 1,4-beta-cellobiosidase = exoglucanase– COLD =/ C.O.L.D. =/ cold (runny nose) =/ cold (low T)
• Problem 2: disambiguating entities (© M. Martone):– 95 antibodies were (manually!) identified in 8 articles– 52 did not contain enough information to determine the antibody
used– Some provided details in other papers– Failed to give species, clonality, vendor, or catalog number
Noun Phrases: some progress• Despite these difficulties, noun phrase recall/precision is
quite high, e.g. I2B22011 [1], [2], others: 90%-98%• Many tools, see [3] for a list; e.g. GoPubMed:
Triples: some issues:• Contingent on good NP & VP detection• Hard to parse text! E.g. a commercial tool gave:insulin maintaining glucose homeostasis When insulin secretion cannot be increased adequately (type I diabetes defect) to overcome insulin resistance in maintaining glucose homeostasis, hyperglycemia and glucose intolerance ensues. insulin may be involved glucose homeostasis Because PANDER is expressed by pancreatic beta-cells and in response to glucose in a similar way to those of insulin, PANDER may be involved in glucose homeostasis.
Triples: some progress:Biological Expression Language [4]: We provide evidence that these miRNAs are potential novel oncogenes participating in the development of human testicular germ cell tumors by numbing the p53 pathway, thus allowing tumorigenic growth in the presence of wild-type p53. Increased abundance of miR-372 decreases activity of TP53r(MIR:miR-372) -| tscript(p(HUGO:Trp53))Context: cancerSET Disease = “Cancer”
Activity of TP53 decreases cell growthtscript(p(HUGO:Trp53)) -| bp(GO:”Cell Growth”
Use biological pathway visualizations as a user interface for knowledge discovery.
23
Author-created triples: MSR ActiveText
Metadiscourse: why it matters:
• Voorhoeve et al., 2006: “These miRNAs neutralize p53- mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor suppressor LATS2.”
• Kloosterman and Plasterk, 2006: “In a genetic screen, miR-372 and miR-373 were found to allow proliferation of primary human cells that express oncogenic RAS and active p53, possibly by inhibiting the tumor suppressor LATS2 (Voorhoeve et al., 2006).”
• Okada et al., 2011: “Two oncogenic miRNAs, miR-372 and miR-373, directly inhibit the expression of Lats2, thereby allowing tumorigenic growth in the presence of p53 (Voorhoeve et al., 2006).”
“[Y]ou can transform .. fiction into fact just by adding or subtracting references”, Bruno Latour [5]
Adding metadiscourse to triples:Biological statement with BEL/ epistemic markup
BEL representation: Epistemic evaluation
These miRNAs neutralize p53-mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor-suppressor LATS2.
r(MIR:miR-372) -|(tscript(p(HUGO:Trp53)) -| kin(p(PFH:”CDK Family”)))Increased abundance of miR-372 decreases abundance of LATS2r(MIR:miR-372) -| r(HUGO:LATS2)
Value = PossibleSource = UnknownBasis = Unknown
Biological statement with Medscan/epistemic markup
MedScan Analysis: Epistemic evaluation
Furthermore, we present evidence that the secretion of nesfatin-1 into the culture media was dramatically increased during the differentiation of 3T3-L1 preadipocytes into adipocytes (P < 0.001) and after treatments with TNF-alpha, IL-6, insulin, and dexamethasone (P < 0.01).
IL-6 NUCB2 (nesfatin-1)Relation: MolTransportEffect: PositiveCellType: AdipocytesCell Line: 3T3-L1
Value = ProbableSource = AuthorBasis = Data
Claims and Evidence, some examples: Data2Semantics [11]
• Linking clinical guidelines to evidence in a linked data form• Goal: improve speed of integration of research > practice • Issue: evidence is not even correct within guideline?
• Studies have demonstrated inconsistent results regarding the use of such markers of inflammation as C-reactive protein (CRP), interleukins- 6 (IL-6) and -8, and procalcitonin (PCT) in neutropenic patients with cancer [55–57]. • [55]: PCT and IL-6 are more reliable markers than CRP for
predicting bacteremia in patients with febrile neutropenia• [56] In conclusion, daily measurement of PCT or IL-6
could help identify neutropenic patients with a stable course when the fever lasts >3 d. …, it would reduce adverse events and treatment costs.
• [57] Our study supports the value of PCT as a reliable tool to predict clinical outcome in febrile neutropenia.
Claims and Evidence, example: Drug Interaction Knowledgebase [12]
• Extracting adverse drug interactions (ADIs) from literature and creating linked data node of this
• Goal: improve speed and coverage of ADIs and allowing improved access to patients and doctors
• Issue: how to identify evidence? – Claim:
R-citalopram_is_not_substrate_of_cyp2c19: – Evidence:
At 10uM R- or S-CT, ketoconazole reduced reaction velocity to 55 -60% of control, quinidine to 80%, and omeprazole to 80-85% of control (Fig. 6)
Using what is known about interactions in fly & yeast: predict new interactions with a human protein
Data, e.g. Web Science 2.0: Mark Wilkinson (SADI, Madrid)
Wilkinson: doing science ON the web:
These are differentWeb services!
...selected at run-time based on the same model
Data
• All this evidence is based on data• Increasingly: science is distributed between
– Groups creating data– Groups using data – creating tools– Groups using tools on data – ideas
• All of these groups need to communicate!
In summary:
1. How do scientists read?2. How do computers read?3. What should we do?
How we read vs. computers:Level: People read: Computers read:Noun phrases Know topic Pretty wellTriples Know topic Pretty wellMetadiscourse Trust method Not very wellClaims and evidence Understand and trust Not very wellData Trust - and new science! Can enable!
Publisher runs service (‘app’)
Publisher runs service (‘app’)
6. User applications: distributed applications run on this ‘exposed data’ universe.
Is this the future of publishing? [17]
1. Research: Each item in the system has metadata (including provenance) and relations to other data items added to it.
metadata
metadata
metadata
metadata
metadata
5. Publishing and distribution: When a paper is published, a collection of validated information is exposed to the world. It remains connected to its related data item, and its heritage can be traced.
2. Workflow: All data items created in the lab are added to a (lab-owned) workflow system.
4. Editing and review: Once the co-authors agree, the paper is ‘exposed’ to the editors, who in turn expose it to reviewers. Reports are stored in the authoring/editing system, the paper gets updated, until it is validated.
Review
Edit
Revise
Rats were subjected to two grueling tests(click on fig 2 to see underlying data). These results suggest that the neurological pain pro-
3. Authoring: A paper is written in an authoring tool which can pull data with provenance from the workflow tool in the appropriate representation into the document.
What should we do?• Experiment! All over the place. Scientists get it ! • Support scientists working on these (e.g. text miners,
web science evangelists, data repositories, etc etc) – great return for your investment!
• Join forums where interactions happen between scientists, publishers, libraries, etc. e.g. Force11.org: – Collective, sponsored by Sloane, aimed at enabling/supporting
this discussion– Planning workshop,
innovative projects for 2013– Please join us at
http://force11.org!
References[1] J Am Med Inform Assoc. 2010 September; 17(5): 514–518 http://dx.doi.org/10.1136/jamia.2010.003947 [2] Quanzhi Li, Yi-Fang Brook Wu (2006): Identifying important concepts from medical documents, Journal of Biomedical Informatics 39 (2006) 668–679[3] Useful list of resources in bioinformatics http://www.bioinformatics.ca/[4] Biological Expression Language – http://www.openbel.org [5] Latour, B. and Woolgar, S., Laboratory Life: the Social Construction of Scientific Facts, 1979, Sage Publications[6] Light M, Qiu XY, Srinivasan P. (2004). The language of bioscience: facts, speculations, and statements in between. BioLINK 2004: Linking Biological Literature, Ontologies and Databases 2004:17-24.[7] Wilbur WJ, Rzhetsky A, Shatkay H (2006). New directions in biomedical text annotations: definitions, guidelines and corpus construction. BMC Bioinformatics 2006, 7:356.[8] Thompson P., Venturi G., McNaught J, Montemagni S, Ananiadou S. (2008). Categorising modality in biomedical texts. Proc. LREC 2008 Wkshp Building and Evaluating Resources for Biomedical Text Mining 2008.[9] Kim, S-M. Hovy, E.H. (2004). Determining the Sentiment of Opinions. Proceedings of the COLING conference, Geneva, 2004. [10] de Waard, A. and Pander Maat, H. (2012). Epistemic Modality and Knowledge Attribution in Scientific Discourse: A Taxonomy of Types and Overview of Features. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 47–55, Jeju, Republic of Korea, 12 July 2012. [11] Data2Semantics project: http://www.data2semantics.org/ [12] Boyce R, Collins C, Horn J, Kalet I. (2009) Computing with evidence Part I: A drug-mechanism evidence taxonomy oriented toward confidence assignment. J Biomed Inform. 2009 Dec;42(6):979-89. Epub 2009 May 10, see also http://dbmi-icode-01.dbmi.pitt.edu/dikb-evidence/front-page.html [13] Sándor, Àgnes and de Waard, Anita, (2012). Identifying Claimed Knowledge Updates in Biomedical Research Articles, Workshop on Detecting Structure in Scholarly Discourse, ACL 2012. [14] Blake, C. (2010) Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles, Journal of Biomedical Informatics, 43(2):173-189[15] See e.g. http://ucsdbiolit.codeplex.com/ and http://research.microsoft.com/en-us/projects/ontology/ for MS Word ontology add-ins[16] de Waard, A. and Schneider, J. (2012) Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution (ORCA), Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine workshop, ISWC 2012 [17] de Waard, A. (2010). The Future of the Journal? Integrating research data with scientific discourse, LOGOS: The Journal of the World Book Community, Volume 21, Numbers 1-2, 2010 , pp. 7-11(5) also published in Nature Precedings,http://precedings.nature.com/documents/4742/version/1
Top Related