Ismb2009

Some Experiments in Semantic Enrichment

From Proteins to Hypotheses:

Anita de Waard

Principal Researcher, Disruptive TechnologiesElsevier Labs, Amsterdam

Utrecht Institute of Linguistics, University UtrechtThe Netherlands

[email protected]

mailto:[email protected]

mailto:[email protected]

Entities and Relationships:FEBS Structured Digital Abstracts

OKKAM Entity Repository




Hypotheses and Evidence: Discourse Analysis of Biology Articles

Hypotheses, Evidence and Relationships




Hypotheses and Evidence: Discourse Analysis of Biology Articles

Hypotheses, Evidence and Relationships

Collaborations:Elsevier Grand Challenge

Future of Research Communications


Entities and Relationships

- With FEBS Letters Editorial Office in Heidelberg; and MINT database Rome

- SDA [Gerstein et. al]: ‘machine-readable XML summary of pertinent facts’

- For each ppi paper: proteins, protein-protein interactions, methods: provided by author in XLS

- Started April 9, 2008: in ScienceDirect links to MINT and Uniprot

- Now 117 articles with SDA: http://www.febsletters.org/content/sda

- Being used as Gold Standard for Biocreative Challenge II.5http://www.biocreative.org/tasks/biocreative-ii5/biocreative-ii5-evaluation/

FEBS Letters Structured Digital Abstracts

http://www.febsletters.org/content/sda

http://www.febsletters.org/content/sda

http://www.biocreative.org/tasks/biocreative-ii5/biocreative-ii5-evaluation/

http://www.biocreative.org/tasks/biocreative-ii5/biocreative-ii5-evaluation/

D-07-02772Testis-specific poly(A) polymerase (TPAP) is a cytoplasmic poly(A) polymerase that is highly expressed in round spermatids. We identified germ cell-specific gene 1 (GSG1) as a TPAP interaction partner protein using yeast two-hybrid and coimmunoprecipitation assays. Subcellular fractionation analysis showed that GSG1 is exclusively localized in the endoplasmic reticulum (ER) of mouse testis where TPAP is also present. In NIH3T3 cells cotransfected with TPAP and GSG1, both proteins colocalize in the ER. Moreover, expression of GSG1 stimulates TPAP targeting to the ER, suggesting that interactions between the two proteins lead to the redistribution of TPAP from the cytosol to the ER.

MINT-6168263:Gsg1 (uniprotkb:Q8R1W2), TPAP (uniprotkb:Q9WVP6) and Calmegin (uniprotkb:P52194) colocalize (MI:0403) by cosedimentation (MI:0027)

MINT-6168204, MINT-6168178:Gsg1 (uniprotkb:Q8R1W2) and TPAP (uniprotkb:Q9WVP6) colocalize (MI:0403) by fluorescence microscopy (MI:0416)

MINT-6167930:Gsg1 (uniprotkb:Q8R1W2) physically interacts (MI:0218) with TPAP (uniprotkb:Q9WVP6) by two-hybrid (MI:0018)

http://mint.bio.uniroma2.it/mint/search/interaction.do?interactionAc=MINT-6168263


http://www.ebi.uniprot.org/entry/Q8R1W2


http://www.ebi.uniprot.org/entry/Q9WVP6


http://www.ebi.uniprot.org/entry/P52194


http://www.ebi.ac.uk/ontology-lookup/?termId=MI%3A0403




















http://www.ebi.ac.uk/ontology-lookup/?termId=MI:0218






OKKAM Project

FP7 - EU funded project, 13 partners, 2008 - 2010

- Main goal: create entity identifier repository

- Create entity-centric architecture:

- create new entities + synonyms where needed

- offer interface/link where not needed

- contain resolver, matcher, repository

- E.g. http://sig.ma - semantic mashup of entity properties

http://sig.ma

http://sig.ma

http://sig.ma/search?q=p53



OKKAM Entity-centric authoring toolCurrent:

- Build MS Word plugin to find proteins in FEBS articles (i.e., semi-automating FEBS SDA!)

- Now: testing web server interface to WhatizitNext: link to Biocreative metaserver

- Add: links to other entities via OKKAM server

- Prototype now being tested with FEBS Office

Future:

- Create unique Author ID, straddling proprietary (Scopus) and other (ResearcherID, OpenID) systems

- Allow query and resolution of authors/papers within Word

OKKAM Entity Editor in MS Word

OKKAM plugin vs. UCSD group Plugin

UCSD plugin Okkam plugin

Works! Can be downloaded, open source

etc.

Prototype, starting to test with FEBS authors

Offline: load ontologies locally

Online: call out to web services (soon...)

Identify classes of entities (e.g., ‘inhibitor’)

Identify specific entities (e.g., a specific protein)

Works with Word 2007Works with other version of Word

(on a PC!)

Stores annotations as XML metadata - maintain

metadata in Word info pane

Store annotations as Word comments - can generate report

of entitieslinked to identifiers

Work well together!

Hypotheses and Evidence

Issue with Biocreative challenge:

From one of the documents in the training set:

- In Xenopus oocyte maturation, cytoplasmic polyadenylation mediated by cytoplasmic polyadenylation element binding protein (CPEB) induces the translation of maternal mRNA [5].

- In mouse testis, another novel member of the CPEB protein family (CPEB2) and a homolog of xGLD-2 (mGLD-2) have been identified [7] and [8]

versus:

- TPAP was present in GSG1 immunoprecipitates (Fig. 2B). The in vivo data suggest that TPAP–GSG1 interactions occur in mammalian cells.

Issue 1: what is experimentally verified content?Issue 2: what is new?

Discourse analysis of biological text


- How can we identify line of argumentation in biology text?



- Liguistics says: parse into clauses (anything with a verb)



- Liguistics says: parse into clauses (anything with a verb)- Identify Discourse Segment Purpose



- Liguistics says: parse into clauses (anything with a verb)- Identify Discourse Segment Purpose- Identify verb tense, type, markers for each segment type

3 Realms of Science:

Experimental realm

Data realm

Conceptual realm

3 Realms of Science:

Experimental realm

Data realm

Conceptual realm

(1) Oncogene-induced senescence is characterized by the appearance of cells with a flat morphology that express senescence associated (SA)-

-Galactosidase.

(4b) transduction with either miR-Vec-371&2 or miR-Vec-373 prevents RASV12-induced growth arrest in primary human cells.

(2b) control RASV12-arrested cells showed relatively high abundance of flat cells expressing SA- -Galactosidase

(3b) very few cells showed senescent morphology when transduced with either miR-Vec-371&2, miR-Vec-373, or control p53kd.

(4a) Altogether, these data show that

(3a) Consistent with the cell growth assay,

(Figures)

(2a) Indeed,

(2c) (Figures 2G and 2H).

Hypotheses ‘erode’ into facts:

Concepts

KnownFact KnownFact


Concepts

KnownFact KnownFact


Hypothesis

To investigate the possibility that miR-372 and miR-373 suppress the expression of LATS2, we...

Concepts

KnownFact KnownFact

Experiment 1

Goal

Result

Data

Method


Hypothesis


Concepts

KnownFact KnownFact

Experiment 1

Goal

Result

Data

Method


Hypothesis


Implication

Therefore, these results point toLATS2 as a mediator of the miR-372 and miR-373 effects on cell proliferation and tumorigenicity,

Concepts

KnownFact KnownFact

Experiment 1

Goal

Result

Data

Method

Goal

Experiment 2

Data

Method Result


Hypothesis


Implication


Concepts

KnownFact KnownFact

Experiment 1

Goal

Result

Data

Method

Goal

Experiment 2

Data

Method Result


Hypothesis


Implication


Fact

two miRNAs, miRNA-372 and-373, function as potential novel oncogenes in testicular germ cell

tumors by inhibition of LATS2 expression, which suggests that Lats2 is an important tumor suppressor (Voorhoeve et al., 2006).

Raver-Shapira et.al, JMolCell 2007

Concepts

KnownFact KnownFact

Experiment 1

Goal

Result

Data

Method

Goal

Experiment 2

Data

Method Result


Hypothesis


Implication


Fact




miR-372 and miR-373 target the Lats2 tumor suppressor (Voorhoeve et al., 2006)

Yabuta, JBioChem 2007

Concepts

KnownFact KnownFact

Experiment 1

Goal

Result

Data

Method

Goal

Experiment 2

Data

Method Result


Hypothesis


Implication


Fact




miR-372 and miR-373 target the Lats2 tumor suppressor (Voorhoeve et al., 2006)

Yabuta, JBioChem 2007

Hypothesis + Evidence

Identification of hypotheses in papers: SWAN Alzheimer KB

HypER Working Group:- Goal: Align and expand existing efforts on detection and

analysis of Hypotheses, Evidence & Relationships

- Partners:

- Harvard/MGH: SWAN, ARF

- Open University: Cohere

- Oxford University: CiTO, eLearning/Rhetoric

- DERI: SALT, aTags

- University of Trento: LiquidPub

- Xerox Research: XIP hypothesis identifier

- U Tilburg: ML for Science

- Elsevier, UUtrecht: Discourse analysis of biology

HypER Working Group:- Goal: Align and expand existing efforts on detection and

analysis of Hypotheses, Evidence & Relationships

- Partners:

- Harvard/MGH: SWAN, ARF

- Open University: Cohere

- Oxford University: CiTO, eLearning/Rhetoric

- DERI: SALT, aTags

- University of Trento: LiquidPub

- Xerox Research: XIP hypothesis identifier

- U Tilburg: ML for Science

- Elsevier, UUtrecht: Discourse analysis of biology

Hypothesis 22: Intramembrenous Aβ dimer may be toxic.

Derived from: POSTAT_CONTRIBUTION(This essay explores the possibility that a fraction of these Abeta peptides never leave the

HypER Activities: http://hyper.wik.isCurrent activities:

- Aligning discourse ontologies: joint task with W3C HCLSSig

- Aligning architectures to exchange hypotheses + evidence

- Format for a rhetorical conference paper (SALT + abcde)

- Parser comparison of hypothesis identification tools

- With NIF+SWAN/SCF: Structured Rhetorical Abstracts for Neuroscience publications

Further interests:

- Better structure of evidence: MyExperiment, KeFeD, ...

- Granularity of annotation/access: entity, hypothesis, discussion?

- Where/how annotate: http://saakm2009.semanticauthoring.org/

http://hyper.wik.is

http://hyper.wik.is

http://saakm2009.semanticauthoring.org

http://saakm2009.semanticauthoring.org

Collaborations

Scope: Tools and processes to:

- Improve the process of creating, reviewing and editing scientific content

- Interpret, visualize or connect science knowledge

- Provide tools/ideas for measuring the impact of these improvements.

June 2008: 71 Submissions from 15 countries.

August 2008: 10 Semi-finalists teams, access to:

- 500,000 full text articles

- Plus EMTREE, EmBase, Scopus

- Created tool/demo

- Presented to the Judges

- Wrote a paper (accepted for JWeb Semantics)

April 2009: Judges selected 4 Finalist teams.

And the winners are:

Scope: Tools and processes to:

- Improve the process of creating, reviewing and editing scientific content

- Interpret, visualize or connect science knowledge

- Provide tools/ideas for measuring the impact of these improvements.

June 2008: 71 Submissions from 15 countries.

August 2008: 10 Semi-finalists teams, access to:

- 500,000 full text articles

- Plus EMTREE, EmBase, Scopus

- Created tool/demo

- Presented to the Judges

- Wrote a paper (accepted for JWeb Semantics)

April 2009: Judges selected 4 Finalist teams.

And the winners are:

Figure 1: A summary generated when moving the mouse

over the link “Discovery’s” (mouse pointer omitted).

tive Summariser (IBES), complements generic sum-

maries in providing additional information about a

particular aspect of a page.3 Generic summaries

themselves are easy to generate due to rules enforced

by the Wikipedia style-guide, which dictates that all

titles be noun phrases describing an entity, thus serv-

ing as a short generic summary. Furthermore, the

first sentence of the article should contain the title

in subject position, which tends to create sentences

that define the main entity of the article.

For the elaborative summarisation scenario de-

scribed, we are interested in exploring ways in

which the reading context can be leveraged to pro-

duce the elaborative summary. One method ex-

plored in this paper attempts to map the content of

the linked document into the semantic space of the

reading context, as defined in vector-space. We use

Singular Value Decomposition (SVD), the underly-

ing method behind Latent Semantic Analysis (Deer-

wester et al., 1990), as a means of identifying latent

topics in the reading context, against which we com-

pare the linked document. We present our system

and the results from our preliminary investigation in

the remainder of this paper.

3http://www.ict.csiro.au/staff/stephen.wan/ibes/

2 Related Work

Using link text for summarisation has been explored

previously by Amitay and Paris (2000). They identi-

fied situations when it was possible to generate sum-

maries of web-pages by recycling human-authored

descriptions of links from anchor text. In our work,

we use the anchor text as the reading context to pro-

vide an elaborative summary for the linked docu-

ment.

Our work is similar in domain to that of the 2007

CLEF WiQA shared task.4 However, in contrast to

our application scenario, the end goal of the shared

task focuses on suggesting editing updates for a

particular document and not on elaborating on the

user’s reading context.

A related task was explored at the Document Un-

derstanding Conference (DUC) in 2007.5 Here the

goal was to find new information with respect to a

previously seen set of documents. This is similar to

the elaborative goal of our summary in the sense that

one could answer the question: “What else can I say

about topic X (that hasn’t already been mentioned

in the reading context)”. However, whereas DUC

focused on unlinked news wire text, we explore a

different genre of text.

3 Algorithm

Our approach is designed to select justification sen-

tences and expand upon them by finding elaborative

material. The first stage identifies those sentences

in the linked document that support the semantic

content of the anchor text. We call those sentences

justification material. The second stage finds mate-

rial that is supplementary yet relevant for the user.

In this paper, we report on the first of these tasks,

though ultimately both are required for elaborative

summaries.

To locate justification material, we implemented

two known summarisation techniques. The first

compares word overlap between the anchor text and

the linked document. The second approach attempts

to discover a semantic space, as defined by the read-

ing context. The linked document is then mapped

into this semantic space. These are referred to as the

Simple Link method and the SVD method, where

4http://ilps.science.uva.nl/WiQA/5http://duc.nist.gov/guidelines/2007.html

130

FoRC: The Future of Research Communication

Improve ways to:

- create, review, edit scientific content

- interpret, visualize, connect scientific knowledge

- measure the impact of these improvements

- present new paradigms in publishing

- serve global academic knowledge communities

- communicate between collaboratories

- share research data

- support complex knowledge ecosystems.

Harvard, Fall 2010

FoRC: The Future of Research Communication

Improve ways to:

- create, review, edit scientific content

- interpret, visualize, connect scientific knowledge

- measure the impact of these improvements

- present new paradigms in publishing

- serve global academic knowledge communities

- communicate between collaboratories

- share research data

- support complex knowledge ecosystems.

New ways of communicating science requires new ways of communicating! Conference registrants form a community: - Website is their meeting place, continues as long as users want to- The physical conference is synchronous manifestation of the community- Others can participate through a virtual environment at other time/place

Harvard, Fall 2010

Acknowledgments- Funding:

FP7-CordisNWO Casimir Project

- UUtrecht: Leen Breure,Henk Pander MaatTed Sanders

- ISI: Gully BurnsEd Hovy

- Elsevier: David Marques Darin McBeath Stefano Bocconi Adriaan KlinkenbergEmilie Marcus

- Challenge judges and participants: EMBL Team, CMU TeamDavid ShottonAlfonso Valencia

- Harvard/MGH:Tim ClarkElizabeth Wu

- DERI:Siegfried HandschuhTudor GrozaMatthias Samwald Paul Buitelaar

- Xerox Research: Agnes Sandor

- Open University: Simon Buckingham Shum David ShottonJack ParkClara Mancini

- Oxford University:Annamaria CarusiDavid Shotton

- Tilburg U:Piroska Lendvai

HypER:

Ismb2009

Education

Transcript of Ismb2009