Extracting information from scientific papers: Challenges and Opportunities for Researchers and...

30
xtracting information from scientific pape Challenges and Opportunities for Researchers and Curators DPB

Transcript of Extracting information from scientific papers: Challenges and Opportunities for Researchers and...

Page 1: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Extracting information from scientific papers:Challenges and Opportunities for Researchers and Curators

DPB

Page 2: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

• What does a curator do?

• What do we ALL (researches and curators) want from the papers we read?

• What problems do we encounter when reading papers?• Identifying items• Choosing annotations

• How can we work together to improve these processes?

• Why does this matter to YOU?

Discussion Plan

Page 3: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

It depends on the type of curator!

Functional genomics curator / Metabolic pathway curator:

• Help to maintain the TAIR and Plant Metabolic Network / AraCyc websites

• Answer questions from users

• Give presentations and workshops at conferences and universities

• Interact with curators at other institutions to develop better curation practices and tools

What does a curator do?

• Read LOTS of papers

Page 4: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

What do we all want from papers?

Page 5: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

It depends on the type of paper!

• I focus on papers that describe:

• genes/proteins (TAIR and PMN) • metabolic pathways (PMN)

• We all want the important information!

• Curators also want to be able to capture that information and display it for users on the TAIR and AraCyc/PMN websites.

What do we all want from papers?

Page 6: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

• What gene / protein are they talking about?

• AGI locus code (TAIR / PMN)•At2g46990

• Gene symbol and FULL names (TAIR / PMN)• BSK3 = Brassinsteroid (BR)-signaling kinase 3• GGT2 = Glutamate:Glyoxylate aminotransferase 2

• Gene model (TAIR)• At2g46990.1

What do we all want from papers?

Page 7: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

• What does this gene do?

• Molecular Function GO terms (TAIR)• has “protein kinase activity” - GO: 0004672• functions in “histone binding” - GO: 0042393 • has “L-glutamine transmembrane transporter activity” - GO:0015186

• Phenotype description (TAIR)• “The ppc4-2 mutant has reduced PEP carboxylase activity”

• Reactions catalyzed (PMN)• indole-3-acetonitrile + 2 H2O = ammonia + indole-3-acetate (IAA)

• Information for gene summaries (TAIR)

• Information for enzyme summaries (PMN)

What do we all want from papers?

Page 8: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

• Where is this protein found?

• Cellular Component GO terms (TAIR)• located in “nucleolus” - GO:0005730 • located in “TOC complex” - GO:0010006

• Cellular Ontology (PMN)• chloroplast

• Information for gene summaries (TAIR)

• Information for enzyme summaries (PMN)

What do we all want from papers?

Page 9: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

• When and where is this gene / protein expressed?

• Plant Structure PO terms (TAIR)• expressed in “anther” - PO:0009066

• Plant Growth Stages PO terms (TAIR)• expressed during “expanded cotyledon stage” - PO:0001078

• Information for gene summaries (TAIR)

• Information for enzyme summaries (PMN)

What do we all want from papers?

Page 10: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

• What biological processes does this protein participate in?

• Biological Process GO terms (TAIR)• involved in “petal development” - GO:0048441• involved in “L-glutamate import” - GO:0051938 • involved in “brassinosteroid biosynthetic process” - GO:0016132

• Metabolic Pathways (PMN)• put enzyme in “alanine degradation” pathway

• Phenotype descriptions• “The phot1-4 mutant shows reduced responses to blue light”

• Information for gene summaries (TAIR)

• Information for enzyme summaries (PMN)

What do we all want from papers?

Page 11: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

• What mutant(s) did they describe? (TAIR)

• Mutant ID • SALK_nnnnnn• SAIL_21_A07

• Mutant name and unique symbol• rte1-2 (reversion-to-ethylene-sensitivity 1-2)

• Ecotype

• Ploidy level (e.g. heterozygous, homozygous)

• Phenotype description

What do we all want from papers?

Page 12: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

• What experiments did they do?

• Assay conditions and reagents• Help curators

• make GO and PO annotations (TAIR)

• identify enzymatic reactions (PMN)• specific substrates, e.g. L-glutamate• necessary co-factors, e.g. Mg2+

• capture pH and temperature optimums (PMN)

• We don’t capture:• PCR primers• good antibody sources• etc.• . . . but you are welcome to submit this information using “Comments”

What do we all want from papers?

Page 13: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Have you ever read a paper that’s missing important information?

How did that make you feel?

Did it interfere with your ability to do your work?

What do we all want from papers?

A lot of important information . . .

Gene identity

Gene function

Gene expression patterns

and much more!

Page 14: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Identifying Objects

• Case 1: Paper describes a gene or genes using a symbol • Authors never provide AGI code, sequence information, or other unique ID

• Different genes can have the same symbols in TAIR• ASA:

• Attenuated shade avoidance?• Anthranilate Synthase Alpha Subunit?

• ARF1• Auxin Response Factor 1?• ADP-Ribosylation Factor 1?

• Not all symbols are in TAIR

• Authors describe a new mutant or name a new gene family and never give IDs

• Impossible for us to annotate / Impossible for you to do related experiments

Page 15: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Identifying Objects

• Case 2: Paper does not specify gene model when appropriate

a. “The T-DNA insertion is in the third exon of TPK1” Which “third exon?”

b. “We expressed TPK1 in E.coli and saw activity” Which “TPK1?”

c. “A TPK1:GFP fusion protein localizes to the nucleus” Which “TPK1?”

Page 16: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Identifying Objects

• Case 3: Not enough information is given about a mutant

•“The phyb mutant had a longer hypocotyl than the wild type plant”

• 30 alleles / germplasms associated with phyB in TAIR

Which phyb?

What ecotype?

Page 17: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Identifying Objects

Case 4: Not enough information is given about enzymatic reactions

• Diagram in paper shows: arogenate tyrosine

• “In vitro, AR dehydrogenase catalyzed the formation of tyrosine from arogenate”

D- or L-formof amino acid?

What oxidizing agent is involved? What other substrates or products are involved?

What is the chemical structure of “arabidiol?”

• “We detected the formation of arabidiol”

Page 18: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Opportunities : Identifying Objects

• You can help each other and curators to identify all the important items in the manuscripts you write or review

• AGI locus code for all genes in paper (At2g46990)

• Gene model information when relevant (At2g46990.1)

• Specific mutant names (abc1-7), IDs (SALK_nnnnn) and ecotype

• Complete and balanced biochemical reactions

• Chemical structures or chemical database IDs for compounds

• But, for curators, identifying objects is only one of the challenges . . .

•You are the next generation of:

• Authors• Reviewers• Journal Editors

Page 19: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Choosing annotations

• Curators have to make decisions . . .

• When should we make annotations?

• What specific annotations should we make?

• You should be concerned about how we “choose” annotations

• You are data providers• We’re capturing the data from your papers

• How would you like to see it presented?

• You are data users• You use our annotations of individual genes• You analyze your microarray data using our GO and PO annotations• You view your transcript and metabolomic data using the OMICs viewer

• How would you like to see it presented?

Page 20: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Choosing annotations – YOU make the call!

• When and what should we annotate using GO terms?

Page 21: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Choosing annotations – YOU make the call!

• Case 1: When is something “involved in” a biological process? • Molecular Function and Cellular Component annotations – pretty clear• Biological Process can be pretty ambiguous!

•Glycine metabolic process

• 6 mutants are uncovered that have altered levels of glycine

• lgl1-1, lgl2-1, lgl3-1 make “Less GLycine” than wild-type plants

• mgl1-1, mgl2-1, mgl3-1 make “More GLycine” than wild-type plants

• Annotate all 6 genes: involved in “glycine metabolic process”• Use evidence code: IMP = inferred from mutant phenotype

Page 22: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Choosing annotations – YOU make the call!

LGL1 = threonine aldolase?

LGL2 = transcription factor

• Which genes are “involved in” – glycine metabolic process?

LGL3 = tyrosine kinase

MGL1 = F-box protein (E3 ligase subunit)

MGL2 = phosphatase

up-regulates enzyme

turns on TF

degrades kinase

promotes E3 ligase activity

MGL3 = nucleoporin

allows phosphatase to enter nucleus

?

?

?

?

?

?

??

• Where do we stop?• Should we change old annotations? (***Evidence code is important – be aware of IMP!)

• What belongs in a GO annotation versus a phenotype description?

Page 23: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Choosing annotations – YOU make the call!

• Case 2: How do we deal with over-expressers? RNAi? etc.?

•What biological process is XYZ1 involved in?

• 35S:XYZ1• more petals than wild type plants

•xyz1 KO mutants

• normal number of petals

• Is XYZ involved in “petal development?”

• XYZ1 is “only expressed in roots”

• XYZ1 is “expressed at very low levels in flowers”

• XYZ1 – no expression data mentioned

• What if XYZ is part of a large gene family?

• What if XYZ is unique (not related to other genes)?

?

?

?

?

?

Page 24: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Choosing annotations – YOU make the call!• Case 3: When is it “enough” to make an annotation?

• JKL is expressed in “rosette leaves”

• “RT-PCR analyses show expression of JKL in rosette leaves”

• “JKL is expressed at low levels in rosette leaves”

• “JKL expression is barely detectable in rosette leaves”

• GHI has enzymatic activity with the following substrates in vitro:• Which Molecular Functions do we annotate with GO in TAIR?• Which reactions do we add to AraCyc?

• IAA + isoleucine -> IAA-Ile (90%)

• IAA + leucine -> IAA-Leu (50%)

• IAA + histidine -> IAA-His (20%)

• IAA + cysteine -> IAA-Cys (5%)

• IAA + proline -> IAA-Pro (1%)

?

?

?

?

?

?

?

?

? What if the reactions are

characterized in vivo?

Page 25: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Choosing annotations – YOU make the call! Case 4: Figures without text support

• Which genes are “expressed in” these tissues?

Page 26: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Choosing annotations – YOU make the call!

Case 4: Figures without text support

• “The expression of 11 genes was detected in leaves.”

??

?

Page 27: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Challenges : Choosing annotations – YOU make the call!

Case 5: Which term is “most” appropriate?

GRI (Grim Reaper) is involved in the regulation of extracellular ROS-induced cell death

• “gri plants show increased ROS-induced cell death and reduced seed content.“

• “The seed content in siliques was reduced in gri and GRI overexpressors compared with Col-0 and vector control.“

Wrzaczek et al 2009

• involved in “fruit development”

Are the siliques shorter?

Are there empty spaces in normal siliques?

• involved in “seed development”

?

?

Page 28: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Opportunities : Choosing annotations – YOU make the call!

You can be the annotators of the future!

• informally : e-mail us or drop by and say hello!

• use TAIR or PMN submission forms

• during journal publication process

• Plant Physiology (now)• more journals in the future!

Page 29: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Extracting information from scientific papers:Challenges and Opportunities for Researchers and Curators

• We all read papers

• We all want to extract important and useful information from papers

• We all want reliable annotations in our databases

• Challenges:

• Sometimes it is difficult to find the information we need in papers

• Sometimes it is hard to judge how to curate data in papers

• Opportunities:

• Authors, reviewers, and editors can make sure that papers have adequate information

• Curators can help researchers to directly submit annotations to TAIR or the PMN

• Curators and researchers can communicate about the curation process• You know what we want• We know what you want!• We all work together to advance scientific research!

Page 30: Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB.

Thank you!

Current Curators:

- Tanya Berardini (lead curator – functional annotation)

- David Swarbreck (lead curator – structural annotation)

- Peifen Zhang (Director and lead curator- metabolism)

- A. S. Karthikeyan (curator)

- Philippe Lamesch (curator)- Donghui Li (curator)- Rajkumar Sasidharan (curator)

Recent Past Contributors:

- Debbie Alexander (curator)

- Christophe Tissier (curator)

- Hartmut Foerster (curator)NSF

Tech Team Members:- Bob Muller (Manager)- Larry Ploetz (Sys. Administrator)- Raymond Chetty- Anjo Chi- Vanessa Kirkup- Cynthia Lee- Tom Meyer- Shanker Singh- Chris Wilks

Metabolic Pathway Software:- Peter Karp and SRI group

TAIR, AraCyc, and the PMN

Eva Huala (Director and Co-PI) Sue Rhee (PI and Co-PI)