1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1....
-
Upload
adam-edwards -
Category
Documents
-
view
217 -
download
0
Transcript of 1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1....
1
iProLINK: An integrated protein resource for literature mining and literature-based curation
1. Bibliography mapping
- UniProt mapped citations
2. Annotation extraction
- annotation tagged literature
3. Protein named entity recognition
- dictionary, name tagged literature
4. Protein ontology development
- PIRSF-based ontology
2
Objective: Accurate, Consistent, and Rich Annotation of Protein Sequence and Function
Literature-Based Curation – Extract Reliable Information from Literature Function, domains/sites, developmental stages, catalytic
activity, binding and modified residues, regulation, pathways, tissue specificity, subcellular location …...
Ensure high quality, accurate and up-to-date experimental data for each protein.
A major bottleneck!
Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management UniProtKB entries will be annotated using widely accepted
biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature.
4
iProLINK http://pir.georgetown.edu/iprolink/
Testing and Benchmarking Dataset
• RLIMS-P text mining tool
• Protein dictionaries
• Name tagging guideline
• Protein ontology
5
Protein Phosphorylation Annotation Extraction Manual tagging assisted with computational extraction Training sets of positive and negative samples
Substrate(e.g., cPLA2)
phosphorylated-cPLA2
Enzyme(e.g., MAP kinase)
<THEME> Substrate (protein being phosphorylated)
<AGENT> Enzyme (kinase catalyzing the phosphorylation)
Phosphorylation
P-site
(e.g., Ser505)
P-group
<SITE> P-Site (amino acid residue being phosphorylated)
Ser-P
RLIMS-P
Evidence attribution
3 objects
6
RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation
Sentence extraction
Part of speech tagging
Preprocessing
Acronym detection
Term recognition
Entity Recognition
Noun and verb group detection
Other syntactic structure detection
Phrase Detection
Semantic Type
Classification
Nominal level relation
Verbal level relation
Relation Identification
Abstracts Full-Length Texts
Post-Processing
Extracted Annotations Tagged Abstracts
Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?ATR/FRP-1 also phosphorylated p53 in Ser 15
http://pir.georgetown.edu/iprolink/
download
7
Benchmarking of RLIMS-P
UniProtKB site feature annotation Proteomics Mass Spec. data
analysis: protein identification
High recall for paper retrieval and high precision for information extractionBioinformatics. 2005 Jun 1;21(11):2759-65
8
Online RLIMS-P http://pir.georgetown.edu/iprolink/rlimsp/
(version 1.0)
• Search interface
• Summary table with top hit of all sites
• All sites and tagged text evidence
1.
2.
3.
9
Raw Thesaurus
iProClass
NCBIEntrez Gene
RefSeqGenPept
UniProt
UniProtKBUniRef90/5
0PIR-PSD
Genome
FlyBaseWormBase
MGDSGDRGD
OtherHUGO
ECOMIM
Name Filtering
Highly Ambiguous Nonsensical
Terms
Semantic Typing
UMLS
NameExtraction
UniProtKB Entries:
Protein/Gene Names &
Synonyms
BioThesaurus
BioThesaurus http://pir.georgetown.edu/iprolink/biothesaurus/
• Biological entity tagging
• Name mapping
• Database annotation
• literature mining
• Gateway to other resources
Applications:
# UniProtKB entry 1.86m
# Source DB record 6.6m
# Gene/protein names/terms 3.6m
BioThesaurus v1.0 m = million
(May, 2005)
10
BioThesaurus Report
1 3
Synonyms for Metalloproteinase inhibitor 3
Gene/Protein Name Mapping
1. Search Synonyms
2. Resolve Name Ambiguity
3. Underlying ID Mapping
2
ID Mapping
Name ambiguityTMP3
11
Protein Name Tagging
Tagging guideline versions 1.0 and 2.0 Generation of domain expert-tagged corpora Inter-coder reliability – upper bound of machine tagging
Dictionary pre-tagging F-measure: 0.412 (0.372 Precision, 0.462 Recall) Advantages: helpful with standardization and extent of
tagging, reducing the fatigue problem, and improve inter-coder reliability.
BioThesaurus for pre-tagging
12
PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names as hierarchical protein ontology DAG Network structure for PIRSF family classification system
PIRSF-Based Protein Ontology
PIRSF in DAG View
13
PIRSF to GO Mapping Mapped 5363 curated PIRSF homeomorphic families and
subfamilies to the GO hierarchy 68% of the PIRSF families and subfamilies map to GO leaf nodes 2329 PIRSFs have shared GO leaf nodes
Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies
Superimpose GO and PIRSF hierarchies
Bidirectional display (GO- or PIRSF-centric views)
DynGO viewerHongfang Liu
University of Maryland
14
Protein Ontology Can Complement GO
Expanding a Node: Identification of GO subtrees that can be expanded when GO concepts are too broad IGFBP subfamilies and High- vs. low-affinity
binding for IGF between IGFBP and IGFBPrP
GO-centric view
15
Exploration of Gene and Protein Ontology
PIRSF-centric viewMolecular function
Biological process
Estrogen receptor alpha (PIRSF50001)
Systematic links between three GO sub-ontologies, e.g., linking molecular function and biological process: Estrogen receptor binding Estrogen receptor
signaling pathway
16
Summary
PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development
RLIMS-P text-mining tool for protein phosphorylation from PubMed literature.
BioThesaurus can be used for name mapping to solve name synonym and ambiguity issues.
PIRSF-based protein ontology can complement other biological ontologies such as GO.
17
Acknowledgements
Research Projects NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology)
Collaborators I. Mani from Georgetown University Department of Linguistics on
protein name recognition and protein name ontology. H. Liu from University of Maryland
Department of Information System on protein name recognition and text mining.
Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features.