MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
-
Upload
francisco-couto -
Category
Science
-
view
137 -
download
1
Transcript of MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
![Page 1: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/1.jpg)
MER: a Minimal Named‐Entity Recognition Tagger
and Annotation Server
Francisco M. Couto, Luis F. Campos, and Andre LamuriasLaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
BioCreative V.5 Workshop , April 26‐27, 2017
![Page 2: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/2.jpg)
Why Minimal?
• TIPS (Technical interoperability and performance of annotation servers)
– it’s cool, we have to participate somehow
• But we have limited computational resources• Idea: Go Minimal
– Minimize the number of tools and steps to perform Named‐Entity Recognition (NER)
![Page 3: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/3.jpg)
What is Minimal?
• Flexibility– Simple input
• Autonomy – minimal set of components and software dependencies
• Efficiency– Low execution time
![Page 4: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/4.jpg)
How Minimal?
• Only requires a lexicon as input – a text file
• Only two components: 1. process the lexicon (offline)2. produce the annotations (on‐the‐fly)
• GNU Bash shell script– Using high performance grep and awk tools– Portability: any Unix‐like operating system
![Page 5: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/5.jpg)
Input
• lexicon text file
α‐maltosenicotinic acidnicotinic acid D‐ribonucleotidenicotinic acid‐adenine dinucleotide phosphate
![Page 6: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/6.jpg)
Pre‐Processing
== one‐word ( . . . word1 . txt ) α.maltose== two‐word ( . . . word2 . txt )nicotinic acid== more‐words ( . . . words . txt )nicotinic acid d.ribonucleotidenicotinic acid.adenine dinucleotide phosphate== first‐two‐words ( . . . words2 . txt )nicotinic acidnicotinic acid.adenine
![Page 7: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/7.jpg)
Recognition
• Common Solution– Apply grep directly to the input text
– execution time is proportional to the size of the lexicon
• Inverted Solution– input text as patterns matched against the lexicon– more than 100 times faster
• TIPS chemical lexicon
![Page 8: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/8.jpg)
Input text as patterns
![Page 9: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/9.jpg)
Output
./get_entities.sh 'α‐maltose and nicotinic acid D‐ribonucleotide was found, but not nicotinic acid' lexicon
0 9 α‐maltose14 28 nicotinic acid65 79 nicotinic acid14 45 nicotinic acid D‐ribonucleotide
![Page 10: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/10.jpg)
ANNOTATION SERVER
![Page 11: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/11.jpg)
Input: Lexicons• Cell line and cell type
– Cellosaurus• Chemical
– HMDB, ChEBI and ChEMBL• Disease:
– Human Disease Ontology• miRNA:
– miRBase• Protein:
– Protein Ontology• Subcellular structure:
– cellular component aspect of Gene Ontology• Tissue and organ:
– tissue and organ subsets of UBERON
https://github.com/lasigeBioTM/MER/raw/biocreative2017/data/TIPS_MER_lexicons_Jan2017.zip
![Page 12: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/12.jpg)
Lexicon Size
• more than 1M terms composed of more than 2M words and more than 25M characters
![Page 13: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/13.jpg)
Input: text
• jq– a command‐line JSON processor – to parse the requests
• cURL– to download each document
• Parsers– PubMed, Patents, PMC
https://github.com/lasigeBioTM/MER/tree/biocreative2017/external_services
• NO CACHE
![Page 14: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/14.jpg)
Output
• Added some more columns to MER output– BeCalm TSV format
• The score – 1‐1/ln(nc), – nc = # characters of the recognized term
![Page 15: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/15.jpg)
Infrastructure
• Three Virtual Machines (VM). – Each ad 8GB of RAM and 4 CPUs @ 1.7 GHz– CentOS Linux release 7.3.1611 (Core)
• VM (primary) to process the requests, distribute the jobs, and execute MER.
• The other two VMs (secondary) just execute MER.
• NGINX as HTTP server running CGI scripts – high performance
• Task Spooler to manage and distribute jobs
![Page 16: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/16.jpg)
Results
• April 21, 2017• less than 3 seconds on average
![Page 17: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/17.jpg)
Web Tool
http://labs.fc.ul.pt/mer/
![Page 18: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/18.jpg)
RESTful Web service
![Page 19: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/19.jpg)
Conclusions
• MER a minimal NER tagger– Flexible: extensible to any lexicon– Autonomous: only requires a GNU Bash shell– Efficient: high‐performance capacity of grep
• Annotation Server – developed in‐house – minimal software dependencies – and is open‐source
• Future: entity linking functionality in MER
![Page 20: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server](https://reader033.fdocuments.in/reader033/viewer/2022051710/5a676c737f8b9a656a8b5065/html5/thumbnails/20.jpg)
Acknowledgments
• Portuguese National Distributed Computing Infrastructure (http://www.incd.pt)
• Links– https://github.com/lasigeBioTM/MER– http://labs.fc.ul.pt/mer/