GlobalNames - Canadensys - Shorthouse

21
Global Names Recognition and Discovery (GNRD) • High throughput, queue-based « skin » on multiple processes of scientific name- finding engines NetiNeti: Python, machine-learning-based TaxonFinder: Perl, dictionary-based • Inputs: any file, URL, free-form text – Uses Docsplit gem (Tesseract OCR as needed) – Can send gzip request • Outputs: JSON/xml – Scientific names & their character offsets – OCR text Resolved names

description

Summary slides for AntCat workshop August 24-26 San Francisco, CA

Transcript of GlobalNames - Canadensys - Shorthouse

Page 1: GlobalNames - Canadensys - Shorthouse

Global Names Recognition and Discovery (GNRD)

• High throughput, queue-based « skin » on multiple processes of scientific name-finding engines– NetiNeti: Python, machine-learning-based– TaxonFinder: Perl, dictionary-based

• Inputs: any file, URL, free-form text– Uses Docsplit gem (Tesseract OCR as needed)– Can send gzip request

• Outputs: JSON/xml– Scientific names & their character offsets– OCR text– Resolved names

Page 2: GlobalNames - Canadensys - Shorthouse
Page 3: GlobalNames - Canadensys - Shorthouse

GNRD Clients & Applications

Page 4: GlobalNames - Canadensys - Shorthouse

15,000 OCR’d articles, 1868 - 2002All with DOIs158,000 unique scientific names92,000 vernaculars20,000 entities

Page 5: GlobalNames - Canadensys - Shorthouse
Page 6: GlobalNames - Canadensys - Shorthouse

No Consistency in Search APIs

{ "totalResults": 152, "startIndex": 1, "itemsPerPage": 30, "results": [ { "id": 14349, "title": "Ursus", "link": "http://eol.org/14349?action=overview&controller=taxa", "content": "Ursus Linnaeus, 1758; Ursus; Ursus (genus); Ursus (genus) Linnaeus, 1758; Ursus Arctos Bruinosus" }, { ... }, ], "first": "http://eol.org/api/search/Ursus.json?page=1", "self": "http://eol.org/api/search/Ursus.json?page=1", "next": "http://eol.org/api/search/Ursus.json?page=2", "last": "http://eol.org/api/search/Ursus.json?page=6"}

http://eol.org/api/search/1.0.json?q=Ursus http://api.gbif.org/name_usage/search?q=Ursus

{ offset: 0, limit: 20, endOfRecords: false, count: 77, results: [ { datasetTitle: "English Wikipedia Species Pages", parent: "Ursidae", kingdom: "Animalia", phylum: "Chordata", clazz: "Mammalia", order: "Carnivora", family: "Ursidae", genus: "Ursus », scientificName: "Ursus", canonicalName: "Ursus", authorship: "", nameType: "WELLFORMED", rank: "GENUS", …

Page 7: GlobalNames - Canadensys - Shorthouse

Use Darwin Core Terms

Page 8: GlobalNames - Canadensys - Shorthouse

OpenURL

• Created in late 1990s by a Flemish librarian• eg v0.1 http://resolver.example.edu/cgi?genre=book&isbn=0836218310&title=The+Far+Side+Gallery+3

• But no specification for response structure!!!

Page 9: GlobalNames - Canadensys - Shorthouse

bibJSON{ "title": "Open Bibliography for Science, Technology and Medicine", "author":[ {"name": "Richard Jones"}, {"name": "Mark MacGillivray"}, {"name": "Peter Murray-Rust"}, {"name": "Jim Pitman"}, {"name": "Peter Sefton"}, {"name": "Ben O'Steen"}, {"name": "William Waites"} ], "type": "article", "year": "2011", "journal": {"name": "Journal of Cheminformatics"}, "link": [{"url":"http://www.jcheminf.com/content/3/1/47"}], "identifier": [{"type":"doi","id":"10.1186/1758-2946-3-47"}]}

Page 10: GlobalNames - Canadensys - Shorthouse

Recommendation

• Use DwC terms as query params for find or ‘q’ for search

• Use DwC terms as keys in JSON responses

http://www.antweb.org/description.do?name=claripes%20orbiculatopunctatus&genus=camponotus&rank=species&project=worldants

http://www.antweb.org/description.do?specificEpithet=claripes&infraspecificEpithet=orbiculatopunctatus&genus=camponotus&taxonRank=species&project=worldants

Page 11: GlobalNames - Canadensys - Shorthouse

Canadensys:Vascular Plants of Canada

(VASCAN)

Luc Brouillet, Peter Desmet, et al.

Page 12: GlobalNames - Canadensys - Shorthouse

http://data.canadensys.net/vascan

Page 13: GlobalNames - Canadensys - Shorthouse
Page 14: GlobalNames - Canadensys - Shorthouse
Page 15: GlobalNames - Canadensys - Shorthouse

http://data.canadensys.net/vascan/name/Carex%20abbreviata

Page 16: GlobalNames - Canadensys - Shorthouse

http://data.canadensys.net/vascan/taxon/26512

Page 17: GlobalNames - Canadensys - Shorthouse
Page 18: GlobalNames - Canadensys - Shorthouse

http://doi.org/10.3897/phytokeys.25.3100

Page 19: GlobalNames - Canadensys - Shorthouse
Page 20: GlobalNames - Canadensys - Shorthouse

http://creativecommons.org/publicdomain/zero/1.0/

Page 21: GlobalNames - Canadensys - Shorthouse

Suggestions for AntCat

• Run literature through GNRD• Simplify web presence with concentration on

search as the entry point– index all available content– Present « pages » as declaration of relationships

• Use Darwin Core terms in « find » and « search » services

• Make DwC-A, CC-0 waiver, data paper & publish to GBIF, make accessible to GN