Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P....
-
Upload
josephine-lee -
Category
Documents
-
view
215 -
download
0
Transcript of Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P....
![Page 1: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/1.jpg)
Article Semanticizer – Stitching Data Mining Services Into a Standalone
Search Appliance
David P. ShorthouseUniversité de Montréal / Canadensys
Dmitry MozzherinMarine Biological Laboratory / Global Names
@dpsSpiders, @dimus
![Page 2: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/2.jpg)
Biota of Canada
http://biologicalsurvey.ca
![Page 3: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/3.jpg)
We want to find & then organize data from printed materials but search is
exasperatingly limited
![Page 4: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/4.jpg)
15,000 OCR articles & their scanned images (9GB)
![Page 5: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/5.jpg)
Key Players
![Page 6: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/6.jpg)
Global Names
http://gnrd.globalnames.orghttp://resolver.globalnames.org
![Page 7: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/7.jpg)
![Page 8: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/8.jpg)
Named Entity Extractionpeople, companies, organizations, cities, geographic features
![Page 9: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/9.jpg)
elasticsearch
![Page 10: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/10.jpg)
http://canent.shorthouse.net
![Page 11: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/11.jpg)
https://github.com/dshorthouse/article_semanticizer
![Page 12: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/12.jpg)
Search Characteristics
• Tokenizers: path hierarchy• Filters: edge Ngram, pattern replace
(abbreviated genera), stemmer (English), elisions (French)
• Analyzers: lowercase, ascii folding, autocomplete
• Full text• Thanks to: Christian Gendreau (Canadensys)
![Page 13: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.](https://reader036.fdocuments.in/reader036/viewer/2022081516/56649ebc5503460f94bc54b6/html5/thumbnails/13.jpg)
Possible Next Steps
• Generalize the design to best support content types (eg specimen labels)
• Better recognition of other entities, text blocks• Scientific name plugin for elasticsearch
(hackathon?)• Share with Journal Map and Mining Biodiversity• Engage scientific societies, journals