Text Mining Biodiversity 20160127

31
Text Mining Biodiversity S. Ananiadou E. Milios W. Ulate

Transcript of Text Mining Biodiversity 20160127

MiBio: Mining Biodiversity

Text Mining Biodiversity

S. AnaniadouE. MiliosW. Ulate

Shortcuts for fast forward of VLC videos: http://www.shortcutworld.com/en/win/VLC-Media-Player.htmlBefore starting, go to display settings and make the projector screen the main screen, so that videos pop upThere and not on the laptop screen. 1

Partners

24/14/2016Mining Biodiversity

BHL is the data sourceIMLS is the Funding AgencyMissouri Botanical Garden is the partner for the USSmithsonian Libraries is a contractor (not sure if we should include it)2

Outline IntroductionCreating a Term Inventory of Biodiversity Interactive Visualization of InventoryCreating a Text Mining Infrastructure for Biodiversity Interactive Clustering of Search Engine resultsOCR Error correctionSocial media platform Impact

SophiaSophiaSophiaSophiaEvangelosEvangelosWilliam (Anatoliys video has voice, so it is self-explanatory)William

3

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global biodiversity commons. The BHL consortium works with the international taxonomic community, rights holders, and other interested parties to ensure that this biodiversity heritage is made available to a global audience through open access principles. In partnership with the Internet Archive and through local digitization efforts, the BHL has digitized more than 48 million pages of taxonomic literature, representing over 100,000 titles and over 170,000 volumes.4

What do we want to do?

54/14/2016Mining Biodiversity

http://miningbiodiversity.orgHelp transform BHL into a next-generation social digital library through a multi-disciplinary approach that includes:Text MiningMachine learningHistory of ScienceEnvironmental History & StudiesLibrary and Information ScienceSocial Media

MiBIO will integrate TM tools within an interoperable platform to provide a semantic search system for the BHL, enhanced through clustering and visualisation capabilities. MiBIO will also provide a social media environment, which will enable BHL users to discuss, link and share digital artifacts posted to social media sites linked to the BHL search portal. The outcome will be the transformation of the BHL from a Digital Library (DL) into a Social Digital Library (SDL). This will be achieved through theenrichment of its historical digital archives with semantic metadata generated by TM.Furthermore, by leveraging existing social media sites and providing facilities for their integration with the BHL, we will engage a community of users to exploit the BHL as a forumfor the exchange of ideas.In a nutshell, we have incorporated into BHL three elements, as part of the Mining Biodiversity project: Visualisation, Social Media and Semantic Metadata.5

Creating the Term Inventory: why we need itA species name may usually be expressed in multiple ways, e.g., using scientific names or vernacular namesBalaena mysticetus Bowhead whale, bowheadSpizella passerina Chipping sparrows

Identify synonymous terms in biodiversity text

Why? To go beyond keyword-based search!

6

Such variants may cause low performance to a keyword-based search engine and moreover it causes difficulties for non-expert users (users that are not familiar with scientific names). To alleviate the issue of variants searching in the search engine, we have compiled a terminological inventory containing semantic variants of biodiversity terms, e.g., mammals, birds, plants, by using distributional semantic methods.Learn the representation vector of each termCalculate the cosine similarity between two termsExtract top-20 candidates of synonyms.6

Search Results Using Vernacular Names

Vernacular name of Balaena mysticetusDifferent results!!7

And here is the search result when we use a common name of the previous term, which consists only one document related to bowhead whale.Apparently, the search engine returns a different result with the previous one 7

Keyword-based Search: Ambiguity

Boxwood

historic place in Alabama?North American term for plants in the Buxaceae family?Box

container?Boxwood for other English-speaking countries?8

Another problem with keyword-based search, as mentioned above, is ambiguity.If one searches for Boxwood, a keyword-based system wouldnt know if he/she was referring to a place in Alabama, or the North American term for plants under the Buxaceae family. It will just return all documents pertaining to both.Nor will it know if a query Box pertains to the same plant family because apparently this is how other English-speaking countries refer to it, or a container.

8

Methods: Distributional Semantics Determine the meaning of terms and phrases by looking at the context and the meaning of individual wordsbowhead whale43.9939.9925.0623.9220.8419.8619.5217.915.62

balaenamysticetusalaskasealsdistributionringedcatchquotamurray

9mysticetussealsdistributionringedmurray

43.9925.0619.5217.91

balaenaalaskacatchquota

bowheadwhale39.9923.9220.8419.525.62

We then implemented two distributional semantic models. The first one is a count-based model that determines the

For example, within a 7-word window, this is the context vector of bowhead whale -- SA rubbish frequency 9

Distributional semantics methods balaena mysticetusbalaena glacialis0.7896bowhead whale0.7392bowhead0.7074bowhead whales0.6999eubalaena glacialis0.6905minke whale 0.6864humpback whale0.6490sperm whale 0.6440finback whale 0.6322sei whale0.6287eubalaena japonica0.6065brydes whale 0.6052humpback whales0.6000finback whales0.599810

In this manner, for each name, we generate a list of names ranked by similarity.For balaena mysticetus, for example, we obtained the following list.Determine the meaning of a term by considering all lexical units occurring within a N-word window.

10

ExperimentsTraining data: all English texts from the BHLabout 26 million pages with a size of 49GBEvaluation data: synonymous terms from the Catalogue of Life Select 500 scientific names and their synonyms from the CoL

Results at top-20CategoryClass#terms in CoL#terms in BHL#average synonyms in CoLBirdsAves11408182.28MammalsMammalia11317262.26PlantsPlantae11418262.28

CategoryPre@[email protected]%63%Mammals62.12%53.84%Plants56.17%21.43%

11

We have conducted our experiments on the Biodiversity Heritage Library (BHL) corpus. The corpus size is about 49 GB.

We have created a golden data of synonymous terms based on the Catalogue of Life. For each scientific name, we extract the corresponding common names and synonyms. We then picked randomly 500 species whose class is Aves. As a result, we got about 11 hundred terms of bird names (both vernacular and scientific names), of which about 8 hundreds existing in the BHL corpus.According to CoL, the average number of synonyms for each scientific names is about 2.We did the same process with mammal and plant names.

Follows are the precision and recall scores at top-20.Among the three categories, the performance of bird names is the best. With plant names, its lower performance can be explained by the fact that unlike mammals and birds, most of synonyms of plant names are also scientific names, which is more difficult to detect than the other.

11

3. Interactive visualization of term inventory

12

Shift+Arrow Right/Arrow LeftJump 3 seconds forward/ backwardAlt+Arrow Right/Arrow LeftJump 10 seconds forward/ backwardCtrl+Arrow Right/Arrow LeftJump 1 minute forward/ backward

-Frequency of species names can be visually explored, or queried by a search interface-Clicking on a species name acts as a query to retrieve its top-20 semantically related species.--Their semantically related score can be inspected--A blue color denotes that the species names appear as synonym in the CoL-Interactive visualizations were constructed for mammals, plants and birds

[and in case somebody asks:]-Images, which were crawled from external open sources, may help assess visually species' relatedness based on their visible features.

Shift+Arrow Right/Arrow LeftJump 3 seconds forward/ backwardAlt+Arrow Right/Arrow LeftJump 10 seconds forward/ backwardCtrl+Arrow Right/Arrow LeftJump 1 minute forward/ backward12

Term Inventory Visualization Video

Species names are shown in bubblesLarger bubbles denote species more frequently mentioned in the biodiversity literatureUpon interaction (semantically) related species can be inspectedColor opacity indicates degree of relatednessBlue color indicates that species also appear as synonyms in CoLImages are retrieved from open data collections (e.g. Wikipedia)

13

4. Creating a text mining infrastructure for biodiversity 14

Web-based, graphical TM workbenchStraightforward integration of tools into modular, extensible, reconfigurable and reusable workflows

http://argo.nactem.ac.uk

Source: LEGO DUPLO

Web-based application: No installation; Access with a web browserMulti-user system: Remote collaborative annotationSupports Unstructured Information Management Architecture UIMA, Cloud and high-performance computing

14

Annotation Workflow for Biodiversity

Pre-processingDictionary lookupMachine learning-based recognitionRelation extractionSaving15

This is the workflow that we put together using Argo. Without going too much into detail, I will just point out the general types of processing it tries to do: pre-processing (sentence splitting, tokenisation and part-of-speech tagging), matching against dictionaries or controlled vocabularies such as the ENVO and PATO ontologies, machine learning-based recognition of entities, extraction of relations based on the results of dependency parsing, and serialisation of the generated annotations.15

Annotation Workflows Video

5. Interactive clustering of search engine resultsGoal: to cluster BHL search engine resultsInput dataset: output of an Or query based on the following terms:KangarooLionRabbitSharkOnly titles of books or articles are considered in clustering Interactive clustering based on the keyterms of the titles

Interactive Clustering Video

6. OCR error correctionCorrect errors in natural language textsSpelling errors (e.g. the => teh)Grammar errors (e.g. this is => this are)Outline

OCR error correctionInputDocumentComponent selection (select components to use for processing)Correction candidatesA list of candidates with confidence for each errorComponent structure

OCR error correction video

Shift+Arrow Right/Arrow LeftJump 3 seconds forward/ backwardAlt+Arrow Right/Arrow LeftJump 10 seconds forward/ backwardCtrl+Arrow Right/Arrow LeftJump 1 minute forward/ backward

21

7. Social media platform

Making Biodiversity Digital Objects More Social and Shareable

Follow us on Twitter: @SMLabTO

My Tweeps app mytweeps.com Helping BHL (and other organizations) to get daily insights about their Twitter followers (or Tweeps) and what they are interested in.We call it a "reverse" Twitter because instead of seeing tweets from people whom you follow, the app shows you tweets from people who follow you.

Follow us on Twitter: @SMLabTO

We also partnered with Altmetric to better understand who and why people share BHL content across various social media platforms

Follow us on Twitter: @SMLabTO

My Tweeps video

8. Impact

Enhanced Searching of BHL Content

Faceted searchAutomatically generated questionsTime-sensitive search28

Enhanced Document Viewing

Page in PDF/image formatOCR-corrected text with colour-coded annotations29

The TeamNaCTeM Ryerson

Dalhousie

Missouri Botanical Garden

Smithsonian Libraries (contract)

NaCTeM: Riza Theresa Batista-Navarro, Sophia Ananiadou, Georgios KontonatsiosDalhousie: Axel Soto, Aminul Islam, Evangelos Milios, Abdul Mohd, HamidMissouriBotanicalGarden team: Mike Lichtenberg, Trish Rose-Sandler &William UlateSmithsonian Libraries staff(contract): Grace Costantino &Jen Hammock

30

Thanks to the sponsors: