Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief...

48
Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Transcript of Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief...

Page 1: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Natural Language Processing for LODLAM

Presented at IGeLU 2014by Corey A Harper2014-09-16

A brief intro to machine learning & data science

for Libraries

Page 2: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Context

Narrative

Story telling

The Library's story,

and the Archives story,

but also…

Page 3: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Users’ stories

Scholars' stories

Adding context through recombinant metadata

Page 4: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Scholars & Users Stories – Tim Sherratt (@wragge)

Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/

Page 5: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Library Authority Data

β€œInclude links to other URIs. so that they can discover more things.”

Short of providing and linking to URIs, this *is* authority data.

This is what our authority files are for.

Page 6: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Linked data is about context

authorities provide context

and yet our controlled vocabs

are nearly gone

because the interfaces to them were broken

Page 7: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 8: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 9: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

The Death of Browse

β€’ Next-Gen Discovery Systems don't make use of Authority Control

β€’ β€œBrowse” was/is broken as a UI Design

β€’ Rich data in Authorities, disconnected from narrative, context, search

β€’ Richer β€œAuthority” type data outside libraries...

β€’ β€œNext Gen Next Gen Discovery…

Page 10: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 11: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 12: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 13: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 14: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Fuzzy Wuzzy – Seat GeekF

uzzy Wuzzy – A

wesom

e Library from S

eatGeek

https://github.com/seatgeek/fuzzyw

uzzyh

ttp://se

atg

ee

k.com

/blo

g/d

ev/fu

zzywu

zzy-fuzzy-strin

g-m

atch

ing

-in-p

ytho

n

Page 15: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Slide courtesy of Doug Oard Univ. of Maryland

Page 16: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Tools - Natural Language Processing

β€’ DBPedia Spotlighthttps://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki

β€’ Zemanta: http://www.zemanta.com/?wpst=1

β€’ Open Calais: http://www.opencalais.com/

β€’ Open Refine: http://openrefine.org/

β€’ DataTXT: https://dandelion.eu/products/datatxt/

β€’ AlchemyAPI: http://www.alchemyapi.com/

β€’ FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy

Page 17: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 18: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 19: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Where does this lead?

We need new interfaces

new tools

for new kind of catalogers

for knowledge organization experts

Page 20: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Linked Jazz Back End

Page 21: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Primo PNX and Authorities

β€’ Indexing Cross References

β€’ New Browse Functionality

β€’ Authority Control from Aleph / Almaβ€’ What about non-MARC, or non-

Aleph Data?

β€’ Matching Strings to Authorities

Page 22: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Enter Open Refinehttp://freeyourm

etadata.org/

Page 23: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Match strings to vocabularies…

Page 24: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Like LCNAF…

Page 25: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Or Wikipedia

Page 26: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Automated Authority Control?

Page 27: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 28: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Open Refine RDF Skeleton

Page 29: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 30: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Proposed System Architecture

Page 31: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Hydra Modeling & Architecture

β€’ Approaches to Provenanceβ€’ Prov-O

β€’ Named Graphs

β€’ Named Datastreams

β€’ β€œn” nyucore β€œrecords”‒ Same properties defined for each

β€’ Keep data sources separate

β€’ Merge for display in Blacklight & export to Primo

Page 32: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Separate Metadata Datastreams

β€’ source_metadata, enrich_metadataβ€’ Reload one or both without affecting other

or native metadata

β€’ native_metadataβ€’ Edited only through Hydra UIβ€’ Partitioned from external sources

Page 33: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Metadata Provenance

Page 34: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Fedora Datastreams

Page 35: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Blacklight User Interface

Page 36: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Where does this lead?

We need new interfaces

new tools

for new kind of catalogers

for knowledge organization experts

Page 37: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

A Role for Ex Libris

β€’ Alma &/or Primoβ€’ Named Entity Recognition

β€’ Vocabulary Reconciliation

β€’ Provenance Management

β€’ Primo Centralβ€’ Named Entity Recognition on Full Text

β€’ Auto Classification

Page 38: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

A bit louder...

we need new interfaces

we need enterprise tools

Integrated into our metadata management systems

for new kind of catalogers

for knowledge organization experts

Page 39: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Simplified Workflow Proposal

Page 40: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

More Tools – At Programming Level

β€’ Open NLP: https://opennlp.apache.org/

β€’ Stanford Natural Language Toolkit: http://nlp.stanford.edu/software/index.shtml

β€’ Python Tools β€’ SciKitLearn, Pandas, NLTK, SciPi, NumPiβ€’ https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience

β€’ http://pandas.pydata.org/

β€’ http://www.nltk.org/

Page 41: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

More Data Science-ey Toolshttp://w

ww

.rexeranalytics.com/D

ata-Miner-S

urvey-Results-2013.htm

l

Page 42: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Data Science Techniques

β€’ Feature Extraction / Feature Engineering

β€’ Predictive Modeling

β€’ Probabilistic Classification – Large Multi-Class Problems

β€’ Text Analyticsβ€’ Vectorization

β€’ Bags & Sets of Words

β€’ TF/IDF

β€’ N-Grams

β€’ Sparse Matrices

Page 43: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Simple Example – Predict Yelp Star Ratings

Page 44: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Fitting a Model – NaΓ―ve Bayes

Page 45: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Data Science Venn Diagramhttp://drew

conway.com

/zia/2013/3/26/the-data-science-venn-diagram

Page 46: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

1+ lnπ‘‡π‘œπ‘‘π‘Žπ‘™ π·π‘œπ‘π‘’π‘šπ‘’π‘›π‘‘πΆπ‘œπ‘’π‘›π‘‘

π·π‘œπ‘π‘’π‘šπ‘’π‘›π‘‘π‘ πΆπ‘œπ‘›π‘‘π‘Žπ‘–π‘›π‘–π‘›π‘”π‘‡π‘’π‘Ÿπ‘š

http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323

Page 47: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Where can we go from here?

β€’ NER is just the beginning

β€’ Feature Engineering

β€’ Hiring Statisticians

β€’ Clustering & Classification

β€’ Vocabulary Pruning and Engineeringβ€’ Manageable 10-20k Class Text Classification Problems

β€’ Domain Specific

β€’ Ex Libris’ Activity in this space

Page 48: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Thanks!

[email protected]

212.998.2479

@chrpr