Leveraging Semantic Fingerprinting for Building Author Networks

24
LEVERAGI N G SEMANTIC FINGERPRINTS FOR BUILDING AUTHOR NETWORKS Bob Kasencha Production Coordinato DHUG 201

description

Presented at the 10th annual Data Harmony Users Group meeting on Wednesday, February 12, 2014 by Bob Kasenchak of Access Innovations, Inc. With the rise of ORCID and other universal databases of researchers and institutions, it is increasingly crucial for publishers to sort out their own data containing named entities. This talk details Access Innovations' approach to author disambiguation, which includes a taxonomy-based solution in addition to algorithmic processes. The presentation includes a case study.

Transcript of Leveraging Semantic Fingerprinting for Building Author Networks

Page 1: Leveraging Semantic Fingerprinting for Building Author Networks

LEVER

AGING S

EMANTI

C

FINGER

PRINTS

FOR

BUILDIN

G AUTH

OR

NETW

ORKS

Bob KasenchakProduction Coordinator

DHUG 2014

Page 2: Leveraging Semantic Fingerprinting for Building Author Networks

NAMED ENTITY DISAMBIGUATIONMost publishers (and many other organizations) have need of disambiguating lists of:Persons

AuthorsEditorsMembersEmployees

InstitutionsColleges, Companies, Laboratories, Organizations

Copyright 2014 Access Innovations, Inc.

Page 3: Leveraging Semantic Fingerprinting for Building Author Networks

BUT WHY DISAMBIGUATE?

Facilitate content discoveryBrowse by Author or Institution name

Resolve member, author, marketing listsLink out to other organizations (e.g., ORCID)Demonstrate value to stakeholders

e.g., College libraries less apt to cancel subscriptions if they are shown how many of their professors are published in your content

Market research and analysis

Copyright 2014 Access Innovations, Inc.

Page 4: Leveraging Semantic Fingerprinting for Building Author Networks

TWO DISAMBIGUATION PROCESSESMatching algorithms

String matchingFuzzy matching

Leveraging other data associated with each entity to increase matching probability and reduce false matches, such as:CountryDateCo-authors

Copyright 2014 Access Innovations, Inc.

Page 5: Leveraging Semantic Fingerprinting for Building Author Networks

TWO-PHASE WORKFLOW Initial set of raw data is used to create an authority fileQuestionable names are subject to human review

Authority file is subject to constant review and cleanup

Entities are extracted from new content and compared to the authority fileAnomalies are reviewed and matched to existing records or added as new entities

Copyright 2014 Access Innovations, Inc.

Page 6: Leveraging Semantic Fingerprinting for Building Author Networks

INSTITUTION DISAMBIGUATIONHaving a clean Institution authority file allows for better processing of persons

The work is easier and more clear-cutDevelop standards and practices, but be prepared to change or add to them as new data comes to lightForcing data into a bad paradigm isn’t helpfulThe data should inform your standards and practices

Copyright 2014 Access Innovations, Inc.

Page 7: Leveraging Semantic Fingerprinting for Building Author Networks

INSTITUTION DISAMBIGUATION FLOW

Copyright 2014 Access Innovations, Inc.

Page 8: Leveraging Semantic Fingerprinting for Building Author Networks

QUALITY OF RAW DATA MATTERS

Well-formed source data?Structured or unstructured?Legacy content?

Often not as well structuredOr auto-tagged, so can be unreliable

Parsed using punctuation etc. as delimitersCommon abbreviations and stopwordsAlso, leverage country information if available

Copyright 2014 Access Innovations, Inc.

Page 9: Leveraging Semantic Fingerprinting for Building Author Networks

INSTITUTIONS: RAW DATAOhio Aerosp. Inst., Cleveland, OH 44142

Ohio Aerospace Institute (OAI)

Ohio Dominican University

Ohio Institute of Technology

Ohio Northern University

Ohio State

Ohio State Univ., Columbus, OH

Ohio State Univ., Columbus, OH 43210

Ohio State Univ., Columbus, OH 43210‐1298

Ohio State Univ., Dept. of Linguist.

Ohio State Univ., Dept. of Mech. Eng., Columbus, OH 43210, [email protected]

Ohio University

Copyright 2014 Access Innovations, Inc.

Page 10: Leveraging Semantic Fingerprinting for Building Author Networks

INSTITUTION DISAMBIGUATION FLOW

Copyright 2014 Access Innovations, Inc.

Page 11: Leveraging Semantic Fingerprinting for Building Author Networks

HUMAN EDITORIAL REVIEW

Two kinds of human intervention are used:QC of automated matches for accuracy

Culls out errorsGather data to iteratively adjust matching algorithms

Reviewing non-matched entitiesMatch by hand to existing authority fileCreate new listings for new entities

Copyright 2014 Access Innovations, Inc.

Page 12: Leveraging Semantic Fingerprinting for Building Author Networks

EDITORIAL REVIEW INTERFACE

Institutions to be reviewed

AuthorityFile lookup

Search results

Copyright 2014 Access Innovations, Inc.

Page 13: Leveraging Semantic Fingerprinting for Building Author Networks

AUTHORS (AND OTHER PERSONS)

Persons are trickier than institutions!VariantsNicknamesMiddle name, initial, or nothing

InitialsSuffixes and PrefixesSimilar last namesName changesTransliterations

Copyright 2014 Access Innovations, Inc.

Page 14: Leveraging Semantic Fingerprinting for Building Author Networks

NAMES: RAW DATACarlson, N.

Carlson, Neil N.

Carlson, P.

Carlson, R. L.

Carlson, R. M. K.

Carlson, R. W.

Carlson, Roy

Carlson, Roy F.

Carlson, T. A.

Carlson, Thomas

Carlson, Thomas A.

Carlson, Thomas J.

Carlson, W. G.

Carlson, William

Carlson, William V.

Which, if any, arethe same person?

Copyright 2014 Access Innovations, Inc.

Page 15: Leveraging Semantic Fingerprinting for Building Author Networks

PERSON NAME DISAMBIGUATION FLOW

Copyright 2014 Access Innovations, Inc.

Page 16: Leveraging Semantic Fingerprinting for Building Author Networks

RESOLVER; SEMANTIC FINGERPRINTS

Copyright 2014 Access Innovations, Inc.

Page 17: Leveraging Semantic Fingerprinting for Building Author Networks

RESOLVER; SEMANTIC FINGERPRINTS

Copyright 2014 Access Innovations, Inc.

Page 18: Leveraging Semantic Fingerprinting for Building Author Networks

AUTHOR NAME AUTHORITY FILE

Each author record is linked to other associated data:Every DOI (or other document #)Every co-authorEvery institutionDates of publicationSubject terms from thesaurus used to index content associated with each personEach of these is used in the disambiguation algorithm to weight the potential matches of similar names

Copyright 2014 Access Innovations, Inc.

Page 19: Leveraging Semantic Fingerprinting for Building Author Networks

LEVERAGING THESAURUS TERMS

The indexing from every paper by each known author comprises a weighted subject “fingerprint”

Potential matching names from incoming content are associated with the indexing from each paper

Subject terms are compared to potential matches to increase certainty weighting

Copyright 2014 Access Innovations, Inc.

Page 20: Leveraging Semantic Fingerprinting for Building Author Networks

PERSON NAME DISAMBIGUATION FLOW

Copyright 2014 Access Innovations, Inc.

Page 21: Leveraging Semantic Fingerprinting for Building Author Networks

EDITORIAL REVIEW INTERFACE

Authors to be reviewed AuthorityFile lookup

Search results

Copyright 2014 Access Innovations, Inc.

Page 22: Leveraging Semantic Fingerprinting for Building Author Networks

ITERATIVE PROCESSES

Every batch of new content adds more data for the matching algorithms to use

The authority files should be reviewed by editors for QC to keep the files clean

Editors can suggest tweaks to the algorithm based on the results that are being sent to them for review and QC of the authority files Too many obvious matches being kicked out; or Bad automatic matches being added to authority files

Copyright 2014 Access Innovations, Inc.

Page 23: Leveraging Semantic Fingerprinting for Building Author Networks

CONTENT-AWARE PROCESSES

Every dataset is different, so the named entity disambiguation processes and algorithms should be modified to suit

More “adjustable” than “one-size-fits-all”Basic processes can be customized to suit different datasets and client needs

Leveraging thesaurus/subject terms from indexing is a huge addition to the disambiguation algorithms

Copyright 2014 Access Innovations, Inc.

Page 24: Leveraging Semantic Fingerprinting for Building Author Networks

NAMED ENTI

TY

DISAMBIG

UATIO

N

PROCESSES A

ND

PROCEDURES

Bob KasenchakProject CoordinatorNovember 20, 2013

Thank You – Any Questions?