Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of...

27
Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research 1 www.peerproject.eu etel 1 , Patrice Lopez 1-2 , Maud Medves 1-2 , Alain Monteil 1 , Laurent Univ. Berlin
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of...

Page 1: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Back to meaning

Information structuring in the PEER project

PEER Publishing and the Ecology of European Research 1 www.peerproject.eu

Foudil Bretel1, Patrice Lopez1-2, Maud Medves1-2, Alain Monteil1, Laurent Romary1-2

1INRIA2Humboldt Univ. Berlin

Page 2: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Sorting out the chaos?

• Vision: channelling heterogeneous (publisher’s) data into one single (meaningful) format

• Material: PDF with metadata – what can the TEI do with it?

• Articulating a pivot/reference format seen as a strict customization of the TEI

• Exploring the possibility of automatic metadata extraction from PDFs

PEER Publishing and the Ecology of European Research 2 www.peerproject.eu

Page 3: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Why is it so difficult?

• Great heterogeneity of format within publishers– Meta data (and full-text)– Proprietary, ScholarOne, NLM 2.0, NLM 3.0, …

• Various issues– Affiliations– Publication date information– ISO 639 codes (countries)– Bibliographical references– Proprietary metadata fields

PEER Publishing and the Ecology of European Research 3 www.peerproject.eu

Page 4: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

The information chaos

• Article title– article-title/title | ArticleTitle | article-title | ce:title | art_title

| article_title | nihms-submit/title | ArticleTitle/Title | ChapterTitle

• Journal title– j-title | JournalTitle | full_journal_title | jrn_title | journal-title

• ISSN (print)– JournalPrintISSN | issn[@issn_type='print'] | issn[@pub-

type='ppub'] | PrintISSN | issn-paper• First page of a paper

– spn | FirstPage | ArticleFirstPage | fpage | first-page

PEER Publishing and the Ecology of European Research 4 www.peerproject.eu

Page 5: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Sorting this out

• Defining a coherent infrastructure to facilitate– The long-term management of scholarly content

in research institutions– Smooth interaction between publishers and

research institutions• Better understanding of what each of us can provide

• On-going experimental setting: the EU PEER project

PEER Publishing and the Ecology of European Research 5 www.peerproject.eu

Page 6: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

The PEER project

• Initiated by the EU commission (DG INFSO)• Objective: study the impact of systematically

archiving stage-two outputs in “institutional repositories” (cf. Romary & Armbruster 2010)

– on journals and business models– on wider ecology of scientific resarch

• Consortium– STM, European Science Foundation (ESF), Goettingen

State and University Library (UGOE), Max Planck Gesellschaft (MPG), INRIA

PEER Publishing and the Ecology of European Research 6 www.peerproject.eu

Page 7: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

PEER Publishing and the Ecology of European Research 7 www.peerproject.eu

Content submission - publishers

Eligible Journals / Articles

Publishers

PEER Depot Authors

Select

100 % Metadata 50 % Manuscripts

Publishers Transfer

50 % Manuscripts

Publishers Deposit

Publishers Inform

Page 8: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

PEER Publishing and the Ecology of European Research 8 www.peerproject.eu

Content submission – to repositories & LTP archive

PEER Depot

Transfer

AuthorsDeposit

Transfer

Long-Term Preservation; LTP Depot

(e-Depot, KB)

Publicly Available PEER Repositories

UGOE

HAL

ULD

TDC

MPG

SSOAR

KTU

Publishers Deposit

Page 9: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

What has been done• Publishers involved the project

– BMJ Publishing Group (proprietary format)– Cambridge University Press (NLM2.2)– EDP Science (NLM3.0)– Elsevier (proprietary format)– IOP Publishing (NLM3.0)– Nature Publishing Group (proprietary format)– Oxford University Press (ScholarOne)– Portland Press (NLM2.0)– Sage Publications (proprietary format)– Springer (proprietary format)– Taylor & Francis Group (ScholarOne)– Wiley-Blackwell (ScholarOne)

PEER Publishing and the Ecology of European Research 9 www.peerproject.eu

Page 10: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

PEER Publishing and the Ecology of European Research 10 www.peerproject.eu

The PEER deposit workflow

HAL

SUB-Göt

MPS

PEER Depot

KB

PublishersRepositories

Preservation

Page 11: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

TEI as a pivot format for interchange

• General strategy: no information should be lost– Nearly everything in sourceDesc– + Keywords, Summary, Copyright

• Strict author description– Deep encoding of names– Deep encoding of affiliations (Web of Science - 3-level)– Deep encoding of addresses – getting the country right

• Precise publishing information– Pagination, DOIs, volume, issue, journals name(s)– Yes, biblStruct is cool!

PEER Publishing and the Ecology of European Research 11 www.peerproject.eu

Page 12: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Example

PEER Publishing and the Ecology of European Research 12 www.peerproject.eu

<Author AffiliationIDS="Aff1" CorrespondingAffiliationID="Aff1"><AuthorName DisplayOrder="Western"><GivenName>Hucheng</GivenName><FamilyName>Qi</FamilyName></AuthorName><Contact><Email>[email protected]</Email></Contact></Author>….<Affiliation ID="Aff1"><OrgName>Durisol, A division of Armtec Limited Partnership</OrgName><OrgAddress><Street>51 Arthur Street South</Street><Postcode>N0K 1N0</Postcode><City>Mitchell</City><State>ON</State><Country>Canada</Country></OrgAddress></Affiliation>

Source (Springer proprietary format)

PEER format(TEI)

<author><persName><forename type="first">Hucheng</forename><surname>Qi</surname></persName><email>[email protected]</email><affiliation><orgName type="institution">Durisol, A division of Armtec Limited Partnership</orgName><address><street>51 Arthur Street South</street><postCode>N0K 1N0</postCode><settlement>Mitchell</settlement><region>ON</region><country key="CA">CANADA</country></address></affiliation></author>

Page 13: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Example

PEER Publishing and the Ecology of European Research 13 www.peerproject.eu

<ArticleID>351</ArticleID> <ArticleDOI>10.1007/s00107-009-0351-z</ArticleDOI> <ArticleSequenceNumber>0</ArticleSequenceNumber> <ArticleTitle Language="En"> The investigation of basic processes of rapidly hardening wood-cement-water mixture with CO<Subscript>2</Subscript> </ArticleTitle> <ArticleTitle Language="De"> Untersuchung der Vorgänge bei der schnellen Härtung einer Holz-Zement-Wasser-Mischung mit CO<Subscript>2</Subscript> </ArticleTitle> <ArticleCategory>Originals Originalarbeiten </ArticleCategory> <ArticleFirstPage>1</ArticleFirstPage> <ArticleLastPage>7</ArticleLastPage> <ArticleHistory> <RegistrationDate> <Year>2009</Year> <Month>05</Month> <Day>14</Day> </RegistrationDate> <Received> <Year>2008</Year><Month>12</Month><Day>9</Day></Received> <OnlineDate> <Year>2009</Year> <Month>5</Month><Day>30</Day></OnlineDate> </ArticleHistory> <ArticleCopyright> <CopyrightHolderName>Springer-Verlag</CopyrightHolderName> <CopyrightYear>2009</CopyrightYear> </ArticleCopyright> <ArticleContext> <JournalID>107</JournalID> </ArticleContext> </ArticleInfo>

Source (Springer proprietary format)

<sourceDesc><biblStruct><analytic>…<title level="a" type="main" xml:lang="en"> The investigation of basic processes of rapidly hardening wood-cement-water mixture with CO<hi rend="subscript">2</hi></title><title level="a" type="main" xml:lang="de"> Untersuchung der Vorgänge bei der schnellen Härtung einer Holz-Zement-Wasser-Mischung mit CO<hi rend="subscript">2</hi></title></analytic><monogr><imprint><date when="2009-05-30"/><biblScope type="fpage">1</biblScope><biblScope type="lpage">7</biblScope></imprint></monogr><idno type="DOI">10.1007/s00107-009-0351-z</idno><idno type="publisherID">s00107-009-0351-z</idno><idno type="articleID">351</idno></biblStruct></sourceDesc></fileDesc>

PEER format(TEI)

Page 14: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

… And when no metadata is available

PEER Publishing and the Ecology of European Research 14 www.peerproject.eu

Page 15: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

GROBID

• GeneRation Of BIbliographic Data

• A text mining tool for extracting bibliographical metadata at large

• Input:

– Technical and scientific domains

– Scholar documents, technical manuals and patents

– Raw text or text with layout information (PDF)

• Machine learning approach

PEER Publishing and the Ecology of European Research 15 www.peerproject.eu

Page 16: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Metadata extraction from front page

PEER Publishing and the Ecology of European Research 16 www.peerproject.eu

Page 17: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Metadata extraction from front page

• Extraction of bibliographical information from article header

• Fields: title, authors, date, abstract, location, affiliation, book title, journal title, email, publication number, web, degree, keywords, etc.

• As features, exploitation of

– position information (begin/end of line, in the doc.)

– lexical information (vocabulary, large gazetteers)

– layout information (font size, font style, etc.)

• Conditional Random Fields (CRF) (Peng & McCallum 04)

• Current training corpus: 1 350 global examples + 200 affiliations/addresses blocks + 500 authors sequences, etc.

PEER Publishing and the Ecology of European Research 17 www.peerproject.eu

Page 18: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Layout & Block Analysis: XY-Cut algorithm

PEER Publishing and the Ecology of European Research 18 www.peerproject.eu

Page 19: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Metadata extraction from header

PEER Publishing and the Ecology of European Research 19 www.peerproject.eu

Page 20: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Metadata extraction from header

PEER Publishing and the Ecology of European Research 20 www.peerproject.eu

Page 21: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Metadata consolidation• Exploitation of external bibliographical databases

for correcting/completing results based on extraction results

• Crossref: The full bibliographical record can be obtained based on:

– DOI

– Journal title, volume, first page

– Title + author first name ➞ frequent!

• Other databases: xISSN, xISBN, Amazon Web Service

• Real time: online requests between 0.8-1.5

seconds PEER Publishing and the Ecology of European Research 21 www.peerproject.eu

Page 22: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Accuracy overview: corpus CORA

Features Accuracy

Precision Recall F1

Token 99.65 97.37 94.19 95.75

Field 94.7

Instance 74.91

Instance after consolidation 82.20

Title 99.70 98.24 95.48 96.84

Author 99.38 90.27 96.36 93.21

Date 99.86 97.53 81.07 87.29

Affiliation 99.52 98.25 93.26 95.69

Abstract 98.95 99.64 98.81 99.22

(+9.7%)

PEER Publishing and the Ecology of European Research 22 www.peerproject.eu

Page 23: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

+ catalog

Extraction from headerCollection Pre-

processing

Documentsegmentation

Token + features

CRF models

- text segmentation- feature generation

train

Affiliations

Authors

Header

+ catalog + expected result

PEER Publishing and the Ecology of European Research 23 www.peerproject.eu

Page 24: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Documentsegmentation

terms +labels

post-processingconsolidation

- text segmentation- feature generation

train /classify

Final biblio.record

Document Segmenteddocument

Term candidates + features

+ catalog

Extraction from headerCollection Pre-

processingToken + features

CRF models

Affiliations

Authors

Header

+ catalog + expected result

PEER Publishing and the Ecology of European Research 24 www.peerproject.eu

Page 25: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Why GROBID ?

• Cataloguing: mass digitalization

• User needs:

– self-archiving of scholar papers by authors in open archives

– metadata not easily available

• Extraction of additional metadata: references, keywords, etc. for enriching/correcting existing ones

– improvement in search & retrieval

• Ease document access from citation strings

• Playground for experimenting with CRF models for text mining

PEER Publishing and the Ecology of European Research 25 www.peerproject.eu

Page 26: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

Lessons

• Reusable infrastructure for various types of academic-publisher relation (e.g. Gold OA agreements)

• biblStruct is cool– Cf. Michael’s talk: deeply structured

• Standardization in the publishing world is still an open issue… diplomatically put

• The TEI has a role to play in the publishing world– Coherence between publication material and other sources– E.g. central role of attribution/authorship/affiliation

• Stylesheets to be made available in OxGarage

Page 27: Back to meaning Information structuring in the PEER project PEER Publishing and the Ecology of European Research1  Foudil Bretel 1, Patrice.

A TEI customization for scholarly publishing

• A family of formats based on the TEI customization facilities– Core editing customization (to be further extended – minimal tool

support)– Reference customization family for archiving– Can be extended to specific domains: Maths, physics, SVG graphics,

etc.– Precise representation of bibliographic information– Specific support through associated tool:

• XSLT stylesheets (html, pdf TEI2NLM)• PDF 2 TEI facility (Grobid)• Open Office 2 TEI facilities (maintained at Oxford)• MSWord 2 TEI facilities (TEI project with ISO)• AccessTEI

PEER Publishing and the Ecology of European Research 27 www.peerproject.eu