GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

36
GROTOAP2 — The methodology of creating a large ground truth dataset of scientific articles Dominika Tkaczyk , Pawel Szostek and Lukasz Bolikowski Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw 3rd International Workshop on Mining Scientific Publications 12 September 2014 D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 1 / 23

description

An article "GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles" presented during WOSP 2014

Transcript of GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Page 1: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

GROTOAP2 — The methodology of creatinga large ground truth dataset of scientific articles

Dominika Tkaczyk, Pawe l Szostek and Lukasz Bolikowski

Interdisciplinary Centre for Mathematical and Computational ModellingUniversity of Warsaw

3rd International Workshop on Mining Scientific Publications12 September 2014

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 1 / 23

Page 2: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Background

CERMINE extracts:

document’smetadata,

bibliographicreferences,

structured fulltext.

CERMINE needsa training set forits zone classifiers!

PDFBT /F13 10 Tf 250 720 Td (PDF) TjET

<title>Syst...<author>M...<author>J.I...<journal>J...<date>2009..

<ref> <author>M.. <title>Sys... <journal>J...</ref><ref>...

Basicstructureextraction

Metad

ata

extra

ction

Textextraction

<JATS><front> <meta><title</front><body> <sec><title></body><back> <ref>1. <aut</back>

<body> <sec> <title>1. In <p>The ... ...</body>

<XML>

<XML>

<XML>Referencesextraction

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 2 / 23

Page 3: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Requirements

A good dataset for documentregion classification should be:

large,

diverse,

preserving document text,

and the way text is displayed,

with fine-grained labels,

open.

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 3 / 23

Page 4: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

GROTOAP

GROTOAP dataset:

113 documents

1,031 pages

20,121 zones

20 zone labels

12 publishers

created by automatic tools+ manual correction of everydocument = non-scalable

∼100% accurate

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 4 / 23

Page 5: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

GROTOAP vs. GROTOAP2

GROTOAP dataset:

113 documents

1,031 pages

20,121 zones

20 zone labels

12 publishers

created by automatic tools+ manual correction of everydocument = non-scalable

∼100% accurate

GROTOAP2 dataset:

13,210 documents

119,334 pages

1,640,973 zones

22 zone labels

208 publishers

created by automatic tools+ manually developedcorrection rules = scalable

∼93% accurate

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 5 / 23

Page 6: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

The content

GROTOAP2 is composed of:

13,210 ground-truth files in XML format storing thecontent of scientific publications from PubMed Central,

a list of URLs to corresponding PDF files,

a bash script for downloading PDF files from PMC repository.

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 6 / 23

Page 7: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

The model

The document’s model in GROTOAP2contains:

geometric hierarchical structure:pages, zones, lines, words andcharacters,

the text content of all the objects,

the dimentions and positions,

the reading order,

zone labels.

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 7 / 23

Page 8: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Zone labels

front: type, title, author,title author, editor,affiliation, abstract,keywords, bib info, dates,correspondence, glossary,copyright

body: body content, figure,table, equation

back: references,acknowledgment,conflict statement

other: page number,unknown

BIB_INFO

BODY_CONTENT

REFERENCES

AFFILIATION

PAGE_NUMBER

ABSTRACT

AUTHOR

DATESTITLE

COPYRIGHT

ACKNOWLEDGMENT

UNKNOWN

FIGURE

CORRESPONDENCE

CONFLICT_STATEMENT

TABLETYPE

KEYWORDS

EDITOR

TITLE_AUTHOR

GLOSSARY

EQUATION

0

20

40

60

80

100

% o

f doc

umen

ts

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 8 / 23

Page 9: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 10: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 11: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page>

<PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 12: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 13: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 14: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 15: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 16: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT Text Value=”B”/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 17: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners>

<Vertex x=”55.4” y=”34.3”/><Vertex x=”250.5” y=”58.3”/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners>

<Vertex x=”55.4” y=”34.3”/><Vertex x=”250.5”y=”58.3”/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners>

<Vertex x=”55.4” y=”34.3”/><Vertex x=”115.3” y=”58.3”/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x=”55.4” y=”34.3”/><Vertex x=”74.1” y=”58.3”/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 18: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification>

<Category Value=”TITLE”/><Type Value=/>

</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 19: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

TrueViz format

<Document><Page>

<PageID Value=”0”/><PageNext Value=”1”/><Zone>

<ZoneID Value=”0”/><ZoneNext Value=”1”/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>

</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>

</Classification><Line>

<LineID Value=”0”/><LineNext Value=”1”/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>

</LineCorners>

<Word><WordID Value=”0”/><WordNext Value=”1”/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>

</WordCorners><Character>

<CharacterID Value=”0”/><CharacterNext Value=”1”/><CharacterCorners>

<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>

</CharacterCorners><GT_Text Value="B"/>

</Character>[...]</Word>[...]

</Line>[...]</Zone>[...]

</Page>[...]</Document>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

Page 20: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

The method

PubMedCentral

CERMINEtools

zone textmatching

rules

PDF

<NLM>PDF

<NLM>

PDF

<NLM>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23

Page 21: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

The method

PubMedCentral

CERMINEtools

zone textmatching

rules

PDF

<NLM>PDF

<NLM>

PDF

<NLM>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23

Page 22: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Structure extraction

CERMINE tools were used to:

extract individual characters and their bounding boxes fromPDF files,

group individual characters into words, lines and zones,

compute the reading order of all the elements.

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23

Page 23: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

The method

PubMedCentral

CERMINEtools

zone textmatching

rules

PDF

<NLM>PDF

<NLM>

PDF

<NLM>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23

Page 24: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Zone text matching

Labels were assigned to zones:

the text content of zones was matched with correspondingNLM files,

Smith-Watermann sequence alignment algorithm was usedto measure string similarity,

the label was chosed by selecting a string with the highestsimilarity score above a threshold,

additional attempt to assign a label to every ”unknown”zone based on the labels of the neighbouring zones wasmade.

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 12 / 23

Page 25: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Document filtering

43% of all processeddocuments have atleast 90% of zoneslabelled.

0 20 40 60 80 100Percentage of labelled zones

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Frac

tion

of d

ocum

ents

in b

in

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 13 / 23

Page 26: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Distribution similarity

Publisher distribution similarity of two datasets A and B can be calculated as:

sim(A,B) =∑p∈P

min(dA(p),dB(p))

where P is the set of all publishers in A ∪ B and dA(p) and dB (p) are thepercentage share of a given publisher in sets A and B, respectively.

Some examples:

sim({60% X, 40% Y}, {60% X, 40% Y}) = 1.0

sim({60% X, 40% Y}, {40% X, 60% Y}) = 0.8

sim(entire processes set, selected set) = 0.78

sim({30% X, 70% Y}, {100% Z}) = 0.0

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23

Page 27: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

The method

PubMedCentral

CERMINEtools

zone textmatching

rules

PDF

<NLM>PDF

<NLM>

PDF

<NLM>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23

Page 28: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Rules

a zone containing both title and authors → title author

pages numbers from range 1–n → page number

figures captions → figure

tables captions → table

small zones lying in the close neighbourhood of table zones → table

zones that occur on every page or every odd/even page and areplaced close to the top or bottom of the page → bib info

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 15 / 23

Page 29: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

The method

PubMedCentral

CERMINEtools

zone textmatching

rules

PDF

<NLM>PDF

<NLM>

PDF

<NLM>

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 16 / 23

Page 30: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

The evaluation

manual evaluation — using a small random sample of documents

indirect evaluation — evaluating the performance of CERMINEtrained on GROTOAP2

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 17 / 23

Page 31: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Manual evaluation

without rules with rules

prec. recall F-score prec. recall F-score

abstract 0.93 0.96 0.94 0.98 0.98 0.98

acknowledgement 0.98 0.67 0.80 1.0 0.90 0.95

affiliation 0.77 0.90 0.83 0.95 0.95 0.95

author 0.85 0.95 0.90 1.0 0.98 0.99

bib info 0.95 0.45 0.62 0.96 0.94 0.95

body content 0.65 0.98 0.79 0.88 0.99 0.93

conflict statement 0.63 0.24 0.35 0.82 0.89 0.85

copyright 0.71 0.94 0.81 0.93 0.78 0.85

correspondence 1.0 0.72 0.84 1.0 0.97 0.99

dates 0.28 1.0 0.44 0.94 1.0 0.97

editor - 0 - 1.0 1.0 1.0

equation - - - - - -

figure 0.99 0.36 0.53 0.99 0.46 0.63

glossary 1.0 1.0 1.0 1.0 1.0 1.0

keywords 0.94 0.94 0.94 1.0 0.94 0.97

page number 0.99 0.53 0.69 0.98 0.97 0.98

references 0.91 0.95 0.93 0.99 0.95 0.97

table 0.98 0.83 0.90 0.98 0.96 0.97

title 0.51 1.0 0.67 1.0 1.0 1.0

title author - 0 - 1.0 1.0 1.0

type 0.76 0.46 0.57 0.89 0.47 0.62

unknown 0.22 0.46 0.30 0.62 0.94 0.75

average 0.79 0.68 0.73 0.95 0.91 0.92

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 18 / 23

Page 32: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

CERMINE-based evaluation

precision recall F-score

title 93.05% 88.40% 90.67%

author 94.38% 90.01% 92.14%

affiliation 84.20% 78.03% 81.00%

abstract 85.24% 83.67% 84.45%

keywords 87.98% 65.30% 74.96%

journal name 71.88% 63.40% 67.38%

volume 96.28% 93.20% 94.72%

issue 49.12% 55.67% 52.19%

pages 47.41% 45.79% 46.59%

year 99.79% 97.80% 98.29%

DOI 96.12% 85.34% 90.41%

average 82.22% 76.96% 79.34%

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 19 / 23

Page 33: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

CERMINE-based evaluation

GROTOAP

GROTOAP2

without with

rules rules

Precision 77.13% 81.88% 82.22%

Recall 55.99% 70.94% 76.96%

F-score 62.41% 75.38% 79.34%

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 20 / 23

Page 34: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Future work

enriching the ground truth files with the names of the fonts,

assigning more specific body labels, eg. section titles,

generating a dataset of parsed bibliographic referencesin a similar way.

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 21 / 23

Page 35: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Links

GROTOAP2: http://cermine.ceon.pl/grotoap2/

CERMINE web service: http://cermine.ceon.pl

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 22 / 23

Page 36: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Thank you

Thank you!Questions?

Dominika [email protected]

c© 2014 Dominika Tkaczyk. This document is distributed under the Creative Commons Attribution 3.0 license.

The complete text of the license can be seen here: http://creativecommons.org/licenses/by/3.0/

D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 23 / 23