GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles
description
Transcript of GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles
GROTOAP2 — The methodology of creatinga large ground truth dataset of scientific articles
Dominika Tkaczyk, Pawe l Szostek and Lukasz Bolikowski
Interdisciplinary Centre for Mathematical and Computational ModellingUniversity of Warsaw
3rd International Workshop on Mining Scientific Publications12 September 2014
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 1 / 23
Background
CERMINE extracts:
document’smetadata,
bibliographicreferences,
structured fulltext.
CERMINE needsa training set forits zone classifiers!
PDFBT /F13 10 Tf 250 720 Td (PDF) TjET
<title>Syst...<author>M...<author>J.I...<journal>J...<date>2009..
<ref> <author>M.. <title>Sys... <journal>J...</ref><ref>...
Basicstructureextraction
Metad
ata
extra
ction
Textextraction
<JATS><front> <meta><title</front><body> <sec><title></body><back> <ref>1. <aut</back>
<body> <sec> <title>1. In <p>The ... ...</body>
<XML>
<XML>
<XML>Referencesextraction
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 2 / 23
Requirements
A good dataset for documentregion classification should be:
large,
diverse,
preserving document text,
and the way text is displayed,
with fine-grained labels,
open.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 3 / 23
GROTOAP
GROTOAP dataset:
113 documents
1,031 pages
20,121 zones
20 zone labels
12 publishers
created by automatic tools+ manual correction of everydocument = non-scalable
∼100% accurate
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 4 / 23
GROTOAP vs. GROTOAP2
GROTOAP dataset:
113 documents
1,031 pages
20,121 zones
20 zone labels
12 publishers
created by automatic tools+ manual correction of everydocument = non-scalable
∼100% accurate
GROTOAP2 dataset:
13,210 documents
119,334 pages
1,640,973 zones
22 zone labels
208 publishers
created by automatic tools+ manually developedcorrection rules = scalable
∼93% accurate
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 5 / 23
The content
GROTOAP2 is composed of:
13,210 ground-truth files in XML format storing thecontent of scientific publications from PubMed Central,
a list of URLs to corresponding PDF files,
a bash script for downloading PDF files from PMC repository.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 6 / 23
The model
The document’s model in GROTOAP2contains:
geometric hierarchical structure:pages, zones, lines, words andcharacters,
the text content of all the objects,
the dimentions and positions,
the reading order,
zone labels.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 7 / 23
Zone labels
front: type, title, author,title author, editor,affiliation, abstract,keywords, bib info, dates,correspondence, glossary,copyright
body: body content, figure,table, equation
back: references,acknowledgment,conflict statement
other: page number,unknown
BIB_INFO
BODY_CONTENT
REFERENCES
AFFILIATION
PAGE_NUMBER
ABSTRACT
AUTHOR
DATESTITLE
COPYRIGHT
ACKNOWLEDGMENT
UNKNOWN
FIGURE
CORRESPONDENCE
CONFLICT_STATEMENT
TABLETYPE
KEYWORDS
EDITOR
TITLE_AUTHOR
GLOSSARY
EQUATION
0
20
40
60
80
100
% o
f doc
umen
ts
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 8 / 23
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page>
<PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT Text Value=”B”/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners>
<Vertex x=”55.4” y=”34.3”/><Vertex x=”250.5” y=”58.3”/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners>
<Vertex x=”55.4” y=”34.3”/><Vertex x=”250.5”y=”58.3”/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners>
<Vertex x=”55.4” y=”34.3”/><Vertex x=”115.3” y=”58.3”/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x=”55.4” y=”34.3”/><Vertex x=”74.1” y=”58.3”/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page><PageID Value="0"/><PageNext Value="1"/><Zone><ZoneID Value="0"/><ZoneNext Value="1"/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification>
<Category Value=”TITLE”/><Type Value=/>
</Classification><Line><LineID Value="0"/><LineNext Value="1"/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value="0"/><WordNext Value="1"/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character><CharacterID Value="0"/><CharacterNext Value="1"/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format
<Document><Page>
<PageID Value=”0”/><PageNext Value=”1”/><Zone>
<ZoneID Value=”0”/><ZoneNext Value=”1”/><ZoneCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5" y="58.3"/>
</ZoneCorners><Classification><Category Value="TITLE"/><Type Value=/>
</Classification><Line>
<LineID Value=”0”/><LineNext Value=”1”/><LineCorners><Vertex x="55.4" y="34.3"/><Vertex x="250.5"y="58.3"/>
</LineCorners>
<Word><WordID Value=”0”/><WordNext Value=”1”/><WordCorners><Vertex x="55.4" y="34.3"/><Vertex x="115.3" y="58.3"/>
</WordCorners><Character>
<CharacterID Value=”0”/><CharacterNext Value=”1”/><CharacterCorners>
<Vertex x="55.4" y="34.3"/><Vertex x="74.1" y="58.3"/>
</CharacterCorners><GT_Text Value="B"/>
</Character>[...]</Word>[...]
</Line>[...]</Zone>[...]
</Page>[...]</Document>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
Structure extraction
CERMINE tools were used to:
extract individual characters and their bounding boxes fromPDF files,
group individual characters into words, lines and zones,
compute the reading order of all the elements.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23
Zone text matching
Labels were assigned to zones:
the text content of zones was matched with correspondingNLM files,
Smith-Watermann sequence alignment algorithm was usedto measure string similarity,
the label was chosed by selecting a string with the highestsimilarity score above a threshold,
additional attempt to assign a label to every ”unknown”zone based on the labels of the neighbouring zones wasmade.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 12 / 23
Document filtering
43% of all processeddocuments have atleast 90% of zoneslabelled.
0 20 40 60 80 100Percentage of labelled zones
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Frac
tion
of d
ocum
ents
in b
in
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 13 / 23
Distribution similarity
Publisher distribution similarity of two datasets A and B can be calculated as:
sim(A,B) =∑p∈P
min(dA(p),dB(p))
where P is the set of all publishers in A ∪ B and dA(p) and dB (p) are thepercentage share of a given publisher in sets A and B, respectively.
Some examples:
sim({60% X, 40% Y}, {60% X, 40% Y}) = 1.0
sim({60% X, 40% Y}, {40% X, 60% Y}) = 0.8
sim(entire processes set, selected set) = 0.78
sim({30% X, 70% Y}, {100% Z}) = 0.0
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23
Rules
a zone containing both title and authors → title author
pages numbers from range 1–n → page number
figures captions → figure
tables captions → table
small zones lying in the close neighbourhood of table zones → table
zones that occur on every page or every odd/even page and areplaced close to the top or bottom of the page → bib info
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 15 / 23
The method
PubMedCentral
CERMINEtools
zone textmatching
rules
<NLM>PDF
<NLM>
<NLM>
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 16 / 23
The evaluation
manual evaluation — using a small random sample of documents
indirect evaluation — evaluating the performance of CERMINEtrained on GROTOAP2
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 17 / 23
Manual evaluation
without rules with rules
prec. recall F-score prec. recall F-score
abstract 0.93 0.96 0.94 0.98 0.98 0.98
acknowledgement 0.98 0.67 0.80 1.0 0.90 0.95
affiliation 0.77 0.90 0.83 0.95 0.95 0.95
author 0.85 0.95 0.90 1.0 0.98 0.99
bib info 0.95 0.45 0.62 0.96 0.94 0.95
body content 0.65 0.98 0.79 0.88 0.99 0.93
conflict statement 0.63 0.24 0.35 0.82 0.89 0.85
copyright 0.71 0.94 0.81 0.93 0.78 0.85
correspondence 1.0 0.72 0.84 1.0 0.97 0.99
dates 0.28 1.0 0.44 0.94 1.0 0.97
editor - 0 - 1.0 1.0 1.0
equation - - - - - -
figure 0.99 0.36 0.53 0.99 0.46 0.63
glossary 1.0 1.0 1.0 1.0 1.0 1.0
keywords 0.94 0.94 0.94 1.0 0.94 0.97
page number 0.99 0.53 0.69 0.98 0.97 0.98
references 0.91 0.95 0.93 0.99 0.95 0.97
table 0.98 0.83 0.90 0.98 0.96 0.97
title 0.51 1.0 0.67 1.0 1.0 1.0
title author - 0 - 1.0 1.0 1.0
type 0.76 0.46 0.57 0.89 0.47 0.62
unknown 0.22 0.46 0.30 0.62 0.94 0.75
average 0.79 0.68 0.73 0.95 0.91 0.92
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 18 / 23
CERMINE-based evaluation
precision recall F-score
title 93.05% 88.40% 90.67%
author 94.38% 90.01% 92.14%
affiliation 84.20% 78.03% 81.00%
abstract 85.24% 83.67% 84.45%
keywords 87.98% 65.30% 74.96%
journal name 71.88% 63.40% 67.38%
volume 96.28% 93.20% 94.72%
issue 49.12% 55.67% 52.19%
pages 47.41% 45.79% 46.59%
year 99.79% 97.80% 98.29%
DOI 96.12% 85.34% 90.41%
average 82.22% 76.96% 79.34%
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 19 / 23
CERMINE-based evaluation
GROTOAP
GROTOAP2
without with
rules rules
Precision 77.13% 81.88% 82.22%
Recall 55.99% 70.94% 76.96%
F-score 62.41% 75.38% 79.34%
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 20 / 23
Future work
enriching the ground truth files with the names of the fonts,
assigning more specific body labels, eg. section titles,
generating a dataset of parsed bibliographic referencesin a similar way.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 21 / 23
Links
GROTOAP2: http://cermine.ceon.pl/grotoap2/
CERMINE web service: http://cermine.ceon.pl
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 22 / 23
Thank you
Thank you!Questions?
Dominika [email protected]
c© 2014 Dominika Tkaczyk. This document is distributed under the Creative Commons Attribution 3.0 license.
The complete text of the license can be seen here: http://creativecommons.org/licenses/by/3.0/
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 23 / 23