Analysing Structured Scholarly Data Embedded in Web Pages
Pracheta Sahoo, Ujwal Gadiraju, Ran Yu, Sriparna Saha and Stefan Dietze
WWW 2016
April 11th, 2016Montreal, Canada
OVERVIEW❏ INTRODUCTION❏ MOTIVATION❏ RESEARCH
QUESTIONS❏ ANALYSES❏ CONCLUSIONS❏ FUTURE WORK
INTRODUCTION (1/3)
The Web: nearly 46 trillion Web pages indexed by Google
VS
Linked Data: approx. 1000 datasets & 100 billion statements
● different order of magnitude w.r.t. scale & dynamics
Are there other semantics (structured facts) on the Web?
INTRODUCTION (2/3)● Web pages embed structured data
(microdata, microformats and RDFa)○ Interpretation of web documents
(search & retrieval)● Increase in prevalence of embedded
markup (2014 Google study of 12 bn pages estimates an adoption of 26%)
● “Web Data Commons” (Meusel et al. [ISWC’14])○ Markup from Common Crawl (2.2 bn
pages) ○ 17 billion RDF quads○ Markup in 26% of pages, 14% of PLDs
in 2013 (increase from 6% in 2011)
Other semantics (structured facts) on
the Web!
INTRODUCTION (3/3)
Characteristics of Markup Data
MOTIVATION
● Embedded markup ⇒ sparsely linked, large % of coreferences, redundant statements
● Uptake and reuse of embedded markup is hindered by the lack of dynamics, scale
● Lack of understanding of the adoption of markup for scholarly resource metadata
WHAT WE BRING TO THE TABLE ...
● Study of scholarly data extracted from embedded annotations (Web Data Commons)
● Shape & characteristics of entity descriptions
● Level of adoption of terms & types, distributions across TLDs, PLDs, data publishers
RESEARCH QUESTIONS
RQ1 What are frequently used terms & types for scholarly data?
RQ2 How are statements about bibliographic data distributed across the web? Who are the key providers of bibliographic markup?
RQ3 What are the frequent errors that can be observed?
DATASET
● Web Data Commons (WDC) 2014 dataset● Subset ⇒ all statements describing entities
of type s:ScholarlyArticle or co-occuring on same document with any s:ScholarlyArticle instance○ 6,793,764 quads○ 1,184,623 entities○ 83 distinct classes○ 429 distinct predicates
DATASET - Considerations ● s:ScholarlyArticle is the only type which
explicitly refers to scholarly articles● We focus on schema.org, the most
widely used schema● Types considered ⇒ s:ScholarlyArticle,
s:Person and s:Organization○ 280,616 instances (s:
ScholarlyArticle)○ 847,417 insrances (s:Person)○ 3,798 instances (s:Organization)
SCHOLARLY TYPES & PREDICATES (½)
Cumulative dist. of predicates over instances across extracted types
1 to 14
1 to 9 1 to 4
SCHOLARLY TYPES & PREDICATES (2/2)
Top-10 Predicates for s:ScholarlyArticle
DOMAINS & DOCUMENTS (1/5)
Distribution of Entities & Statements across PLDs
DOMAINS & DOCUMENTS (2/5)
Top-10 PLDs (ranked by no. of entities)
DOMAINS & DOCUMENTS (3/5)
Distribution of Entities & Statements across TLDs
DOMAINS & DOCUMENTS (4/5)
Distribution of Entities & Statements across HTML Documents
DOMAINS & DOCUMENTS (5/5)
Top-10 Documents Ranked According to Embedded Entities
TOPICS & PUBLICATION TYPES (1/4)
Distribution of Scholarly Articles across Publishers
TOPICS & PUBLICATION TYPES (2/4)
Top-10 Publishers and corresponding no. of Publications
TOPICS & PUBLICATION TYPES (3/4)
Top-10 Publication Types (genres) across WDC
TOPICS & PUBLICATION TYPES (4/4)
Top-10 Article Titles (ranked by frequency of occurrence)
FREQUENT ERRORS - Schema Violations
Top-10 Misused Predicates
CONCLUSIONS (½) ● First study on coverage & char. of
bibliographic metadata embedded in web pages.
● Early adopters ⇒ publishers, libraries, other providers of bibliographic data.
● Usage of terms, types ⇒ dist. across providers, domains and topics follows a power law; few providers & documents contributing to majority of data.
● Top-k genres & publishers indicate a bias towards French, English data providers.
● Article titles, PLDs & publishers ⇒ bias Computer Science and Life Sciences.
● In this study we only consider entities tagged explicitly as "scholarlyArticle", a deeper analysis considering more types (article, book, etc.) and other creative works can shed light on the true scale of and potential of embedded markup data.
CONCLUSIONS (2/2)
FUTURE WORK
● Targeted crawl of typical providers of scholarly data (publishers, academic orgs., libraries, etc.)
● Consider implicitly typed bibliographic or creative work as scholarly data
LIMITATIONS
● Our study is limited to schema.org & the types of s:ScholarlyArticle, s:Person, s:Organization.
● We consider only explicitly linked scholarly works.
Top Related