Download - Analysing Structured Scholarly Data Embedded in Web Pages

Analysing Structured Scholarly Data Embedded in Web Pages

Pracheta Sahoo, Ujwal Gadiraju, Ran Yu, Sriparna Saha and Stefan Dietze

WWW 2016

April 11th, 2016Montreal, Canada

OVERVIEW❏ INTRODUCTION❏ MOTIVATION❏ RESEARCH

QUESTIONS❏ ANALYSES❏ CONCLUSIONS❏ FUTURE WORK

INTRODUCTION (1/3)

The Web: nearly 46 trillion Web pages indexed by Google

VS

Linked Data: approx. 1000 datasets & 100 billion statements

● different order of magnitude w.r.t. scale & dynamics

Are there other semantics (structured facts) on the Web?

INTRODUCTION (2/3)● Web pages embed structured data

(microdata, microformats and RDFa)○ Interpretation of web documents

(search & retrieval)● Increase in prevalence of embedded

markup (2014 Google study of 12 bn pages estimates an adoption of 26%)

● “Web Data Commons” (Meusel et al. [ISWC’14])○ Markup from Common Crawl (2.2 bn

pages) ○ 17 billion RDF quads○ Markup in 26% of pages, 14% of PLDs

in 2013 (increase from 6% in 2011)

Other semantics (structured facts) on

the Web!

INTRODUCTION (3/3)

Characteristics of Markup Data

MOTIVATION

● Embedded markup ⇒ sparsely linked, large % of coreferences, redundant statements

● Uptake and reuse of embedded markup is hindered by the lack of dynamics, scale

● Lack of understanding of the adoption of markup for scholarly resource metadata

WHAT WE BRING TO THE TABLE ...

● Study of scholarly data extracted from embedded annotations (Web Data Commons)

● Shape & characteristics of entity descriptions

● Level of adoption of terms & types, distributions across TLDs, PLDs, data publishers

RESEARCH QUESTIONS

RQ1 What are frequently used terms & types for scholarly data?

RQ2 How are statements about bibliographic data distributed across the web? Who are the key providers of bibliographic markup?

RQ3 What are the frequent errors that can be observed?

DATASET

● Web Data Commons (WDC) 2014 dataset● Subset ⇒ all statements describing entities

of type s:ScholarlyArticle or co-occuring on same document with any s:ScholarlyArticle instance○ 6,793,764 quads○ 1,184,623 entities○ 83 distinct classes○ 429 distinct predicates

DATASET - Considerations ● s:ScholarlyArticle is the only type which

explicitly refers to scholarly articles● We focus on schema.org, the most

widely used schema● Types considered ⇒ s:ScholarlyArticle,

s:Person and s:Organization○ 280,616 instances (s:

ScholarlyArticle)○ 847,417 insrances (s:Person)○ 3,798 instances (s:Organization)

SCHOLARLY TYPES & PREDICATES (½)

Cumulative dist. of predicates over instances across extracted types

1 to 14

1 to 9 1 to 4

SCHOLARLY TYPES & PREDICATES (2/2)

Top-10 Predicates for s:ScholarlyArticle

DOMAINS & DOCUMENTS (1/5)

Distribution of Entities & Statements across PLDs


Top-10 PLDs (ranked by no. of entities)


Distribution of Entities & Statements across TLDs


Distribution of Entities & Statements across HTML Documents


Top-10 Documents Ranked According to Embedded Entities

TOPICS & PUBLICATION TYPES (1/4)

Distribution of Scholarly Articles across Publishers


Top-10 Publishers and corresponding no. of Publications


Top-10 Publication Types (genres) across WDC


Top-10 Article Titles (ranked by frequency of occurrence)

FREQUENT ERRORS - Schema Violations

Top-10 Misused Predicates

CONCLUSIONS (½) ● First study on coverage & char. of

bibliographic metadata embedded in web pages.

● Early adopters ⇒ publishers, libraries, other providers of bibliographic data.

● Usage of terms, types ⇒ dist. across providers, domains and topics follows a power law; few providers & documents contributing to majority of data.

● Top-k genres & publishers indicate a bias towards French, English data providers.

● Article titles, PLDs & publishers ⇒ bias Computer Science and Life Sciences.

● In this study we only consider entities tagged explicitly as "scholarlyArticle", a deeper analysis considering more types (article, book, etc.) and other creative works can shed light on the true scale of and potential of embedded markup data.

CONCLUSIONS (2/2)

FUTURE WORK

● Targeted crawl of typical providers of scholarly data (publishers, academic orgs., libraries, etc.)

● Consider implicitly typed bibliographic or creative work as scholarly data

Contact Details :

[email protected]://www.L3S.de

mailto:[email protected]

mailto:[email protected]

LIMITATIONS

● Our study is limited to schema.org & the types of s:ScholarlyArticle, s:Person, s:Organization.

● We consider only explicitly linked scholarly works.