Logical Structure Analysis of Scientific Publications in Mathematics

Logical Structure Analysis ofScientific Publications in Mathematics

Valery Solovyev, Nikita Zhiltsov

Kazan (Volga Region) Federal University, Russia

1 / 44

http://ksu.ru

http://cll.niimm.ksu.ru

http://wims.vestforsk.no

Overview

É LOD Cloud has been growing at 200-300%per year since 2007∗

É Prevalent domains: government (43%),geographic (22%) and life sciences (9%)

É However, it lacks data sets related toacademic mathematics

∗C.Bizer et al. State of the Web of Data.LDOW WWW’11

2 / 44

http://ksu.ru


1 Background

2 Proposed Semantic Model

3 Analysis Methods

4 Experiments and Evaluation

5 Prototype

3 / 44

http://ksu.ru


Mathematical Scholarly PapersEssential features

É Well-structured documentsÉ The presence of mathematical formulaeÉ Peculiar vocabulary (“mathematical

vernacular”)

4 / 44

http://ksu.ru


Research Objectives

Current studyÉ Specification of the document logical structureÉ Methods for extracting structural elements

Long-term goalsÉ A large corpus of semantically annotated papersÉ Semantic search of mathematical papers

5 / 44

http://ksu.ru


Modelling the Structure of ScientificPublicationsABCDE format

É LaTeX-based format to represent the narrativestructure of proceedings and workshop contributions

É Sections:� Annotations (Dublin Core metadata)� Background (e.g. description of research positioning)� Contribution (description of the presented work)� Discussion (e.g. comparison with other work)� Entities (citations)

6 / 44

http://ksu.ru


Modelling the Structure of ScientificPublicationsSALT

É LaTeX-based authoring tool for generatingsemantically annotated PDF documents

É Three ontologies:� SALT Document Ontology� SALT Annotation Ontology� SALT Rhetorical Ontology

7 / 44

http://ksu.ru


SALT Layers

8 / 44

http://ksu.ru


Mathematical KnowledgeRepresentation

É Languages for formalized mathematics� Mizar� Coq� Isabelle

É Semiformal math languages� HELM ontology� MathLang� OMDoc format (+ OMDoc ontology, sTeX)

É Presentation/authoring formats� PDF� LATEX

9 / 44

http://ksu.ru



10 / 44

http://ksu.ru


Trade-off Candidates

É arXMLiv format� XHTML+MathML� Marked up theorem-like elements, sections,

equations� Automatic conversion for LaTeX documents with

styles of available bindings (LaTeXML)� 60% of arXiv.org were converted into the format

É Present work� Follow the slides ⇒

11 / 44

http://ksu.ru


1 Background


3 Analysis Methods


5 Prototype

12 / 44

http://ksu.ru



13 / 44

http://ksu.ru


Proposed Semantic Model

É It is an ontology that captures the structural layoutof mathematical scholarly papers (as in the LaTeXmarkup)

É The segment represents the finest level ofgranularity and has the properties:

� starting and ending positions� the text or math contents� functional role

É Select most frequent segments from samplecollections of genuine papers

É Consider synonyms as one concept (e.g. conjectureand hypothesis)

14 / 44

http://ksu.ru


Proposed Semantic Model (cont.)

É Select basic semantic relations between segmentsfrom the prior-art models

É Integration with SALT Document Ontology classes:� Publication� Section� Figure� Table

15 / 44

http://ksu.ru


Ontology Elementshttp://cll.niimm.ksu.ru/ontologies/mocassin#

16 / 44

http://ksu.ru


http://cll.niimm.ksu.ru/ontologies/mocassin#

1 Background


3 Analysis Methods


5 Prototype

17 / 44

http://ksu.ru


Logical Structure Analysis

É The ontology specifies a controlled vocabulary tosemantic analysis

É Two analysis tasks:� recognizing the types of document segments� recognizing the semantic relations between them

18 / 44

http://ksu.ru


Example

19 / 44

http://ksu.ru


Example (cont.)

20 / 44

http://ksu.ru


Example (cont.)

21 / 44

http://ksu.ru


Recognizing the Types of DocumentSegments

We exploit the LaTeX markup extensively

1 Elicit a LaTeX environment2 Associate it with a string that may beeither the environment name

or the environment title (if available)

3 Filter out standard formatting environments (e.g.center, align, itemize)

4 Compute string similarity between a string andcanonical names of ontology concepts

5 Check if the found most similar concept isappropriate using a predefined threshold

22 / 44

http://ksu.ru


Recognizing Navigational Relations

The dependsOn and refersTo relations are navigational

AssumptionNavigational relations are induced by referentialsentences

ExamplesÉ “By applying Lemma 1, we obtain ...” (dependsOn)É “Theorem 2 provides an explicit algorithm ...”

(refersTo)

23 / 44

http://ksu.ru


Recognizing Navigational RelationsSupervised method

1 Given a segment S; split its text into sentences,tokenize and do POS tagging

2 Referential sentences are ones that contain the \refcommand entries

3 For each sentence:� find mentioned segments; each of them makes a pair

with S (type feature)� for each pair, compute relative positions of segments

normalized by the document size (distance feature)� build a boolean vector for its verbs (verb feature)

24 / 44

http://ksu.ru


Recognizing Navigational Relations(cont.)Supervised method

Example training instance

t1 t2 d1 d2 add ... apply ... relation

proof lemma 0.09 0.27 0 ... 1 ... dependsOn

É Train a learning model using these features and alabeled example set

É Apply the model to classify new induced relations

25 / 44

http://ksu.ru


Recognizing Restricted Relations

The hasConsequence, exemplifies and proves relationsare restricted

AssumptionRestricted relations occur between consecutivesegments

26 / 44

http://ksu.ru


Recognizing Restricted Relations (cont.)Baseline method

According to the ontology, restricted relations involveinstances of three types, separately: Corollary, Exampleand Proof

1 Seek a segment of one of these types

2 Find its segments-predecessors

3 Filter out segments of inappropriate types

4 Return the closest predecessor

27 / 44

http://ksu.ru


1 Background


3 Analysis Methods


5 Prototype

28 / 44

http://ksu.ru


Experimental SetupCollectionsÉ 1355 papers of the “Izvestiya Vysshikh Uchebnykh

Zavedenii. Matematika” journalÉ A sample of 1031 papers from arXiv.org

ImplementationAn open source Java library built upon:É LaTeX-to-XML convertersÉ GATE frameworkÉ WekaÉ Jena

See http://code.google.com/p/mocassin

29 / 44

http://ksu.ru


http://code.google.com/p/mocassin

Segment Recognition EvaluationÉ Evaluation on the arXiv sample onlyÉ Q-gram string matching algorithm was usedÉ The threshold value was optimized w.r.t. F1-score

Type # of F1-scoretrue instances

Axiom 5 1.000Claim 114 0.987Conjecture 152 0.987Corollary 1715 0.995Definition 1838 1.000Example 771 0.999Lemma 4061 0.998Proof 4943 0.997Proposition 3052 0.999Remark 2114 1.000Theorem 4670 0.991other 671 0.892

30 / 44

http://ksu.ru


Ontology Coverage Evaluation

É Evaluation on the both entire collections (“Izvestiya”and arXiv)

É Equations are most ubiquitous segments (52% and69%, respectively)

É The ontology covers types of 91.9% and 91.6% ofsegments (with SALT Section class – 99.5% and99.6%)

31 / 44

http://ksu.ru


Distribution of Segment Types

0%

5%

10%

15%

20%

25%

30%T

heo

rem

Pro

of

Lem

ma

Rem

ark

Coro

llar

y

Def

init

ion

Pro

posi

tion

Ex

ample

oth

ers

Cla

im

Co

nje

cture

Per

centa

ge

of

seg

men

t o

ccu

rren

ces

Izvestiya

arXiv

32 / 44

http://ksu.ru


Evaluation of Navigational RelationRecognition

É A paper contains 51.4 (Izvestiya) and 53.9 (arXiv)referential sentences on the average

É 243 referential sentences were randomly selectedand manually annotated

É 95% were true navigational relationsÉ A decision tree learner (C4.5) was trainedÉ The results were from 10-fold cross validation

Features Accuracy F1-score F1-scorerefersTo dependsOn

type 0.663 0.566 0.752type+distance 0.658 0.663 0.704type+verb 0.704 0.653 0.770type + distance + verb 0.741 0.744 0.772

33 / 44

http://ksu.ru


A Cloud of Frequent Verbs

34 / 44

http://ksu.ru


Evaluation of Restricted RelationRecognition

É Evaluation on the arXiv sample onlyÉ 10% of the documents which contain certain

segments were randomly selectedÉ For each such a segment, corresponding relations

were annotated manuallyÉ Known issues: imported corollaries and examples

for arbitrary text fragments

Relation # of instances F1-scorehasConsequence 178 0.687exemplifies 62 0.613proves 216 0.954

35 / 44

http://ksu.ru


Conclusion on Evaluation

É The ontology covers the largest part of the logicalstructure and appears to be feasible for automaticextraction methods

É The task of segment type recognition has beenaccomplished

É The method for recognizing navigational relationsestablishes ground truth, however, a large-scaleevaluation and learning model selection are required

É The baseline method for recognizing restrictedrelations must be improved by leveraging additionalinformation (discussed in the paper!)

36 / 44

http://ksu.ru


1 Background


3 Analysis Methods


5 Prototype

37 / 44

http://ksu.ru


Prototype

A prototype:É demonstrates our ongoing research on

semantic search of mathematical papersÉ incorporates the logical structure analysis

methodsÉ is integrated with arXiv APIÉ enables enhanced search for arXiv papers

and visualization of their logical structureÉ publishes the semantic index as Linked

Data via SPARQL endpoint

38 / 44

http://ksu.ru


Search Interfacehttp://cll.niimm.ksu.ru/mocassin

39 / 44

http://ksu.ru


http://cll.niimm.ksu.ru/mocassin

Formulating a Queryhttp://cll.niimm.ksu.ru/mocassin

40 / 44

http://ksu.ru



Search Resultshttp://cll.niimm.ksu.ru/mocassin

41 / 44

http://ksu.ru



Preview a Search Resulthttp://cll.niimm.ksu.ru/mocassin

42 / 44

http://ksu.ru



Summary

É The proposed approach aims to analyze thestructure of mathematical scholarly papersin an automatic way

É Our ontology provides a controlledvocabulary for analysis

É The methods elicit document segments interms of the ontology

É The extracted semantic graph can be usedfor:

� discovering important document parts� semantic search of theoretical results

43 / 44

http://ksu.ru


Thanks for your attention!Questions?

44 / 44

http://ksu.ru


Logical Structure Analysis of Scientific Publications in Mathematics

Technology

Transcript of Logical Structure Analysis of Scientific Publications in Mathematics