Service-Oriented Architecture for automatic markup of documents
-
Upload
francisco-cifuentes-silva -
Category
Software
-
view
378 -
download
2
description
Transcript of Service-Oriented Architecture for automatic markup of documents
Service-Oriented Architecture for automatic markup of documents.
An use case for legal documents.
Francisco Adolfo Cifuentes-Silva
Library of Congress of Chile - BCN
2014-08-19
“Digital law libraries at the crossroads: Innovative solutions to complex challenges.”
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Project context
It borns in response to two (2) problems:
To be able for to obtain all the parliamentary interventions, within the legislative process (Congress sessions and related documents)
To know the evolution and the discussion around a law, since that this is defined as a bill until it is published as law
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 2
11
22
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Project context
It borns in response to two (2) problems:
To be able for to obtain all the parliamentary interventions, within the legislative process (Congress sessions and related documents)
To know the evolution and the discussion around a law, since that this is defined as a bill until it is published as law
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 3
And in an automated way!
And in an automated way!
11
22
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Project context
How to: Two (2) sibling projects:
Parliamentary Labor project (PL):
To be able for to obtain all the parliamentary interventions, within the legislative process (Congress sessions and related documents)
History of the Law project (HL):
To know the evolution and the discussion around a law, since that this is defined as a bill until it is published as law
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 4
11
22
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Project context
“Sibling projects” because both are possible processing the same documents:
• Session dailies• Debate reports• Reports• Amendments• Bills• etc.
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 5
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 6
Project context
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 7
Congress and legal resources
Project context
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 8
Chilean Congress
- Senate- Chamber of Deputies
Project context
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 9
Legal resources production
- Session dailies- Debate reports- Bills, etc
Project context
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 10
Congress and legal resources
Workflow
Project context
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 11
Business Processes
- Each type of document has an own process flow
- BCN implements a Workflow Management System for PL & HL
Project context
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 12
Congress and legal resources
Tools
Project context
Workflow
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 13
Support tools
- Automatic XML Marker - Web XML Editor- XSD in the base of support tools
Project context
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 14
Congress and legal resources
Tools
XMLStorage
Project context
Workflow
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 15
XML Storage
- SVN server for XML documents- Allow us manage all XML versions - REST access: HTTP GET, PUT
Project context
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 16
Tools
XMLStorage
Information extraction
Linked Open Data
Congress and legal resources
Project context
Workflow
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 17
Information Extraction
New information is extracted from enriched XML in two formats:
- Linked Open Data- Relational data (facts table)
Project context
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 18
Tools
XMLStorage
Information extraction
Linked Open Data
Congress and legal resources
Project context
Workflow
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 19
Tools
XMLStorage
Information extraction
Linked Open Data
Congress and legal resources
New data is used for a new process
Project context
Workflow
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Strategic decisions
Service Oriented ArchitectureOur focus:- HTTP is the base- REST Web Services- W3C Web Standards
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 20
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Strategic decisions
Service Oriented Architecture
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 21
Workflow Management SystemWorkflow Management System
Automatic MarkupAutomatic Markup XML EditorXML Editor RDF TriplestoreRDF Triplestore
SVN XMLSVN XML MediatorMediator Web ServicesWeb Services
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Strategic decisions
Linked Open Data - LODSince 2011 BCN publishes LOD:
Dataset of legal norms Dataset of legislative documents Datasets and ontologies about:
People Geographic places Organizations Others like roles, bills, congress structure, etc.
Please visit http://datos.bcn.cl !!
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 22
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Strategic decisions
Linked Open Data For automatic markup we are using:• URIs for legal documents• URIs for metadata• URIs for named entities:
– URIs for people– URIs for organizations– URIs for roles– URIs for events– URIs for locations– …. URIs for all
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 23
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Strategic decisions
The definition of a XML Schema
We need a XML Schema for markup of documents, and eventually interchange the documents, so we have two big choices:
• Own XML Schema = low interoperability, reusability and high cost
• Standard XML Schema = high interoperability, reusability and low cost
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 24
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Strategic decisions
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 25
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Strategic decisions
The definition of a XML Schema
Standard XML Schema = high interoperability, reusability and low cost
Ok but, why Akoma-Ntoso?
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 26
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Strategic decisions
Akoma-Ntoso
- XML Schema for legal documents designed and supported by “great minds” in OASIS Group
- Support to many types of documents:(session daily, bills, debate reports, amendments, among others)
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 27
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Strategic decisions
Akoma-Ntoso
- There is a growing set of tools for working with him, such as Web XML editors or office editor tools, example:– LegisProWeb – Bungeni– Lime Editor
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 28
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 29
PlainText
Named Entities recognitionNamed Entities recognition
URI assignmentURI assignment
Structural MarkupStructural Markup
Akoma-Ntoso translationAkoma-Ntoso translationXMLAKN
Automatic XMLMarker
Automatic markup in XML
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 30
PlainText
Named Entities recognitionNamed Entities recognition
URI assignmentURI assignment
Structural MarkupStructural Markup
Akoma-Ntoso translationAkoma-Ntoso translationXMLAKN
Automatic XMLMarker
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Named Entity Recognizer (NER)
- We need to identify entities in the text- We are using a spanish adapted version
of Stanford NER which uses a CRF classifier.
- The classifier was trained with large documents achieving results over 80% of effectivity in entity recognition
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 31
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Named Entity Recognizer (NER)
Web service, written in Java and based in the Stanford NER
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 32
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 33
PlainText
Named Entities recognitionNamed Entities recognition
URI assignmentURI assignment
Structural MarkupStructural Markup
Akoma-Ntoso translationAkoma-Ntoso translationXMLAKN
Automatic XMLMarker
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
URI assignment
- Once the NER find all entities, we need to assign its URI
- This tool is called “The Mediator” and it has been developed in collaboration with the Weso Research Group of the University of Oviedo.
Francisco Adolfo Cifuentes-Silva - Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 34
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Mediator output in XML
Web service, written in Java and based in Apache Lucene
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 35
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Mediator features
- Connected to SPARQL Endpoint- It allows to set context information for each
work session (ex: date, chamber, type of doc. in markup)
- Using the context information, it applies a set of heuristics for each entity type, identifying correctly the URI for each one
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 36
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 37
PlainText
Named Entities recognitionNamed Entities recognition
URI assignmentURI assignment
Structural MarkupStructural Markup
Akoma-Ntoso translationAkoma-Ntoso translationXMLAKN
Automatic XMLMarker
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Structural markup
- The problem is to detect structural sections
- Combination of methods:- Regular expressions- Algorithms for detecting sequences- Rules and algorithms
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 38
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Structural markup
- The combination of methods depends on each document type
- Finally, the object representation of document (simmilar to DOM) is converted to ad-hoc XML
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 39
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Structural markup
Web service and written in Java
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 40
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 41
PlainText
Named Entities recognitionNamed Entities recognition
URI assignmentURI assignment
Structural MarkupStructural Markup
Akoma-Ntoso translationAkoma-Ntoso translationXMLAKN
Automatic XMLMarker
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Akoma-Ntoso translator
- We need AKN documents for edition, enrichment and extraction
- AKN is a complex schema- The best solution was to build a web
service for convert ad-hoc XML to AKN
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 42
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Automatic markup in XML
Akoma-Ntoso translator
Web service and written in Java
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 43
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Results and discussion
Positive impact in the work, reducing dramatically time of XML markup compared to manual labeling of documents
reducing time and cost of product generation
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 44
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Results and discussion
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 45
Time for completing a History of the Law in distinct scenarios
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Conclusions
SOA has provided to improve each component separately impacting positively the final result (ex. Datasets, NER training, heuristics)
It is possible to integrate aditional XML Schemas to output
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 46
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Conclusions
The automatic markup of XML documents, and subsequent manual enrichment of metadata provides an excelent source for data extraction
Our solution based on SOA allow us an easy integration of exceptions and new cases in the markup
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 47
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Future Work
Alfonso Pérez, Director of the BCN, has installed the concept of “Semantic Library” like one of the main objectives of the BCN in the institutional strategic plan.
This new concept implies to apply the automatic markup schema to all BCN areas, developing new markup schemas and possible new challenges in terms of identify document sections and semantic content.
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 48
Project contextStrategic decisions - SOA - Linked Open Data - Akoma-NtosoAutomatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translatorResults and discussionConclussionsFuture workAcknowledgements
Acknowledgements
• Library of Congress of Chile Team • Developers team
– Ricardo Muñoz– Claudio Devia– Eridan Otto– David Vilches– Me
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 49
Thanks for your attention!
fcifuentes <at> bcn <dot> cl
twitter.com/fcifuentes
www.slideshare.net/francisco.cifuentes
www.linkedin.com/in/fcifuentes
Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 50
Me
If you need more details, you can contact me: