MASWS - Natural Language and the Semantic Web · Populating the Semantic Web why and how? Why...

MASWSNatural Language and the Semantic Web

Kate Byrne

School of Informatics

14 February 2011

1 / 27

Populating the Semantic Webwhy and how?text to RDF

Organising the Triplesschema designopen issues and problems

References

2 / 27

Populating the Semantic Web why and how?

The Semantic Web needs data

• Semantic Web first proposed in 1999

• Why hasn’t it taken off the way WWW did?

3 / 27

enough

queries

aren’t

usefulmy data

adding

I’ll delay

3 / 27

How to get data

create convert

How do we populate the Semantic Web?

structured data

and semi−structured

unstructured

4 / 27

How to get data

create convert

structured data

unstructured

4 / 27

Semi-structured content is everywhere

5 / 27

Why convert unstructured text?

• Extracting RDF triples from free text is not easy• So why is it worth doing?

• there was knowledge before WWW (!)• high quality material, often historical• big digitisation push in 1990s (and again in 2011?) –• – but still challenging to search text effectively• natural language is our preferred way of communicating

• Can the web learn to talk in natural language...?

6 / 27

7 / 27

Populating the Semantic Web txt2rdf

Natural Language Processing

• Extraction of semantic content from natural language text

• Systems now available – Open Calais, Powerset, AlchemyAPI

• To explain the process: part of my PhD work

“Kate likes chocolate.”Think of RDF triples as declarative sentences:RDF nodes = nouns; RDF arcs = verbs

8 / 27

:kate:likes

:chocolate

prefix : <http://www.ltg.ed.ac.uk/>

8 / 27

Converting text to RDF

fancy NLP processing

and RDFisation

9 / 27

Example – RCAHMS text

10 / 27

Evidence of a quartz knapping site was found within the confines of the stone

strongly suggests a domestic site.Besides the quartz implements and corresponding waste, several other artifacts of localorigin occurred including a split pebble axe of greenstone with Shetland EarlyBronze Age affinities. B Beveridge, 1972.Field survey and excavation, as a response to continual wind and marineerosion, was carried out at the Sands of Breckon between1982 and 1983.HP50NW 11.00 was recorded as a stone settings surrounded byoccupational debris (Site 22). Excavation revealed midden deposits of anearly Iron Age date and a surface scatter of artefacts of mixed dates. Thestone settings were tentatively interpreted as the basal stones of longcists.Historic Scotland Archive Project (SW) 2002.

circle, and in conjunction with several structures within the inner ring,

site 20

10 / 27

site 20

10 / 27

site 20

The txt2rdf process in a nutshell

• First find the important nouns, or “named entities”:

Example

Mr. Peter Moar found a small hoard of 5 polished stone knives on thesame patch in May 1946.

• Then look for relationships between them:

11 / 27

Example

11 / 27

Example

11 / 27

Example

Mr. Peter Moar found a small hoard of 5 polished stone knives on the

same patch in May 1946.

11 / 27

My txt2rdf pipeline

sentenceand para

POS tagtokenise

multi−wordtokens and

features trained NERmodel

list of NEsand

classes

removeunwantedrelations

generatetriples

attachsiteids

trained REmodel

set of NEpairs andfeatures

list ofrelations

and classes

sfsjksjwjvssjkljljs sd’lajoen s

jjs kjdlk lksjlkj sks oihhg sk

jjlkjlj jljbjl skj ekw

RDFtranslation

Graphof triples

Pre−processing Named Entity Recognition

Relation Extraction

Text documents

12 / 27

Dealing with EVENTs

• FIND event is n-ary relation:

eventRel(eventAgent, eventAgentRole, eventDate, eventPatient, eventPlace)

find123(“Mr. Peter Moar”,,“May 1946”,“polished stone knives”,“Hill of Shurton”)

Split event relation into RDF triples

:find123 :hasAgent :peterMoar .:find123 :hasDate :may1946 .:find123 :hasPatient :polishedStoneKnives .:find123 :hasLocation :hillOfShurton .

13 / 27

Dealing with EVENTs

• FIND event is n-ary relation:

eventRel(eventAgent, eventAgentRole, eventDate, eventPatient, eventPlace)

find123(“Mr. Peter Moar”,,“May 1946”,“polished stone knives”,“Hill of Shurton”)

Split event relation into RDF triples

:find123 :hasAgent :peterMoar .:find123 :hasDate :may1946 .:find123 :hasPatient :polishedStoneKnives .:find123 :hasLocation :hillOfShurton .

13 / 27

Archaeological events

• Example was from RCAHMS data:http://canmore.rcahms.gov.uk/en/site/1102/details/hill+of+shurton/

• Extracting events allows structured searches

• Question: What do we do with NEs that are not in relations?

For more informationByrne and Klein (2009) Automatic Extraction of ArchaeologicalEvents from Text.

14 / 27

Available NLP systems

• OpenCalais: http://viewer.opencalais.com/

• Powerset: now part of Microsoft’s Bing

• AlchemyAPI: http://www.alchemyapi.com/api/demo.html

• Trained on lots of (English) text, mostly news articles• Many general-purpose NLP tools available online, eg

• nltk (http://www.nltk.org/)• Edinburgh LTG software (http://www.ltg.ed.ac.uk/software)• NaCTeM (http://nactem.ac.uk/)

15 / 27

Organising the Triples schema design

Populating the Semantic Webwhy and how?text to RDF

Organising the Triplesschema designopen issues and problems

References16 / 27

Do we have a data schema?

create convert

structured data

unstructured

16 / 27

A schema for free text

• So far we’ve looked at instance data (ABox statements)• Where is framework to slot instances into?• Named Entity categories: PERSON, PLACE, DATE, etc• NE categories become RDFS classes:

Entities are typed by their NE class

:peterMoar rdf:type :person .:found rdf:type :event .:polishedStoneKnives rdf:type :artefact .:may1946 rdf:type :date .

17 / 27

RDFS class hierarchy

• NE categories usually flat list, not hierarchical• My set: ORG, PERSNAME, ROLE, SITETYPE, ARTEFACT,

PLACE, SITENAME, ADDRESS, PERIOD, DATE, EVENT• Contrast with structured data• Hand-designed hierarchy (with rdfs:subClassOf ):

18 / 27

http://www.ltg.ed.ac.uk/tether/

timeagentsiteid

rolepersonorg date period

loc classn

sitetype objtype

addresssitenameplace

survey excavation find visit description creation alteration

18 / 27

Property labels

• Can we extract RDF predicate set automatically from text?

• Eg clustering commonly occurring verbs

• But – schema design only has to be done once

• Property set and class hierarchy are related

• My set: hasEvent, hasAgent, hasAgentRole, hasPeriod,hasPatient, hasLocation, hasClassn, hasObject, partOf

• Plus “standard” ones like owl:sameAs, rdfs:seeAlso,skos:broader

• Flat list (no rdfs:subPropertyOf )

19 / 27

Property labels

19 / 27

Property labels

19 / 27

Property labels

19 / 27

Property labels

19 / 27

Property labels

19 / 27

Property labels

19 / 27

Existing schemas and vocabularies

• Lots of vocabularies to choose from• http://esw.w3.org/topic/VocabularyMarket

• RDFS and OWL are fundamental• SKOS – translate existing domain thesauri

• http://www.w3.org/2004/02/skos/

• Reuse where possible; invent local scheme where not

• Incentive to reuse is strong

20 / 27

Organising the Triples open issues and problems

Issues around schema design

• Schema granularity• more categories to choose from⇒ NLP less accurate• so class hierarchy quite coarse and flat

• Schema discovery• tension between simplicity and query power• how will software agents “understand” your schema?

21 / 27

Issues around NLP accuracy

• Disambiguation of extracted NEs (Mr Peter Moar, P Moar, etc)• NLP is not 100% accurate!

• how many false statements can Semantic Web stand?• “the computer said it, so it must be true”

• Even given canonical URI :peterMoar for our Peter Moar...

• ... can we distinguish him from others with same name?

22 / 27

Grounding local data

• Grounding: tie unique canonical URIs to authoritative reference

• Is http://www.ltg.ed.ac.uk/tether/Loc/Place#edinburgh same ashttp://www.geonames.org/2650225/edinburgh.html?

• Grounding during NLP processing impractical

• Use owl:sameAs? – redundant triples, more complex queries

• Who maintains the authoritative lists of URIs?

• Specialist thesauri, eg archaeological classification terms like“Stone setting”

23 / 27

Grounding specialist terminology

siteid:site20

address:breckon

sitename:sands+of+breckon

date:1982

event:excavation

event:excavation20w158

address:hp50nw+11.01+hp+5304+0519

sitetype:stone+settings20w179

sitetype:stone+setting

"stone setting"

"An arrangement of twoor more standing stones"

sitetype:religious+ritual+and+funerary

sitetype:standing+stone

sitetype:stone+circle

sitetype:stone+row

sitetype:

:hasLocation

:hasPeriod

rdf:type

:hasEvent

:hasLocation

:hasClassn

rdf:type

rdfs:label

skos:scopeNote

skos:broader

skos:related

rdfs:subClassOf

Aim: ground place names, people, organisations, etc.24 / 27

siteid:site20

address:breckon

date:1982

event:excavation

address:hp50nw+11.01+hp+5304+0519

sitetype:stone+row

sitetype:

"stone setting"

:hasClassn

rdfs:label

skos:scopeNote

skos:broader

skos:related

rdf:type

rdfs:subClassOf

:hasLocation

:hasPeriod

rdf:type

:hasEvent

:hasLocation

siteid:site20

address:breckon

date:1982

event:excavation

address:hp50nw+11.01+hp+5304+0519

sitetype:stone+row

sitetype:

"stone setting"

:hasClassn

rdfs:label

skos:scopeNote

skos:broader

skos:related

rdf:type

rdfs:subClassOf

:hasLocation

:hasPeriod

rdf:type

:hasEvent

:hasLocation

Linked Data

25 / 27

Grounding turns data into Linked Data

26 / 27

grounding local URIs

against "authority" nodes

is the

next big challenge!

26 / 27

References

References to follow up

• NLP software for Semantic Web applications• OpenCalais: http://opencalais.com/• AlchemyAPI: http://www.alchemyapi.com/api/demo.html

• VocabularyMarket: http://esw.w3.org/topic/VocabularyMarket• Case study

• Byrne and Klein (2009) Automatic Extraction of ArchaeologicalEvents from Text. CAA2009, Williamsburg, VA.

• General references for NLP• Natural Language Processing with Python, Bird, Klein and Loper

(http://www.nltk.org/book)• Speech and Language Processing, Jurafsky and Martin

(http://www.cs.colorado.edu/∼martin/slp.html)

27 / 27

References

27 / 27

References

27 / 27

References

27 / 27

References

27 / 27

References

27 / 27

References

27 / 27

References

27 / 27

References

27 / 27

MASWS - Natural Language and the Semantic Web · Populating the Semantic Web why and how? Why...

Documents

Transcript of MASWS - Natural Language and the Semantic Web · Populating the Semantic Web why and how? Why...