Sitemap4rdf(v2 boris)

22
sitemap4rdf generate Sitemap files from a SPARQL generate Sitemap files from a SPARQL endpoint http://www deri ie/ http://www.deri.ie/ Boris Villazón-Terrazas and Richard Cyganiak (DERI) Facultad de Informática, Universidad Politécnica de Madrid Campus de Montegancedo sn 28660 Boadilla del Monte Madrid Campus de Montegancedo sn, 28660 Boadilla del Monte, Madrid http://www.oeg-upm.net Phone: 34.91.3366605, Fax: 34.91.3524819

description

sitemap4rdf , a tool to generate Sitemap files from a SPARQL endpoint

Transcript of Sitemap4rdf(v2 boris)

Page 1: Sitemap4rdf(v2 boris)

sitemap4rdfgenerate Sitemap files from a SPARQLgenerate Sitemap files from a SPARQL

endpointhttp://www deri ie/http://www.deri.ie/

Boris Villazón-Terrazas and Richard Cyganiak (DERI)Facultad de Informática, Universidad Politécnica de Madrid

Campus de Montegancedo sn 28660 Boadilla del Monte MadridCampus de Montegancedo sn, 28660 Boadilla del Monte, Madridhttp://www.oeg-upm.net

Phone: 34.91.3366605, Fax: 34.91.3524819

Page 2: Sitemap4rdf(v2 boris)

ToC

• Publishing Linked Data from a triple store• Publishing Linked Data from a triple store• Search engines

The Sitemap protocol• The Sitemap protocol• sitemap4rdf

S• Summary• Future work

2

Page 3: Sitemap4rdf(v2 boris)

Linked Data frontends for triple stores

Source: Pubby website, http://www4.wiwiss.fu-berlin.de/pubby/

3

Page 4: Sitemap4rdf(v2 boris)

ToC

• Publishing Linked Data from a triple store• Publishing Linked Data from a triple store• Search engines

The Sitemap protocol• The Sitemap protocol• sitemap4rdf

S• Summary• Future work

4

Page 5: Sitemap4rdf(v2 boris)

Sindice: the best RDF search engine

5

Page 6: Sitemap4rdf(v2 boris)

Sindice: the best RDF search engine

• 120M+ documentsC ti l d ti i 2006• Continuously updating since 2006

• Search APISearch API• RDF/XML, Turtle, RDFa, microformats

6

Page 7: Sitemap4rdf(v2 boris)

ToC

• Publishing Linked Data from a triple store• Publishing Linked Data from a triple store• Search engines

The Sitemap protocol• The Sitemap protocol• sitemap4rdf

S• Summary• Future work

7

Page 8: Sitemap4rdf(v2 boris)

Sitemap Protocol

• Used by web crawlers• Efficiently find all your content & discover

what has been updatedhttp://sitemaps.org/

A i fil i i f i di URLA sitemap file contains information regarding one or more URLs onyour Web site. The information that is stored there helps searchengines better spider your website.

8

Page 9: Sitemap4rdf(v2 boris)

Sitemap Protocol: Simple example

<?xml version="1.0" encoding="UTF-8"?><urlset

xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url>

<loc>http://yoursite/</loc></url><url>

<loc>http://yoursite/products/53546</loc>oc ttp://you s te/p oducts/535 6 / oc</url><url>

<loc>http://yoursite/products/98421</loc><loc>http://yoursite/products/98421</loc></url><url>

<loc>http://yoursite/products/41003</loc></url>

</urlset>

9

Page 10: Sitemap4rdf(v2 boris)

Sitemap Protocol: Optional parts

<?xml version="1.0" encoding="UTF-8"?><urlset

xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url>

<loc>http://yoursite/</loc><lastmod>2010-06-24</lastmod>< h f >d il </ h f ><changefreq>daily</changefreq>

</url></urlset>

10

Page 11: Sitemap4rdf(v2 boris)

Sitemap Protocol: Huge sitemaps

• Gzip-compress your sitemap• Limit: 50k URLs or 10MB

• split into multiple sitemap filessplit into multiple sitemap files• add a sitemap index file

11

Page 12: Sitemap4rdf(v2 boris)

Sitemap Protocol: Discovery

• Publish the sitemap file

• Add a line to http://yoursite/robots.txt

• Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

Sitemap: http://yoursite/sitemap.xml

12

Page 13: Sitemap4rdf(v2 boris)

ToC

• Publishing Linked Data from a triple store• Publishing Linked Data from a triple store• Search engines

The Sitemap protocol• The Sitemap protocol• sitemap4rdf

S• Summary• Future work

13

Page 14: Sitemap4rdf(v2 boris)

sitemap4rdf

• Simple command line tool• Sends a SPARQL query to list all URIs• Generates sitemap• Generates sitemap

it 4 df htt // it / l htt // it / /sitemap4rdf http://yoursite/sparql http://yoursite/resource/

Example:

it 4 df if i th SPARQL d i t

sitemap4rdf http://geo.linkeddata.es/sparql http://geo.linkeddata.es/

• run sitemap4rdf specifying the SPARQL endpointand the prefix of the URLs to include in the Sitemap

14

Page 15: Sitemap4rdf(v2 boris)

Submit the sitemap location - Sindice

• http://sindice.com/main/submit

15

Page 16: Sitemap4rdf(v2 boris)

Submit the sitemap location - Google

• https://www.google.com/webmasters/tools/

16

Page 17: Sitemap4rdf(v2 boris)

ToC

• Publishing Linked Data from a triple store• Publishing Linked Data from a triple store• Search engines

The Sitemap protocol• The Sitemap protocol• sitemap4rdf

S• Summary• Future work

17

Page 18: Sitemap4rdf(v2 boris)

Summary

• Sitemap protocol informs search engines about available pagesavailable pages• Supported by Sindice!

• sitemap4rdf generates Sitemap files by listing URIsin a SPARQL endpoint• Open source, Java• http://lab.linkeddata.deri.ie/2010/sitemap4rdf/• http://mccarthy dia fi upm es/sitemap4rdf/• http://mccarthy.dia.fi.upm.es/sitemap4rdf/• http://www.oeg-upm.net/index.php/en/downloads/122-sitemap4rdf

18

Page 19: Sitemap4rdf(v2 boris)

ToC

• Publishing Linked Data from a triple store• Publishing Linked Data from a triple store• Search engines

The Sitemap protocol• The Sitemap protocol• sitemap4rdf

S• Summary• Future work

19

Page 20: Sitemap4rdf(v2 boris)

Future Work

• Integrate sitemap4rdf with Pubby

• Generate voiD file automatically from a SPARQL endpoint

• Generate an entry in CKAN (registry of open knowledge packages) automatically through CKAN-API

http://ckan net/package/geolinkeddata• http://ckan.net/package/geolinkeddata

• Interact with prefix cc ( service for remembering and• Interact with prefix.cc ( service for remembering and looking up URI prefixes) through its API• geoes: < http://geo.linkeddata.es/ontology>geoes: http://geo.linkeddata.es/ontology

20

Page 21: Sitemap4rdf(v2 boris)

Future Work

• Support the semantic sitemap extension (when it willbe compatible with google)be compatible with google)• http://sw.deri.org/2007/07/sitemapextension/

21

Page 22: Sitemap4rdf(v2 boris)

sitemap4rdfgenerate Sitemap files from a SPARQLgenerate Sitemap files from a SPARQL

endpointhttp://www deri ie/http://www.deri.ie/

Boris Villazón-Terrazas and Richard Cyganiak (DERI)Facultad de Informática, Universidad Politécnica de Madrid

Campus de Montegancedo sn 28660 Boadilla del Monte MadridCampus de Montegancedo sn, 28660 Boadilla del Monte, Madridhttp://www.oeg-upm.net

Phone: 34.91.3366605, Fax: 34.91.3524819