First they have to find it: Getting Government Data Discovered and Used Adapted from: John S....

27
First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer Polytechnic Institute Troy, New York, USA Twitter: @olyerickson #TWCRPI <Panel: The Art & Science of Data Visuali

Transcript of First they have to find it: Getting Government Data Discovered and Used Adapted from: John S....

Page 1: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

First they have to find it: Getting Government Data Discovered and Used

Adapted from: John S. Erickson, Ph.D.Tetherless World ConstellationRensselaer Polytechnic InstituteTroy, New York, USA

Twitter: @olyerickson #TWCRPI

<Panel: The Art & Science of Data Visualization>

Page 2: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Open Government Data Around the World

2

Starting with efforts in the US and UK, governments around the world have recognized the need to publish their critical data

Percent of total collection (from 1M+ datasets)

Page 3: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Diverse Approaches to Open Gov't Data

3

Government data initiatives have taken many forms

GovData portals are widely varied in how they help users discover and use relevant datasets

Percent of total catalogs(from 192 catalogs)

Page 4: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Federated Discovery of Government Data

4

Stakeholders have seenthe need for

Federated discoveryacross catalogs,

especially from withinmajor search engines

includingBing, Google, Yahoo!

and Yandex

Page 5: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Government Data in the linked open data cloud

http://linkeddata.org/

Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)

Page 6: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Linked Data is Not Enough...

6

• Publishing open government data as Linked Data is not enough

• For OGD to be useful, datasets must be published using metadata, markup standards and presentation that aid discovery and use

Page 7: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Linked Data is Not Enough...

7

• Publishing open government data as Linked Data is not enough

• For OGD to be useful, datasets must be published using metadata, markup standards and presentation that aid discovery and use

Page 8: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Dataset Metadata for Discovery and Use

8

Recent work at TWC RPI demonstrates

the value of applying emerging standards for

uniformly describing government datasets

and catalogs

Page 9: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

International Open Government Dataset Search

9

TWC's IOGDS application is an aggregated catalog of more than 1M datasets from over 192 dataset catalogs from governments at every level around the world

See: http://logd.tw.rpi.edu

Page 10: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

10

Anticipates W3C DCAT RDF vocabulary

Demos what a comprehensive federated catalog based on DCAT and aggregation API might look like

International Open Government Dataset Search

Page 11: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

11

IOGDS is a multi-year effort based on downloading, scraping or accessing APIs, converting metadata to a proto-DCAT model, and publishing via endpoint and download

International Open Government Dataset Search

API

Download

WebWebWeb

IOGDS Workflow

IODGSCSVPer-site

scrapercode

ad hoccode

Csv2rdf4lodautomation

11

Catalogs

See: http://logd.tw.rpi.edu

Page 12: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Schema.org: Semantic Markup for Discovery

12

TWC RPI has published dataset listings based on IOGDS using emerging microdata standards, esp. schema.org model endorsed by Bing, Google, Yahoo!, Yandex...

Page 13: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Schema.org datasets extension

13

• TWC RPI's schema.org dataset extension will enable government dataset catalogs to more easily be parsed and indexed by the major search engines...

• ...which will help users find relevant datasets!

• TWC's dataset extension entered public discussion June 2012

Page 14: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Schema.org datasets extension

14

The schema.org datasets extension enables relevant datasets to be more easily discovered by a range of stakeholders including researchers, data journalists, bloggers and developers

Page 15: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

15

Schema.org datasets extension

“...we've reviewed the current datasets schema proposal in draft, and we are comfortable with the current state of things...

“...At this point, if the group would solidify on the dataset proposal, then Data.gov would support and use it.

---Chris Musialek

Page 16: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

CKAN Data Catalog Scheme & Protocol

16

API-based catalog federation is also possible

ckan announced DCAT-based query/federation API

enables OAI-PMH-like harvesting and more

Page 17: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Dataset extension to schema.org

Page 18: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Demo/ links

http://www.w3.org/wiki/WebSchemas/Datasets

http://www.w3.org/wiki/WebSchemas/SchemaDotOrgProposals

Good introduction (longer/ with more context):

http://www.slideshare.net/joshsh/semantic-markup-using-schemaorg

Page 19: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Examples of current schema.org results

http://schema-creator.org/event.php

http://schema-creator.org/product.php

Page 20: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

To do…

Get Google, Bing, Yahoo, … to crawl these pages

It might look like this: http://www.google.com/publicdata/directory

Page 21: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

From Jim Hendler:

Google is now building custom search engines that will pull down schema.org

Dan Brickley is working on one from the Dataset schema, not yet public 

There's also an open govt data search – not much in it, but looks nice – it's at http://www.google.com/publicdata/directory

Page 22: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Retrieve all the logd datasets:PREFIX dgtwc: <http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#>PREFIX conv: <http://purl.org/twc/vocab/conversion/>PREFIX void: <http://rdfs.org/ns/void#>PREFIX dcterms: <http://purl.org/dc/terms/>SELECT DISTINCT ?dataset ?catalog ?catalog_id ?title ?desc ?country ?homepage ?agency_id ?contributor_id WHERE {    ?dataset a conv:CatalogedDataset .    ?dataset void:inDataset ?catalog .    ?catalog dcterms:identifier ?catalog_id .    ?dataset <http://purl.org/dc/terms/title> ?title .    ?dataset dcterms:description ?desc .    OPTIONAL {        ?dataset dgtwc:catalog_country ?country .    }    OPTIONAL {        ?dataset <http://xmlns.com/foaf/0.1/homepage> ?homepage .    }    OPTIONAL {        ?dataset dgtwc:agency ?agency .        ?agency dcterms:identifier ?agency_id .    }    OPTIONAL {        ?dataset <http://purl.org/dc/terms/contributor> ?contributor .        ?contributor dcterms:identifier ?contributor_id .    }    #?dataset dgtwc:catalog_country <http://dbpedia.org/resource/United_States> .}

Courtesy: Josh Shinavier (RPI/TWC)

Page 23: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

A large number of datasets:

http://logd.tw.rpi.edu/schemaorg_dataset_extension

http://www.google.com/webmasters/tools/richsnippets?url=http://logd.tw.rpi.edu/schemaorg_dataset_extension&view=

Page 24: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

http://logd.tw.rpi.edu/page/international_dataset_catalog_search

Page 25: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Latest from Josh:

Datasets-as-Linked-Data demo.  The RDFa in the pages is not only correct w.r.t. schema.org but is also presented in such a way that an RDFa-aware Linked Data crawler can hop from datasets to catalogs, back again, into DBpedia, etc. while gathering the RDFa as linked RDF.

Since we now have Datasets-ish RDFa markup in the main IOGDS dataset pages (i.e. the pages which the URIs of the datasets redirect to), we're pretty close to a completely integrated demo.  

What remains: (1) the current markup has some problems.  We need to fix those; (2) we need markup for catalogs as well as datasets…

Page 26: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

Needed (1) and (2):

To fix (1), we need to make changes to the LODSPeaKr templates that automatically generate those pages, to make them compliant with the model Josh developed.

To fix (2), we'll work with Alvaro (Graves) to create LODSPeaKr-based automation to generate catalog pages in an efficient way.

(2) presents more of a challenge than (1) at this point, since the IOGDS implementation of dataset details pages is mostly correct at this point.

Still need Dan B. to assist with getting them found…

Page 27: First they have to find it: Getting Government Data Discovered and Used Adapted from: John S. Erickson, Ph.D. Tetherless World Constellation Rensselaer.

What we need:

Willingness to adopt the dataset schema extension – we need lots of datasets to start showing up

We (TWC) will be pushing out some tools, more demos and how-tos, very soon

Wanna play? http://wiki.esipfed.org/index.php/DatasetSchema