Data Repositories & Linked Data ARD Prasad DRTC Indian Statistical Institute [email protected].

27
Data Repositories & Linked Data ARD Prasad DRTC Indian Statistical Institute [email protected]

Transcript of Data Repositories & Linked Data ARD Prasad DRTC Indian Statistical Institute [email protected].

Data Repositories&

Linked Data

ARD PrasadDRTC

Indian Statistical [email protected]

Looking Back

Open Access to Information (OAI)

•A Fairly successful movement, resulted in•Open Access Repositories (> 2000)•Open Access Journals (> 5000)

•Partially bridging digital divide in Social, Physical, Natural Sciences and Humanities,

Nature of Publications

Many publications use data. Actual article may not have complete data used

• For lack of space• Author might have overlooked the data• Author deliberately did not present data - so that others can not verify the data

For Example

Some suspect that Sigmund Freud's data is of fictious persons, it is not just fictitious names

If data is available ...

• Others may draw different conclusions contradictory to that of the author

• Others may deal with other facets of the data• Data Transparency supplements the Objectivity

and self corrective characteristics of Science

If “Case history of patients” is openly available, it will contribute significantly to medical research

Digital Divide• Social Sciences do not require laboratory

infrastructure• However, physical and natural sciences do

require expensive infrastructure• If experimental data is available to scientists that

do not have infrastructure, it will significantly reduce digital divide in Physical and Natural Sciences

ODA is a step toward transparency and quality in science

For Example

• Human Genome data• Data from Accelerator Labs (CERN)• Recent controversy about particle moving faster

than light• Not surprisingly, astronomy data is openly

available even before the OA movement

Features of Open Data Repositories

Metadata: specify who is the owner, creator etc• license the data to waive your rights to facilitate

bulk download Open Data

• Technology Tools: automate data extraction preferable on Cloud

• Ontology: Index data

Licences

Creative Commons licenses (apart from CCZero), GPL, BSD, etc are NOT quite appropriate for

open data licences

Open Data Licences

• Open Data Commons Public Domain Dedication and Licence (PDDL)

• Dedicate to the Public Domain (all rights waived)• Open Data Commons Attribution License• Attribution for data(bases)• Open Data Commons Open Database

License (OdbL)• Attribution-ShareAlike for data(bases)• Creative Commons CCZero• Dedicate to the Public Domain (all rights waived)

Amazon Web Services (AWS)

Public Data Sets on AWS• Annotated Human Genome Data provided by ENSEMBL•The Ensembl project produces genome databases for human as well as almost 50 other species, and makes this information freely available.

• Various US Census Databases from The US Census Bureau•Demographic data•US Censuses•Summary information about Business and Industry•Economic Household Profile Data.

• UniGene provided by the National Center for Biotechnology Information

Astronomy

Sloan Digital Sky Survey DR6 Subset

Biology

• Influenza Virus (including updated Swine Flu sequence

• Ensembl Annotated Human Genome Data - for MySQL

• GenBank

Chemistry

• PubChem Library• A data set of information on the biological activities of

small molecules.

• 3D Version of the PubChem Library

• UGI Virtual Conformer Library• 500,000 molecules for virtual screening.

Climate

Daily Global Weather Measurements, 1929-2009

Economics

• Federal Reserve Economic Data • Transportation Databases• Labor Statistics Databases• US Census• Business and Industry Summary Data

Digital Curation

• Collecting verifiable digital assets• Providing digital asset search and retrieval• Certification of the trustworthiness and integrity

of the collection content• Semantic and ontological continuity and

comparability of the collection content• Use of open standards (formats) for term

preservation and future proofing by migration of data

Technology

• Data repositories are much larger than OA repositories

• Cloud Computing is a good solution (AWS uses)• Semantic Web & Linked Data (Linking Data

through various methods)

Resource Description in terms of Metadata and Ontology

RDF: Resource Description Framework SKOS: Simple Knowledge Organization

System OWL: Web Ontology Language

SPARQL: SPARQL Protocol and RDF Query Language

RDF Example

Title: Dil-E-NaddanArtist: Talat MahamedArtist: SuraiyaCompany: HMVCountry: IndiaPrice: Rs.100Year: 1955

<?xml version="1.0"?><rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-

ns#"xmlns:cd="http://www.recshop.fake/cd#">

<rdf:Descriptionrdf:about="http://www.hmv.com/cd/Dil-E-Naddan"> <cd:artist>Talat Mahamed</cd:artist> <cd:artist>Suraiya</cd:artist> <cd:country>India</cd:country> <cd:company>HMV</cd:company> <cd:price>Rs. 100</cd:price> <cd:year>1955</cd:year></rdf:Description>

SKOS Example

prefLabel - The preferred term altLabel - These are the See references which point to this

record narrower - Contains the related narrower term broader - Contains a sub-element for the authority type which

contains the related broader term related - Contains a related term which is at the same level in

the heirarchy scopeNote - Note information

DBpedia Data Set

Multi-domain ontology derived from Wikipedia 3.77 million “things” (entities - Entitypedia) 400 million “facts” Uses YAGO (Yet Another Great Ontology)

Entitypedia

Multilingual controlled vocabulary Entity matching Data quality and type checking Entity type specific services Semantic or faceted search and navigation on

entities Summarization of entities and concepts

DRTC Projects

Living Knowledge (EC funded project) ITPAR: India-Trento Program for Advanced

Research (work on Entitypedia)

CHAIN – REDS (EC funded Project): Coordination and Harmonization of Advanced e-Infrastructures–Research & Education Data Sets

Thank You