Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issues

Click here to load reader

download Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issues

of 28

description

This tutorial about Open Government Data was a 4 hours tutorial at the Conferencia Latinoameticana en Informatica (CLEI 2013) http://clei2013.org.ve/ divided into 5 parts: 1 - Introduction http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-1-introduction 2 - Issues https://www.slideshare.net/jpane/02-issues-v4slideshare 3 - Real Experience http://www.slideshare.net/jpane/open-government-data-tutorial-03-real-experience 4 - Applications http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-4-applications 5 - Semantic Issues http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-5-semantic-issues This is part 5 - Semantic Issues

Transcript of Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issues

2. Outline Overview Issues of opening data Entity centric Semantic layer Importing pipeline Importing tool2Juan Pane, Lorenzino Vaccari08/10/2013 3. AvailableStructuredLinked Open DataOpen formats RedefenceableLinkedThe best data is an open data Vs.All data must be perfect3Juan Pane, Lorenzino Vaccari08/10/2013 4. Lack of explicit semantics The real meaning of the data was kept in the developers mind when creating the datahttp://goo.gl/npEHKr (Thanks to Moaz Reyad)4Juan Pane, Lorenzino Vaccari08/10/2013 5. Lack of explicit semantics Can lead to things like:http://goo.gl/npEHKr (Thanks to Moaz Reyad)5Juan Pane, Lorenzino Vaccari08/10/2013 6. Semantic heterogeneity Difference in the meaning of local data6Juan Pane, Lorenzino Vaccari08/10/2013 7. Issues when Opening Trentino Data Each department has authority on only some part of the data. Dataset originally created for internal use only. Dataset created for a specific need. Dataset created with custom format: For structure (some exceptions) For data Lack of reuse -> duplication. Lack of programmers. We cannot TELL them what/how to do (always). Data changes7Juan Pane, Lorenzino Vaccari08/10/2013 8. AvailableData CatalogStructuredOpen formats RedefenceableLinked8Entity Centric Semantic Layer Juan Pane, Lorenzino Vaccari08/10/2013 9. Entity centric: Added value Aggregated data Accurate data, manually curated Unique identifiers, distributed perspectives Re-think identifiers Semantified values E1E2namenameIgnacio P. F.nationalityitalianborn inParaguaylives inTrentodate of birth1980affiliation9Juan PaneUniv. TrentoaffiliationPF-UNAJuan Pane, Lorenzino Vaccari08/10/2013 10. Entities Real world: is something that has a distinct, separateexistence, although it need not be a material (physical) existence. Has a set of properties, which evolve over time. Example: Mental: personal (local) model created and maintained by aperson that references and describes a real world entity. Digital: capture the semantics of real world entities,provided by people. 10Juan Pane, Lorenzino Vaccari08/10/2013 11. Entity Centric Semantic Layer: Address the integration problems due to semanticheterogeneity: Different formats Different identifiers Implicit semantics Homonyms, synonyms, aliases Partial knowledge Knowledge evolution http://www.webfoundation.org/2011/11/5-staropen-data-initiatives/11Juan Pane, Lorenzino Vaccari08/10/2013 12. Entity-based Integration Focus on entities as first class citizens Entities are objects which are so important in our everyday life to be referred with a name Each entity has its own metadata (e.g. name, latitude, longitude, ) Each entity is in relation with many other entities (e.g. Einstein was born in Ulm, his affiliationwas Charles University, Ulm is a city in Germany) There are relatively few commonsense entity types (person, , event) There are many domain specific entities (bus stops, cycling paths, ..) All components have explicit semantics: schema, entities, attributes, values12Juan Pane, Lorenzino Vaccari08/10/2013 13. Importing pipeline, Macro Steps Domain analysis1.Study the needed entity types, adapt the knowledge base accordingly. First time bootstrappingImport entities2.Semi-automatic tool. 13Domain experts are expensive. Human attention is a scarce resource. Incremental enrichment and aggregation of entities.Juan Pane, Lorenzino Vaccari08/10/2013 14. Open Data Peculiarities All data comes from a CKAN repository (DCAT). Process one data file at a time. Each data file can be represented as a table. Each row in the table represents a (partial) entity. The format of the values might not be enforced in the datafiles. Not all data is relevant.14Juan Pane, Lorenzino Vaccari08/10/2013 15. AvailableData CatalogStructuredOpen formats RedefenceableLinked15Juan Pane, Lorenzino VaccariEntity centric Importing tool08/10/2013 16. Importing tool process16Juan Pane, Lorenzino Vaccari08/10/2013 17. 1. Source Selection Import one data file at a time17Juan Pane, Lorenzino Vaccari08/10/2013 18. 2. Schema Matching Select a target type of entity -> correspondences between the input columns and the output attributes LocalitaTuristica nomeprovinciadescrizioneAndalo (1047)Provincia di TrentoCanazei (1450)Trento Prov.18latlongSorge su un'ampia sella prativa 3 al centro...654463712857Situato all'estremit settentrionale della...511504147444Juan Pane, Lorenzino Vaccarifunivie2 Nome Provincia Quota Coordinate Descrizione popolazione08/10/2013 19. 3. Data Validation Applies format and structure validation and possible automatic transformations needed to have the input data in the expected format.19Juan Pane, Lorenzino Vaccari08/10/2013 20. 4. Semantic Enrichment (1/2) Entity disambiguation: Transform text references into links to existing entities.20Juan Pane, Lorenzino Vaccari08/10/2013 21. 4. Semantic Enrichment (2/2) Natural Language Processing: Extract concepts and entity references from free-text.21Juan Pane, Lorenzino Vaccari08/10/2013 22. 5. Reconciliation Run Identity Management Algorithms to identify each row as a new or existing entity. Result No Match Match Multiple MatchesAction: Use ID New ID Ignore Row22Juan Pane, Lorenzino Vaccari08/10/2013 23. 6. Exporting At this point: We know what to export. All values for target attributes conform to the expected format. All text has been semantified (NLP). All textual references to entities are converted to links Each row has an identifierv0 23Juan Pane, Lorenzino Vaccariii+1 08/10/2013 24. 7. Publishing Put back the semantified entities into CKAN so that the entities can be Open Data and can be found in the same catalog as the original data. Developers and find the data files of the cleaned, aggregated entities But can also interact with the entities via the Entitypedia APIs8. Visualization Search and Navigation 24Juan Pane, Lorenzino Vaccari08/10/2013 25. Semantic Layer: Services Tool for aiding the semantification of the datasets in the catalog based on: Schema matching services Identity Management services Entity Matching services Global Unique Identifier services Semantic search and indexing services Natural Language Processing Entity store25Juan Pane, Lorenzino Vaccari08/10/2013 26. Our Goal TNUKESBE26Juan Pane, Lorenzino Vaccari08/10/2013 27. 27Juan Pane, Lorenzino Vaccari08/10/2013 http://www.shabra.com/wp-content/uploads/2011/03/lets-work-together.jpg 28. Gracias!Grazie! Mercy!Thanks! Kiitos!Dank u! Grcies!Gratias! Danke!We thank in particular CLEI 2013, Autonomous Province of Trento, TrentoRise association, Universidad Nacional de Asuncion, and University of Trento28Juan Pane, Lorenzino Vaccari08/10/2013