Harry Halpin: Artificial Intelligence versus Collective Intelligence
Historical Data Integration based on Collective Intelligence
description
Transcript of Historical Data Integration based on Collective Intelligence
WHD Colloquium, March 27, 2012 1
Historical Data Integration based on Collective Intelligence
Vladimir Zadorozhny
Graduate Information Science and Technology ProgramSchool of Information Sciences
University of Pittsburgh
NADM Group
V. Zadorozhny
2
Challenge
Diverse ,Heterogeneous,Semi-structuredData Sources
WHD Data Integration Infrastructure
ConsolidatedStructuredInformation
V. Zadorozhny
WHD Colloquium, March 27, 2012 3
Web of Data?• Linked Data: using the Web to create typed links between data
from different sources• Linked Data uses RDF (Resource Description Framework) to make
typed statements (triples)• Expected result: Web of Data extending the Web with a global
data space connecting diverse domains (people, companies, publications , etc.)
• In general, Web of Data has a potential (still questionable) to support loose data coupling that may facilitate more efficient data utilization
While WHD can utilize LD and related Web mashup technologies to some extent, it would be premature to rely upon the Linked Data infrastructure
V. Zadorozhny
WHD Colloquium, March 27, 2012 4
Dataverse Network?• An open source application to publish, share, reference, extract and
analyze research data that facilitates making data available to others• "Dataverse owners can upload any file type and format (excel, txt,pdf,
doc, etc.), and the files will be stored and made available in the original format“ (http://thedata.org/files/dataversehandout.pdf)
• Information consumers should further integrate data sources to perform analysis using multiple "dataverses".
While WHD aims to be a part of the Dataverse Network, it would not encourage users to contribute data in ANY format. Instead, users integrate their data into the WHD repository while submitting the data. To summarize, WHD infrastructure crowdsourses the data integration task, not just data contribution task. V. Zadorozhny
WHD Colloquium, March 27, 2012 5
Data Submission System
Structured homogeneoushistorical data
Information Providers
Annotated historical data
Internal Data
ReliabilityAssessment
Fused historical data
Information Consumers
…
Wrapper
Wrapper
Heterogeneous historical data sources
WrapperGeneration
WrapperRegistration
ExternalData Reliability
Assessment
DataFusion
General WHD Architecture
V. Zadorozhny
According to the 2006 revision of the World Population Prospects the total population in the region of Liberia in 1950 was 824,000. The average population growth percent per year for the following ten years was 2.5. For Ivory Coast those numbers are 2,505,000 and 3.6 correspondingly
Extendable Target Schema (relational is not mandatory):Source | Location | From | To | Population |
Data Source: s1 (xl) Data Source: s2 (doc)
Source|Location | From |To | Population| s2 | Liberia | 01/01/1950 | 12/31/1950| 824000 | s2 |Liberia | 01/01/1960 | 12/31/1960| 1,052,000 | s2 |Ivory Coals | 01/01/1950 | 12/31/1950| 2,505, 000 | s2 |Ivory Coast | 01/01/1950 | 12/31/1950| 3,692,000 |
Materialize Data
Keep Data Remotely
select * from Population
s1 |Mauritania | 01/01/1950 | 12/31/1950| 692,000 | s1 |Mauritania | 01/01/1960 | 12/31/1960| 892,000 | s1 | Senegal | 01/01/1950 | 12/31/1950| 2,543,000 | s1 | Senegal | 01/01/1960 |12/31/1960 | 3,277,000 |
Simple Scenario
Mapping: Territories -> Location Population -> PopulationData Aggregation -> TotalYear -> From,To
Wrapper
Mapping: region -> Location Population -> PopulationData Aggregation -> TotalYear -> From,To
Wrapper
WHD Infrastructure
WHD Colloquium, March 27, 2012 7
WHD Infrastructure
Data Curation Data Collection
Data Utilization
Big Picture: continuously growing infrastructure (a la Wikipedia)
V. Zadorozhny
WHD Colloquium, March 27, 2012 8
• Group of graduate IS students: special project in Advanced Data Management class (INFSCI2711)
• Content Management → Pligg ( Open Source Content Management System, Apache, PHP, and MySQL based)
• Data Integration Engine → Pentaho Kettle (Open Source Data Integration Engine, Java-based GUI and Command Line Tools, XML based data transformation file)
• Data providers download Wrapper Generating Software configure wrappers on their workstation ( using
preconfigured templates) register wrappers on WHD Server
WHD Prototype
V. Zadorozhny
10
Data Source
Data Transformation
Transformed Data
XML Wrapper
WHD Colloquium, March 27, 2012 11V. Zadorozhny
12
Data Reliability Assessment and Data Fusion
• The systems based on crowdsourcing require mechanisms to ensure data quality. • WHD Infrastructure will support efficient data curation strategies based on advanced data reliability assessment and data fusion methods. • As system continuously receives new historical reports, WHD estimates reliability of this data, which evolves with respect to new evidence. • WHS uses a measure of inconsistency caused by a report to assess its internal reliability.• WHD also allows users to submit their subjective feedback on reliability of data to assess external reliability. •WHD utilizes subjective logic to combine internal and external reliability assessment
13
Historical Data: Redundancy
t1 | source_ref1 | Measles | NYC |10/10/1900 | 10/10/1920 | 700t2 | source_ref2 | Measles | NYC |10/20/1910 | 10/30/1930 | 300
Total number of Measles cases in New York City from 1900 to 1930: 700+300 = 1000 ??? Temporal overlap between t1 and t2
1900 193019201910
Measles reports: 700 300
Temporal Overlaps
t5 | source_ref1 | Yellow fever | NY |10/10/1900 | 10/10/1920 | 700t6 | source_ref2 | Hepatitis | NY|10/10/1900 | 10/10/1920 | 700t7 | source_ref4 | Hepatitis B | NY| 10/20/1910 | 10/30/1930 | 300
Total number of Hepatitis cases in New York State from 1920 to 1930: 700+700+300 =1700 ??? Naming overlap between t5, t6 and t7
Naming Overlaps
Total number of Smallpox cases in New York State from 1900 to 1930: 500+600 = 1100 ??? Spatial overlap between t3 and t4
Smallpox reports: 500 (NY) 600 (NYC)
t3 | source_ref1 | Smallpox | NY |10/20/1900 | 10/20/1920 | 500t4 | source_ref1 | Smallpox | NYC |10/30/1920 | 10/30/1930 | 600
Spatial Overlaps
1900 193019201910
WHD Colloquium, March 27, 2012 14
Historical Data: Inconsistency
time
Measles reports in NYC: 200 500
300 400
700
……….
R1:
R2:
Redundant and Inconsistent :
V. Zadorozhny
Information Consumer Toolset:Data Visualization Dashboard
ICTS: Map Exhibits and Timeline Widgets
CV
CVCV
ICTS: Motion Chart Animation
WHD Colloquium, March 27, 2012 18
Conclusion
• We explore a novel approach to reliable, large-scale historical data integration based on collective intelligence
• We implement this approach in WHD infrastructure for consolidation heterogeneous historical data
• Major challenge: how to engage a large community of researchers to share their data and collectively resolve the data heterogeneities in a continuously growing large-scale distributed historical repository?– contributions from CHAI members (only a small fraction of Wikipedia users
contributes information to ensure its growth)– as the infrastructure evolves users may become interested in “embedding” their
data in a larger context to perform global analysis and to utilize WHD tools– open development platform (extendable data transformation library and
toolsets)
V. Zadorozhny
WHD Colloquium, March 27, 2012 19
AcknowledgementsGraduate IS Students (WHD system development team):
Andrew Barnett (team leader)Andrew Entin Thomas JunkerJidapa KraisangkaHan LiaoEric Miller Ye PengEvan PulginoHenry Quattrone Mark Swartz Miao Tan Liu Yuchen Lihong Zhang
Doctoral Students:
Ying-Feng Hsu Julian Lee
V. Zadorozhny