How can Big Data contribute to the Open Data process? DANE Big Data Event, Bogotá, October 2013 1.
-
Upload
dayna-bishop -
Category
Documents
-
view
218 -
download
0
Transcript of How can Big Data contribute to the Open Data process? DANE Big Data Event, Bogotá, October 2013 1.
1
How can Big Data contribute to the Open Data process?
DANE Big Data Event , Bogotá, October 2013
DANE, Big Data, October 2013 2
1. What is “Big Data”?
2. Big Data sources, challenges & opportunities
3. Big Data and official statistics
4. Current Big Data initiatives
5. What is “Open Data” ?
6. Current Open Data initiatives
7. Incorporating Big Data into official statistics Open Data programmes - the OECD perspective
Presentation contents
DANE, Big Data, October 2013 3
DATABIG
DANE, Big Data, October 2013 4
Big data are data sources that can be –generally– described as: “high volume, velocity and variety of data that demand cost-effective, innovative forms of processing for enhanced insight and decision making.”
Gartner
• Big data is characterized as data sets of increasing volume, velocity and variety
• Big data is often largely unstructured, meaning that it has no pre-defined data model and/or does not fit well into conventional relational databases
• private sector may take advantage of the Big data era and produce more and more statistics that attempt to beat official statistics on timeliness and relevance
What is “Big Data”?
DANE, Big Data, October 2013 5
What is “Big Data”? The data deluge
DANE, Big Data, October 2013 6
New wealth of digital data - 90% of the world’s digital data has been created in just the last two years and is doubling every 20 months.
Big data comes in a number of forms: • “Data Exhaust” collected passively from devices (phones, credit cards, web searches etc) as sensors of human behaviour
• Online information (blogs, twitters, news articles...) sensors of human sentiments
• Physical sensors (pollution, light emission etc) remote sensors of human activity
• Citizen reporting – information actively produced via phone-surveys, hotlines etc
What is “Big Data”? The data deluge
DANE, Big Data, October 2013 7
• Administrative (electronic medical records, hospital visits, insurance records, bank records, food banks, etc.)
• Commercial/Transactional (credit card transactions, on-line transactions, etc.)
• Sensors (satellite imaging, road sensors, climate sensors, etc.)
• Tracking devices (mobile telephones, GPS, etc.)
• Behavioural (online searches, online page view, etc.)
• Opinion (comments on social media, etc.)
Big Data - Sources
DANE, Big Data, October 2013 8
• Legislative - with respect to the access and use of data.
• Privacy - managing public trust and acceptance of data re-use and its link to other sources.
• Financial - potential costs of sourcing data vs. benefits.
• Management - policies and directives about the management and protection of the data.
• Methodological - data quality and suitability of statistical methods.
• Technological - issues related to information technology.
Big Data - Challenges
DANE, Big Data, October 2013 9
• Collecting data in real time or near real time maximize the potential of data
• big data has potential as an input for official statistics; either for use on its own, or in combination with more traditional data sources such as sample surveys and administrative registers
• Big data has the potential to produce more relevant and timely statistics than traditional sources of official statistics
• By incorporating relevant Big data sources into their official statistics process NSOs are best positioned to measure their accuracy
Big Data - Opportunities
DANE, Big Data, October 2013 10
1. NSO as brokers of Big Data?
2. NSO to provide “Quality Stamp” ?
3. Combining Big data with official statistics
4. Replacing official statistics by Big data
5. Filling new data gaps, i.e. developing new 'Big data - based' measurements to address emerging phenomena (not known in advance or for which traditional approaches are not feasible)
6. Visualization methods
7. Text mining
8. High Performance Computing.
Big Data Opportunities - areas for experimentation
DANE, Big Data, October 2013 11
“What does Big Data mean for official statistics?”
Big Data and Official Statistics
DANE, Big Data, October 2013 12
Big Data and Official Statistics
DANE, Big Data, October 2013 13
Statistical organisations are encouraged to address formally Big data issues in their annual and multi-annual work programmes by: • undertaking research and pilot projects
in selected areas • allocating appropriate resources for that
purpose.
Big Data & Official Statistics
DANE, Big Data, October 2013 14
• Collaboration of NSOs with private data source owners is of critical importance and it touches upon sensitive issues such as privacy, trust and corporate competitiveness, as well as the legislation framework of the NSOs.
• To use Big data, statisticians are needed with a different mind-set and new skills. The processing of more and more data for official statistics requires statistically aware people with an analytical mind-set, an affinity for IT and a determination to extract valuable ‘knowledge’ from data : “Data scientists” (Quote stats from US)
• NSOs should develop the necessary internal analytical capability through specialised training.
Big Data & Official Statistics
DANE, Big Data, October 2013 15
Example: Twitter used an algorithm which could perceive the difference between actual sickness and usage of the common word ‘sick’, researchers were able to plot and predict when people from a certain area were at risk of picking up a flu bug
Big Data – Examples
DANE, Big Data, October 2013 16
Big Data – Examples
DANE, Big Data, October 2013 17
Real Estate• Collect real time real estate data, incl. location, product characteristics and price
information
• Gather data by extracting information from Real estate sites ads in major agglomerations
• Collection mechanism: Search engine with semantic capability collects and structures data
• Data analysis with existing statistical tools to produce aggregated indicators
• Enrich the GOV metropolitan database with the compiled indicators
Traffic • Collect a sample of data tracking movements of mobile users over a territory
• Compile from that data and create transportation performance indicators (especially: reliability)
• Including evolution over time
• Collection mechanism: Sample of #100K users across several countries who downloaded an App on mobile access quality
• Data analysis with existing statistical tools to produce aggregated indicators
Big Data – Examples (OECD)
DANE, Big Data, October 2013 18
Political tension • Produce indicators on Political tension for African Economic Outlook
• Collect a sample of qualitative data (articles, …) qualifying political tension in African countries. Based on keywords: strike, demonstration, kidnapping,…
• Text mining on countries or topics could raise interest as part of the data collection process
Employment• Supplement survey data on employment with job offerings and applications
collected from the Internet
• Building indicators by analysing legal documents (labour codes, labour legislation, court judgements, ….)
Internet• Quality of Internet network infrastructure and security
• Ranking of languages most used on the Internet
• Study the effectiveness of current intellectual property protection laws
Big Data – Examples (OECD)
DANE, Big Data, October 2013 19
• Objectives– Collect a sample of data tracking movements of mobile users over a
territory
– Compile from that data and create transportation performance indicators (especially: reliability)
– Including evolution over time
• Proof of concept envisaged– Solution provider identified (Sensorly, start-up specialised in mobile
data)
– Privacy should not be an issue since Sensorly collects data based on an opt-in mechanism and does not rely on mobile operators data.
– Collection mechanism: Sample of #100K users across several countries who downloaded an App on mobile access quality
– Data analysis with existing statistical tools to produce aggregated indicators
Big Data – Examples (OECD)
DANE, Big Data, October 2013 21
Big Data – Examples (Health data)
DANE, Big Data, October 2013 22
Any questions?
DANE, Big Data, October 2013 23
DATAOPEN
DANE, Big Data, October 2013 24
From wikipedia:• Open data is the idea that data
should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.
What is “Open Data”?
DANE, Big Data, October 2013 25
From open data handbook:Open data is as defined by the Open Definition:Open data is data that can be freely used, reused and redistributed by anyone.
What is “Open Data”?
DANE, Big Data, October 2013 26
From OECD Open Data Project:
Definition of ‘Open’ from 2011 OECD Publishing Review :
To make OECD data machine-readable, retrievable, indexable and re-usable
What is “Open Data”?
DANE, Big Data, October 2013 27
1. Completeness: Datasets released by the government should be as complete as possible, reflecting the entirety of what is recorded about a particular subject. Metadata that defines and explains the raw data should be included as well, along with formulas and explanations for how derived data was calculated.
2. Primacy: Datasets released should be primary source data. This includes the original information collected, details on how the data was collected and the original source documents recording the collection of the data.
3. Timeliness: Datasets should be available to the public in a timely fashion. Whenever feasible, information collected should be released as quickly as it is gathered and collected
4. Ease of Physical and Electronic Access: Datasets should be as accessible as possible. There should be no barriers such as completing forms or submitting requests or systems that require browser-oriented technologies (e.g., Flash, Javascript, cookies or Java applets).
5. Machine readability: Information should be stored in widely-used file formats that easily lend themselves to machine processing. These files should be accompanied by documentation related to the format and how to use it in relation to the data..
Open Data: Ten Principles for Opening Up Government Information (Sunlight Foundation)
DANE, Big Data, October 2013 28
6. Non-discrimination: Barriers to use of data can include registration or membership requirements. Any person can access the data at any time without having to identify him/herself or provide any justification for doing so.
7. Use of Commonly Owned Standards: Should be freely available formats by which stored data can be accessed without the need for a software license to make the data available to a wider pool of potential users.
8. Licensing: Maximal openness means making data available without restrictions on use as part of the public domain.
9. Permanence: Information should be available online in archives in perpetuity. Data should remain online, with appropriate version-tracking and archiving over time.
10. Usage Costs: Data should be available free of charge
Open Data: Ten Principles for Opening Up Government Information (Sunlight Foundation)
DANE, Big Data, October 2013 29
Examples of open data initiatives
DANE, Big Data, October 2013 30
Open Data examples – Data.Gov.uk
DANE, Big Data, October 2013 35
Introducing the OECD DELTA Programme….
Incorporating Big Data into official statistics Open Data programmes - the OECD perspective
DANE, Big Data, October 2013 36
DELTA Programme – Making OECD data Open, Accessible, Free
Accessible
Open
Free
FindUnderstandUse
Machine-readableIndexableRe-Useable
Available without charge
DANE, Big Data, October 2013 37
• To make OECD data machine-readable, retrievable, indexable and re-usable.
• To increase the dissemination and impact of OECD data via open data services for OECD statistical data
• To encourage re-use of OECD data and re-use by OECD of external innovation via open innovation process and communities,
The Open Data project - goals
DANE, Big Data, October 2013 38
Data content• All datasets within the OECD.Stat data
warehouse with standardised structural format and content necessary for machine-to-machine “Open” access.
The Open Data project - scope
DANE, Big Data, October 2013 39
i) SDMX/JSON JavaScript Object Notation (JSON)text-based open standard designed for human-readable data interchange Widely-used open data format on web sites today. JSON has a number of advantages, including: • Simplicity - simple and ‘lightweight’ format with a smaller
grammar and can map directly onto the data structures used in today’s programming languages.
• Interoperability - has the same interoperability potential as XML.
• Openness - has the same open capabilities as XML• Readability - is much easier for human to read than XML. It is
easier to write and is easier for machines to read and write.
Open data web services – Data Formats
DANE, Big Data, October 2013 40
ii) Excel/CSV Excel and CSV are already widely used exchange standards so including them as output formats was a fairly obvious decision.
iii) Open Data (OData)OData is an open protocol for sharing data
Open data web services – Data Fotmats
DANE, Big Data, October 2013 41
iv) Future formats could include Google Data (a REST-inspired technology), Google Dataset Publishing Language (DPSL) or Google KML, a Geospatial file format.
Open data web services – Data Formats
DANE, Big Data, October 2013 42
Any questions?