HLG Big Data project and Sandbox

download HLG Big Data project and Sandbox

If you can't read please download the document

description

Presentation at IAOS 2014 Conference - Da Nang (Vietnam)

Transcript of HLG Big Data project and Sandbox

  • 1. HLG Big Data projectand SandboxCarlo Vaccari (Istat) IAOS October 2014 1

2. This material is distributed under the Creative Commons"Attribution - NonCommercial - Share Alike - 3.0", available athttp://creativecommons.org/licenses/by-nc-sa/3.0/Carlo Vaccari (Istat) IAOS October 2014 2 3. Carlo Vaccari (Istat) IAOS October 2014 3InternationalHigh Level Group to coordinate groups working on StatisticalStandards: UNECE, OECD, Eurostat, National Statistical Org. 4. May 2013: task team with the aim to define a project to bepresented to international statistical community:Three main objectives:To identify the main possibilities and the main strategic andmethodological issues that Big Data poses for the official statisticsTo analyze the feasibility of efficient production of officialstatistics using Big Data sources, and the possibility to replicatethese approaches across different national contextsTo facilitate the sharing across organizations of knowledge,expertise, tools and methods for the production of statistics usingBig Data sourcesCarlo Vaccari (Istat) IAOS October 2014 4BigDataProject 5. Project presented to HLG and CESTask teams composed by people from 13 organisationsThe project composed of four task teams:Partnership Task TeamPrivacy Task TeamQuality Task TeamSandbox Task TeamCarlo Vaccari (Istat) IAOS October 2014 5BigDataProject 6. Carlo Vaccari (Istat) IAOS October 2014 6PartnerProviders shipTaskand sources of data - challenges: access to data,managing privacy and confidentialityGovernment (Administrative records)Private (Commercial records)Social Media and other Internet sitesDesign - research design and developmentAcademiaPrivate and/or public research institutesNGOsInternational organizations 7. Carlo Vaccari (Istat) IAOS October 2014 7PartnerTechnology shipTask- Tools, data and infrastructure for dataprocessing, data mining, real-time analytics, storage,computing, and data visualizationPrivate sector (technology providers, IT companies)Data providers themselvesAnalysis - NSOs can provide standards and methodologywhereas others provide analytical capacity and modelingAcademiaPrivate and/or public research institutesNGOsInternational organizations 8. Overview of existing tools for risk management in view of privacyissuesCarlo Vaccari (Istat) IAOS October 2014 8PrivacyTaskTeaRisks to privacy - Privacy softwareData access strategies (onsite, remote access, microdata)Overview of database privacy technologiesEvaluation of different privacy approachesBig Data characteristics and their implications for data privacyData access strategies for Big DataComputer Science and Statistical Disclosure approachesDisclosure Risk assessment for Big Data 9. Information Integration and Governance (DB monitoring,security, transport security)Statistical Disclosure LimitationsCarlo Vaccari (Istat) IAOS October 2014 9PrivacyTaskTeaPreserving confidentialityBalance between Data utility and Disclosure RiskSDL methods:Data maskingTraditional approaches: aggregation, obfuscation,perturbations, data swappingModern approaches: sampling and simulationManaging potential risk to reputation: ethical practices,controls, communication, dialog with public 10. Carlo Vaccari (Istat) IAOS October 2014 10QualiInput tyTaskTeaquality framework with indicators:Source: data-source, reliability, privacy, availability, costs, procedures,...Metadata: representativeness, usability, completeness, id, ...Data: collection, coverage, complexity, efficiency, integrabilityOutput quality framework with indicators:Metadata: clarity, accessibility, completeness, comprehensivenessData: relevance, accuracy, timeliness, accessibility, coherence,predictivity, selectivityProcess quality with indicators :Cleaning: unambiguous, objectivity, granularity, reliabilityTransformations: compliance, categorization, precisionLinking: completeness, selectivity, accuracy, id, time_relatedAggregation: quantity, confidentiality, Integration, validity, accuracy 11. Carlo Vaccari (Istat) IAOS October 2014 11SandboxSandbox: web-accessible environment where researchers comingfrom different institutions explore tools and methods needed forstatistical production and the feasibility of producing Big Data-derivedstatisticsList of tools chosen: Hadoop, Hortonworks, Pentaho, RHadoopOpen list ... 12. Carlo Vaccari (Istat) IAOS October 2014 12SandboxSandbox hosted at the Irish Center for High-End Computing (ICHEC) which will assistthe task team for the testing and evaluationof Hadoop work-flows and associated dataanalysis application softwareThe mission of ICHEC is to provide High-Performance Computing (HPC) resources,support, education and training forresearchers 13. Carlo Vaccari (Istat) IAOS October 2014 13SandboxconfigurThe hardware on which thesandbox system is based is a HighPerformance Computing Linuxcluster hosted in the NationalUniversity of Ireland (Galway)composed of 30 nodes each ofwhich has two quad-coreprocessors, 48GB of RAM and a1TB local diskEach node is connected to twonetworks one for accessing theshared Lustre and one GigabitEthernet network for management20TB shared filesystem is availableto all nodes 14. Virtual Sprint (March 2014) first documentWorkshop in Rome (April 2014)Training in Rome (May 2014)Sandbox installation and verificationWorkshop in Heerlen (September 2014)Testing scenarios for BD usage in Official Statistics:Carlo Vaccari (Istat) IAOS October 2014 14Sandbox in2014use as auxiliary information to improve an existing surveyreplacing all or part of an existing survey with Big Dataproducing a predefined statistical output either with orwithout supplementation of survey dataproducing a statistical output guided by findings from thedata 15. Carlo Vaccari (Istat) IAOS October 2014 15SandboxpartnerSoftware:Hortonworks Granted a free enterprise supportsubscription for the duration of the projectPentaho Free trial of enterprise platformData:Mobile data from OrangeSmart meters data from Irish power agencySmart meters from Canadian power agency 16. Carlo Vaccari (Istat) IAOS October 2014 16SandboxexperiOrganized in Task teams, one for each source:Consumer Price IndexMobile phone dataSmart metersTraffic loopsSocial DataWeb scrapingJob vacancies 17. Carlo Vaccari (Istat) IAOS October 2014 17ExperimentConsSources:Web scraping from ONS (UK supermarkets)Synthetic scanner data from IstatTest performance of big data technologies applied to thecomputation of a simplified consumer price index, based onsynthetic data sets modeling scanner dataA first version of the price generator was tested successfully ingenerating a sample csv file with 11 billions rows, successfullyuploaded in the sandboxComparison between Hadoop NoSQL RDBMSVisual analysis of data through Pentaho suite 18. Carlo Vaccari (Istat) IAOS October 2014 18ExperimentMobilFour dataset from Orange provider for Ivory Coast:calls and duration for pair of cells for each hourcalls coming from 500k phones with time and cellcalls coming from 500k randomly sampled individualscommunication sub-graphs for 5k usersExperiments:Classification of Caller: workers, students, business, not LF,...Classification of zones (cells): industrial, residential,school/university, farmers, high/low trafficTemporal distribution of Calls (day/week/season) 19. Carlo Vaccari (Istat) IAOS October 2014 19ExperimentMobilParallel experiment on Slovenian and Orange data: exchange of methods, tools, findingsSearching for other datasets from other providers 20. Carlo Vaccari (Istat) IAOS October 2014 20ExperimentDatasets:SmartSmart meter data from Ireland (household level, linkedwith 2 surveys)Synthetic smart meter data from Canada (householdlevel, covering several years, time stamped hourlyelectricity consumption linked with hourly weather dataand hourly price data, matched with quarterly surveydata)Experiment: Rhadoop code for visualizing synthetic Canadiansmart meter data, providomg time elapsed for the following:Hourly Consumption (kWh) v Hourly Temperature (C) for alldataHourly Consumption (kWh) v Hourly Price (c) for all data 21. Carlo Vaccari (Istat) IAOS October 2014 21ExperimentTraffiIn the Netherlands, 20,000 traffic loops, counting the numberof vehicles each minute, are located on approximately 3,000km of speedway. All this data is collected by a central agency,the NDW (National data warehouse for traffic). Data loaded forone year for the area of South Limburg, consisting of about800 of these traffic loopExperiment:Find out how to deal with multiple files in HadoopSee how the traffic develops during a yearDeliverables:Code for aggregating the data in Hive and RHadoopA graphical representation about the development of thetraffic on these roads and in this region 22. Carlo Vaccari (Istat) IAOS October 2014 22ExperimentTraffi 23. Carlo Vaccari (Istat) IAOS October 2014 23ExperimentSociSet of tweets generated in Mexico from January to July 2014:Sentimental analysis techniques in obtaining indicators ofsubjective wellbeing (compare with stats)Use geo-tagged tweets for analysing people movementState of origin of tourists visiting "Magic Towns" in Mexico 24. Carlo Vaccari (Istat) IAOS October 2014 24ExperimentSociNext steps:Geo-located tweets experiments on:Working patterns / commuting from morning to nightWeekends / Holydays / Seasonal movementsSouth North mobility / Commerce at the North borderWork on emoticons and media acronyms analysis:Develop a small emoticons dictionary / review researchpapersCount of emoticons on the tweets that we have, and howmany tweets have emoticons to have an idea of theirrepresentativity powerReview of algorithms: work with some MapReduceadaptations, Spark, Scala 25. The Job-vacancies team works on (historical) job vacanciesdata, scraped from various sites on the web goals:to identify possible both free and commercial data sourcesand its APIs and illustrate potential use casesto scrape job vacancies data from the biggest nationalwebsites (possibly international also)to test scraping tools (Irobotsoft and Kimonolabs)to test statistical process of data manipulationCarlo Vaccari (Istat) IAOS October 2014 25ExperimentJob 26. Carlo Vaccari (Istat) IAOS October 2014 26ExperimentWeb8,600 Italian websites, indicated by the 19,000 enterprisesresponding to ICT survey of year 2013, have been scrapedand the acquired texts have been processedThe scraping and processing work took about 33 hours on avirtual server in Italy, the goal of this activity is to reproduce theused software configuration and rerun the process on a morepowerful environment in order to measure the timeconsumptionExperiment:Configure a Nutch job runnable in the Sandbox environmentExecute the scraping job in order to produce the scrapeddata in HDFSCompare the performance of the sandbox with theperformance of a single server 27. Carlo Vaccari (Istat) IAOS October 2014 27Stateof thePrAll teams are running experiments and have definedobjectives for final deliverables (preliminary results due forend of November, final end of year)Outline of final deliverables defined in September meetingsDeveloped training material, available for all participants andpublic in futureEffective cooperation and exchange of ideas: all participantsrequested more time for developing other experiments andlook forward to extending the project 28. Carlo Vaccari (Istat) IAOS October 2014 28LessonsLearnedInternational cooperation can multiply the ideasData acquisition can be a long process. (eg: five months toget Orange mobile data)group suggested other possible approaches for the futureneed political/legal sponsorshipSetup of the environment required time difficult to achieve"stable" configurationTraining should operate on different skills: IT, statistical andalgorithms. Need of people open to learn new tools,techniques, methods... 29. Thank you for your attention!