25-26/11/2010 ESSnet training Budapest ESSnet Training Part 1 – Administrative Matters P. Jacques.
Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job...
Transcript of Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job...
Big Data ESSNet
WP1 - Web Scraping for Job Vacancy
Statistics
Job Vacancy Pilot: Overview
Official Estimates
(Survey)
Web data
Frequency Monthly Daily (real-time?)
Industry Sector
Enterprise Size
Job type / skills
Sub-national
National Totals
More frequent More timely More granular Potentially Cheaper
Job Vacancy Pilot: Participants
• United Kingdom (lead)
• Germany (core)
• Sweden (core)
• Slovenia (core)
• Italy (observer)
• Greece (observer)
Broad Approach
• Understand the landscape of web-based job vacancy
data in each country
• Focus first on job portals (investigate web scraping of
enterprise websites depending on WP2 progress)
• Try to replicate existing outputs, then investigate
opportunities to produce new types of output.
• Develop specific approaches that are appropriate to
the circumstances in each country
• Develop common approaches where possible
Task 1. Data Access
Sub-task Status
Inventory and assessment of job
portals
Template developed (Germany).
- Deliverable due July 2016
Legal aspects of web scraping
In progress
Review concepts and standards
Template developed
Explore access to third party
sources
CEDEFOP LMI pilot data
Government Employment Agency
data (Sweden, Slovenia)
Obtain existing job vacancy survey
data (for coverage assessment)
In progress
Web scraping enterprise websites SGA2??
Task 2: Data Handling
Sub-task Status
Small scale web scraping
experiments
Experiments by UK, Slovenia, Greece
(using Import.io)
Use of APIs by UK
Evaluate third party data
Evaluation of Govt employment
agency data (Sweden and Slovenia).
CEDEFOP LMI data (priority over
next few months)
Design larger scale web scraping
experiments
Awaiting Sandbox access
Evaluate, adopt and enhance
methods for:
• de-duplication
• data cleaning and formatting
• classifying data
Have good understanding of these
issues from CEDEFOP (training visit
in May)
Task 3: Methodology for Output
Production
Sub-task Status
• Combine 3rd party sources with
targeted web scraping.
Started
• Link with job vacancy survey data
and evaluate coverage
Started
• Experimental estimates replicating
existing outputs
Not there yet
• Further iterations??
Task 4: Future Perspectives
• Web scraping job vacancies from enterprise
websites (WP2)
• Extending approaches to other member
states
• New statistical products
Deliverables
• 1.1 Inventory and qualitative assessment of
job portals: July 2016
• 1.2 Interim feasibility report: November 2016
• 1.3 Final technical report: July 2017
Milestones
• 1.4 Progress and technical report of 1st internal
work package meeting (7-8 Apr, Wiesbaden)
• 1.5 Progress and technical report of 2nd internal
work package meeting (7,8,9 Nov, Rome)
• 1.6 Simulated production systems deployed (Jan
2017 … start earlier?)
• 1.7 Simulated production systems
decommissioned (July 2017)
Many potential approaches to data collection
1. Web scraping Job Portals 2. Job Portal APIs
3. Web scraping Enterprise Websites
4. Commercial Suppliers
5. Public Sector Agencies
Job Portals
• Types
- Portals (original content)
- Job search engines (crawl job portals)
- Generalist vs Specialist, Regional
• Country differences:
- Germany, 1600 job portals
- Slovenia, 40 job portals. 2 largest cover 95% of all
online jobs.
Job Portals – Evaluation Criteria
What 1. Position
2. Occupation
3. Education
4. Type of job (temporary or permanent,
full-time, or part time)
When 5. Date of advertised vacancy
6. Date of application deadline
7. Date to fill a vacancy
Where 8. Location of job
Who 9. Direct employer or agency
10. Economic activity of employer (NACE)
Third party data: CEDEFOP
• Web scraping pilot in 2015
• Primarily interested in skills / job titles
• 5 EU countries (UK, Ireland, Germany, Italy,
Czech Rep)
• 4-5 portals per country
• Fully documented processes (e.g. robots, de-
duplication, classification)
• Can we use this data as the basis for job
vacancy statistics?
• Agreement in place to access to pilot system
SCB official JV estimates vs Arbetsformedligen
(National Employment Agency)
Public sector (SCB)
Private sector (SCB)
Total (SCB)
AF data
UK Job Vacancy Estimates ONS vs Adzuna
Vacancy
estimates
(thousands)
UK Job Vacancy Estimates ONS vs Adzuna
Vacancy
estimates
(thousands)
Monthly %
change
Results from UNECE pilot (Slovenia)
Current Priorities
• Complete inventory and evaluation of job
portals
• Evaluate existing data sources (including
comparisons with survey data)
• Further develop methods for obtaining and
processing additional data (web scraping,
APIs)
• Prepare for a “virtual sprint” (28-29 July)