Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job...

19
Big Data ESSNet WP1 - Web Scraping for Job Vacancy Statistics

Transcript of Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job...

Page 1: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Big Data ESSNet

WP1 - Web Scraping for Job Vacancy

Statistics

Page 2: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Job Vacancy Pilot: Overview

Official Estimates

(Survey)

Web data

Frequency Monthly Daily (real-time?)

Industry Sector

Enterprise Size

Job type / skills

Sub-national

National Totals

More frequent More timely More granular Potentially Cheaper

Page 3: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Job Vacancy Pilot: Participants

• United Kingdom (lead)

• Germany (core)

• Sweden (core)

• Slovenia (core)

• Italy (observer)

• Greece (observer)

Page 4: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Broad Approach

• Understand the landscape of web-based job vacancy

data in each country

• Focus first on job portals (investigate web scraping of

enterprise websites depending on WP2 progress)

• Try to replicate existing outputs, then investigate

opportunities to produce new types of output.

• Develop specific approaches that are appropriate to

the circumstances in each country

• Develop common approaches where possible

Page 5: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Task 1. Data Access

Sub-task Status

Inventory and assessment of job

portals

Template developed (Germany).

- Deliverable due July 2016

Legal aspects of web scraping

In progress

Review concepts and standards

Template developed

Explore access to third party

sources

CEDEFOP LMI pilot data

Government Employment Agency

data (Sweden, Slovenia)

Obtain existing job vacancy survey

data (for coverage assessment)

In progress

Web scraping enterprise websites SGA2??

Page 6: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Task 2: Data Handling

Sub-task Status

Small scale web scraping

experiments

Experiments by UK, Slovenia, Greece

(using Import.io)

Use of APIs by UK

Evaluate third party data

Evaluation of Govt employment

agency data (Sweden and Slovenia).

CEDEFOP LMI data (priority over

next few months)

Design larger scale web scraping

experiments

Awaiting Sandbox access

Evaluate, adopt and enhance

methods for:

• de-duplication

• data cleaning and formatting

• classifying data

Have good understanding of these

issues from CEDEFOP (training visit

in May)

Page 7: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Task 3: Methodology for Output

Production

Sub-task Status

• Combine 3rd party sources with

targeted web scraping.

Started

• Link with job vacancy survey data

and evaluate coverage

Started

• Experimental estimates replicating

existing outputs

Not there yet

• Further iterations??

Page 8: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Task 4: Future Perspectives

• Web scraping job vacancies from enterprise

websites (WP2)

• Extending approaches to other member

states

• New statistical products

Page 9: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Deliverables

• 1.1 Inventory and qualitative assessment of

job portals: July 2016

• 1.2 Interim feasibility report: November 2016

• 1.3 Final technical report: July 2017

Page 10: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Milestones

• 1.4 Progress and technical report of 1st internal

work package meeting (7-8 Apr, Wiesbaden)

• 1.5 Progress and technical report of 2nd internal

work package meeting (7,8,9 Nov, Rome)

• 1.6 Simulated production systems deployed (Jan

2017 … start earlier?)

• 1.7 Simulated production systems

decommissioned (July 2017)

Page 11: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Many potential approaches to data collection

1. Web scraping Job Portals 2. Job Portal APIs

3. Web scraping Enterprise Websites

4. Commercial Suppliers

5. Public Sector Agencies

Page 12: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Job Portals

• Types

- Portals (original content)

- Job search engines (crawl job portals)

- Generalist vs Specialist, Regional

• Country differences:

- Germany, 1600 job portals

- Slovenia, 40 job portals. 2 largest cover 95% of all

online jobs.

Page 13: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Job Portals – Evaluation Criteria

What 1. Position

2. Occupation

3. Education

4. Type of job (temporary or permanent,

full-time, or part time)

When 5. Date of advertised vacancy

6. Date of application deadline

7. Date to fill a vacancy

Where 8. Location of job

Who 9. Direct employer or agency

10. Economic activity of employer (NACE)

Page 14: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Third party data: CEDEFOP

• Web scraping pilot in 2015

• Primarily interested in skills / job titles

• 5 EU countries (UK, Ireland, Germany, Italy,

Czech Rep)

• 4-5 portals per country

• Fully documented processes (e.g. robots, de-

duplication, classification)

• Can we use this data as the basis for job

vacancy statistics?

• Agreement in place to access to pilot system

Page 15: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

SCB official JV estimates vs Arbetsformedligen

(National Employment Agency)

Public sector (SCB)

Private sector (SCB)

Total (SCB)

AF data

Page 16: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

UK Job Vacancy Estimates ONS vs Adzuna

Vacancy

estimates

(thousands)

Page 17: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

UK Job Vacancy Estimates ONS vs Adzuna

Vacancy

estimates

(thousands)

Monthly %

change

Page 18: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Results from UNECE pilot (Slovenia)

Page 19: Big Data ESSNet - Europa · Task 1. Data Access Sub-task Status Inventory and assessment of job portals Template developed (Germany). - Deliverable due July 2016 Legal aspects of

Current Priorities

• Complete inventory and evaluation of job

portals

• Evaluate existing data sources (including

comparisons with survey data)

• Further develop methods for obtaining and

processing additional data (web scraping,

APIs)

• Prepare for a “virtual sprint” (28-29 July)