Frantisek (Fero) [email protected]
Big data team
Web scraping
job vacancies
(ESSnet on Big Data - Work package 1)
Outline
Sample based
scraping
Full-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Outline
Sample based
scraping
Full-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Outline
Sample based
scraping
Full-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Outline
Sample based
scraping
Full-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Outline
Sample based
scraping
Full-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Outline
Sample based
scraping
Full-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Proof of concept
● P.O.C. on random sample of 50 (large) companies○ Survey vs. Company websites vs. Job portals
Survey CW Indeed ...
Tesco 2345 1351 1525 ...
HSBC 321 243 210 ...
... ... ... ... ...
Useful quick insights
● Which portal is better?
● “Boots” problem
● Gap: survey - online
pros and cons
✚
+ Quick and simple scrapers
+ Entries already linked (matched)
+ Lightweight (less risk) - at least
for small sample
━
- Sample bias
- Effort to increase the sample
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Full-size scraping
● Directory
● “Proper” spiders○ Careerjet
○ CV-library
○ Universal job match
● T&Cs / robots.txt
✚
+ Lot of data
+ Not influenced by sample
━- More “risky” scraping
- Need to match
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Matching company names
“Milton Keynes Borough Council“
“MILTON KEYNES COUNCIL INCL EDUCATION EXCL SCHOOLS WITH EXTERNAL PAYROLL PROVIDERS“
25
34
company name JV count
Survey
Careerjet
Milton Keynes council
● Casing, stop-words, (TF-)IDF scores, INCL/EXCL
● 434 entries matched (3.7%)
company
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Scraping company websites
● One website, one spider - no problem
● 50 websites
Specific spider code
● Name and rep. unit
● URL
● Extraction○ XPath
○ Regex pattern
Scraping company websites
● Type of access to the relevant HTML○ Simple HTTP. E.g. Caring homes○ Selenium. E.g. Care UK
● Obtaining count○ Direct count. E.g. Caring homes○ Counting vacancies. E.g. University of Portsmouth
● Pagination○ Not necessary. E.g. Caring homes○ Necessary. E.g. Somerset county
Scraping company websites
Scraping company websites
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites
DEMO!
DEMO!
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Project architecture
Python project
Spiders(scrapy, xpath, regex, beautiful soup)
Sample-based
PSB (portal sample-based)
Full-size
PFS (portal full-size)
Inde
ed
Car
eerje
t
Bric
k7
CV-
libra
ryU
JM (sample of 50 companies)
Emailer(mailjet)
Tests, scripts, notebooks...(nose, bash, jupyter,...)
Car
eerje
t
CW (comp. websites)(selenium)
...
Emails from scraping
Deploying project
Python project Google cloud
Spiders(scrapy, xpath, regex, beautiful soup)
Sample-based
PSB (portal sample-based)
Full-size
PFS (portal full-size)
CV-
libra
ryU
JM
Emailer(mailjet)
Deploy(bash)
“Managing” instance
Run scraping(Cron-job)
Mongo DB instance
24h
Turn on/off,Store data
Car
eerje
t
CW (comp. websites)(selenium)
Tests, scripts, notebooks...(nose, bash, jupyter,...)
(sample of 50 companies)In
deed
Car
eerje
t
Bric
k7
...
Technologies used
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Visualise the data
Python project Google cloud
Spiders(scrapy, xpath, regex, beautiful soup)
Sample-based
PSB (portal sample-based)
Full-size
PFS (portal full-size)
CV-
libra
ryU
JM
Emailer(mailjet)
Deploy(bash)
“Managing” instance
Run scraping(Cron-job)
Mongo DB instance
24h
Turn on/off,Store data
Car
eerje
t
CW (comp. websites)(selenium)
Dashboard(flask, bokeh, js)
Visualise
Tests, scripts, notebooks...(nose, bash, jupyter,...)
(sample of 50 companies)In
deed
Car
eerje
t
Bric
k7
...
Dashboard
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!
DEMO!
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Scatter plot with best fit
Bland-Altman plot with KDE
BA-plots side by side
Krippendorff’s alpha
● inter-rater agreement
● <-1, 1>○ 1 = perfect agreement
○ 0 = absence of reliability
○ -1 = systematic disagreement
K.A. = 0.755
Comparing portals
Sample based
scrapingFull-size
scraping
Company names
matching
Comparisons
To-do
Automated scraping
framework
Dashboard
Scraping company
websites DEMO!DEMO!
Nowcasting
1 month
Survey data
Scraped data... ... ...
1 day traintrain train trainpredict
● In total
● Per industry
● Per company
Nowcasting survey entry
● Possible model inputs○ Scraped values
○ Previous survey values
○ Company parameters
○ Industry (dummy 0-1 coding)
○ Outlying factor
○ …● Possibly lots of training data
○ 6k entries in survey
○ monthly
For company X at time tPortal 1Portal 2
…Portal n
Comp. website
Survey(t-1)Survey(t-2)
…Survey(t-k)
Industry 1Industry 2
…Industry m
Employee size
Regression(neural network?)
Survey(t)
Outlying factor
Scale up and expand!
● Why not?
○ New FS spider ⟶ 1 - 3 days
○ New SB spider ⟶ 1 - 3 days + sample
○ New CW spider ⟶ 10 minutes
● Sample ⟶ 100
● Improve matching
● Data from partners
The deadly triangle
Top Related