Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from...
Transcript of Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from...
Cedefop DataLAB
Access to dataIn order to get data from Athena or Hive we need to have two things:
A QueryA Connection
First let's import pyathenajdbc and pandas libraries :
In [1]:
from pyathenajdbc import connectimport pandas as pd
With these libraries being imported, we can create our connection and run our queries:
In [2]:
# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')
# query we want to runquery_1 = """ SELECT * FROM cedefop_presentation.ft_document_essnet ORDER BY RAND() LIMIT 1000; """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)
# closing connection conn.close()
Understanding our data
There are simple methods and attributes in Pandas library which allow us to get to know our data:
sizeheadtailsampledescribeinfo
General info
info( ) is a useful method which gives us different information about the data we've just imported:
column namesdata type of each columnmemory usage of data on the local disknumber of non-null values per each column
In [3]:
documents.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 49 columns):general_id 1000 non-null objectgrab_date 1000 non-null int64year_grab_date 1000 non-null int64month_grab_date 1000 non-null int64day_grab_date 1000 non-null int64expire_date 1000 non-null int64year_expire_date 1000 non-null int64month_expire_date 1000 non-null int64day_expire_date 1000 non-null int64lang 1000 non-null objectidesco_level_4 1000 non-null objectesco_level_4 1000 non-null objectidesco_level_3 1000 non-null objectesco_level_3 1000 non-null objectidesco_level_2 1000 non-null objectesco_level_2 1000 non-null objectidesco_level_1 1000 non-null objectesco_level_1 1000 non-null objectidcity 1000 non-null objectcity 1000 non-null objectidprovince 1000 non-null objectprovince 1000 non-null objectidregion 1000 non-null objectregion 1000 non-null objectidmacro_region 1000 non-null objectmacro_region 1000 non-null objectidcountry 1000 non-null objectcountry 1000 non-null objectidcontract 1000 non-null objectcontract 1000 non-null objectideducational_level 1000 non-null objecteducational_level 1000 non-null objectidsector 1000 non-null objectsector 1000 non-null objectidmacro_sector 1000 non-null objectmacro_sector 1000 non-null objectidcategory_sector 1000 non-null objectcategory_sector 1000 non-null objectidsalary 1000 non-null objectsalary 1000 non-null objectidworking_hours 1000 non-null objectworking_hours 1000 non-null objectidexperience 1000 non-null objectexperience 1000 non-null objectsource_category 1000 non-null objectsourcecountry 1000 non-null objectsource 1000 non-null objectsite 1000 non-null objectcompanyname 1000 non-null objectdtypes: int64(8), object(41)memory usage: 382.9+ KB
Shape of data
To quickly get number of rows and column of your data table you can use size attribute:
In [4]:
documents.shape
If you are just interested to know the number of records, you can use len( ) function instead:
In [5]:
len(documents)
Data preview
To have a sneak peek to data you can use either head, tail or sample:
Getting the first n rows of the table:
In [6]:
documents.head(10)
# try to use .head( ) without indicating number of rows
Out[4]:
(1000, 49)
Out[5]:
1000
Getting last n rows of the table:
Out[6]:
general_id grab_date year_grab_date month_grab_date day_grab_date expire_date
0 89833230 17758 2018 8 15 17878
1 148565316 17872 2018 12 7 17932
2 168716345 17914 2019 1 18 18034
3 168990410 17915 2019 1 19 18035
4 86938457 17742 2018 7 30 17784
5 208673305 17959 2019 3 4 18079
6 166831480 17908 2019 1 12 18028
7 119787491 17825 2018 10 21 17945
8 247840311 17976 2019 3 21 18096
9 79270616 17723 2018 7 11 17843
10 rows × 49 columns
In [7]:
documents.tail(3)
Getting a random sample of n rows from table:
In [8]:
documents.sample(4)
# try to use .sample( ) without indicating sample number
Descriptive statistics
Using describe( ) method you'll get a table with simple statistics for both numerical and categoricalfeatures:
Out[7]:
general_id grab_date year_grab_date month_grab_date day_grab_date expire_date
997 177733340 17937 2019 2 10 18057
998 145506827 17871 2018 12 6 17886
999 113561477 17807 2018 10 3 17927
3 rows × 49 columns
Out[8]:
general_id grab_date year_grab_date month_grab_date day_grab_date expire_date
353 178790867 17923 2019 1 27 18043
745 172033162 17923 2019 1 27 18043
650 78356125 17725 2018 7 13 17845
391 171295029 17921 2019 1 25 18041
4 rows × 49 columns
In [9]:
documents.describe(include='all').round(1)
In case you want to get statistics just for the numeric numbers use documents.describe( ).round(1)
Data pre-processingAlost always before passing to processing phase, we need to perform several levels of pre-processing.One of the most important pre-processing tasks is Deduplication:
Deduplication
Due to different reasons we may have duplicated rows in our data. Sometimes these rows are completelyidentical and sometimes, like in our example, duplicated rows indicate the same job announcementcoming from different sources. In this case in order to identify and eliminate these records we should usegeneral_id field:
In [10]:
documents = documents.drop_duplicates(subset=['general_id'])
↦
Out[9]:
general_id grab_date year_grab_date month_grab_date day_grab_date expire_date
count 1000 1000.0 1000.0 1000.0 1000.0 1000.0
unique 1000 NaN NaN NaN NaN NaN
top 166663080 NaN NaN NaN NaN NaN
freq 1 NaN NaN NaN NaN NaN
mean NaN 17847.8 2018.3 7.2 15.4 17953.6
std NaN 72.2 0.5 4.0 8.7 76.5
min NaN 17714.0 2018.0 1.0 1.0 17748.0
25% NaN 17787.0 2018.0 3.0 8.0 17887.0
50% NaN 17852.0 2018.0 8.0 15.0 17960.0
75% NaN 17907.0 2019.0 11.0 23.0 18020.0
max NaN 17986.0 2019.0 12.0 31.0 18106.0
11 rows × 49 columns
The reason why we used subset=['general_id'] is that in this example, we're not looking for exactlyidentical rows but for us, any two records with the same general_id consider as duplicates.
Since we just get the first 100 records from Athena, there is no duplicated record in this subset. That'swhy if we get number of records in the deduplicated table, we still have 100 records:
In [11]:
len(documents)
Data aggregation and manipulation
Filtering data
Filtering data for a specific country:
In [16]:
# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')
# query we want to runquery_1 = """ SELECT COUNT(DISTINCT GENERAL_ID) as num_job_vacancy FROM cedefop_presentation.ft_document_essnet WHERE country = 'UNITED KINGDOM' """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)
In [17]:
print(documents)
Out[11]:
1000
num_job_vacancy0 13303636
In [70]:
# query we want to runquery_1 = """ SELECT COUNTRY, COUNT(DISTINCT GENERAL_ID) as num_job_vacancy FROM cedefop_presentation.ft_document_essnet GROUP BY COUNTRY ORDER BY num_job_vacancy desc """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)
documents.head(20)
Multiple filtering on year and month:
Out[70]:
COUNTRY num_job_vacancy
0 DEUTSCHLAND 15679793
1 UNITED KINGDOM 13303636
2 FRANCE 11959196
3 NEDERLAND 3483875
4 ITALIA 2371035
5 ESPAÑA 1832400
6 BELGIQUE-BELGIË 1802390
7 POLSKA 1444290
8 ÖSTERREICH 1413784
9 SVERIGE 1063498
10 IRELAND 570060
11 ČESKÁ REPUBLIKA 541947
12 LUXEMBOURG 77559
In [20]:
# query we want to runquery_1 = """ SELECT * FROM cedefop_presentation.ft_document_essnet WHERE country = 'UNITED KINGDOM' and year_grab_date = 2018 and month_grab_date = 12 limit 100 """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)
documents.head()
Filtering on a specific occupation:
Out[20]:
general_id grab_date year_grab_date month_grab_date day_grab_date expire_date
0 146726257 17871 2018 12 6 17991
1 152031314 17876 2018 12 11 17996
2 152222010 17876 2018 12 11 17996
3 152155722 17878 2018 12 13 17998
4 152152423 17878 2018 12 13 17998
5 rows × 49 columns
In [24]:
# query we want to runquery_1 = """ SELECT esco_level_4, count(distinct general_id) as num_ojv FROM "AwsDataCatalog".cedefop_presentation.ft_document_essnet WHERE country = 'UNITED KINGDOM' and year_grab_date = 2018 and esco_level_4 = 'Software developers' GROUP BY esco_level_4 """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)
documents
Group by
Top 15 occupations by country in 2018:
In [95]:
# query we want to runquery_1 = """ SELECT * FROM ( SELECT idcountry, esco_level_4, num_ojv, rank() over (partition by idcountry order by num_ojv desc) as rank FROM ( SELECT idcountry, esco_level_4, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE year_grab_date = 2018 GROUP BY esco_level_4, idcountry ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 15 """
# reading data using connection and querytop_15_occ_df = pd.read_sql(query_1, conn)
Out[24]:
esco_level_4 num_ojv
0 Software developers 402245
In [96]:
# reindexing the tabletop_15_occ_df.set_index(["idcountry","esco_level_4"])
# renaming the columnstop_15_occ_df.columns = ['idcountry', 'occupation', 'count', 'rank']
# rearranging the columnstop_15_occ_df = top_15_occ_df[['idcountry', 'occupation','count', 'rank']]
# sort the tabletop_15_occ_df = top_15_occ_df.sort_values(['count'],ascending=False)
let's take a look at the result:
In [97]:
top_15_occ_df[top_15_occ_df['idcountry']=='IT'].head(15)
Top 5 sectors, in 2019, by country:
Out[97]:
idcountry occupation count rank
90 IT Freight handlers 72314 1
91 IT Shop sales assistants 70410 2
92 IT Software developers 49289 3
93 IT Cleaners and helpers in offices, hotels and ot... 48368 4
94 IT Manufacturing labourers not elsewhere classified 41240 5
95 IT Administrative and executive secretaries 38120 6
96 IT Draughtspersons 32772 7
97 IT Commercial sales representatives 29679 8
98 IT Assemblers not elsewhere classified 28718 9
99 IT Metal working machine tool setters and operators 27659 10
100 IT Systems analysts 26714 11
101 IT Accounting and bookkeeping clerks 25259 12
102 IT Advertising and marketing professionals 24595 13
103 IT Electrical mechanics and fitters 24527 14
104 IT Retail and wholesale trade managers 22844 15
In [106]:
# query we want to runquery_1 = """ SELECT * FROM ( SELECT idcountry, macro_sector, num_ojv, rank() over (partition by idcountry order by num_ojv desc) as rank FROM ( SELECT idcountry, macro_sector, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE year_grab_date = 2019 GROUP BY macro_sector, idcountry ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 15 """
# reading data using connection and querytop_15_sect_df = pd.read_sql(query_1, conn)
In [107]:
# reindexing the tabletop_15_sect_df.set_index(["idcountry","macro_sector"])
# renaming the columnstop_15_sect_df.columns = ['idcountry', 'macro_sector', 'count', 'rank']
# rearranging the columnstop_15_sect_df = top_15_sect_df[['idcountry', 'macro_sector','count', 'rank']]
# sort the tabletop_15_sect_df = top_15_sect_df.sort_values(['count'],ascending=False)
In [108]:
top_15_sect_df[top_15_sect_df['idcountry']=='UK'].head(5)
Top 5 occupations by country and sector:
Out[108]:
idcountry macro_sector count rank
180 UK Administrative and support service activities 767198 1
181 UK Professional, scientific and technical activit... 710293 2
182 UK Human health and social work activities 458694 3
183 UK Information and communication 317472 4
184 UK Other service activities 215169 5
In [146]:
# query we want to runquery_1 = """ SELECT * FROM ( SELECT idcountry, idmacro_sector, macro_sector, esco_level_4, num_ojv, rank() over (partition by idcountry, macro_sector,idmacro_sector order by num_ojv desc) as rank FROM ( SELECT idcountry, idmacro_sector, macro_sector,esco_level_4, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE year_grab_date = 2019 GROUP BY idmacro_sector, macro_sector, esco_level_4, idcountry ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 5 """
# reading data using connection and querytop15_occ_by_count_sector = pd.read_sql(query_1, conn)
In [147]:
# reindexing the tabletop15_occ_by_count_sector.set_index(["idcountry","idmacro_sector","macro_sector",'esco_level_4'])
# renaming the columnstop15_occ_by_count_sector.columns = ['idcountry', 'idmacro_sector', 'macro_sector', 'occupation','count', 'rank']
# rearranging the columnstop15_occ_by_count_sector = top15_occ_by_count_sector[['idcountry', 'idmacro_sector', 'macro_sector', 'occupation','count', 'rank']]
# sort the tabletop15_occ_by_count_sector = top15_occ_by_count_sector.sort_values(['count'],ascending=False)
In [148]:
top15_occ_by_count_sector.head(10)
Out[148]:
idcountry idmacro_sector macro_sector occupation count rank
460 DE GWholesale andretail trade; repairof motor ve...
Shop salesassistants
95170 1
1274 UK JInformation andcommunication
Softwaredevelopers
78750 1
1393 UK QHuman healthand social workactivities
Nursingprofessionals
72297 1
365 DE NAdministrativeand supportservice activities
Administrativeand executivesecretaries
60448 1
385 DE MProfessional,scientific andtechnical activit...
Systems analysts 55686 1
375 DE JInformation andcommunication
Softwaredevelopers
49373 1
380 DE C Manufacturing
Manufacturinglabourers notelsewhereclassified
47292 1
370 DE QHuman healthand social workactivities
Health careassistants
46932 1
386 DE MProfessional,scientific andtechnical activit...
Engineeringprofessionals notelsewhereclassi...
46488 2
366 DE NAdministrativeand supportservice activities
Manufacturinglabourers notelsewhereclassified
45413 2
In [150]:
top15_occ_by_count_sector[(top15_occ_by_count_sector['idcountry'] == 'DE') & (top15_occ_by_count_sector['idmacro_sector'] == 'J')].head(5)
Data VisualizationLet's make a simple plot which shows the number of announcements per month. To do so, first, weshould group our data by year and month and then count the records:
In [158]:
# query we want to runquery_1 = """ SELECT idcountry, year_grab_date, month_grab_date, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet GROUP BY idcountry, year_grab_date, month_grab_date """
# reading data using connection and querydate_groupped = pd.read_sql(query_1, conn)
Out[150]:
idcountry idmacro_sector macro_sector occupation count rank
375 DE JInformation andcommunication
Software developers 49373 1
376 DE JInformation andcommunication
Systems analysts 27968 2
377 DE JInformation andcommunication
Systemsadministrators
10222 3
378 DE JInformation andcommunication
Engineeringprofessionals notelsewhere classi...
6989 4
379 DE JInformation andcommunication
Advertising andmarketingprofessionals
6616 5
In [161]:
date_groupped.reset_index()date_groupped.head()
Now that we have our aggregated data, we should start plotting. Python community offers a wide rangeof visualization packages but here, we stick with the matplotlib, a classic choice!
In [160]:
import matplotlib.pyplot as pltimport matplotlib.dates as mdates
In [165]:
# getting number of records from tablecounts = date_groupped.groupby(['year_grab_date','month_grab_date']).sum()
counts
Out[161]:
idcountry year_grab_date month_grab_date num_ojv
0 IT 2018 7 230650
1 BE 2019 3 192837
2 BE 2018 12 180383
3 ES 2018 7 170325
4 BE 2018 8 192713
Out[165]:
num_ojv
year_grab_date month_grab_date
2018 7 5142680
8 5271987
9 6237170
10 6619957
11 8683441
12 6446523
2019 1 8013169
2 4566919
3 4561617
In [166]:
# creating a date range and set it as the index of our datacounts.index = pd.DatetimeIndex(start='2018-07-01',end='2019-03-31' , freq='MS')
# setting the size of the plotfig, ax = plt.subplots(figsize=(15,7))
# plot the data (blue lines)plt.plot(counts.index , counts)
# plot the data (black dots)plt.scatter(counts.index , counts, c='k', zorder=10)
# setting the x ticks of the plot as index of data (dates)plt.xticks(counts.index)
# setting X and Y axes labelsplt.xlabel('Date')plt.ylabel('# Announcements')
# change the format of date ticksax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
# setting the label of the plotplt.title('Total Monthly Number of Announcements', fontsize= 20)
# drawing the plotplt.show()
Now let's repeat what we have just did, using a subset of data by filtering it for one country:
In [174]:
filter_country = 'DE'data_by_country = date_groupped[date_groupped.idcountry == filter_country]data_by_country
The other steps are identical to what we have done for the previous plot:
Out[174]:
idcountry year_grab_date month_grab_date num_ojv
27 DE 2019 1 2280502
29 DE 2018 9 1487587
40 DE 2018 7 1345705
42 DE 2018 12 1782813
55 DE 2019 2 1525130
72 DE 2019 3 1487849
83 DE 2018 11 2373502
95 DE 2018 8 1535226
102 DE 2018 10 1861479
In [175]:
counts = data_by_country.num_ojvcounts.index = pd.DatetimeIndex(start='2018-07-01',end='2019-03-31' , freq='MS')
fig, ax = plt.subplots(figsize=(15,7))plt.plot(counts.index , counts)plt.scatter(counts.index , counts, c='k', zorder=10)plt.xticks(counts.index)plt.xlabel('Date')plt.ylabel('# Announcements')ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))plt.title(f'Monthly Number of Announcements - {filter_country}', fontsize= 20)plt.show()
Ok, let's try another type of visualization: Pie Chart
We want to add two filters on city and occupation and plot a pie chart for quantity of contract types.
Note: 25 for ISCO/ESCO -> some ITC occupations
In [180]:
# query we want to runquery_1 = """ SELECT contract, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE city = 'Milano' and idesco_level_2 = '25' GROUP BY contract """
# reading data using connection and queryfiltered = pd.read_sql(query_1, conn)
In [181]:
# getting count distinct fot each contract and convert the result to the dataframepie_data = pd.DataFrame(filtered.contract.value_counts())
# Notice that we're not directly using Marplotlib as we did for the previous plots# Pandas actully uses Matplotlib so for simple plots like this one you can use# integrated visualizations of pandas without explecitly use Matplotlib functionspie_data.plot.pie(y='contract', figsize=(8, 8))plt.show()
Repeating the previous plot, this time for working hours:
In [182]:
# query we want to runquery_1 = """ SELECT working_hours, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE city = 'Milano' and idesco_level_2 = '25' GROUP BY working_hours """
# reading data using connection and queryfiltered = pd.read_sql(query_1, conn)
pie_data = pd.DataFrame(filtered.working_hours.value_counts())pie_data.plot.pie(y='working_hours', figsize=(8, 8))plt.show()
In [183]:
conn.close()
Case-Study : Source Country and DestionationCountry
create a pivot table using sourcecountry and country as index and columns and percentage ofcountry records for each sourccecountryRemove non significant values from pivot(in this case ones which are less than 5%)Sort both rows and and columns of pivot table (Descending)Import Skill data from DataLabl and perform the following actions on it:
headinfocount null valuesget the most frequent skill by country
PIVOT TABLE
In [199]:
# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')
# query we want to runquery_1 = """ SELECT idcountry, sourcecountry, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE sourcecountry in ('IT', 'UK', 'IE', 'CZ', 'FR', 'DE', 'ES', 'AT', 'PL', 'BE', 'NL', 'SE', 'LU' ) GROUP BY sourcecountry, idcountry """
# reading data using connection and querydocuments = pd.read_sql(query_1, conn)
conn.close()
In [200]:
documents.head()
In [201]:
# groupping and calculating percentagescoutry_groupped = documents.groupby(['idcountry', 'sourcecountry'])\ .sum().groupby(level=0).apply(lambda x:100 * x / float(x.sum()))\ .reset_index()
# making pivot table, filling blank cells with zero and round the values to one decimal numberpivot_data = coutry_groupped.pivot(index='idcountry', columns='sourcecountry', values='num_ojv').fillna(0).round(1)
Out[200]:
idcountry sourcecountry num_ojv
0 NL SE 1778
1 CZ BE 60
2 FR UK 9429
3 PL IE 48
4 UK SE 2832
In [202]:
pivot_data
DATA CLEANING
In [203]:
# We need to import numpy firstimport numpy as np
Out[202]:
sourcecountry AT BE CZ DE ES FR IE IT LU NL PL SE UK
idcountry
AT 87.5 0.0 0.0 11.9 0.0 0.2 0.0 0.0 0.0 0.1 0.2 0.0 0.0
BE 0.1 94.5 0.0 1.2 0.1 1.5 0.3 0.1 0.0 1.4 0.4 0.0 0.2
CZ 0.0 0.0 99.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0
DE 2.2 0.0 0.0 97.3 0.0 0.2 0.0 0.0 0.0 0.0 0.2 0.1 0.0
ES 0.0 0.0 0.0 0.2 97.8 1.6 0.0 0.1 0.0 0.1 0.0 0.1 0.1
FR 0.0 0.4 0.0 0.3 0.1 98.8 0.1 0.1 0.0 0.1 0.0 0.0 0.1
IE 0.1 0.1 10.3 0.6 0.3 0.1 85.2 0.1 0.0 0.4 0.1 0.0 2.7
IT 0.2 0.2 0.0 0.3 0.2 0.3 0.0 98.5 0.0 0.1 0.1 0.1 0.0
LU 0.1 5.9 0.0 2.0 0.0 3.0 0.0 0.0 88.4 0.3 0.0 0.0 0.2
NL 0.1 0.4 0.0 1.4 0.1 0.0 0.0 0.0 0.0 97.4 0.3 0.1 0.2
PL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.8 0.1 0.0
SE 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.1 0.0 0.0 0.0 99.5 0.1
UK 0.1 0.0 0.0 0.6 0.0 0.4 0.2 0.0 0.0 0.3 0.0 0.0 98.4
In [204]:
# With this line of code we replace the values less that 5% with 0
# .apply --> applies a function to the datatable# lambda x : do(x) --> a simple and fast way to write a function# np.where(condition, something, something_else) --> similar to =IF() function from Excelpivot_data.apply(lambda x : np.where(x<5, 0, x))
The skillsImporting skill data from Athena:
Out[204]:
sourcecountry AT BE CZ DE ES FR IE IT LU NL PL SE UK
idcountry
AT 87.5 0.0 0.0 11.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
BE 0.0 94.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
CZ 0.0 0.0 99.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
DE 0.0 0.0 0.0 97.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ES 0.0 0.0 0.0 0.0 97.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
FR 0.0 0.0 0.0 0.0 0.0 98.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0
IE 0.0 0.0 10.3 0.0 0.0 0.0 85.2 0.0 0.0 0.0 0.0 0.0 0.0
IT 0.0 0.0 0.0 0.0 0.0 0.0 0.0 98.5 0.0 0.0 0.0 0.0 0.0
LU 0.0 5.9 0.0 0.0 0.0 0.0 0.0 0.0 88.4 0.0 0.0 0.0 0.0
NL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 97.4 0.0 0.0 0.0
PL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.8 0.0 0.0
SE 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.5 0.0
UK 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 98.4
In [208]:
# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')
# query we want to runquery_2 = """ SELECT * FROM "AwsDataCatalog".cedefop_presentation.ft_skill_analysis_essnet ORDER BY RAND() LIMIT 1000; """# reading data using connection and queryskills = pd.read_sql(query_2, conn)
# closing connection conn.close()
In [209]:
skills.head()
Out[209]:
general_id grab_date year_grab_date month_grab_date day_grab_date expire_date
0 163539137 17905 2019 1 9 18025
1 175883759 17935 2019 2 8 17995
2 166123483 17908 2019 1 12 18028
3 136630915 17852 2018 11 17 17972
4 82106708 17731 2018 7 19 17851
5 rows × 51 columns
In [210]:
skills.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 51 columns):general_id 1000 non-null objectgrab_date 1000 non-null int64year_grab_date 1000 non-null int64month_grab_date 1000 non-null int64day_grab_date 1000 non-null int64expire_date 1000 non-null int64year_expire_date 1000 non-null int64month_expire_date 1000 non-null int64day_expire_date 1000 non-null int64lang 1000 non-null objectidesco_level_4 1000 non-null objectesco_level_4 1000 non-null objectidesco_level_3 1000 non-null objectesco_level_3 1000 non-null objectidesco_level_2 1000 non-null objectesco_level_2 1000 non-null objectidesco_level_1 1000 non-null objectesco_level_1 1000 non-null objectidescoskill_level_3 1000 non-null objectescoskill_level_3 1000 non-null objectidcity 1000 non-null objectcity 1000 non-null objectidprovince 1000 non-null objectprovince 1000 non-null objectidregion 1000 non-null objectregion 1000 non-null objectidmacro_region 1000 non-null objectmacro_region 1000 non-null objectidcountry 1000 non-null objectcountry 1000 non-null objectidcontract 1000 non-null objectcontract 1000 non-null objectideducational_level 1000 non-null objecteducational_level 1000 non-null objectidsector 1000 non-null objectsector 1000 non-null objectidmacro_sector 1000 non-null objectmacro_sector 1000 non-null objectidcategory_sector 1000 non-null objectcategory_sector 1000 non-null objectidsalary 1000 non-null objectsalary 1000 non-null objectidworking_hours 1000 non-null objectworking_hours 1000 non-null objectidexperience 1000 non-null objectexperience 1000 non-null objectsource_category 1000 non-null objectsourcecountry 1000 non-null objectsource 1000 non-null objectsite 1000 non-null object
Null Values
to get the number of null cells for each column we should first use .isnull( ) method which returns 1 foreach null cell and 0 for a non-null cell. Then we should sum these 0s and 1s to get the total number ofnull values for each column:
In [211]:
skills.isnull().sum()
companyname 1000 non-null objectdtypes: int64(8), object(43)memory usage: 398.5+ KB
Out[211]:
general_id 0grab_date 0year_grab_date 0month_grab_date 0day_grab_date 0expire_date 0year_expire_date 0month_expire_date 0day_expire_date 0lang 0idesco_level_4 0esco_level_4 0idesco_level_3 0esco_level_3 0idesco_level_2 0esco_level_2 0idesco_level_1 0esco_level_1 0idescoskill_level_3 0escoskill_level_3 0idcity 0city 0idprovince 0province 0idregion 0region 0idmacro_region 0macro_region 0idcountry 0country 0idcontract 0contract 0ideducational_level 0educational_level 0idsector 0sector 0idmacro_sector 0macro_sector 0idcategory_sector 0category_sector 0idsalary 0salary 0idworking_hours 0working_hours 0idexperience 0experience 0source_category 0sourcecountry 0source 0site 0companyname 0dtype: int64
Finding top skills by country:
In [212]:
vals = []countries = []sk = []for country in skills.country.unique(): sag = skills[skills.country == country]['escoskill_level_3'] vals.append(sag.value_counts()[0]) sk.append(sag.value_counts().index[0]) countries.append(country)
In [213]:
res = pd.DataFrame([countries, sk, vals]).Tres.columns = ['country', 'skill', 'count']
In [214]:
res
Case-Study : Digital Occupations
Out[214]:
country skill count
0 NEDERLAND proactivity 3
1 DEUTSCHLAND adapt to change 14
2 UNITED KINGDOM adapt to change 17
3 ESPAÑA ICT networking hardware 1
4 ÖSTERREICH adapt to change 4
5 ITALIA communication 3
6 BELGIQUE-BELGIË create solutions to problems 2
7 SVERIGE communicate with customers 3
8 FRANCE adapt to change 11
9 POLSKA manage time 2
10 ČESKÁ REPUBLIKA engineering processes 1
11 IRELAND communication 1
Using the provided list of Eurostat digital occupations, create a subset of skills data whichcontains only these occupationsFor each digital occupation calculate the mixture of skills in percentage termsFocus on programming languanges
In [215]:
prof_digital = ['1330', '2511', '2512', '2513', '2514', '2519', '2521','2522', '2523', '2529', '3511', '3512', '3513', '3514', '3521', '3522']
In [218]:
# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')
# query we want to run# query we want to runquery_1 = """ SELECT * FROM ( SELECT esco_level_4, escoskill_level_3, num_ojv, rank() over (partition by esco_level_4 order by num_ojv desc) as rank FROM ( SELECT esco_level_4, escoskill_level_3, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_skill_analysis_essnet WHERE idesco_level_4 IN ('1330', '2511', '2512', '2513', '2514', '2519', '2521','2522', '2523', '2529', '3511', '3512', '3513', '3514', '3521', '3522') GROUP BY esco_level_4, escoskill_level_3 ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 5 """
# reading data using connection and querydocuments = pd.read_sql(query_1, conn)
conn.close()
In [220]:
documents[documents['esco_level_4']=='Software developers'].head(5)
PROGRAMMING LANGUAGES BY LOCATION
In [50]:
langs = ['SQL', 'Java', 'C#', 'Python', 'PHP', 'matlab', 'SAS language', 'C++']
In [221]:
# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')
# query we want to run# query we want to runquery_1 = """ SELECT idcountry, escoskill_level_3, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_skill_analysis_essnet WHERE escoskill_level_3 IN ('SQL', 'Java', 'C#', 'Python', 'PHP', 'matlab', 'SAS language', 'C++') GROUP BY idcountry, escoskill_level_3 ORDER BY num_ojv DESC """
# reading data using connection and querydocuments = pd.read_sql(query_1, conn)
conn.close()
Out[220]:
esco_level_4 escoskill_level_3 num_ojv rank
40 Software developers adapt to change 1279015 1
41 Software developers project management 1137579 2
42 Software developers computer programming 1101046 3
43 Software developers English 999963 4
44 Software developers teamwork principles 935304 5
In [225]:
# groupby skill and countrylang_g = documents.reset_index()\ .sort_values(['num_ojv'], ascending=False)
In [226]:
lang_g[lang_g['idcountry']=='UK'].head(20)
Compare with DE...
In [227]:
lang_g[lang_g['idcountry']=='DE'].head(20)
Out[226]:
index idcountry escoskill_level_3 num_ojv
0 0 UK SQL 450804
3 3 UK Java 225294
4 4 UK Python 171930
8 8 UK C# 115636
9 9 UK PHP 112480
12 12 UK C++ 84014
42 42 UK SAS language 13227
56 56 UK matlab 7254
Out[227]:
index idcountry escoskill_level_3 num_ojv
1 1 DE SQL 241867
2 2 DE Java 233301
7 7 DE C++ 124366
10 10 DE PHP 106713
14 14 DE Python 82810
24 24 DE C# 39666
26 26 DE matlab 29322
51 51 DE SAS language 9235
... or Pivotting the data
In [230]:
lang_g.pivot(index='escoskill_level_3',columns='idcountry', values='num_ojv').fillna(0)
Case-Study : Education vs ExperienceCreating a bubble chart for Education and Experience with number of records as the size of bubbles:
Out[230]:
idcountry AT BE CZ DE ES FR IE
escoskill_level_3
C# 4168.0 4794.0 1172.0 39666.0 17889.0 21808.0 14227.0 13341.0
C++ 11446.0 2774.0 2090.0 124366.0 15214.0 41115.0 2644.0 13002.0
Java 22522.0 13014.0 5013.0 233301.0 47641.0 124376.0 13060.0 40730.0
PHP 6735.0 3934.0 1405.0 106713.0 20267.0 58000.0 3333.0 14702.0
Python 5699.0 4589.0 1894.0 82810.0 17914.0 50950.0 7813.0 7910.0
SAS language 323.0 708.0 0.0 9235.0 683.0 103818.0 358.0 880.0
SQL 27988.0 23465.0 8221.0 241867.0 51685.0 126214.0 22828.0 52186.0
matlab 2095.0 250.0 0.0 29322.0 259.0 5387.0 0.0 1171.0
In [252]:
# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')
# query we want to run# query we want to runquery_1 = """ SELECT ideducational_level, educational_level, idexperience, experience, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE idesco_level_4 IN ('1330', '2511', '2512', '2513', '2514', '2519', '2521','2522', '2523', '2529', '3511', '3512', '3513', '3514', '3521', '3522') GROUP BY ideducational_level, educational_level, idexperience, experience ORDER BY ideducational_level ASC, idexperience ASC """
# reading data using connection and querydocuments = pd.read_sql(query_1, conn)
conn.close()
In [253]:
# First we should aggregate the data for the desired colummnsedu_exp = documents.reset_index()
In [254]:
edu_exp
Out[254]:
index ideducational_level educational_level idexperience experience num_ojv
0 0 33289
1 1 1Noexperience
5324
2 2 2Up to 1year
13588
3 3 3From 1 to 2years
4210
4 4 4 From 2 to 4years
5046
5 5 5From 4 to 6years
939
6 6 6From 6 to 8years
224
7 7 7From 8 to10 years
111
8 8 8Over 10years
7605
9 9 1 Primary education 8167
10 10 1 Primary education 1Noexperience
1468
11 11 1 Primary education 2Up to 1year
6732
12 12 1 Primary education 3From 1 to 2years
1900
13 13 1 Primary education 4From 2 to 4years
2569
14 14 1 Primary education 5From 4 to 6years
393
15 15 1 Primary education 6From 6 to 8years
73
16 16 1 Primary education 7From 8 to10 years
26
17 17 1 Primary education 8Over 10years
2033
18 18 2Lower secondaryeducation
151209
19 19 2Lower secondaryeducation
1Noexperience
3935
20 20 2Lower secondaryeducation
2Up to 1year
98522
21 21 2Lower secondaryeducation
3From 1 to 2years
28474
22 22 2Lower secondaryeducation
4From 2 to 4years
19069
23 23 2Lower secondaryeducation
5From 4 to 6years
4330
24 24 2 Lower secondaryeducation
6 From 6 to 8years
1525
25 25 2Lower secondaryeducation
7From 8 to10 years
684
26 26 2Lower secondaryeducation
8Over 10years
44730
27 27 3Post-secondarynon-tertiaryeducation
254480
28 28 3Post-secondarynon-tertiaryeducation
1Noexperience
10386
29 29 3Post-secondarynon-tertiaryeducation
2Up to 1year
257372
... ... ... ... ... ... ...
51 51 5Short-cycletertiary education
6From 6 to 8years
7097
52 52 5Short-cycletertiary education
7From 8 to10 years
2220
53 53 5Short-cycletertiary education
8Over 10years
154879
54 54 6Bachelor orequivalent
333980
55 55 6Bachelor orequivalent
1Noexperience
20851
56 56 6Bachelor orequivalent
2Up to 1year
257164
57 57 6Bachelor orequivalent
3From 1 to 2years
52426
58 58 6Bachelor orequivalent
4From 2 to 4years
84984
59 59 6Bachelor orequivalent
5From 4 to 6years
16845
60 60 6Bachelor orequivalent
6From 6 to 8years
7468
61 61 6Bachelor orequivalent
7From 8 to10 years
2467
62 62 6Bachelor orequivalent
8Over 10years
116087
63 63 7 Master orequivalent
172997
64 64 7Master orequivalent
1Noexperience
10289
65 65 7Master orequivalent
2Up to 1year
129748
66 66 7Master orequivalent
3From 1 to 2years
21435
67 67 7Master orequivalent
4From 2 to 4years
43954
68 68 7Master orequivalent
5From 4 to 6years
12525
69 69 7Master orequivalent
6From 6 to 8years
2131
70 70 7Master orequivalent
7From 8 to10 years
778
71 71 7Master orequivalent
8Over 10years
35596
72 72 8Doctoral orequivalent
20955
73 73 8Doctoral orequivalent
1Noexperience
740
74 74 8Doctoral orequivalent
2Up to 1year
17649
75 75 8Doctoral orequivalent
3From 1 to 2years
3008
76 76 8Doctoral orequivalent
4From 2 to 4years
4720
77 77 8Doctoral orequivalent
5From 4 to 6years
1391
78 78 8Doctoral orequivalent
6From 6 to 8years
245
79 79 8Doctoral orequivalent
7From 8 to10 years
58
80 80 8Doctoral orequivalent
8Over 10years
4505
81 rows × 6 columns
there are some missing data which we should remove before starting with plotting. In these case themissing data are indicated by "":
In [255]:
# Replacing "" with np.nan which in python represents a missing dataedu_exp.replace('', np.nan, inplace=True)# removing rows with "any" missing valueedu_exp.dropna(inplace=True)
Now our table is ready for plotting:
In [256]:
edu_exp
Out[256]:
index ideducational_level educational_level idexperience experience num_ojv
10 10 1 Primary education 1Noexperience
1468
11 11 1 Primary education 2Up to 1year
6732
12 12 1 Primary education 3From 1 to 2years
1900
13 13 1 Primary education 4From 2 to 4years
2569
14 14 1 Primary education 5From 4 to 6years
393
15 15 1 Primary education 6From 6 to 8years
73
16 16 1 Primary education 7From 8 to10 years
26
17 17 1 Primary education 8Over 10years
2033
19 19 2Lower secondaryeducation
1Noexperience
3935
20 20 2Lower secondaryeducation
2Up to 1year
98522
21 21 2Lower secondaryeducation
3From 1 to 2years
28474
22 22 2Lower secondaryeducation
4From 2 to 4years
19069
23 23 2 Lower secondaryeducation
5 From 4 to 6years
4330
24 24 2Lower secondaryeducation
6From 6 to 8years
1525
25 25 2Lower secondaryeducation
7From 8 to10 years
684
26 26 2Lower secondaryeducation
8Over 10years
44730
28 28 3Post-secondarynon-tertiaryeducation
1Noexperience
10386
29 29 3Post-secondarynon-tertiaryeducation
2Up to 1year
257372
30 30 3Post-secondarynon-tertiaryeducation
3From 1 to 2years
34109
31 31 3Post-secondarynon-tertiaryeducation
4From 2 to 4years
53213
32 32 3Post-secondarynon-tertiaryeducation
5From 4 to 6years
8818
33 33 3Post-secondarynon-tertiaryeducation
6From 6 to 8years
2342
34 34 3Post-secondarynon-tertiaryeducation
7From 8 to10 years
1146
35 35 3Post-secondarynon-tertiaryeducation
8Over 10years
112518
37 37 4Upper secondaryeducation
1Noexperience
21002
38 38 4Upper secondaryeducation
2Up to 1year
220378
39 39 4Upper secondaryeducation
3From 1 to 2years
37237
40 40 4Upper secondaryeducation
4From 2 to 4years
48012
41 41 4 Upper secondary 5 From 4 to 6 8003
education years
42 42 4Upper secondaryeducation
6From 6 to 8years
2696
... ... ... ... ... ... ...
48 48 5Short-cycletertiary education
3From 1 to 2years
118960
49 49 5Short-cycletertiary education
4From 2 to 4years
151542
50 50 5Short-cycletertiary education
5From 4 to 6years
31885
51 51 5Short-cycletertiary education
6From 6 to 8years
7097
52 52 5Short-cycletertiary education
7From 8 to10 years
2220
53 53 5Short-cycletertiary education
8Over 10years
154879
55 55 6Bachelor orequivalent
1Noexperience
20851
56 56 6Bachelor orequivalent
2Up to 1year
257164
57 57 6Bachelor orequivalent
3From 1 to 2years
52426
58 58 6Bachelor orequivalent
4From 2 to 4years
84984
59 59 6Bachelor orequivalent
5From 4 to 6years
16845
60 60 6Bachelor orequivalent
6From 6 to 8years
7468
61 61 6Bachelor orequivalent
7From 8 to10 years
2467
62 62 6Bachelor orequivalent
8Over 10years
116087
64 64 7Master orequivalent
1Noexperience
10289
65 65 7Master orequivalent
2Up to 1year
129748
66 66 7Master orequivalent
3From 1 to 2years
21435
67 67 7 Master or 4 From 2 to 4 43954
equivalent years
68 68 7Master orequivalent
5From 4 to 6years
12525
69 69 7Master orequivalent
6From 6 to 8years
2131
70 70 7Master orequivalent
7From 8 to10 years
778
71 71 7Master orequivalent
8Over 10years
35596
73 73 8Doctoral orequivalent
1Noexperience
740
74 74 8Doctoral orequivalent
2Up to 1year
17649
75 75 8Doctoral orequivalent
3From 1 to 2years
3008
76 76 8Doctoral orequivalent
4From 2 to 4years
4720
77 77 8Doctoral orequivalent
5From 4 to 6years
1391
78 78 8Doctoral orequivalent
6From 6 to 8years
245
79 79 8Doctoral orequivalent
7From 8 to10 years
58
80 80 8Doctoral orequivalent
8Over 10years
4505
64 rows × 6 columns
In [257]:
# initial variablesplt.rcParams['figure.figsize'] = (20, 8)fig = plt.figure()
# plotting bubblesplt.scatter(edu_exp.educational_level, edu_exp.experience, s=edu_exp.num_ojv/100.0, alpha=0.5)
# Rotatinf X ticksfig.autofmt_xdate(rotation=90)
# Adding a title to the plotplt.title('Education Vs. Experience - Digital Occupations', fontsize=20)
# plotting!plt.show()