Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from...

Cedefop DataLAB

Access to dataIn order to get data from Athena or Hive we need to have two things:

A QueryA Connection

First let's import pyathenajdbc and pandas libraries :

In [1]:

from pyathenajdbc import connectimport pandas as pd

With these libraries being imported, we can create our connection and run our queries:

In [2]:

# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')

# query we want to runquery_1 = """ SELECT * FROM cedefop_presentation.ft_document_essnet ORDER BY RAND() LIMIT 1000; """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

# closing connection conn.close()

Understanding our data

There are simple methods and attributes in Pandas library which allow us to get to know our data:

sizeheadtailsampledescribeinfo

General info

info( ) is a useful method which gives us different information about the data we've just imported:

column namesdata type of each columnmemory usage of data on the local disknumber of non-null values per each column

In [3]:

documents.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 49 columns):general_id 1000 non-null objectgrab_date 1000 non-null int64year_grab_date 1000 non-null int64month_grab_date 1000 non-null int64day_grab_date 1000 non-null int64expire_date 1000 non-null int64year_expire_date 1000 non-null int64month_expire_date 1000 non-null int64day_expire_date 1000 non-null int64lang 1000 non-null objectidesco_level_4 1000 non-null objectesco_level_4 1000 non-null objectidesco_level_3 1000 non-null objectesco_level_3 1000 non-null objectidesco_level_2 1000 non-null objectesco_level_2 1000 non-null objectidesco_level_1 1000 non-null objectesco_level_1 1000 non-null objectidcity 1000 non-null objectcity 1000 non-null objectidprovince 1000 non-null objectprovince 1000 non-null objectidregion 1000 non-null objectregion 1000 non-null objectidmacro_region 1000 non-null objectmacro_region 1000 non-null objectidcountry 1000 non-null objectcountry 1000 non-null objectidcontract 1000 non-null objectcontract 1000 non-null objectideducational_level 1000 non-null objecteducational_level 1000 non-null objectidsector 1000 non-null objectsector 1000 non-null objectidmacro_sector 1000 non-null objectmacro_sector 1000 non-null objectidcategory_sector 1000 non-null objectcategory_sector 1000 non-null objectidsalary 1000 non-null objectsalary 1000 non-null objectidworking_hours 1000 non-null objectworking_hours 1000 non-null objectidexperience 1000 non-null objectexperience 1000 non-null objectsource_category 1000 non-null objectsourcecountry 1000 non-null objectsource 1000 non-null objectsite 1000 non-null objectcompanyname 1000 non-null objectdtypes: int64(8), object(41)memory usage: 382.9+ KB

Shape of data

To quickly get number of rows and column of your data table you can use size attribute:

In [4]:

documents.shape

If you are just interested to know the number of records, you can use len( ) function instead:

In [5]:

len(documents)

Data preview

To have a sneak peek to data you can use either head, tail or sample:

Getting the first n rows of the table:

In [6]:

documents.head(10)

# try to use .head( ) without indicating number of rows

Out[4]:

(1000, 49)

Out[5]:

1000

Getting last n rows of the table:

Out[6]:

general_id grab_date year_grab_date month_grab_date day_grab_date expire_date

0 89833230 17758 2018 8 15 17878

1 148565316 17872 2018 12 7 17932

2 168716345 17914 2019 1 18 18034

3 168990410 17915 2019 1 19 18035

4 86938457 17742 2018 7 30 17784

5 208673305 17959 2019 3 4 18079

6 166831480 17908 2019 1 12 18028

7 119787491 17825 2018 10 21 17945

8 247840311 17976 2019 3 21 18096

9 79270616 17723 2018 7 11 17843

10 rows × 49 columns

In [7]:

documents.tail(3)

Getting a random sample of n rows from table:

In [8]:

documents.sample(4)

# try to use .sample( ) without indicating sample number

Descriptive statistics

Using describe( ) method you'll get a table with simple statistics for both numerical and categoricalfeatures:

Out[7]:


997 177733340 17937 2019 2 10 18057

998 145506827 17871 2018 12 6 17886

999 113561477 17807 2018 10 3 17927


Out[8]:


353 178790867 17923 2019 1 27 18043

745 172033162 17923 2019 1 27 18043

650 78356125 17725 2018 7 13 17845

391 171295029 17921 2019 1 25 18041


In [9]:

documents.describe(include='all').round(1)

In case you want to get statistics just for the numeric numbers use documents.describe( ).round(1)

Data pre-processingAlost always before passing to processing phase, we need to perform several levels of pre-processing.One of the most important pre-processing tasks is Deduplication:

Deduplication

Due to different reasons we may have duplicated rows in our data. Sometimes these rows are completelyidentical and sometimes, like in our example, duplicated rows indicate the same job announcementcoming from different sources. In this case in order to identify and eliminate these records we should usegeneral_id field:

In [10]:

documents = documents.drop_duplicates(subset=['general_id'])

↦

Out[9]:


count 1000 1000.0 1000.0 1000.0 1000.0 1000.0

unique 1000 NaN NaN NaN NaN NaN

top 166663080 NaN NaN NaN NaN NaN

freq 1 NaN NaN NaN NaN NaN

mean NaN 17847.8 2018.3 7.2 15.4 17953.6

std NaN 72.2 0.5 4.0 8.7 76.5

min NaN 17714.0 2018.0 1.0 1.0 17748.0

25% NaN 17787.0 2018.0 3.0 8.0 17887.0

50% NaN 17852.0 2018.0 8.0 15.0 17960.0

75% NaN 17907.0 2019.0 11.0 23.0 18020.0

max NaN 17986.0 2019.0 12.0 31.0 18106.0


The reason why we used subset=['general_id'] is that in this example, we're not looking for exactlyidentical rows but for us, any two records with the same general_id consider as duplicates.

Since we just get the first 100 records from Athena, there is no duplicated record in this subset. That'swhy if we get number of records in the deduplicated table, we still have 100 records:

In [11]:

len(documents)

Data aggregation and manipulation

Filtering data

Filtering data for a specific country:

In [16]:


# query we want to runquery_1 = """ SELECT COUNT(DISTINCT GENERAL_ID) as num_job_vacancy FROM cedefop_presentation.ft_document_essnet WHERE country = 'UNITED KINGDOM' """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

In [17]:

print(documents)

Out[11]:

1000

num_job_vacancy0 13303636

In [70]:

# query we want to runquery_1 = """ SELECT COUNTRY, COUNT(DISTINCT GENERAL_ID) as num_job_vacancy FROM cedefop_presentation.ft_document_essnet GROUP BY COUNTRY ORDER BY num_job_vacancy desc """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

documents.head(20)

Multiple filtering on year and month:

Out[70]:

COUNTRY num_job_vacancy

0 DEUTSCHLAND 15679793

1 UNITED KINGDOM 13303636

2 FRANCE 11959196

3 NEDERLAND 3483875

4 ITALIA 2371035

5 ESPAÑA 1832400

6 BELGIQUE-BELGIË 1802390

7 POLSKA 1444290

8 ÖSTERREICH 1413784

9 SVERIGE 1063498

10 IRELAND 570060

11 ČESKÁ REPUBLIKA 541947

12 LUXEMBOURG 77559

In [20]:

# query we want to runquery_1 = """ SELECT * FROM cedefop_presentation.ft_document_essnet WHERE country = 'UNITED KINGDOM' and year_grab_date = 2018 and month_grab_date = 12 limit 100 """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

documents.head()

Filtering on a specific occupation:

Out[20]:


0 146726257 17871 2018 12 6 17991

1 152031314 17876 2018 12 11 17996

2 152222010 17876 2018 12 11 17996

3 152155722 17878 2018 12 13 17998

4 152152423 17878 2018 12 13 17998


In [24]:

# query we want to runquery_1 = """ SELECT esco_level_4, count(distinct general_id) as num_ojv FROM "AwsDataCatalog".cedefop_presentation.ft_document_essnet WHERE country = 'UNITED KINGDOM' and year_grab_date = 2018 and esco_level_4 = 'Software developers' GROUP BY esco_level_4 """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

documents

Group by

Top 15 occupations by country in 2018:

In [95]:

# query we want to runquery_1 = """ SELECT * FROM ( SELECT idcountry, esco_level_4, num_ojv, rank() over (partition by idcountry order by num_ojv desc) as rank FROM ( SELECT idcountry, esco_level_4, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE year_grab_date = 2018 GROUP BY esco_level_4, idcountry ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 15 """

# reading data using connection and querytop_15_occ_df = pd.read_sql(query_1, conn)

Out[24]:

esco_level_4 num_ojv

0 Software developers 402245

In [96]:

# reindexing the tabletop_15_occ_df.set_index(["idcountry","esco_level_4"])

# renaming the columnstop_15_occ_df.columns = ['idcountry', 'occupation', 'count', 'rank']

# rearranging the columnstop_15_occ_df = top_15_occ_df[['idcountry', 'occupation','count', 'rank']]

# sort the tabletop_15_occ_df = top_15_occ_df.sort_values(['count'],ascending=False)

let's take a look at the result:

In [97]:

top_15_occ_df[top_15_occ_df['idcountry']=='IT'].head(15)

Top 5 sectors, in 2019, by country:

Out[97]:

idcountry occupation count rank

90 IT Freight handlers 72314 1

91 IT Shop sales assistants 70410 2

92 IT Software developers 49289 3

93 IT Cleaners and helpers in offices, hotels and ot... 48368 4

94 IT Manufacturing labourers not elsewhere classified 41240 5

95 IT Administrative and executive secretaries 38120 6

96 IT Draughtspersons 32772 7

97 IT Commercial sales representatives 29679 8

98 IT Assemblers not elsewhere classified 28718 9

99 IT Metal working machine tool setters and operators 27659 10

100 IT Systems analysts 26714 11

101 IT Accounting and bookkeeping clerks 25259 12

102 IT Advertising and marketing professionals 24595 13

103 IT Electrical mechanics and fitters 24527 14

104 IT Retail and wholesale trade managers 22844 15

In [106]:

# query we want to runquery_1 = """ SELECT * FROM ( SELECT idcountry, macro_sector, num_ojv, rank() over (partition by idcountry order by num_ojv desc) as rank FROM ( SELECT idcountry, macro_sector, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE year_grab_date = 2019 GROUP BY macro_sector, idcountry ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 15 """

# reading data using connection and querytop_15_sect_df = pd.read_sql(query_1, conn)

In [107]:

# reindexing the tabletop_15_sect_df.set_index(["idcountry","macro_sector"])

# renaming the columnstop_15_sect_df.columns = ['idcountry', 'macro_sector', 'count', 'rank']

# rearranging the columnstop_15_sect_df = top_15_sect_df[['idcountry', 'macro_sector','count', 'rank']]

# sort the tabletop_15_sect_df = top_15_sect_df.sort_values(['count'],ascending=False)

In [108]:

top_15_sect_df[top_15_sect_df['idcountry']=='UK'].head(5)

Top 5 occupations by country and sector:

Out[108]:

idcountry macro_sector count rank

180 UK Administrative and support service activities 767198 1

181 UK Professional, scientific and technical activit... 710293 2

182 UK Human health and social work activities 458694 3

183 UK Information and communication 317472 4

184 UK Other service activities 215169 5

In [146]:

# query we want to runquery_1 = """ SELECT * FROM ( SELECT idcountry, idmacro_sector, macro_sector, esco_level_4, num_ojv, rank() over (partition by idcountry, macro_sector,idmacro_sector order by num_ojv desc) as rank FROM ( SELECT idcountry, idmacro_sector, macro_sector,esco_level_4, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE year_grab_date = 2019 GROUP BY idmacro_sector, macro_sector, esco_level_4, idcountry ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 5 """

# reading data using connection and querytop15_occ_by_count_sector = pd.read_sql(query_1, conn)

In [147]:

# reindexing the tabletop15_occ_by_count_sector.set_index(["idcountry","idmacro_sector","macro_sector",'esco_level_4'])

# renaming the columnstop15_occ_by_count_sector.columns = ['idcountry', 'idmacro_sector', 'macro_sector', 'occupation','count', 'rank']

# rearranging the columnstop15_occ_by_count_sector = top15_occ_by_count_sector[['idcountry', 'idmacro_sector', 'macro_sector', 'occupation','count', 'rank']]

# sort the tabletop15_occ_by_count_sector = top15_occ_by_count_sector.sort_values(['count'],ascending=False)

In [148]:

top15_occ_by_count_sector.head(10)

Out[148]:

idcountry idmacro_sector macro_sector occupation count rank

460 DE GWholesale andretail trade; repairof motor ve...

Shop salesassistants

95170 1

1274 UK JInformation andcommunication

Softwaredevelopers

78750 1

1393 UK QHuman healthand social workactivities

Nursingprofessionals

72297 1

365 DE NAdministrativeand supportservice activities

Administrativeand executivesecretaries

60448 1

385 DE MProfessional,scientific andtechnical activit...

Systems analysts 55686 1

375 DE JInformation andcommunication

Softwaredevelopers

49373 1

380 DE C Manufacturing

Manufacturinglabourers notelsewhereclassified

47292 1

370 DE QHuman healthand social workactivities

Health careassistants

46932 1

386 DE MProfessional,scientific andtechnical activit...

Engineeringprofessionals notelsewhereclassi...

46488 2

366 DE NAdministrativeand supportservice activities

Manufacturinglabourers notelsewhereclassified

45413 2

In [150]:

top15_occ_by_count_sector[(top15_occ_by_count_sector['idcountry'] == 'DE') & (top15_occ_by_count_sector['idmacro_sector'] == 'J')].head(5)

Data VisualizationLet's make a simple plot which shows the number of announcements per month. To do so, first, weshould group our data by year and month and then count the records:

In [158]:

# query we want to runquery_1 = """ SELECT idcountry, year_grab_date, month_grab_date, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet GROUP BY idcountry, year_grab_date, month_grab_date """

# reading data using connection and querydate_groupped = pd.read_sql(query_1, conn)

Out[150]:

idcountry idmacro_sector macro_sector occupation count rank


Software developers 49373 1


Systems analysts 27968 2


Systemsadministrators

10222 3


Engineeringprofessionals notelsewhere classi...

6989 4


Advertising andmarketingprofessionals

6616 5

In [161]:

date_groupped.reset_index()date_groupped.head()

Now that we have our aggregated data, we should start plotting. Python community offers a wide rangeof visualization packages but here, we stick with the matplotlib, a classic choice!

In [160]:

import matplotlib.pyplot as pltimport matplotlib.dates as mdates

In [165]:

# getting number of records from tablecounts = date_groupped.groupby(['year_grab_date','month_grab_date']).sum()

counts

Out[161]:

idcountry year_grab_date month_grab_date num_ojv

0 IT 2018 7 230650

1 BE 2019 3 192837

2 BE 2018 12 180383

3 ES 2018 7 170325

4 BE 2018 8 192713

Out[165]:

num_ojv

year_grab_date month_grab_date

2018 7 5142680

8 5271987

9 6237170

10 6619957

11 8683441

12 6446523

2019 1 8013169

2 4566919

3 4561617

In [166]:

# creating a date range and set it as the index of our datacounts.index = pd.DatetimeIndex(start='2018-07-01',end='2019-03-31' , freq='MS')

# setting the size of the plotfig, ax = plt.subplots(figsize=(15,7))

# plot the data (blue lines)plt.plot(counts.index , counts)

# plot the data (black dots)plt.scatter(counts.index , counts, c='k', zorder=10)

# setting the x ticks of the plot as index of data (dates)plt.xticks(counts.index)

# setting X and Y axes labelsplt.xlabel('Date')plt.ylabel('# Announcements')

# change the format of date ticksax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

# setting the label of the plotplt.title('Total Monthly Number of Announcements', fontsize= 20)

# drawing the plotplt.show()

Now let's repeat what we have just did, using a subset of data by filtering it for one country:

In [174]:

filter_country = 'DE'data_by_country = date_groupped[date_groupped.idcountry == filter_country]data_by_country

The other steps are identical to what we have done for the previous plot:

Out[174]:

idcountry year_grab_date month_grab_date num_ojv

27 DE 2019 1 2280502

29 DE 2018 9 1487587

40 DE 2018 7 1345705

42 DE 2018 12 1782813

55 DE 2019 2 1525130

72 DE 2019 3 1487849

83 DE 2018 11 2373502

95 DE 2018 8 1535226

102 DE 2018 10 1861479

In [175]:

counts = data_by_country.num_ojvcounts.index = pd.DatetimeIndex(start='2018-07-01',end='2019-03-31' , freq='MS')

fig, ax = plt.subplots(figsize=(15,7))plt.plot(counts.index , counts)plt.scatter(counts.index , counts, c='k', zorder=10)plt.xticks(counts.index)plt.xlabel('Date')plt.ylabel('# Announcements')ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))plt.title(f'Monthly Number of Announcements - {filter_country}', fontsize= 20)plt.show()

Ok, let's try another type of visualization: Pie Chart

We want to add two filters on city and occupation and plot a pie chart for quantity of contract types.

Note: 25 for ISCO/ESCO -> some ITC occupations

In [180]:

# query we want to runquery_1 = """ SELECT contract, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE city = 'Milano' and idesco_level_2 = '25' GROUP BY contract """

# reading data using connection and queryfiltered = pd.read_sql(query_1, conn)

In [181]:

# getting count distinct fot each contract and convert the result to the dataframepie_data = pd.DataFrame(filtered.contract.value_counts())

# Notice that we're not directly using Marplotlib as we did for the previous plots# Pandas actully uses Matplotlib so for simple plots like this one you can use# integrated visualizations of pandas without explecitly use Matplotlib functionspie_data.plot.pie(y='contract', figsize=(8, 8))plt.show()

Repeating the previous plot, this time for working hours:

In [182]:

# query we want to runquery_1 = """ SELECT working_hours, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE city = 'Milano' and idesco_level_2 = '25' GROUP BY working_hours """

# reading data using connection and queryfiltered = pd.read_sql(query_1, conn)

pie_data = pd.DataFrame(filtered.working_hours.value_counts())pie_data.plot.pie(y='working_hours', figsize=(8, 8))plt.show()

In [183]:

conn.close()

Case-Study : Source Country and DestionationCountry

create a pivot table using sourcecountry and country as index and columns and percentage ofcountry records for each sourccecountryRemove non significant values from pivot(in this case ones which are less than 5%)Sort both rows and and columns of pivot table (Descending)Import Skill data from DataLabl and perform the following actions on it:

headinfocount null valuesget the most frequent skill by country

PIVOT TABLE

In [199]:


# query we want to runquery_1 = """ SELECT idcountry, sourcecountry, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE sourcecountry in ('IT', 'UK', 'IE', 'CZ', 'FR', 'DE', 'ES', 'AT', 'PL', 'BE', 'NL', 'SE', 'LU' ) GROUP BY sourcecountry, idcountry """

# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

conn.close()

In [200]:

documents.head()

In [201]:

# groupping and calculating percentagescoutry_groupped = documents.groupby(['idcountry', 'sourcecountry'])\ .sum().groupby(level=0).apply(lambda x:100 * x / float(x.sum()))\ .reset_index()

# making pivot table, filling blank cells with zero and round the values to one decimal numberpivot_data = coutry_groupped.pivot(index='idcountry', columns='sourcecountry', values='num_ojv').fillna(0).round(1)

Out[200]:

idcountry sourcecountry num_ojv

0 NL SE 1778

1 CZ BE 60

2 FR UK 9429

3 PL IE 48

4 UK SE 2832

In [202]:

pivot_data

DATA CLEANING

In [203]:

# We need to import numpy firstimport numpy as np

Out[202]:

sourcecountry AT BE CZ DE ES FR IE IT LU NL PL SE UK

idcountry

AT 87.5 0.0 0.0 11.9 0.0 0.2 0.0 0.0 0.0 0.1 0.2 0.0 0.0

BE 0.1 94.5 0.0 1.2 0.1 1.5 0.3 0.1 0.0 1.4 0.4 0.0 0.2

CZ 0.0 0.0 99.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0

DE 2.2 0.0 0.0 97.3 0.0 0.2 0.0 0.0 0.0 0.0 0.2 0.1 0.0

ES 0.0 0.0 0.0 0.2 97.8 1.6 0.0 0.1 0.0 0.1 0.0 0.1 0.1

FR 0.0 0.4 0.0 0.3 0.1 98.8 0.1 0.1 0.0 0.1 0.0 0.0 0.1

IE 0.1 0.1 10.3 0.6 0.3 0.1 85.2 0.1 0.0 0.4 0.1 0.0 2.7

IT 0.2 0.2 0.0 0.3 0.2 0.3 0.0 98.5 0.0 0.1 0.1 0.1 0.0

LU 0.1 5.9 0.0 2.0 0.0 3.0 0.0 0.0 88.4 0.3 0.0 0.0 0.2

NL 0.1 0.4 0.0 1.4 0.1 0.0 0.0 0.0 0.0 97.4 0.3 0.1 0.2

PL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.8 0.1 0.0

SE 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.1 0.0 0.0 0.0 99.5 0.1

UK 0.1 0.0 0.0 0.6 0.0 0.4 0.2 0.0 0.0 0.3 0.0 0.0 98.4

In [204]:

# With this line of code we replace the values less that 5% with 0

# .apply --> applies a function to the datatable# lambda x : do(x) --> a simple and fast way to write a function# np.where(condition, something, something_else) --> similar to =IF() function from Excelpivot_data.apply(lambda x : np.where(x<5, 0, x))

The skillsImporting skill data from Athena:

Out[204]:

sourcecountry AT BE CZ DE ES FR IE IT LU NL PL SE UK

idcountry

AT 87.5 0.0 0.0 11.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

BE 0.0 94.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

CZ 0.0 0.0 99.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

DE 0.0 0.0 0.0 97.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

ES 0.0 0.0 0.0 0.0 97.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

FR 0.0 0.0 0.0 0.0 0.0 98.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0

IE 0.0 0.0 10.3 0.0 0.0 0.0 85.2 0.0 0.0 0.0 0.0 0.0 0.0

IT 0.0 0.0 0.0 0.0 0.0 0.0 0.0 98.5 0.0 0.0 0.0 0.0 0.0

LU 0.0 5.9 0.0 0.0 0.0 0.0 0.0 0.0 88.4 0.0 0.0 0.0 0.0

NL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 97.4 0.0 0.0 0.0

PL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.8 0.0 0.0

SE 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.5 0.0

UK 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 98.4

In [208]:


# query we want to runquery_2 = """ SELECT * FROM "AwsDataCatalog".cedefop_presentation.ft_skill_analysis_essnet ORDER BY RAND() LIMIT 1000; """# reading data using connection and queryskills = pd.read_sql(query_2, conn)

# closing connection conn.close()

In [209]:

skills.head()

Out[209]:


0 163539137 17905 2019 1 9 18025

1 175883759 17935 2019 2 8 17995

2 166123483 17908 2019 1 12 18028

3 136630915 17852 2018 11 17 17972

4 82106708 17731 2018 7 19 17851


In [210]:

skills.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 51 columns):general_id 1000 non-null objectgrab_date 1000 non-null int64year_grab_date 1000 non-null int64month_grab_date 1000 non-null int64day_grab_date 1000 non-null int64expire_date 1000 non-null int64year_expire_date 1000 non-null int64month_expire_date 1000 non-null int64day_expire_date 1000 non-null int64lang 1000 non-null objectidesco_level_4 1000 non-null objectesco_level_4 1000 non-null objectidesco_level_3 1000 non-null objectesco_level_3 1000 non-null objectidesco_level_2 1000 non-null objectesco_level_2 1000 non-null objectidesco_level_1 1000 non-null objectesco_level_1 1000 non-null objectidescoskill_level_3 1000 non-null objectescoskill_level_3 1000 non-null objectidcity 1000 non-null objectcity 1000 non-null objectidprovince 1000 non-null objectprovince 1000 non-null objectidregion 1000 non-null objectregion 1000 non-null objectidmacro_region 1000 non-null objectmacro_region 1000 non-null objectidcountry 1000 non-null objectcountry 1000 non-null objectidcontract 1000 non-null objectcontract 1000 non-null objectideducational_level 1000 non-null objecteducational_level 1000 non-null objectidsector 1000 non-null objectsector 1000 non-null objectidmacro_sector 1000 non-null objectmacro_sector 1000 non-null objectidcategory_sector 1000 non-null objectcategory_sector 1000 non-null objectidsalary 1000 non-null objectsalary 1000 non-null objectidworking_hours 1000 non-null objectworking_hours 1000 non-null objectidexperience 1000 non-null objectexperience 1000 non-null objectsource_category 1000 non-null objectsourcecountry 1000 non-null objectsource 1000 non-null objectsite 1000 non-null object

Null Values

to get the number of null cells for each column we should first use .isnull( ) method which returns 1 foreach null cell and 0 for a non-null cell. Then we should sum these 0s and 1s to get the total number ofnull values for each column:

In [211]:

skills.isnull().sum()

companyname 1000 non-null objectdtypes: int64(8), object(43)memory usage: 398.5+ KB

Out[211]:

general_id 0grab_date 0year_grab_date 0month_grab_date 0day_grab_date 0expire_date 0year_expire_date 0month_expire_date 0day_expire_date 0lang 0idesco_level_4 0esco_level_4 0idesco_level_3 0esco_level_3 0idesco_level_2 0esco_level_2 0idesco_level_1 0esco_level_1 0idescoskill_level_3 0escoskill_level_3 0idcity 0city 0idprovince 0province 0idregion 0region 0idmacro_region 0macro_region 0idcountry 0country 0idcontract 0contract 0ideducational_level 0educational_level 0idsector 0sector 0idmacro_sector 0macro_sector 0idcategory_sector 0category_sector 0idsalary 0salary 0idworking_hours 0working_hours 0idexperience 0experience 0source_category 0sourcecountry 0source 0site 0companyname 0dtype: int64

Finding top skills by country:

In [212]:

vals = []countries = []sk = []for country in skills.country.unique(): sag = skills[skills.country == country]['escoskill_level_3'] vals.append(sag.value_counts()[0]) sk.append(sag.value_counts().index[0]) countries.append(country)

In [213]:

res = pd.DataFrame([countries, sk, vals]).Tres.columns = ['country', 'skill', 'count']

In [214]:

res

Case-Study : Digital Occupations

Out[214]:

country skill count

0 NEDERLAND proactivity 3

1 DEUTSCHLAND adapt to change 14

2 UNITED KINGDOM adapt to change 17

3 ESPAÑA ICT networking hardware 1

4 ÖSTERREICH adapt to change 4

5 ITALIA communication 3

6 BELGIQUE-BELGIË create solutions to problems 2

7 SVERIGE communicate with customers 3

8 FRANCE adapt to change 11

9 POLSKA manage time 2

10 ČESKÁ REPUBLIKA engineering processes 1

11 IRELAND communication 1

Using the provided list of Eurostat digital occupations, create a subset of skills data whichcontains only these occupationsFor each digital occupation calculate the mixture of skills in percentage termsFocus on programming languanges

In [215]:

prof_digital = ['1330', '2511', '2512', '2513', '2514', '2519', '2521','2522', '2523', '2529', '3511', '3512', '3513', '3514', '3521', '3522']

In [218]:


# query we want to run# query we want to runquery_1 = """ SELECT * FROM ( SELECT esco_level_4, escoskill_level_3, num_ojv, rank() over (partition by esco_level_4 order by num_ojv desc) as rank FROM ( SELECT esco_level_4, escoskill_level_3, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_skill_analysis_essnet WHERE idesco_level_4 IN ('1330', '2511', '2512', '2513', '2514', '2519', '2521','2522', '2523', '2529', '3511', '3512', '3513', '3514', '3521', '3522') GROUP BY esco_level_4, escoskill_level_3 ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 5 """


conn.close()

In [220]:

documents[documents['esco_level_4']=='Software developers'].head(5)

PROGRAMMING LANGUAGES BY LOCATION

In [50]:

langs = ['SQL', 'Java', 'C#', 'Python', 'PHP', 'matlab', 'SAS language', 'C++']

In [221]:


# query we want to run# query we want to runquery_1 = """ SELECT idcountry, escoskill_level_3, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_skill_analysis_essnet WHERE escoskill_level_3 IN ('SQL', 'Java', 'C#', 'Python', 'PHP', 'matlab', 'SAS language', 'C++') GROUP BY idcountry, escoskill_level_3 ORDER BY num_ojv DESC """


conn.close()

Out[220]:

esco_level_4 escoskill_level_3 num_ojv rank

40 Software developers adapt to change 1279015 1

41 Software developers project management 1137579 2

42 Software developers computer programming 1101046 3

43 Software developers English 999963 4

44 Software developers teamwork principles 935304 5

In [225]:

# groupby skill and countrylang_g = documents.reset_index()\ .sort_values(['num_ojv'], ascending=False)

In [226]:

lang_g[lang_g['idcountry']=='UK'].head(20)

Compare with DE...

In [227]:

lang_g[lang_g['idcountry']=='DE'].head(20)

Out[226]:

index idcountry escoskill_level_3 num_ojv

0 0 UK SQL 450804

3 3 UK Java 225294

4 4 UK Python 171930

8 8 UK C# 115636

9 9 UK PHP 112480

12 12 UK C++ 84014

42 42 UK SAS language 13227

56 56 UK matlab 7254

Out[227]:

index idcountry escoskill_level_3 num_ojv

1 1 DE SQL 241867

2 2 DE Java 233301

7 7 DE C++ 124366

10 10 DE PHP 106713

14 14 DE Python 82810

24 24 DE C# 39666

26 26 DE matlab 29322

51 51 DE SAS language 9235

... or Pivotting the data

In [230]:

lang_g.pivot(index='escoskill_level_3',columns='idcountry', values='num_ojv').fillna(0)

Case-Study : Education vs ExperienceCreating a bubble chart for Education and Experience with number of records as the size of bubbles:

Out[230]:

idcountry AT BE CZ DE ES FR IE

escoskill_level_3

C# 4168.0 4794.0 1172.0 39666.0 17889.0 21808.0 14227.0 13341.0

C++ 11446.0 2774.0 2090.0 124366.0 15214.0 41115.0 2644.0 13002.0

Java 22522.0 13014.0 5013.0 233301.0 47641.0 124376.0 13060.0 40730.0

PHP 6735.0 3934.0 1405.0 106713.0 20267.0 58000.0 3333.0 14702.0

Python 5699.0 4589.0 1894.0 82810.0 17914.0 50950.0 7813.0 7910.0

SAS language 323.0 708.0 0.0 9235.0 683.0 103818.0 358.0 880.0

SQL 27988.0 23465.0 8221.0 241867.0 51685.0 126214.0 22828.0 52186.0

matlab 2095.0 250.0 0.0 29322.0 259.0 5387.0 0.0 1171.0

In [252]:


# query we want to run# query we want to runquery_1 = """ SELECT ideducational_level, educational_level, idexperience, experience, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE idesco_level_4 IN ('1330', '2511', '2512', '2513', '2514', '2519', '2521','2522', '2523', '2529', '3511', '3512', '3513', '3514', '3521', '3522') GROUP BY ideducational_level, educational_level, idexperience, experience ORDER BY ideducational_level ASC, idexperience ASC """


conn.close()

In [253]:

# First we should aggregate the data for the desired colummnsedu_exp = documents.reset_index()

In [254]:

edu_exp

Out[254]:

index ideducational_level educational_level idexperience experience num_ojv

0 0 33289

1 1 1Noexperience

5324

2 2 2Up to 1year

13588

3 3 3From 1 to 2years

4210

4 4 4 From 2 to 4years

5046


939


224

7 7 7From 8 to10 years

111

8 8 8Over 10years

7605

9 9 1 Primary education 8167

10 10 1 Primary education 1Noexperience

1468

11 11 1 Primary education 2Up to 1year

6732

12 12 1 Primary education 3From 1 to 2years

1900


2569


393


73

16 16 1 Primary education 7From 8 to10 years

26

17 17 1 Primary education 8Over 10years

2033

18 18 2Lower secondaryeducation

151209


1Noexperience

3935


2Up to 1year

98522


3From 1 to 2years

28474


4From 2 to 4years

19069


5From 4 to 6years

4330

24 24 2 Lower secondaryeducation

6 From 6 to 8years

1525


7From 8 to10 years

684


8Over 10years

44730

27 27 3Post-secondarynon-tertiaryeducation

254480


1Noexperience

10386


2Up to 1year

257372

... ... ... ... ... ... ...

51 51 5Short-cycletertiary education

6From 6 to 8years

7097


7From 8 to10 years

2220


8Over 10years

154879

54 54 6Bachelor orequivalent

333980


1Noexperience

20851


2Up to 1year

257164


3From 1 to 2years

52426


4From 2 to 4years

84984


5From 4 to 6years

16845


6From 6 to 8years

7468


7From 8 to10 years

2467


8Over 10years

116087

63 63 7 Master orequivalent

172997

64 64 7Master orequivalent

1Noexperience

10289


2Up to 1year

129748


3From 1 to 2years

21435


4From 2 to 4years

43954


5From 4 to 6years

12525


6From 6 to 8years

2131


7From 8 to10 years

778


8Over 10years

35596

72 72 8Doctoral orequivalent

20955


1Noexperience

740


2Up to 1year

17649


3From 1 to 2years

3008


4From 2 to 4years

4720


5From 4 to 6years

1391


6From 6 to 8years

245


7From 8 to10 years

58


8Over 10years

4505


there are some missing data which we should remove before starting with plotting. In these case themissing data are indicated by "":

In [255]:

# Replacing "" with np.nan which in python represents a missing dataedu_exp.replace('', np.nan, inplace=True)# removing rows with "any" missing valueedu_exp.dropna(inplace=True)

Now our table is ready for plotting:

In [256]:

edu_exp

Out[256]:

index ideducational_level educational_level idexperience experience num_ojv

10 10 1 Primary education 1Noexperience

1468

11 11 1 Primary education 2Up to 1year

6732


1900


2569


393


73

16 16 1 Primary education 7From 8 to10 years

26

17 17 1 Primary education 8Over 10years

2033


1Noexperience

3935


2Up to 1year

98522


3From 1 to 2years

28474


4From 2 to 4years

19069

23 23 2 Lower secondaryeducation

5 From 4 to 6years

4330


6From 6 to 8years

1525


7From 8 to10 years

684


8Over 10years

44730


1Noexperience

10386


2Up to 1year

257372


3From 1 to 2years

34109


4From 2 to 4years

53213


5From 4 to 6years

8818


6From 6 to 8years

2342


7From 8 to10 years

1146


8Over 10years

112518

37 37 4Upper secondaryeducation

1Noexperience

21002


2Up to 1year

220378


3From 1 to 2years

37237


4From 2 to 4years

48012

41 41 4 Upper secondary 5 From 4 to 6 8003

education years


6From 6 to 8years

2696

... ... ... ... ... ... ...


3From 1 to 2years

118960


4From 2 to 4years

151542


5From 4 to 6years

31885


6From 6 to 8years

7097


7From 8 to10 years

2220


8Over 10years

154879


1Noexperience

20851


2Up to 1year

257164


3From 1 to 2years

52426


4From 2 to 4years

84984


5From 4 to 6years

16845


6From 6 to 8years

7468


7From 8 to10 years

2467


8Over 10years

116087


1Noexperience

10289


2Up to 1year

129748


3From 1 to 2years

21435

67 67 7 Master or 4 From 2 to 4 43954

equivalent years


5From 4 to 6years

12525


6From 6 to 8years

2131


7From 8 to10 years

778


8Over 10years

35596


1Noexperience

740


2Up to 1year

17649


3From 1 to 2years

3008


4From 2 to 4years

4720


5From 4 to 6years

1391


6From 6 to 8years

245


7From 8 to10 years

58


8Over 10years

4505


In [257]:

# initial variablesplt.rcParams['figure.figsize'] = (20, 8)fig = plt.figure()

# plotting bubblesplt.scatter(edu_exp.educational_level, edu_exp.experience, s=edu_exp.num_ojv/100.0, alpha=0.5)

# Rotatinf X ticksfig.autofmt_xdate(rotation=90)

# Adding a title to the plotplt.title('Education Vs. Experience - Digital Occupations', fontsize=20)

# plotting!plt.show()

Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from...

Documents

Transcript of Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from...