Introduction to data cleaning with spreadsheets

29
An Introduction to data cleaning with spreadsheets Anders Pedersen, @anpe School of Data

description

Presented at School of Data training conducted in collaboration with the Open Data PH Taskforce in the Philippines, May 2014.

Transcript of Introduction to data cleaning with spreadsheets

Page 1: Introduction to data cleaning with spreadsheets

An Introduction to data cleaning with spreadsheets

Anders Pedersen, @anpe

School of Data

Page 2: Introduction to data cleaning with spreadsheets

Spreadsheets: The beginning of each and every data story

• Which were the top growth sectors in this quarter?

• What was the crime in the capital region in 2013 compared to 2012?

• Is there a house bubble waiting around the corner?

Page 3: Introduction to data cleaning with spreadsheets

It is time for journalists themselves to tame this beast called spreadsheets!

Page 4: Introduction to data cleaning with spreadsheets

Spreadsheets: Excel or google docs

Page 5: Introduction to data cleaning with spreadsheets

Some basic terminology• data is organized in rows and columns

(rows go across the page, columns go top down)

• each field holding data is called a cell• Rows are numbered, • columns are referred to by letters• each cell has column and a row, or a

specific code (e.g. A1 is the top left cell

Page 6: Introduction to data cleaning with spreadsheets

Some key features to explore today• Sorting and filtering• Basic formulas• Pivot tables

Tricky bits:- don’t include summaries in pivot table- pivot tables cannot remember when you change your data

Page 7: Introduction to data cleaning with spreadsheets

Data sources for exercise

• Education: Secondary school enrollment for 2012 from Data.gov.ph http://data.gov.ph/catalogue/dataset/sy-2012-enrollment-data-secondary

Page 8: Introduction to data cleaning with spreadsheets

Sorting - finding the best and the worst • The 10 best paid sectors• The 10 oldest cities• The 10 poorest countries• …

• If excel is a tool box for journalists, sorting is the hammer!

Page 9: Introduction to data cleaning with spreadsheets

How to sort

• 1) Mark all your data• 2) In the Data tab go to sort range

Page 10: Introduction to data cleaning with spreadsheets

Sorting...

• 3) Check the Data hasheader row check box• 4) Select the column you want to sort

Page 11: Introduction to data cleaning with spreadsheets

Filtering - getting a better sense of your data• 1) Turn on Filtering

via the Data tab (Data → Filter)

Page 12: Introduction to data cleaning with spreadsheets

Filtering...• 2) Filter options now appear at top

Page 13: Introduction to data cleaning with spreadsheets

Filtering...• 3) Now click on the • blue triangular arrow

Page 14: Introduction to data cleaning with spreadsheets

Filtering...• 4) Select the sectionyou wish to filter

Page 15: Introduction to data cleaning with spreadsheets

Filtering...• 5) A green arrowwill now appear on topof the column

Page 16: Introduction to data cleaning with spreadsheets

Moving forward!

• Sorting and filtering - check!• Basic formulas • Pivot tables

Page 17: Introduction to data cleaning with spreadsheets

Basic formulas• Let us know try to sum up some of the

values in the dataset…

• What is it good for: when you do analysis and when you need to check if calculations by your colleagues are right

Page 18: Introduction to data cleaning with spreadsheets

Basic formulas• Go to column H: In the second row (cell H2), type “=sum(f2+g2)”

Page 19: Introduction to data cleaning with spreadsheets

Basic formulas• We now have a sum

• Now try to see if this cell can be calculated for average “=average(f2:g2)”

Page 20: Introduction to data cleaning with spreadsheets

Basic formulas• You can also copy your calculations across

cells

Page 21: Introduction to data cleaning with spreadsheets

Now only Pivot tables to go• Sorting and filtering - check!• Basic formulas - check!• Pivot tables

Page 22: Introduction to data cleaning with spreadsheets

Pivot tables• finding stories inside datasets

• particularly well fitting for organised datasets with clear categories and sub-categories

Page 23: Introduction to data cleaning with spreadsheets

Pivot tables• Mark the full area of the dataset• Go to Data → Pivot table report

Page 24: Introduction to data cleaning with spreadsheets

Pivot tables• Pivot tables allows you to work on rows,

column values and filters• We start by droppinga column header into Rows • Then we drop one of our value columns into Values

Page 25: Introduction to data cleaning with spreadsheets

Basic formulas• We now have a nice summary of the budget

for each department

Page 26: Introduction to data cleaning with spreadsheets

Filtering pivot tables• We can now go ahead and filter the Pivot

table• Add the column you wish to filter by

Page 27: Introduction to data cleaning with spreadsheets

Filtering pivot tables• Then select one or more categories withinthe column you wish to keep

Page 28: Introduction to data cleaning with spreadsheets

Pivot tables• We can finally add several value columns to

the pivot table

Page 29: Introduction to data cleaning with spreadsheets

Exercises• Find the sectors of the national budget that

grew the most in percentage• Identify the budget lines, which had the

biggest absolute increase in the budget• Generate a pivot table based on the

national budget comparing 2014 and 2013 in specific sectors