Cleaning and sorting data

30
Cleaning and sorting data YOU CAN DO A LOT WITH TOOLS YOU ALREADY HAVE

description

If you are working with data, it almost certainly came from people who did not make it for you. That means it is not perfect for your needs. Here are some easy ways to wrangle it with tools you mostly have already.

Transcript of Cleaning and sorting data

Page 1: Cleaning and sorting data

Cleaning and sorting dataYOU CAN DO A LOT WITH TOOLS YOU ALREADY HAVE

Page 2: Cleaning and sorting data

A word about data

Data is information that comes from people. Typically, they did not make it for you.

That means it is not perfect for your needs.

This is true whether they are people who built a sophisticated system to generate the data, or just someone who sent you some lists as an email attachment.

Now you have to deal with it. Often the humbler the data, the bigger the headache.

Knowing some easy ways to wrangle the simple stuff will give you a foundation for doing fancier work with bigger, more sophisticated data sets, if you are so inclined.

But also, it will help you get the job done today, and sometimes that’s good enough.

Page 3: Cleaning and sorting data

You can do a lot with what you’ve got

• Word (yes, Microsoft Word)• Excel• The command line• A simple text editor

Page 4: Cleaning and sorting data

First, you don’t have to cut and paste to sortHere’s a list of info about states in what you wish was in some kind of order…

Page 5: Cleaning and sorting data

Open it in Word. Select your text, then click “A to Z" in the Paragraph menu. Alpha by paragraphs is the default, so just hit OK.

Page 6: Cleaning and sorting data

You can also sort by “fields” if your lines have recurring separators.

Page 7: Cleaning and sorting data

Or find a pattern you can turn into a separator: change the “ – “ to a pipe (“|”) or a tab

Page 8: Cleaning and sorting data

Word lets you find (or insert) paragraph marks and tabsThis means you can turn a nasty old text file into a spreadsheet in a two jifs

Page 9: Cleaning and sorting data

Change returns (^p) to tabs (^t), then double-tabs back to ^p. Copy and paste into Excel!

Page 10: Cleaning and sorting data

Ah, Excel – so nice, so clean…But wait – I needed the state and zip code in separate columns…

Page 11: Cleaning and sorting data

Select your column, then pick "Text to Columns" in the Data Tools menu, Delimited. Check the character to split on (space in this example). Voila!

Actually you have to move the phone numbers to the right first. And yes, you’ll have to fix East Moline.

Page 12: Cleaning and sorting data

Sometimes you need to combine columns – for example, first name and last name, or genus and species. (I’ve moved the amounts to the right to give us space.)

Page 13: Cleaning and sorting data

Select the cell where the first result should go, and type “=CONCATENATE(_," ",_)” with the cells that the info is coming from (here it’s B2 and C2), and put what should go in between them (here it’s a space) inside the quotes.

Page 14: Cleaning and sorting data

When the see the correct result, select the cell and copy to all the cells below.

Page 15: Cleaning and sorting data

And guess what else is still useful – the command prompt!Yes, there are better things out there. But it may still be the fastest way to compare two files, especially if you haven’t installed the other things

Page 16: Cleaning and sorting data

Plain old “fc” (file compare) will list the differences between two files. Just put them in same folder and type fc filename1 filename2 Seriously, this is so darn handy!

Page 17: Cleaning and sorting data

Sometimes the problem is inconsistent filenamingMaybe you got some data where each record is in a separate file, or pics from different cameras. A tool I use all the time to deal with this is “Bulk Rename Utility”

Page 18: Cleaning and sorting data

Like most tools, Bulk Rename Utility can do lots of fancy stuff. But a simple “Replace” will quickly standardize most inconsistent filenames for you.

Page 19: Cleaning and sorting data

An underrated problem with data is finding the bits you needFortunately, some free text editors like “Notepad++” will search across all the files in a folder and all its subfolders – even your whole C: drive.

Page 20: Cleaning and sorting data

Can’t face opening file after file to try to find the data you’re looking for? Well, you don’t need to. Besides, as a human, you might miss some of it.

Page 21: Cleaning and sorting data

Open Notepad++ (or a similar text editor) and select “Find in Files” …

Page 22: Cleaning and sorting data

Tell it what to look for and where to start.To use or save the results, select and copy to a file.

Page 23: Cleaning and sorting data

Download these tools for free

Bulk Rename Utilityhttp://www.bulkrenameutility.co.uk/download.php

Notepad++http://download.cnet.com/notepad/3000-2352_4-10327521.html

Page 24: Cleaning and sorting data

Some other data tools you already have

Actually these are some of the best ones…

Page 25: Cleaning and sorting data

Because let’s say you have a nice clean data set from Socrata.Maybe it’s County procurement data. Or whatever… You still have to make sense of it.

Page 26: Cleaning and sorting data

Thinking about the data• Is it complete? (right number of records)• Is it consistent? (records entered the same way)• Are there typos or variant punctuation? Stray

spaces?• Are there values that don’t seem to make sense?• Does it jibe with what you expected to be there?• For what purpose, or under what mandate, was it

compiled? This can affect the meaning of terms.• What do the values actually mean?

Page 27: Cleaning and sorting data

Getting to the bottom of it• How is this data generated, actually?• What staff are responsible for it?• If it’s automated, what triggers an entry?• If there are “multiple choice” values, what is the

selection based on?• Is anyone checking it?• How often is it updated?• What do these codes / terms /values actually

mean?

Page 28: Cleaning and sorting data

Some more things to look atThere are of course plenty of more-sophisticated ways to clean and test the potential ok-ness of data. Many of them are way beyond me. But they are based on this kind of thinking. Here is some more of it at its best.

• Some thoughts from the IRE bloghttp://ire.org/blog/ire-news/2013/10/25/ten-irrefutable-and-nonnegotiable-rules-responsibl/

• Some thoughts from Drew Skau, visualization architect at Visual.lyhttp://blog.visual.ly/cleaning-data-sets/

• The School of Data Handbookhttp://schoolofdata.org/handbook/

Page 29: Cleaning and sorting data

Data used in my examples

• State publications - http://www.library.illinois.edu/doc/researchtools/guides/state/statelist.html

• Community health centers - http://getcoveredillinois.gov/

• Scott Walker campaign contributors - http://boycottwalker.bsharp.org/walker-by-contributor.html

• Photo files - downloads from personal mobile devices