Diagnosing dirty data_ire2013

22
Diagnosing Dirty Data Jaimi Dowdell, IRE/NICAR Jennifer LaFleur, ProPublica

description

Diagnosing dirty data - IRE 2013 (including cat photos)

Transcript of Diagnosing dirty data_ire2013

Page 1: Diagnosing dirty data_ire2013

Diagnosing Dirty Data

Jaimi Dowdell, IRE/NICARJennifer LaFleur, ProPublica

Page 2: Diagnosing dirty data_ire2013

Get your data's history

• Know the source of the data

• Know how it's used

• Know what all the fields mean

• Know what other stories have been done with it

Page 3: Diagnosing dirty data_ire2013

What is dirty data?

• Missing records

• Incorrect information

• Duplicate information

• No standardization

Page 4: Diagnosing dirty data_ire2013

Take your data's temperature

• How many records should you have?

• Double-check totals or counts. Check for studies/ summary reports.

• Check for duplicates. Make sure they are real duplicates. Is it possible that there are hidden duplicates?

• Consistency-check all fields. Are all city/county names spelled the same? Are all codes found within documentation?

Page 5: Diagnosing dirty data_ire2013

Internal consistency checks

• Is there more money going to sub-contractors than went to the prime contractor?

• Are there more teachers than students?

• How about other important fields?

• Check the range of fields. (For example, check for DOBs that would make people too old or too young.)

• Check for missing data or blank fields. Are they real values, or did something happen with an import or append query?

Page 6: Diagnosing dirty data_ire2013

External Checks

• Compare to reports

• Data reported to other agencies

• On the ground reporting

• Verification from sources

Page 7: Diagnosing dirty data_ire2013

Steps for cleaning data

• Assess the problem

• Identify your goal

• Find the right tool for the job

• Set aside time (double what you think)

• Make a backup copy

• Make a backup copy

• Never alter the original data. Make new columns so you can compare and show your work.

• Create an audit trail.

• Spot check as you go.

Page 8: Diagnosing dirty data_ire2013

Tips for success

• Keep a data notebook

• Duplicate your work

• Duplicate your work

• Bounce your results off folks who really know the data

• Set up some standards for your work/newsroom

Page 9: Diagnosing dirty data_ire2013

Choose the right tool

• You don't need to be fancy, just get the job done

• Work with what you're comfortable with

• Don't forget the power of Excel

• Text editors can be lifesavers

• Many tools exist - Open Refine, programming, etc.

• Get training as needed

Page 10: Diagnosing dirty data_ire2013

Focus is important

Page 11: Diagnosing dirty data_ire2013

So get plenty of food and rest

Page 12: Diagnosing dirty data_ire2013

Get a data buddy

Page 13: Diagnosing dirty data_ire2013

Common ailments

Page 14: Diagnosing dirty data_ire2013

Dates that aren't dates

Page 15: Diagnosing dirty data_ire2013

Names, names, names...

Page 16: Diagnosing dirty data_ire2013

Location matters

Page 17: Diagnosing dirty data_ire2013

Leading and trailing spaces

Page 18: Diagnosing dirty data_ire2013

"Pretty" reports

Page 19: Diagnosing dirty data_ire2013

Inoperable data: Pain management

• Explain caveats

• Choose your wording carefully

• Know when to leave out records

• Be transparent

• Know what questions can and can't be answered with this dataset

• Know when to get more information

Page 20: Diagnosing dirty data_ire2013

Continue learning about dirty data: Sat. 3:40 p.m. Conference Room 11

BYOD (Bring your own data): Sat. 4:50 p.m., Conference Room 11

Get your hands dirty

Page 21: Diagnosing dirty data_ire2013

[email protected] (@j_la28)[email protected] (@jaimidowdell)

Page 22: Diagnosing dirty data_ire2013

Questions?