Diagnosing dirty data_ire2013
-
Upload
jennifer-lafleur -
Category
Technology
-
view
482 -
download
0
description
Transcript of Diagnosing dirty data_ire2013
Diagnosing Dirty Data
Jaimi Dowdell, IRE/NICARJennifer LaFleur, ProPublica
Get your data's history
• Know the source of the data
• Know how it's used
• Know what all the fields mean
• Know what other stories have been done with it
What is dirty data?
• Missing records
• Incorrect information
• Duplicate information
• No standardization
Take your data's temperature
• How many records should you have?
• Double-check totals or counts. Check for studies/ summary reports.
• Check for duplicates. Make sure they are real duplicates. Is it possible that there are hidden duplicates?
• Consistency-check all fields. Are all city/county names spelled the same? Are all codes found within documentation?
Internal consistency checks
• Is there more money going to sub-contractors than went to the prime contractor?
• Are there more teachers than students?
• How about other important fields?
• Check the range of fields. (For example, check for DOBs that would make people too old or too young.)
• Check for missing data or blank fields. Are they real values, or did something happen with an import or append query?
External Checks
• Compare to reports
• Data reported to other agencies
• On the ground reporting
• Verification from sources
Steps for cleaning data
• Assess the problem
• Identify your goal
• Find the right tool for the job
• Set aside time (double what you think)
• Make a backup copy
• Make a backup copy
• Never alter the original data. Make new columns so you can compare and show your work.
• Create an audit trail.
• Spot check as you go.
Tips for success
• Keep a data notebook
• Duplicate your work
• Duplicate your work
• Bounce your results off folks who really know the data
• Set up some standards for your work/newsroom
Choose the right tool
• You don't need to be fancy, just get the job done
• Work with what you're comfortable with
• Don't forget the power of Excel
• Text editors can be lifesavers
• Many tools exist - Open Refine, programming, etc.
• Get training as needed
Focus is important
So get plenty of food and rest
Get a data buddy
Common ailments
Dates that aren't dates
Names, names, names...
Location matters
Leading and trailing spaces
"Pretty" reports
Inoperable data: Pain management
• Explain caveats
• Choose your wording carefully
• Know when to leave out records
• Be transparent
• Know what questions can and can't be answered with this dataset
• Know when to get more information
Continue learning about dirty data: Sat. 3:40 p.m. Conference Room 11
BYOD (Bring your own data): Sat. 4:50 p.m., Conference Room 11
Get your hands dirty
[email protected] (@j_la28)[email protected] (@jaimidowdell)
Questions?