Good Data Gone Bad, Bad Data Gone Worse - · PDF file Good Data Gone Bad, Bad Data Gone Worse...
date post
19-Jul-2020Category
Documents
view
2download
0
Embed Size (px)
Transcript of Good Data Gone Bad, Bad Data Gone Worse - · PDF file Good Data Gone Bad, Bad Data Gone Worse...
Good Data Gone Bad, Bad Data Gone Worse
Renee Phillips PostgresOpen 2019 1
This is me. 2
Sakeeb Sabaaka Creative commons 2.0 license 3
https://creativecommons.org/licenses/by/2.0/legalcode
@DataRenee
4
This is a talk about how good data goes bad, and how bad data gets worse
5
First, what is data?
● Pieces of information, with a format
● Facts used for making decisions ● Information stored in and/or used
by a computer 6
Data is: A representation of some aspect of the
world
7
Next, what is good data?
Fit for its intended uses in operations, planning, decision making
8
9
Why do we want good data? ● Planning ● Operations ● Looking smart ● Saving money ● Decision making ● Completing transactions
10
Finally, what is bad data?
Not fit for its intended uses in operations, planning, decision making
11
12
What can we assess to check if the data might work for the intended purpose?
13
The Six Primary Dimensions for Data Quality Assessment
1. Completeness 2. Uniqueness 3. Timeliness 4. Validity 5. Accuracy 6. Consistency
PDF from UK International Data Management Association 14
https://www.whitepapers.em360tech.com/wp-content/files_mf/1407250286DAMAUKDQDimensionsWhitePaperR37.pdf
Guidelines for quality assurance in health and health care research
1. Acquisition 2. Entry 3. Cleaning 4. Storage 5. Analysis
PDF from Amsterdam Centre for Health and Healthcare Research
15
http://www.myravanzwieten.nl/pdf/pub_artikelen_hoofdstukken/Guidelines_AmCOGG_2002.pdf
Dan Moyle creatvice commons 2.0 license
11 Things to Look At
16
https://www.flickr.com/photos/danmoyle/11715566974/in/photolist-iRgkU3-858Xjp-5AW5f7-a7S3Vw-29u6umN-FXAqnc-5TTqDx-bERCRK-mua5zD-rkXCAc-5brfMz-ciep6L-pFj6d5-2ax3Esk-qLhr2C-HbHRJ6-26YN6PP-2aTBkvy-kLY9Uj-8kDhX7-UPbkjM-FuaL3P-8ngXq2-a67Pt-Q8D5CE-6JwNoQ-dvJh52-gWRmcB-x1wTLj-QH83VX-tuJTE-cVFAUY-coiicW-j5X64T-hMxW1F-8e66Jz-tHSNz-WqkzBm-dCrPfc-7EnxSU-23z4Xq-7EiJyi-jR9gc-bTLt2X-oqVjUs-e73HNB-BkAk5-9FZmBh-7PYXr7-x2gpV https://creativecommons.org/licenses/by/2.0/legalcode
Assessing Data Quality
Data Attributes
1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity
Data Actions
1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis
17
Acquisition/ Entry Cleaning Storage Analysis
Accuracy x
Completeness x x
Conformance x x
Consistency x
Timeliness x x
Uniqueness x x
Validity x 18
Assessing Data Quality
Data Attributes
1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity
Data Actions
1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis
19
Accuracy
Have we stored the correct value?
20
Accuracy at Entry Signing up for airline rewards program, I entered my date of birth. Super easy.
21
But Wait
My birthday was not in the month they return...
22
Ohhhh The dreaded off by
one error.
23
Just to be sure This really isn’t user error. August 31 happens in every year...
24
How does this happen? ● Bad UX/UI
25
What can we do? ● Test the inputs when we design database ● Talk to the Front End team
26
https://www.bitboost.com/pawsense/ 27
How does this happen? ● Errors in entry ● Unauthorized entry
28
What can we do? ● Knowledge Elicitation (talk to domain experts) ● Set database constraints ● Check anticipated vs actual values ● Maintain security
29
Assessing Data Quality
Data Attributes
1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity
Data Actions
1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis
30
Completeness
Are there gaps between expected data and the data we have?
31
32
33
How does this happen? ● Incorrect sampling ● Incomplete understanding of the business
problem ● Not enough data available
34
What can we do? ● Ask more questions of domain experts ● Find additional data sets
35
Assessing Data Quality
Data Attributes
1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity
Data Actions
1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis
36
patricia m creative commons 2.0 license
Sometimes data is discarded for
relevance or size
37
https://www.flickr.com/photos/taffeta/10924191575/in/photolist-hDkkyB-apjt8E-aPcrH-6wPyeH-eGLGuj-po5w7-eJa6a-7535HA-6kd2J7-62fL9M-4xGZh9-6Y7CXh-ZXc3dy-8LkgNN-5hwAS3-vWqAy-4xHe2U-4pvtz-gD4SsQ-d2WfQJ-hhYMq9-iRunU-6hnXEw-rsDDfZ-6QzykP-4DL4hY-SKjuwC-zJcWk-chGY8S-LKWqT7-pnVC6-3cSfuy-6JoFui-5HoiTL-7g5grL-4piHxe-SZ6Qav-4xCVDp-97RMHp-SZ6PnZ-6eX23d-eR2abe-8KBRZG-2AWTts-eJ2Yee-d66xSL-r8Et6J-2AWTtj-m75nS-Z95Ati https://creativecommons.org/licenses/by-sa/2.0/legalcode
patricia m creative commons 2.0 license Conner McCall creative commons 2.0 license 38
https://www.flickr.com/photos/taffeta/10924191575/in/photolist-hDkkyB-apjt8E-aPcrH-6wPyeH-eGLGuj-po5w7-eJa6a-7535HA-6kd2J7-62fL9M-4xGZh9-6Y7CXh-ZXc3dy-8LkgNN-5hwAS3-vWqAy-4xHe2U-4pvtz-gD4SsQ-d2WfQJ-hhYMq9-iRunU-6hnXEw-rsDDfZ-6QzykP-4DL4hY-SKjuwC-zJcWk-chGY8S-LKWqT7-pnVC6-3cSfuy-6JoFui-5HoiTL-7g5grL-4piHxe-SZ6Qav-4xCVDp-97RMHp-SZ6PnZ-6eX23d-eR2abe-8KBRZG-2AWTts-eJ2Yee-d66xSL-r8Et6J-2AWTtj-m75nS-Z95Ati https://creativecommons.org/licenses/by-sa/2.0/legalcode https://www.flickr.com/photos/connerm/3959189828/in/photolist-bPmRsg-ahEmTu-72RTcA-bywu5G-dKEFDi-cSTf2J-51pAj-5GtJqT-U9Sb3M-r5xUag-U6tePE-SVpYTt-FeEcBU-9ScHkv-cK6q6-qDgbeK-GUfRy-GhnQye-fDvrRK-7pcc8K-aCKKGE-dMAjni-xb3qC-M7PrZo-fgs194-2bTjN7D-ZFjmey-SVpUBT-SVpWMV-L6zQwK-et3dSa-qpJwSA-dXmGYa-aepA2n-5TZMtk-o8zPHX-nTbSBG-5ZkVdY-9FjKzq-pPDQAz-33Jugy-avQ7R-nC6GQW-pPd7DJ-zn7Djv-2a1sAoT-t6hNVV-Stu4Ev-rJ3dL8-8QiHWB https://creativecommons.org/licenses/by-nc/2.0/legalcode
How does this happen? ● Dataset is too large ● Dataset contains unnecessary columns
39
What can we do? ● Import selectively ● Screen data carefully ● Trim and filter as appropriate
40
Assessing Data Quality
Data Attributes
1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity
Data Actions
1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis
41
smcgee creative commons 2.0 license
Storage
42
https://www.flickr.com/photos/smcgee/5483445378/in/photolist-9my6gG-kmMTWX-4Rv4Zk-b6YbCe-5xQ1Q7-5WE2KT-Utwk3q-fJU1bR-sutPPj-48im9C-JPQ2Z-7HFr3y-dZnNkv-5SqmWJ-7yhdPp-vqS2s-anKiAE-7TBicZ-oSVzWc-b1CoqB-bUMs9J-fpY3us-h7awW-ea574N-od7YKG-4oc76t-9azPRC-cMGLi1-pwKicA-3LkV2b-7xasj7-assirk-ea57a1-66DsCQ-csN6cY-6zbmhf-e3uXn2-aJ14gk-c7xHx9-LdEWM-iWC5y3-9bWMXc-Pz1be-7xasjo-cqBTB7-68jEV-hi5Hi9-TuwUY-qGNvRX-ceA1x https://creativecommons.org/licenses/by-nc/2.0/
United States Department of Agriculture License Creative Commons 2.0
Choose the right size storage for your database.
43
https://www.flickr.com/photos/usdagov/6502471563/in/photolist-9ND6qW-9NA8FD-aUAWWZ-oWZbsV-aUASfF-aUAVog-YJdEAY-znF2s-znFfW-Zs9isH-2d6W91o-aUATTk-nqaej9-nau5Se-nrGrgL-nrUdBV-nqaeju-nrYw1B-naucYu-44VQKe-ntKicH-nau6CK-ns1XnJ-nau7cF-BcLoAj-JHKtnv-JHF7Eh-Ke5NgN-JHF7LQ-naubXd-npW5Gs-ntKhX4-nrGqJd-nrGonQ-nauabq-ntKjCt https://creativecommons.org/licenses/by/2.0/legalcode
How does this happen? ● Storage size chosen incorrectly or not updated ● Storage location or equipment chosen poorly ● Column, table, or database dropped in error
44
What can we do? ● Be realistic about data needs, assess frequently ● Have backups ● Trust your users, apply alerts and process to
prevent loss 45
Assessing Data Quality
Data Attributes
1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity
Data Actions
1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis
46
Dear Rich Bastard,
47
https://www.snopes.com/fact-check/dear-rich-bastard/
Dear Rich Bastard,
or maybe try
\pset null '¯\\_(ツ)_/¯'
48
https://www.snopes.com/fact-check/dear-rich-bastard/
How does this happen? ● Null is a black hole of data problems
49
What can we do? ● Document the code ● Be careful with null (Go see Lætitia’s talk about