Good Data Gone Bad, Bad Data Gone Worse - · PDF file Good Data Gone Bad, Bad Data Gone Worse...

Click here to load reader

  • date post

    19-Jul-2020
  • Category

    Documents

  • view

    2
  • download

    0

Embed Size (px)

Transcript of Good Data Gone Bad, Bad Data Gone Worse - · PDF file Good Data Gone Bad, Bad Data Gone Worse...

  • Good Data Gone Bad, Bad Data Gone Worse

    Renee Phillips PostgresOpen 2019 1

  • This is me. 2

  • Sakeeb Sabaaka Creative commons 2.0 license 3

    https://creativecommons.org/licenses/by/2.0/legalcode

  • @DataRenee

    4

  • This is a talk about how good data goes bad, and how bad data gets worse

    5

  • First, what is data?

    ● Pieces of information, with a format

    ● Facts used for making decisions ● Information stored in and/or used

    by a computer 6

  • Data is: A representation of some aspect of the

    world

    7

  • Next, what is good data?

    Fit for its intended uses in operations, planning, decision making

    8

  • 9

  • Why do we want good data? ● Planning ● Operations ● Looking smart ● Saving money ● Decision making ● Completing transactions

    10

  • Finally, what is bad data?

    Not fit for its intended uses in operations, planning, decision making

    11

  • 12

  • What can we assess to check if the data might work for the intended purpose?

    13

  • The Six Primary Dimensions for Data Quality Assessment

    1. Completeness 2. Uniqueness 3. Timeliness 4. Validity 5. Accuracy 6. Consistency

    PDF from UK International Data Management Association 14

    https://www.whitepapers.em360tech.com/wp-content/files_mf/1407250286DAMAUKDQDimensionsWhitePaperR37.pdf

  • Guidelines for quality assurance in health and health care research

    1. Acquisition 2. Entry 3. Cleaning 4. Storage 5. Analysis

    PDF from Amsterdam Centre for Health and Healthcare Research

    15

    http://www.myravanzwieten.nl/pdf/pub_artikelen_hoofdstukken/Guidelines_AmCOGG_2002.pdf

  • Dan Moyle creatvice commons 2.0 license

    11 Things to Look At

    16

    https://www.flickr.com/photos/danmoyle/11715566974/in/photolist-iRgkU3-858Xjp-5AW5f7-a7S3Vw-29u6umN-FXAqnc-5TTqDx-bERCRK-mua5zD-rkXCAc-5brfMz-ciep6L-pFj6d5-2ax3Esk-qLhr2C-HbHRJ6-26YN6PP-2aTBkvy-kLY9Uj-8kDhX7-UPbkjM-FuaL3P-8ngXq2-a67Pt-Q8D5CE-6JwNoQ-dvJh52-gWRmcB-x1wTLj-QH83VX-tuJTE-cVFAUY-coiicW-j5X64T-hMxW1F-8e66Jz-tHSNz-WqkzBm-dCrPfc-7EnxSU-23z4Xq-7EiJyi-jR9gc-bTLt2X-oqVjUs-e73HNB-BkAk5-9FZmBh-7PYXr7-x2gpV https://creativecommons.org/licenses/by/2.0/legalcode

  • Assessing Data Quality

    Data Attributes

    1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity

    Data Actions

    1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis

    17

  • Acquisition/ Entry Cleaning Storage Analysis

    Accuracy x

    Completeness x x

    Conformance x x

    Consistency x

    Timeliness x x

    Uniqueness x x

    Validity x 18

  • Assessing Data Quality

    Data Attributes

    1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity

    Data Actions

    1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis

    19

  • Accuracy

    Have we stored the correct value?

    20

  • Accuracy at Entry Signing up for airline rewards program, I entered my date of birth. Super easy.

    21

  • But Wait

    My birthday was not in the month they return...

    22

  • Ohhhh The dreaded off by

    one error.

    23

  • Just to be sure This really isn’t user error. August 31 happens in every year...

    24

  • How does this happen? ● Bad UX/UI

    25

  • What can we do? ● Test the inputs when we design database ● Talk to the Front End team

    26

  • https://www.bitboost.com/pawsense/ 27

  • How does this happen? ● Errors in entry ● Unauthorized entry

    28

  • What can we do? ● Knowledge Elicitation (talk to domain experts) ● Set database constraints ● Check anticipated vs actual values ● Maintain security

    29

  • Assessing Data Quality

    Data Attributes

    1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity

    Data Actions

    1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis

    30

  • Completeness

    Are there gaps between expected data and the data we have?

    31

  • 32

  • 33

  • How does this happen? ● Incorrect sampling ● Incomplete understanding of the business

    problem ● Not enough data available

    34

  • What can we do? ● Ask more questions of domain experts ● Find additional data sets

    35

  • Assessing Data Quality

    Data Attributes

    1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity

    Data Actions

    1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis

    36

  • patricia m creative commons 2.0 license

    Sometimes data is discarded for

    relevance or size

    37

    https://www.flickr.com/photos/taffeta/10924191575/in/photolist-hDkkyB-apjt8E-aPcrH-6wPyeH-eGLGuj-po5w7-eJa6a-7535HA-6kd2J7-62fL9M-4xGZh9-6Y7CXh-ZXc3dy-8LkgNN-5hwAS3-vWqAy-4xHe2U-4pvtz-gD4SsQ-d2WfQJ-hhYMq9-iRunU-6hnXEw-rsDDfZ-6QzykP-4DL4hY-SKjuwC-zJcWk-chGY8S-LKWqT7-pnVC6-3cSfuy-6JoFui-5HoiTL-7g5grL-4piHxe-SZ6Qav-4xCVDp-97RMHp-SZ6PnZ-6eX23d-eR2abe-8KBRZG-2AWTts-eJ2Yee-d66xSL-r8Et6J-2AWTtj-m75nS-Z95Ati https://creativecommons.org/licenses/by-sa/2.0/legalcode

  • patricia m creative commons 2.0 license Conner McCall creative commons 2.0 license 38

    https://www.flickr.com/photos/taffeta/10924191575/in/photolist-hDkkyB-apjt8E-aPcrH-6wPyeH-eGLGuj-po5w7-eJa6a-7535HA-6kd2J7-62fL9M-4xGZh9-6Y7CXh-ZXc3dy-8LkgNN-5hwAS3-vWqAy-4xHe2U-4pvtz-gD4SsQ-d2WfQJ-hhYMq9-iRunU-6hnXEw-rsDDfZ-6QzykP-4DL4hY-SKjuwC-zJcWk-chGY8S-LKWqT7-pnVC6-3cSfuy-6JoFui-5HoiTL-7g5grL-4piHxe-SZ6Qav-4xCVDp-97RMHp-SZ6PnZ-6eX23d-eR2abe-8KBRZG-2AWTts-eJ2Yee-d66xSL-r8Et6J-2AWTtj-m75nS-Z95Ati https://creativecommons.org/licenses/by-sa/2.0/legalcode https://www.flickr.com/photos/connerm/3959189828/in/photolist-bPmRsg-ahEmTu-72RTcA-bywu5G-dKEFDi-cSTf2J-51pAj-5GtJqT-U9Sb3M-r5xUag-U6tePE-SVpYTt-FeEcBU-9ScHkv-cK6q6-qDgbeK-GUfRy-GhnQye-fDvrRK-7pcc8K-aCKKGE-dMAjni-xb3qC-M7PrZo-fgs194-2bTjN7D-ZFjmey-SVpUBT-SVpWMV-L6zQwK-et3dSa-qpJwSA-dXmGYa-aepA2n-5TZMtk-o8zPHX-nTbSBG-5ZkVdY-9FjKzq-pPDQAz-33Jugy-avQ7R-nC6GQW-pPd7DJ-zn7Djv-2a1sAoT-t6hNVV-Stu4Ev-rJ3dL8-8QiHWB https://creativecommons.org/licenses/by-nc/2.0/legalcode

  • How does this happen? ● Dataset is too large ● Dataset contains unnecessary columns

    39

  • What can we do? ● Import selectively ● Screen data carefully ● Trim and filter as appropriate

    40

  • Assessing Data Quality

    Data Attributes

    1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity

    Data Actions

    1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis

    41

  • smcgee creative commons 2.0 license

    Storage

    42

    https://www.flickr.com/photos/smcgee/5483445378/in/photolist-9my6gG-kmMTWX-4Rv4Zk-b6YbCe-5xQ1Q7-5WE2KT-Utwk3q-fJU1bR-sutPPj-48im9C-JPQ2Z-7HFr3y-dZnNkv-5SqmWJ-7yhdPp-vqS2s-anKiAE-7TBicZ-oSVzWc-b1CoqB-bUMs9J-fpY3us-h7awW-ea574N-od7YKG-4oc76t-9azPRC-cMGLi1-pwKicA-3LkV2b-7xasj7-assirk-ea57a1-66DsCQ-csN6cY-6zbmhf-e3uXn2-aJ14gk-c7xHx9-LdEWM-iWC5y3-9bWMXc-Pz1be-7xasjo-cqBTB7-68jEV-hi5Hi9-TuwUY-qGNvRX-ceA1x https://creativecommons.org/licenses/by-nc/2.0/

  • United States Department of Agriculture License Creative Commons 2.0

    Choose the right size storage for your database.

    43

    https://www.flickr.com/photos/usdagov/6502471563/in/photolist-9ND6qW-9NA8FD-aUAWWZ-oWZbsV-aUASfF-aUAVog-YJdEAY-znF2s-znFfW-Zs9isH-2d6W91o-aUATTk-nqaej9-nau5Se-nrGrgL-nrUdBV-nqaeju-nrYw1B-naucYu-44VQKe-ntKicH-nau6CK-ns1XnJ-nau7cF-BcLoAj-JHKtnv-JHF7Eh-Ke5NgN-JHF7LQ-naubXd-npW5Gs-ntKhX4-nrGqJd-nrGonQ-nauabq-ntKjCt https://creativecommons.org/licenses/by/2.0/legalcode

  • How does this happen? ● Storage size chosen incorrectly or not updated ● Storage location or equipment chosen poorly ● Column, table, or database dropped in error

    44

  • What can we do? ● Be realistic about data needs, assess frequently ● Have backups ● Trust your users, apply alerts and process to

    prevent loss 45

  • Assessing Data Quality

    Data Attributes

    1. Accuracy 2. Completeness 3. Conformance 4. Consistency 5. Timeliness 6. Uniqueness 7. Validity

    Data Actions

    1. Acquisition/Entry 2. Cleaning 3. Storage 4. Analysis

    46

  • Dear Rich Bastard,

    47

    https://www.snopes.com/fact-check/dear-rich-bastard/

  • Dear Rich Bastard,

    or maybe try

    \pset null '¯\\_(ツ)_/¯'

    48

    https://www.snopes.com/fact-check/dear-rich-bastard/

  • How does this happen? ● Null is a black hole of data problems

    49

  • What can we do? ● Document the code ● Be careful with null (Go see Lætitia’s talk about