Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone...

130
Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1

Transcript of Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone...

Page 1: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Good Data Gone Bad, Bad Data Gone Worse

Renee Phillips PostgresOpen 20191

Page 2: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

This is me.2

Page 3: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Sakeeb Sabaaka Creative commons 2.0 license3

Page 4: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

@DataRenee

4

Page 5: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

This is a talk about how good data goes bad, and how bad data gets worse

5

Page 6: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

First, what is data?

● Pieces of information, with a format

● Facts used for making decisions● Information stored in and/or used

by a computer6

Page 7: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Data is: A representation of some aspect of the

world

7

Page 8: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Next, what is good data?

Fit for its intended uses in operations, planning, decision making

8

Page 9: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

9

Page 10: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Why do we want good data? ● Planning● Operations ● Looking smart● Saving money● Decision making● Completing transactions

10

Page 11: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Finally, what is bad data?

Not fit for its intended uses in operations, planning, decision making

11

Page 12: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

12

Page 13: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we assess to check if the data might work for the intended purpose?

13

Page 14: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

The Six Primary Dimensions for Data Quality Assessment

1. Completeness2. Uniqueness 3. Timeliness4. Validity 5. Accuracy 6. Consistency

PDF from UK International Data Management Association14

Page 15: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Guidelines for quality assurance in health and health care research

1. Acquisition2. Entry3. Cleaning4. Storage5. Analysis

PDF from Amsterdam Centre for Health and Healthcare Research

15

Page 17: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

17

Page 18: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Acquisition/Entry Cleaning Storage Analysis

Accuracy x

Completeness x x

Conformance x x

Consistency x

Timeliness x x

Uniqueness x x

Validity x18

Page 19: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

19

Page 20: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Accuracy

Have we stored the correct value?

20

Page 21: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Accuracy at EntrySigning up for airline rewards program, I entered my date of birth. Super easy.

21

Page 22: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

But Wait

My birthday was not in the month they return...

22

Page 23: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

OhhhhThe dreaded off by

one error.

23

Page 24: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Just to be sureThis really isn’t user error. August 31 happens in every year...

24

Page 25: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Bad UX/UI

25

Page 26: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Test the inputs when we design database● Talk to the Front End team

26

Page 27: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

https://www.bitboost.com/pawsense/27

Page 28: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Errors in entry● Unauthorized entry

28

Page 29: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Knowledge Elicitation (talk to domain experts)● Set database constraints● Check anticipated vs actual values● Maintain security

29

Page 30: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

30

Page 31: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Completeness

Are there gaps between expected data and the data we have?

31

Page 32: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

32

Page 33: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

33

Page 34: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Incorrect sampling● Incomplete understanding of the business

problem● Not enough data available

34

Page 35: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Ask more questions of domain experts● Find additional data sets

35

Page 36: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

36

Page 39: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Dataset is too large● Dataset contains unnecessary columns

39

Page 40: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Import selectively● Screen data carefully● Trim and filter as appropriate

40

Page 41: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

41

Page 44: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Storage size chosen incorrectly or not updated● Storage location or equipment chosen poorly● Column, table, or database dropped in error

44

Page 45: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Be realistic about data needs, assess frequently● Have backups● Trust your users, apply alerts and process to

prevent loss45

Page 46: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

46

Page 48: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Dear Rich Bastard,

or maybe try

\pset null '¯\\_(ツ)_/¯'

48

Page 49: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Null is a black hole of data problems

49

Page 50: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Document the code● Be careful with null (Go see Lætitia’s talk about

Null Unknown)● Make NULL appear as something more

noticeable 50

Page 51: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

51

Page 52: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Conformance

Is the data in a format that is expected and acceptable?

52

Page 54: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Errors in entry● Limitations in data collection/ availability● Transforming from one data type to another

54

Page 55: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Test the inputs when we design database● Have multiple entries of subset of data, check for

consistency● Search out additional datasets

55

Page 56: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

56

Page 58: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

-2, -1, 1, 22BCE, 1BCE, 1CE,

2CE 58

Page 59: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Some databases have a year 0● Money is a challenge

59

Page 60: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Beware of calendar challenges● Don’t use the money type● Know what jurisdictions recognize leap seconds ● Daylight Savings● Offsets 60

Page 62: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Improper machine calibration● Improper machine reading

62

Page 63: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Set database constraints● Ensure machine generated data is tested● Ensure data collectors are well trained

63

Page 64: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

64

Page 65: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

NOTNULL

65

Page 66: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

NOTNULLPeople get creative

66

Page 67: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

NULL67

Page 68: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

NULLA black hole of data quality issues

Tony Hoar feels bad about it

68

Page 69: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Data entry is forced to provide a response● Radio button or check boxes provided are not

sufficient to capture respondent needs

69

Page 70: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Consult domain experts● Provide free form “other” option and/or “unknown”

option● Practice good knowledge elicitation

70

Page 71: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

71

Page 72: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

72

Page 73: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Text fields● Integer fields

73

Page 74: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Use BOOLEAN data type● Cast BOOLEAN to TEXT or INT if you must

74

Page 75: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Consistency

Does the database only change data in expected ways?

Are there conflicts between data?

75

Page 77: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

77

Page 78: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Selecting datasets for different areas of the

database● Databases started with inconsistency

78

Page 79: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Coalesce appropriately● Clean data with a documented, reproducible

method● Set database constraints to check consistency

79

Page 80: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

80

Page 81: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Changing granularity may make analysis unreliable or impossible

81

Page 82: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

82

Page 83: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

83

Page 84: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Data models are changed ● Database system is changed● Data is transferred inappropriately

84

Page 85: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Exercise extreme caution when designing first

data model● Speak with domain experts● Use new column instead of changing existing

column

85

Page 86: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

86

Page 87: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

87

Page 88: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Improper database constraints ● Database system is changed● Data is entered inappropriately● Data not available in emergency

88

Page 89: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Have access to multiple datasets for emergencies● Speak with domain experts to plan permission● Be clear in analysis when the change happened

and what that does to the analysis● Be able to correct entries after initial input● Use concurrency control in PostgreSQL

○ https://www.postgresql.org/docs/current/mvcc.html

89

Page 90: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Timeliness

Is there more recent data that is appropriate to the task?

Is the data accessible quickly enough?

90

Page 91: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

91

Page 93: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● A newly discovered dataset is outdated● A newly created dataset is not imported● User is not aware of the age of data

93

Page 94: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Check provenance of data● Actively search for additional sources● Combine datasets where appropriate● Identify if more data is really needed

94

Page 95: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

95

Page 97: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Infrastructure limitations

97

Page 98: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Ensure enough storage on collector● Offline first design

98

Page 99: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

99

Page 101: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Data set is large● Data security prevents access● Data is stale

101

Page 102: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

102

Page 103: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Build or select better indexes● Use partitioning● Store less data● Evaluate roles and users● Update materialized views in transactions 103

Page 104: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

104

Page 105: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Uniqueness

Are there duplicates in the dataset?

105

Page 107: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

# SELECT DISTINCT fruit FROM fruits ORDER BY fruit; fruit ------------ apple banana banananana grape loom naranja orangle(7 rows)

107

Page 109: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Assumptions about domain● Entry errors● Naming things● Duplicate entries

109

Page 110: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Consult domain experts● Check entry● Set primary key● Clean based on information not instinct

110

Page 111: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Validity

Are the format, syntax, and type correct?

Does the data have the potential to be accurate?

111

Page 112: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Assessing Data Quality

Data Attributes

1. Accuracy2. Completeness3. Conformance4. Consistency5. Timeliness6. Uniqueness7. Validity

Data Actions

1. Acquisition/Entry2. Cleaning3. Storage4. Analysis

112

Page 113: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

patient | birth | temperature ---------+---------+------ Susan | 5/12/84 | 101.4 Meg | 1/12/90 | 98.6 Julie | 1/12/90 | 97.2 Fiona | 3/31/65 | 970 Sally | 4/3/01 | 111111

113

Page 114: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

000861 == 861

114

Page 115: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

How does this happen?● Importing from various sources● Improper data type selection● Improper format changes to data● NOTNULL● Data Entry errors 115

Page 116: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

What can we do?● Select correct data type● Set value constraint on column● Figure out a solution for NULL?

116

Page 117: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Acquisition/Entry Cleaning Storage Analysis

Accuracy x

Completeness x x

Conformance x x

Consistency x

Timeliness x x

Uniqueness x x

Validity x117

Page 118: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

118

Page 122: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

{ "sender": "[email protected]", "sent-at": "2019-07-05 09:13:57", "to": "[email protected]", "body": "Dear Mr Smith - \n\rYour recent purchase of the My Little Pony Commemorative Coin Set Model FIM2000 ...",} 122

Page 123: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

{ "patient": "r236576", "called-at": "2019-07-05 09:13:57", "note-type": "billing", "body": "pt complains of abdominal pain, states she might have eaten too many tacos at John Smith’s party",}

123

Page 125: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

No wait. These breaches.

125

Page 129: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Cleaning Takes Time● Granularity● Casting types● Trimming/Filtering● Spelling● Duplicate removal● Detect errors

129

Page 130: Good Data Gone Bad, Bad Data Gone Worse - postgresql.us€¦ · Good Data Gone Bad, Bad Data Gone Worse Renee Phillips PostgresOpen 2019 1. This is me. 2. Sakeeb Sabaaka Creative

Toward A Taxonomy of Bad Data● Missing● Factually incorrect/inaccurate● Imprecise● Internally inconsistent● Corrupted● Deleted● Stolen ● Incomplete● Out of context● And more! 130