Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi...

96
Data Cleaning February 6, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter Thanks to C. Binning for some stolen slides. :) 1

Transcript of Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi...

Page 1: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Data CleaningFebruary 6, 2020

Data Science CSCI 1951A Brown University

Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter

Thanks to C. Binning for some stolen slides. :)1

Page 2: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Announcements• Assignment 1: down! Assignment 2: up!

• Projects:

• Let me know by today at 10:20 if you want to be N != 4

• Being thinking about your project data…the first deliverable is not just a “ceremonial” checkpoint

• “Will we know how to do X by in time?” —> maybe/probably/probably not but you should do it regardless!

2

Page 3: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Today

• 45 minutes—let’s just see how far we get….

• Problems with dirty data

• Cleaning and string matching heuristics

• Monday: bash commands (come with a command line…if you don’t know what that means, ask me)

3

Page 4: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

ID Name Street City State Zip Hours

1 N Aldroubi 123 University Ave Providence RI 98106 42

2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30

3 Nam Do 345 Broadway PVD Rhode Island 98101 19

4 N Dellworth 245 Third Street Pawtucket NULL 98052 299

5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19

6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41

7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL

4

Page 5: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Problems?

ID Name Street City State Zip Hours

1 N Aldroubi 123 University Ave Providence RI 98106 42

2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30

3 Nam Do 345 Broadway PVD Rhode Island 98101 19

4 N Dellworth 245 Third Street Pawtucket NULL 98052 299

5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19

6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41

7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL

5

Page 6: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

ID Name Street City State Zip Hours

1 N Aldroubi 123 University Ave Providence RI 98106 42

2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30

3 Nam Do 345 Broadway PVD Rhode Island 98101 19

4 N Dellworth 245 Third Street Pawtucket NULL 98052 299

5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19

6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41

7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL

Problems? Inconsistent Representations

6

Page 7: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

ID Name Street City State Zip Hours

1 N Aldroubi 123 University Ave Providence RI 98106 42

2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30

3 Nam Do 345 Broadway PVD Rhode Island 98101 19

4 N Dellworth 245 Third Street Pawtucket NULL 98052 299

5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19

6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41

7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL

Problems? Inconsistent Representations

Missing Values7

Page 8: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

ID Name Street City State Zip Hours

1 N Aldroubi 123 University Ave Providence RI 98106 42

2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30

3 Nam Do 345 Broadway PVD Rhode Island 98101 19

4 N Dellworth 245 Third Street Pawtucket NULL 98052 299

5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19

6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41

7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL

Problems? Inconsistent Representations

Missing ValuesTypos8

Page 9: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

ID Name Street City State Zip Hours

1 N Aldroubi 123 University Ave Providence RI 98106 42

2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30

3 Nam Do 345 Broadway PVD Rhode Island 98101 19

4 N Dellworth 245 Third Street Pawtucket NULL 98052 299

5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19

6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41

7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL

Problems? Inconsistent Representations

Missing Values

Duplicates

Typos9

Page 10: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

ID Name Street City State Zip Hours

1 N Aldroubi 123 University Ave Providence RI 98106 42

2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30

3 Nam Do 345 Broadway PVD Rhode Island 98101 19

4 N Dellworth 245 Third Street Pawtucket NULL 98052 299

5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19

6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41

7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL

Problems? Inconsistent Representations

Missing Values

Duplicates

Maybe Duplicates?Typos

10

Page 11: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Data is dirty on its own

• Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions)

• Data doesn’t “age well” (inflation, restricting)

• Any combination of the above

11

Page 12: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Data is dirty on its own

• Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions)

• Data doesn’t “age well” (inflation, restricting)

• Any combination of the above

12

Page 13: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Data is dirty on its own

• Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions)

• Data doesn’t “age well” (inflation, restricting)

• Any combination of the above

13

Page 14: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Data is dirty on its own

• Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions)

• Data doesn’t “age well” (inflation, redistricting)

• Any combination of the above

14

Page 15: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Data is dirty on its own

• Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions)

• Data doesn’t “age well” (inflation, redistricting)

• Any combination of the above

15

Page 16: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Parsing input data (e.g., separator issues)

• Naming conventions: NYC vs New York

• Formatting issues – esp. dates

• Missing values and required fields (e.g., always use 0)

• Different representations (2 vs Two)

• Fields too long (get truncated)

• Primary key violations (from data merging)

• Redundant Records (from data merging)

16

Page 17: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Parsing input data (e.g., separator issues)

• Naming conventions: NYC vs New York

• Formatting issues – esp. dates

• Missing values and required fields (e.g., always use 0)

• Different representations (2 vs Two)

• Fields too long (get truncated)

• Primary key violations (from data merging)

• Redundant Records (from data merging)

17

Page 18: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Parsing input data (e.g., separator issues)

• Naming conventions: NYC vs New York

• Formatting issues – esp. dates

• Missing values and required fields (e.g., always use 0)

• Different representations (2 vs Two)

• Fields too long (get truncated)

• Primary key violations (from data merging)

• Redundant Records (from data merging)

18

Page 19: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Parsing input data (e.g., separator issues)

• Naming conventions: NYC vs New York

• Formatting issues – esp. dates

• Missing values and required fields (e.g., always use 0)

• Different representations (2 vs Two)

• Fields too long (get truncated)

• Primary key violations (from data merging)

• Redundant Records (from data merging)

19

Page 20: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Parsing input data (e.g., separator issues)

• Naming conventions: NYC vs New York

• Formatting issues – esp. dates

• Missing values and required fields (e.g., always use 0)

• Different representations (2 vs Two)

• Fields too long (get truncated)

• Primary key violations (from data merging)

• Redundant Records (from data merging)

20

Page 21: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Parsing input data (e.g., separator issues)

• Naming conventions: NYC vs New York

• Formatting issues – esp. dates

• Missing values and required fields (e.g., always use 0)

• Different representations (2 vs Two)

• Fields too long (get truncated)

• Primary key violations (from data merging)

• Redundant Records (from data merging)

21

Page 22: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Parsing input data (e.g., separator issues)

• Naming conventions: NYC vs New York

• Formatting issues – esp. dates

• Missing values and required fields (e.g., always use 0)

• Different representations (2 vs Two)

• Fields too long (get truncated)

• Primary key violations (from data merging)

• Redundant Records (from data merging)

22

Page 23: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Parsing input data (e.g., separator issues)

• Naming conventions: NYC vs New York

• Formatting issues – esp. dates

• Missing values and required fields (e.g., always use 0)

• Different representations (2 vs Two)

• Fields too long (get truncated)

• Primary key violations (from data merging)

• Redundant Records (from data merging)

23

Page 24: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Dirty Data…• Parsing input data (e.g., separator issues)

• Naming conventions: NYC vs New York

• Formatting issues – esp. dates

• Missing values and required fields (e.g., always use 0)

• Different representations (2 vs Two)

• Fields too long (get truncated)

• Primary key violations (from data merging)

• Redundant Records (from data merging)

24

Page 25: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Questions!

25

Page 26: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

Clicker Lightening Round!

ID Name City State Hours1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

26

Page 27: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

Clicker Lightening Round!💩

💩

💩

💩

ID Name City State Hours1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

27

Page 28: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

Clicker Lightening Round!

ID Name City State Hours1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

🌟

🌟🌟 🌟

🌟🌟

💩

💩

💩

💩

28

Page 29: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

Clicker Lightening Round!

How will the dirty data affect the results of this query? (a) Too high (b) Too low (c) UnaffectedID Name City State Hours

1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

🌟

🌟🌟 🌟

🌟🌟

💩

💩

💩

💩

29

Page 30: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Lightening Round!

How many TAs are there?

SELECT COUNT(*) FROM TAS

How will the dirty data affect the results of this query? (a) Too high (b) Too low (c) UnaffectedID Name City State Hours

1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

🌟

🌟🌟 🌟

🌟🌟

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

💩

💩

💩

💩

30

Page 31: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Lightening Round!

How many TAs are there?

SELECT COUNT(*) FROM TAS

Duplicates -> Double Counting

🌟🌟

How will the dirty data affect the results of this query? (a) Too high (b) Too low (c) UnaffectedID Name City State Hours

1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

🌟

🌟🌟 🌟

🌟🌟

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

💩

💩

💩

💩

31

Page 32: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Lightening Round!

How many TAs have worked zero hours?SELECT COUNT(*) FROM TAS WHERE Hours = 0

🌟

How will the dirty data affect the results of this query? (a) Too high (b) Too low (c) UnaffectedID Name City State Hours

1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

🌟

🌟🌟 🌟

🌟🌟

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

💩

💩

💩

💩

32

Page 33: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Lightening Round!

How many TAs have worked zero hours?SELECT COUNT(*) FROM TAS WHERE Hours = 0

How will the dirty data affect the results of this query? (a) Too high (b) Too low (c) Unaffected

NULLS aren’t included in the where clause

ID Name City State Hours1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

🌟

🌟🌟 🌟

🌟🌟

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

💩

💩

💩

💩

33

Page 34: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Lightening Round!

🌟

How will the dirty data affect the results of this query? (a) Too high (b) Too low (c) Unaffected

How many hours do my commuter TAs work?SELECT SUM(Hours) FROM TAS WHERE City != “Providence”

ID Name City State Hours1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

🌟

🌟🌟 🌟

🌟🌟

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

💩

💩

💩

💩

34

Page 35: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Lightening Round!

How will the dirty data affect the results of this query? (a) Too high (b) Too low (c) Unaffected

How many hours do my commuter TAs work?SELECT SUM(Hours) FROM TAS WHERE City != “Providence”

Inconsistent names, typos,

and duplicates…

ID Name City State Hours1 Nazem Aldroubi Providence Rhode Island 422 Natalie Delworth Pawtucket Rhode Island 303 Nam Do Providence Rhode Island 387 Minna Kimura-T Warwick Rhode Island 0

🌟

🌟🌟 🌟

🌟🌟

TASID Name City State Hours1 N Aldroubi Providence RI 422 Natalie Delworth Pawtucket RI 303 Nam Do PVD Rhode Island 194 N Dellworth Pawtucket NULL 3005 Do Nam Providence Rhode Island 196 Nazem Aldroubi PVD Rhode Island 427 Minna Kimura-T Warwick RI NULL

💩

💩

💩

💩

35

Page 36: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

36

Page 37: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

37

Page 38: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

38

Page 39: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

39

Page 40: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

40

Page 41: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

41

Page 42: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

42

Page 43: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

43

Page 44: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

44

Page 45: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

45

Page 46: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

What’s to be done?• Look at your data!

• Maybe set (sensible) defaults

• Maybe remove outliers

• Look at your data

• Maybe machine learn some of the things

• Look at your data

• When you issue a query, don’t take the answer as gospel. Instead…wait for it…look at your data!

46

Page 47: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Look at your data

47

Page 48: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Look at your dataSELECT City, COUNT(*) as pop FROM PEOPLE GROUP BY Zip_Code ORDER BY pop

48

Page 49: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Look at your dataSELECT City, COUNT(*) as pop FROM PEOPLE GROUP BY Zip_Code ORDER BY pop

City Count(*)Schenectady 2,500New York City 2,200Los Angeles 1,900

Dallas 1,40049

Page 50: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Look at your dataSELECT City, COUNT(*) as pop FROM PEOPLE GROUP BY Zip_Code ORDER BY pop

City Count(*)Schenectady 2,500New York City 2,200Los Angeles 1,900

Dallas 1,400

?!?!

50

Page 51: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Look at your dataSELECT City, COUNT(*) as pop FROM PEOPLE GROUP BY Zip_Code ORDER BY pop

City Count(*)Schenectady 2,500New York City 2,200Los Angeles 1,900

Dallas 1,400

?!?!City Count(*)

12345 2,50010001 2,200090001 1,90075001 1,400

TYPICAL PROBLEMS: DATA ENTRY

Why are so many of our customers in Schenectady, NY?

51

Page 52: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Set Defaults/Remove Outliers

0

12.5

25

37.5

50

Hours Worked

0 10 20 30 40 50 60 70 80 90 100

52

Page 53: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

0

12.5

25

37.5

50

Hours Worked

0 10 20 30 40 50 60 70 80 90 100

Set Defaults/Remove Outliers

?!?!

53

Page 54: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

0

12.5

25

37.5

50

Hours Worked

0 10 20 30 40 50 60 70 80 90 100

Set Defaults/Remove Outliers

Assume 0?

54

Page 55: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

0

12.5

25

37.5

50

Hours Worked

0 10 20 30 40 50 60 70 80 90 100

Set Defaults/Remove Outliers

Assume 40?

55

Page 56: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

0

12.5

25

37.5

50

Hours Worked

0 10 20 30 40 50 60 70 80 90 100

Set Defaults/Remove Outliers

Delete?

56

Page 57: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

The discovery of the Antarctic "ozone hole" by British Antarctic Survey scientists Farman,

Gardiner and Shanklin…came as a shock to the scientific community…[The data] were initially rejected as unreasonable by data

quality control algorithms (they were filtered out as errors since the values were unexpectedly

low); the ozone hole was detected only in satellite data when the raw data was

reprocessed following evidence of ozone depletion in in situ observations. When the

software was rerun without the flags, the ozone hole was seen as far back as 1976.

https://en.wikipedia.org/wiki/Ozone_depletion#Antarctic_ozone_hole

Set Defaults/Remove Outliers

57

Page 58: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

The discovery of the Antarctic "ozone hole" by British Antarctic Survey scientists Farman,

Gardiner and Shanklin…came as a shock to the scientific community…[The data] were initially rejected as unreasonable by data

quality control algorithms (they were filtered out as errors since the values were unexpectedly

low); the ozone hole was detected only in satellite data when the raw data was

reprocessed following evidence of ozone depletion in in situ observations. When the

software was rerun without the flags, the ozone hole was seen as far back as 1976.

https://en.wikipedia.org/wiki/Ozone_depletion#Antarctic_ozone_hole

Set Defaults/Remove Outliers

58

Page 59: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

The discovery of the Antarctic "ozone hole" by British Antarctic Survey scientists Farman,

Gardiner and Shanklin…came as a shock to the scientific community…[The data] were initially rejected as unreasonable by data

quality control algorithms (they were filtered out as errors since the values were unexpectedly

low); the ozone hole was detected only in satellite data when the raw data was

reprocessed following evidence of ozone depletion in in situ observations. When the

software was rerun without the flags, the ozone hole was seen as far back as 1976.

https://en.wikipedia.org/wiki/Ozone_depletion#Antarctic_ozone_hole

Set Defaults/Remove Outliers

Always always always! Look at

the data!59

Page 60: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

60

Page 61: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

https://en.wikipedia.org/wiki/Levenshtein_distance

Minimal number of edits (inserts, deletes, substitutions) needed to transform A into B.

61

Page 62: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

https://en.wikipedia.org/wiki/Levenshtein_distance

Minimal number of edits (inserts, deletes, substitutions) needed to transform A into B.

62

Page 63: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

115th Waterman St., Providence, RI 110th Waterman St., Providence, RI

EditDistance = 1

63

Page 64: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

Waterman Street, Providence, RI Waterman St, Providence, RI

EditDistance = 4

64

Page 65: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

Problems?

65

Page 66: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

148th Ave NE, Redmond, WA 148th Ave NE, Redmond, WA

66

Page 67: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

148th Ave NE, Redmond, WA 148th Ave NE, Redmond, WA

Edit Distance = o

67

Page 68: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

148th Ave NE, Redmond, WA 148th Ave NE, Redmond, WA

Edit Distance = o

148th Ave NE, Redmond, WA NE 148th Ave, Redmond, WA

68

Page 69: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Edit Distance

148th Ave NE, Redmond, WA 148th Ave NE, Redmond, WA

Edit Distance = o

148th Ave NE, Redmond, WA NE 148th Ave, Redmond, WA

Edit Distance = 4 69

Page 70: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Jaccard Similarity

https://en.wikipedia.org/wiki/Jaccard_index70

Page 71: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Jaccard Similarity

https://en.wikipedia.org/wiki/Jaccard_index

148th Ave NE, Redmond, WA 140th Ave NE, Redmond, WA

71

Page 72: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Jaccard Similarity

https://en.wikipedia.org/wiki/Jaccard_index

148th Ave NE, Redmond, WA 140th Ave NE, Redmond, WA

72

Page 73: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Jaccard Similarity

https://en.wikipedia.org/wiki/Jaccard_index

148th Ave NE, Redmond, WA 140th Ave NE, Redmond, WA

Jaccard = 4 / 6 = .67

73

Page 74: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Jaccard Similarity

https://en.wikipedia.org/wiki/Jaccard_index

148th Ave NE, Redmond, WA NE 148th Ave, Redmond, WA

Jaccard = ???

74

Page 75: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Jaccard Similarity

https://en.wikipedia.org/wiki/Jaccard_index

148th Ave NE, Redmond, WA NE 148th Ave, Redmond, WA

Jaccard = 1

75

Page 76: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Question!

76

Page 77: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Question!

What’s the Jaccard Similarity? (a) 3/8 (b) 4/11 (c) 4/7

iPad Two 16GB WiFi White iPad 2nd generation 16GB WiFi White

77

Page 78: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Question!

What’s the Jaccard Similarity? (a) 3/8 (b) 4/11 (c) 4/7

iPad Two 16GB WiFi White iPad 2nd generation 16GB WiFi White

#(iPad, 16GB, Wifi, White) #(iPad, Two, 2nd, generation, 16GB, Wifi, White)

78

Page 79: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Jaccard Similarity

https://en.wikipedia.org/wiki/Jaccard_index

Michigan State University Michigan State Univ.

Michigan State University Ohio State University

79

Page 80: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Jaccard Similarity

https://en.wikipedia.org/wiki/Jaccard_index

Michigan State University Michigan State Univ.

Jaccard = 0.5

Michigan State University Ohio State University

Jaccard = 0.5

80

Page 81: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

https://en.wikipedia.org/wiki/Jaccard_index

Michigan State University

Michigan State Univ.

Jaccard = 0.5

Michigan State University

Ohio State University

Jaccard = 0.25

31 1

String Similarity: (Weighted) Jaccard Similarity

81

Page 82: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: (Weighted) Jaccard Similarity

https://en.wikipedia.org/wiki/Jaccard_index

Michigan State University

Michigan State Univ.

Jaccard = 0.5

Michigan State University

University of Michigan

Jaccard = 0.5

31 1

82

Page 83: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Cosine Similarity

Senator Washington announced party primary chairman

GOP 1002 41 502 700 400 3Republican 800 35 521 698 423 10

83

Page 84: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Cosine Similarity

θ

GOP

Republican

84

Page 85: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Question!

85

Page 86: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Question!

Which metric would (likely) consider the above words more similar?

(a) Jaccard (b) Cosine

Brown Brown Uni.

86

Page 87: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Question!

Which metric would (likely) consider the above words more similar?

(a) Jaccard (b) Cosine

Brown Brown Uni.

87

Page 88: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Question!

Which metric would (likely) consider the above words more similar?

(a) Jaccard (b) Cosine

Motown Detroit

88

Page 89: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

Clicker Question!

Which metric would (likely) consider the above words more similar?

(a) Jaccard (b) Cosine

Motown Detroit

89

Page 90: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Machine Learn It!!!!

90

Page 91: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Machine Learn It!!!!RECORD MATCHING PROBLEMSRecord Matching Problem

Id Name Street City State P-Code Age

1 J Smith 123 University Ave Seattle Washington 98106 42

2 Mary Jones 245 3rd St Redmond WA 98052-1234 30

3 Bob Wilson 345 Broadway Seattle Washington 98101 19

4 M Jones 245 Third Street Redmond NULL 98052 299

5 Robert Wilson 345 Broadway St Seattle WA 98101 19

6 James Smith 123 Univ Ave Seatle WA NULL 41

7 J Widom 123 University Ave Palo Alto CA 94305 NULL

… … … … … … …

Customer

12/02/2009 42

1.0 0.57 0.91 0.0 1.0 1.0 𝑊𝑡𝐽𝑎𝑐𝑐𝑎𝑟𝑑 =  

CSE 544: Data Cleaning

91

Page 92: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Machine Learn It!!!!

COMBINING SIMILARITY FUNCTIONSCombining Similarity Functions

12/02/2009 44

Jacc(Name)

Jacc(Street)

Edit(City)

Edit(State)

Edit(PostalCode)

Equality(Age)

Record Pair

Vector of similarity scores

Fn Match/Non-Match

Binary Classification Features

CSE 544: Data Cleaning

Idea: Weighted sum of per attribute similarity + threshold?

92

Page 93: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Machine Learn It!!!!

LEARNING-BASED APPROACHLearning-based approach

12/02/2009 45

Bob Wilson 345 Broadway Seattle Washington 98101 19

Robert Wilson 345 Broadway St Seattle WA 98101 19 Match

Mary Jones 245 3rd St Redmond WA 98052-1234 30

Robert Wilson 345 Broadway St Seattle WA 98101 19 Non-Match

B Wilson 123 Broadway Boise Idaho 83712 19

Robert Wilson 345 Broadway St Seattle WA 98101 19 Non-Match

Mary Jones 245 3rd St Redmond WA 98052-1234 30

M Jones 245 Third Street Redmond NULL 98052 299 Match

CSE 544: Data Cleaning

93

Page 94: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Machine Learn It!!!!LEARNING BASED APPROACHLearning-based approach

12/02/2009 46

Jaccard(Name)

Jacc

ard(

Stre

et)

+

+ +

+ +

+

+

+ +

+ +

+

+

+ +

+

+ + + 1.0

0.0 1.0

CSE 544: Data Cleaning

94

Page 95: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

String Similarity: Machine Learn It!!!!LEARNING BASED APPROACHLearning-based approach

12/02/2009 46

Jaccard(Name)

Jacc

ard(

Stre

et)

+

+ +

+ +

+

+

+ +

+ +

+

+

+ +

+

+ + + 1.0

0.0 1.0

CSE 544: Data Cleaning

95

Page 96: Data Cleaning - Brown University...Clicker Questions! 25 TAS ID Name City State Hours 1 N Aldroubi Providence RI 42 2 Natalie Delworth Pawtucket RI 30 3 Nam Do PVD Rhode Island 19

And now….a word from your HTAs

(Meanwhile: I HAVE TO GO I’M GONNA MISS MY TRAIN EMAIL ME YOUR QUESTIONS HAVE

A GOOD WEEKEND BYEEEEE)

96