03 Cleaning

Hadley Wickham

Data cleaningStat405

Monday, 31 August 2009

1. Intro to data cleaning

2. Missing values

3. Subsetting

4. Modifying

5. Short cuts

Clean data is:

Columnar (rectangular, observations in rows, variables in columns)

Consistent

Concise

Complete

Correct

Can’t restore correct values without original data but can remove clearly incorrect values

Options:

Remove entire row

Mark incorrect value as missing

What is a missing value?

In R, written as NA. Has special behaviour:

NA + 3 = ?

NA > 2 = ?

mean(c(2, 7, 10, NA)) = ?

NA == NA ?

Use is.na() to see if a value is NA

Many functions have na.rm argument

Your turn

Look at histograms and scatterplots of x, y, z from the diamonds dataset

Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm)

qplot(x, data = diamonds, binwidth = 0.1)qplot(y, data = diamonds, binwidth = 0.1)qplot(z, data = diamonds, binwidth = 0.1)qplot(x, y, data = diamonds)qplot(x, z, data = diamonds)qplot(y, z, data = diamonds)

Modifying data

To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts.

Basic structure: df$varnamedf[row index, column index]

Remember str(diamonds) ?

That hints at how to extract individual variables:

diamonds$carat

diamonds$price

positive integers select specified

negative integers omit specified

characters extract named items

nothing include everything

logicals select T, omit F

Challenge

There is an equivalency between logical (boolean) and numerical (set) indexing.

How do you change a logical index to a numeric index? And vice versa?

What are the equivalents of the boolean operations for numerical indices?

# Nothingstr(diamonds[, ])

# Positive integers & nothingdiamonds[1:6, ] # same as head(diamonds)diamonds[, 1:4] # watch out!

# Positive integers * 2diamonds[1:10, 1:4]diamonds$carat[1:100]

# Negative integersdiamonds[-(1:53900), -1]

# Character vectordiamonds[, c("depth", "table")]diamonds[1:100, "carat"]

[ + logical vectors# The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the datax_big <- diamonds$x > 10

head(x_big)tail(x_big)sum(x_big)

diamonds$x[x_big]diamonds[x_big, ]

Useful functions for

logical vectors

table(zeros)sum(zeros)mean(zeros)

TRUE = 1; FALSE = 0

x_big <- diamonds$x > 10diamonds[x_big, ]diamonds[x_big, "x"]diamonds[x_big, c("x", "y", "z")]

small <- diamonds[diamonds$carat < 1, ]lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ]

# Comparison functions:# < > <= >= != == %in%

# Boolean operatorssmall <- diamonds$carat < 1 & diamonds$price > 500lowqual <- diamonds$colour == "D" | diamonds$cut == "Fair"

And a & b

Or a | b

Not !b

Xor xor(a, b)

Saving results

# Prints to screen

diamonds[diamonds$x > 10, ]

# Saves to new data frame

big <- diamonds[diamonds$x > 10, ]

# Overwrites existing data frame. Dangerous!

diamonds <- diamonds[diamonds$x < 10,]

diamonds <- diamonds[1, 1]diamonds

# Uh oh!

rm(diamonds)str(diamonds)

# Phew!

Your turn

Extract diamonds with equal x & y.

Extract diamonds with incorrect/unusual x, y, or z values.

equal <- diamonds[diamonds$x == diamonds$y, ]

y_big <- diamonds$y > 10z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0zeros <- x_zero | y_zero | z_zero

bad <- y_big | z_big | zerosdbad <- diamonds[bad, ]

Aside: strategy

The biggest problem I see new programmers make is trying to do too much at once.

Break the problem into pieces and solve the smallest piece first. Then check each piece before solving the next problem.

Making new variables

diamonds$pricepc <- diamonds$price /

diamonds$carat

diamonds$volume <- diamonds$x *

diamonds$y * diamonds$z

qplot(pricepc, carat, data = diamonds)

qplot(carat, volume, data = diamonds)

Modifying values

Combination of subsetting and making new variables:

diamonds$x[x_zero] <- NA

diamonds$z[z_big] <- diamonds$z[z_big] / 10

These modify the data in place. Be careful!

diamonds$volume <- diamonds$x * diamonds$y * diamonds$zqplot(carat, volume, data = diamonds)

# Fix problems & replotdiamonds$x[x_zero] <- NAdiamonds$y[y_zero] <- NAdiamonds$z[z_zero] <- NAdiamonds$y[y_big] <- diamonds$y[y_big] / 10diamonds$z[z_big] <- diamonds$z[z_big] / 10

diamonds$volume <- diamonds$x * diamonds$y * diamonds$zqplot(carat, volume, data = diamonds)

Your turn

Fix the incorrect values and replot scatterplots of x, y, and z. Are all the unusual values gone?

Correct any other strange values.

Hint: If qplot(a, b) is a straight line, qplot(a, a / b) will be a flat line. Makes selecting strange values much easier!

qplot(carat, volume, data = diamonds)qplot(carat, volume / carat, data = diamonds)

weird_density <- (diamonds$volume / diamonds$carat) < 140 | (diamonds$volume / diamonds$carat) > 180weird_density <- weird_density & !is.na(weird_density)

diamonds[weird_density, c("x", "y", "z", "volume")] <- NA

Short cuts

You’ve been typing diamonds many many times. There are three shortcuts: with, subset and transform.

These save typing, but may be a little harder to understand, and will not work in some situations.

Useful tools, but don’t forget the basics.

weird_density <- (diamonds$volume / diamonds$carat) < 140 | (diamonds$volume / diamonds$carat) > 180weird_density <- with(diamonds, (volume / carat) < 140 | (volume / carat) > 180)

diamonds[diamonds$carat < 1)subset(diamonds, carat < 1)

equal <- diamonds[diamonds$x == diamonds$y, ]equal <- subset(diamonds, x == y)

diamonds$volume <- diamonds$x * diamonds$y * diamonds$zdiamonds$pricepc <- diamonds$price / diamonds$carat

diamonds <- transform(diamonds, volume = x * y * z, pricepc = price / carat)

Try to convert your previous statements to use with, subset and transform. Which ones convert easily? Which are hard?

When is the shortcut actually a longcut?

Your turn

Next time

Learning how to use latex: a scientific publishing program.

If you’re using a laptop, please install latex from the links on the course webpage.

a & b intersect(c, d)

a | b union(c, d)

!b setdiff(U, c)

xor(a, b) union(setdiff(c, d), setdiff(d, c))

c = which(a)d = which(b)

U = seq_along(a)a = U %in% cb = U %in% d

03 Cleaning

Business

Transcript of 03 Cleaning

SILWAYIPHI CLEANING AND SECURITY SERVICES · 2020-03-03 · very much hope that you consider Silwayiphi Cleaning and Security Services’ team as a strong candidate for selection.

SAFE CLEANING: BODY FLUIDSSAFE CLEANING: BODY FLUIDS · 2020-03-18 · SAFE CLEANING: BODY FLUIDSSAFE CLEANING: BODY FLUIDS Follow These Steps When Cleaning Take care of the child

Cleaning & Sanitation Guide for Food Retail · 2020-03-13 · Cleaning Sanitation Guide for Food Retail3 Purpose The Cleaning and Sanitation Guide for Food Retail provides food safety

Amazon’s New Home Services “Opportunity” · 2015-03-25 · pressure washing, gutter cleaning, snow removal, holiday light installation, window cleaning* and roof inspection/moss

BELT CLEANING TECHNOLOGY · 2020-01-03 · 8 Belt Cleaning Technology Jet System or Mes Conveyor Belts TECHNICAL SPECIFICATIONS Power supply (cleaning head) 24V DC Available via 24V

Ba pig cleaning station mst 3a 2014 03 26 en tcm11 23931

12-Hole Cleaning-A-03-April-12 - Copy - Copy.ppt

Section 03 - Guidelines for Cleaning, Disinfection & Sterilisation

11.4 Corporate and Community Services · EFT58195 14/03/2017 Leeuwin Window Cleaning Cleaning 330.00 EFT58196 14/03/2017 Lightning Ridge Electrical Contracting Electrical Contractor

B-4 The Walsh County Record Wednesday, March …...2020/03/25 · Auto Repair Building Services Brouillard’s Steam Cleaning & Air Duct Cleaning Professional service in steam cleaning

Investigation of Enhancing Drill Cuttings Cleaning and Penetration … · 2016-03-11 · Investigation of Enhancing Drill Cuttings Cleaning and Penetration Rate Using Cavitating Pressure

Jetless Rotary Cleaning Technology - HydraMaster · 2017-03-03 · The Rotary Drimaster electric motor drives a reduction pulley, which in turn drives the rotary head assembly. Cleaning

Reception - carpet-cleaning-machines-toronto.comcarpet-cleaning-machines-toronto.com/wp-content/uploads/2017/03/... · Scope yout fadlW to determlne the carpet problem areas, then

Institute of Inspection, Cleaning and Restoration Certification Standard · 2017-03-23 · Institute of Inspection, Cleaning and Restoration Certification Standard IICRC S520 IICRC

Reliance Ultrasonic Cleaning Systems · 2017-03-19 · Reliance ® Ultrasonic Cleaning Systems Key Features and Beneﬁ ts Cleaning Power - 132 kHz Transducers The Reliance Ultrasonic

reliable & professional Quality for less - cleaning services for … · 2012-03-29 · Gutter cleaning End of tenancy cleaning Post construction cleaning Upholstery cleaning ... Delaware

CLEANING MACHINES › ... › 2020 › 03 › Cleaning-Machines.pdf · ANIKO30 is a spray extraction machine for the professional cleaning of small carpet and hard floor surfaces.

Guide To Cleaning Resin Floors - FeRFA · 2020-03-19 · Guide To Cleaning Resin Floors Contents 2 1 Introduction3 2 An Effective Cleaning Programme 4 2.1. General Cleaning 4 3 Typical

ACOUSTIC CLEANING SYSTEMS Manual · 2018-03-07 · Manual Acoustic Cleaning Systems page 1 1. Safety notes The operation of the acoustic cleaning system generates a significant level

Cleaning Indoor Air using Bi-Polar Ionization Technology Dr. Philip M…atmosair.com/wp-content/uploads/2020/03/Cleaning-Indoor... · 2020-03-16 · Cleaning Indoor Air using Bi-Polar