Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Data manipulation in R

Richard L. Zijdeman

May 29, 2015

Richard L. Zijdeman Data manipulation in R

1 Recap

2 Data manipulation

3 data.table package

4 Basic statistical techniques

What we’ve seen so far

functions to read in dataread.csv(), read.xlsx()

objectsassignment <-characteristics, e.g.:

str(), summary(), head(), tail()

calculusmean(), min(), max()

plottingplot()ggplot()

paint by ‘layer’

Before we go on. . .

Structure your R scriptFilename, Date, Purpose, Author, Last changeUse comments to tell what you are doing

read in datachanging variables (why did you do it)

Create a working directory, with subdirs

+ documents+ data

- source- derived

+ analysis+ figures

Set a working directorysetwd(), getwd()use relative paths to save things“./” = currenty directory“./../” = folder up

Read J. Scott Long’ “Workflow”

Data manipulation

Assignment and Indexing

First, we’ll read in the HSN marriages again

hmar <- read.csv("./../data/derived/HSN_marriages.csv",stringsAsFactors = FALSE,encoding = "latin1",header = TRUE,nrows = 10000)

Change case of text

tolower()toupper()

tolower("CaN we pleASe jUSt have LOWER cases?")

## [1] "can we please just have lower cases?"

names(hmar) <- tolower(names(hmar))names(hmar)

## [1] "id_marriage" "idnr" "m_loc" "m_year"## [5] "sex_hsnrp" "age_groom" "occ_groom" "civilst_groom"## [9] "sign_groom" "b_loc_groom" "l_loc_groom" "age_bride"## [13] "occ_bride" "civilst_bride" "sign_bride" "b_loc_bride"## [17] "l_loc_bride" "a_f_groom" "occ_f_groom" "sign_f_groom"## [21] "a_m_groom" "occ_m_groom" "sign_m_groom" "a_f_bride"## [25] "occ_f_bride" "sign_f_bride" "a_m_bride" "occ_m_bride"## [29] "sign_m_bride"

Indexing

There were way to many names to print on a slide. . . How manynames are there actually?

Use the length() command to find out:

length(names(hmar))

## [1] 29

So let’s print just the first two:

names(hmar)[1:2]

## [1] "id_marriage" "idnr"

The technique using squared brackets is called indexing

Any idea how we would show the last two names?

x <- length(names)names(hmar)[(x-1):x]

## [1] "id_marriage"

Using concatenate we could also extract various names

names(hmar)[c(1, 3, 5)]

## [1] "id_marriage" "m_loc" "sex_hsnrp"

We can also apply indexing to a data.frame:

hmar[1:2, 1:3]

## id_marriage idnr m_loc## 1 1 1001 Abcoude-Baambrugge## 2 2 1005 Baarn

# shows the first 2 rows and first 3 columns# so, in general: data.frame[rows, columns]

head() and tail()

So actually, you should now be able to replace head() and tail()

# head()hmar[1:6, ]

# tail()y <- nrow(hmar)hmar[(y-6):y, ]

data.table package

Developed by Matt DowleWebsite:https://github.com/Rdatatable/data.table/wikiWhy data.table?

fast subsetting on large filesmore consistent ‘grammar’less typing

install.packages("data.table")

library(data.table)

Class: data.tableFor data.table functions to work we need to define a data.frame asclass data.base

is.data.table(hmar)

## [1] FALSE

hmar.dt <- data.table(hmar)is.data.table(hmar.dt)

## [1] TRUE

is.data.frame(hmar.dt)

## [1] TRUERichard L. Zijdeman Data manipulation in R

Friends with benefitsData.frame and data.table are like ‘friends with benefits’

all.equal(hmar, hmar.dt)

## [1] "Attributes: < Names: 2 string mismatches >"## [2] "Attributes: < Length mismatch: comparison on first 2 components >"## [3] "Attributes: < Component 1: Modes: character, externalptr >"## [4] "Attributes: < Component 1: target is character, current is externalptr >"## [5] "Attributes: < Component 2: Modes: numeric, character >"## [6] "Attributes: < Component 2: Lengths: 10000, 2 >"## [7] "Attributes: < Component 2: target is numeric, current is character >"

# so we have all the benefits of a data.frame# ... and additional benefits of data.table

NB: next series of commands will only work for data.tablesRichard L. Zijdeman Data manipulation in R

Sort with setkeyOften we want to sort our data. We can do so with setkey()

hmar.dt[1:6, m_year]

## [1] 1849 1851 1864 1840 1843 1858

# note for data.frame hmar it would be:# hmar[1:6, hmar$m_year]setkeyv(hmar.dt, "m_year")hmar.dt[1:6, m_year]

## [1] 1831 1831 1833 1833 1834 1834

identical(hmar.dt, hmar)

## [1] FALSE Richard L. Zijdeman Data manipulation in R

Multiple keys

It is alo possible to sort on multiple keys

setkeyv(hmar.dt, c("id_marriage", "idnr"))

Subsetting

groom.sig <- hmar.dt[age_groom > 30, ]dim(groom.sig)

## [1] 2493 29

groom.sig <- hmar.dt[sign_groom == "h", ]dim(groom.sig)

## [1] 9590 29

groom.sig <- hmar.dt[sign_groom == "h" &age_groom > 30, ]

dim(groom.sig)

## [1] 2358 29

groom.sig <- hmar.dt[m_year != 1840,list(id_marriage, idnr)]

dim(groom.sig)

## [1] 9985 2

Creating new variablesLet’s create a variable for the mean of marriage of grooms

hmar.dt[, mean.gage := mean(age_groom)]

summary(hmar.dt$age_groom)

## Min. 1st Qu. Median Mean 3rd Qu. Max.## -2.00 24.00 26.00 28.38 30.00 79.00

summary(hmar.dt$mean.gage)

## Min. 1st Qu. Median Mean 3rd Qu. Max.## 28.38 28.38 28.38 28.38 28.38 28.38

Another example (from yesterday)

Dummy variable for equal municipality of birth

hmar.dt[, eq_b_loc := (b_loc_groom == b_loc_bride)]

summary(hmar.dt$eq_b_loc)

## Mode FALSE TRUE NA's## logical 6957 3043 0

Creating variables by groupAs we saw, a var with mean age wasn’t really interesting

average age of grooms at marriage by civil status

hmar.dt[, gage.mean.civ := mean(age_groom),by = civilst_groom]

table(hmar.dt$civilst_groom, hmar.dt$gage.mean.civ)

#### 27.2427939112599 40.8829787234043 42.9548286604361 53## 1 9263 0 0 0## 2 0 0 642 0## 3 0 94 0 0## 6 0 0 0 1

Summary subsets of the data

So far, added vars to original data.framecan be redundant though

Think of context, say municipalitiesarchival material on characteristics, e.g.:

populationsteam power

You can also make context characteristics by aggregation

mc <- hmar.dt[, mean(age_groom), by = b_loc_groom]

summary(mc)

## b_loc_groom V1## Length:1184 Min. :-2.00## Class :character 1st Qu.:26.00## Mode :character Median :28.17## Mean :29.36## 3rd Qu.:31.00## Max. :69.00

We can improve by naming the variable directly, and adding morevariables

mc2 <- hmar.dt[, list(mean_gage = mean(age_groom),mean_bage = mean(age_bride)),

by = b_loc_groom]

summary(mc2)

## b_loc_groom mean_gage mean_bage## Length:1184 Min. :-2.00 Min. :-2.00## Class :character 1st Qu.:26.00 1st Qu.:23.80## Mode :character Median :28.17 Median :25.88## Mean :29.36 Mean :26.53## 3rd Qu.:31.00 3rd Qu.:28.00## Max. :69.00 Max. :64.00

One more. . . counts

Yesterday, we talked about the problem of overlapping points. Weused geom_jitter to solve it.

Now let’s do it properly:

mc3 <- hmar.dt[, list(frequency = .N),by = list(m_year, age_bride)]

# notice the .N ... N is often used for nr. of obs

library(ggplot2)

Using colour

ggplot(mc3, aes(x= m_year, y = age_bride)) +geom_point(aes(colour = frequency),

size = 10, shape = 18) +theme_bw()

1850 1900 1950 2000m_year

30frequency

Using size

ggplot(mc3, aes(x= m_year, y = age_bride)) +geom_point(aes(size = frequency),

colour = "blue", shape = 18) +theme_bw()

1850 1900 1950 2000m_year

e frequency

Box and whisker plot

Distribution of dataMedian: 50% of the cases above and belowBox: 1st and 3rd quartileInterquartile range (IQR): Q3-Q1Outliers (Tukey, 1977):

x < Q1 - 1.5*IQRx > Q3 + 1.5*IQR

boxplot(hmar.dt$age_bride,ylab = "Age")

hmar.dt[, sign.bride.cln := sign_bride == "h"]hmar.dt[age_bride < 14, age_bride := NA]# NB: no missing values here, but mind this when recoding!

boxplot(hmar.dt$age_bride ~ hmar.dt$sign.bride.cln,names = c("not signed", "signed"),col = c("red", "green"))

not signed signed

Introduction into R for historians (part 4: data manipulation)

Data & Analytics

Transcript of Introduction into R for historians (part 4: data manipulation)

STM, NC-AFM, and Atom Manipulation: From Personal Art … · STM, NC-AFM, and Atom Manipulation: ... Exponential dependence of corrugation amplitude ... (111)7X7 is split into two.

Introduction into R for historians (part 1: introduction)

Legal Historians Brief

How historians work

ELIZABETH's Historians

Manipulation Planning. Locomotion ~ Manipulation 2.

Tools used by historians

Memory, history and politics - Institut Ramon Llull · historians who devote themselves to mytho-phobia are usually the ones most given to facilitating the manipulation of the past

Manipulation of 3D Enveloped Object - hlab-osaka-u ... · Manipulation of 3D Enveloped Object ... Instead, we assign all fingers into a position controlled finger ... ment The goal

Early Ionian Historians

histoGraph for historians

Digital Age Historians

Historians toybook (5.35MB)

NewYorkHeritage.org overview for historians

Early Roman Historians

Hollywood for Historians

Accounting Historians Notebook

Civil Rights and Conservatism 1948-1994. MAJOR ERAS IN TEXAS HISTORY WHY DO HISTORIANS DIVIDE THE PAST INTO ERAS? Historians divide the past into.

Market Manipulation: An Adversarial Learning Framework for …xintongw/papers/advgan2020ijcai.pdf · the manipulation effect, we decompose the SP behavior into manipulation and exploitation

Air Power Development Centre Guiding the Air Force into the Future Air Power Development Centre Guiding the Air Force into the Future Unit Historians Training.