Topic 2: Distributions and basic descriptive statistics ... · Web viewExperiment in which...

Topic 2: Distributions and basic descriptive statistics for quantitative variables

Jeroen Claes

Contents1. How to load and view data from a CSV file2. Working with data.frames3. Measures of central tendency4. Normal distribution5. Dispersion measures6. What to do when your data is not distributed normally

1. How to load data from a CSV file

1. How to load data from a CSV file (1/2)In this class, we will be working with data by Balota et al. (2007), provided by Levshina (2015):

• Data in CSV (comma-separated values)• Experiment in which subjects had to decide whether or not a word is an existing word.• Columns: Word, Word length, Word frequency, and Mean reaction time

1. How to load data from a CSV file (2/2)To load data, we use the package readr and we load the data straight from the course website. This works the same for local data sets.

library(readr)dataSet <- read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class2_balota_et_al_2007.csv")

1. How to create a CSV fileIn Excel:

• Save as ‘CSV’ (Comma-separated values)• Make sure you choose UTF-8 encoding (on Excel 2016 for Mac: CSV UTF-8 Encoding)• If your Excel is set to Belgian locale, the separator will not be the comma ,, but rather

the semicolon ;.• In that case, you have to use the read_csv2 function to load your data.

2. Working with data.frames

2.1 Structure of data.frames (1/3)Now let’s check out how the object dataSet looks like. Recall that we can use str() for this purpose.

str(dataSet)

2.1 Structure of data.frames (2/3)We can also get a nice table-like view of it

View(dataSet)

2.1 Structure of data.frames (3/3)With dim we compute the number of dimensions (columns and rows), with nrow the number of rows, and with ncol the number of columns.

dim(dataSet)

## [1] 100 4

nrow(dataSet)

## [1] 100

ncol(dataSet)

## [1] 4

2.2 headPrint the first N lines from the data.frame

#first 6 lines head(dataSet)

## # A tibble: 6 x 4## Length Freq Mean_RT Word## <int> <int> <dbl> <chr>## 1 8 131 819.19 marveled## 2 10 82 977.63 persuaders## 3 7 0 908.22 midmost## 4 6 592 766.30 crutch## 5 12 2 1125.42 resuspension## 6 12 9 948.33 efflorescent

2.3 tailPrint the last N lines from the data.frame

#last 10 linestail(dataSet, 10)

## # A tibble: 10 x 4## Length Freq Mean_RT Word## <int> <int> <dbl> <chr>## 1 6 757 641.06 club's## 2 6 775 775.59 minced## 3 5 14860 602.09 drain## 4 7 31679 576.58 justice## 5 10 181 792.93 numerology## 6 6 158 876.56 powwow## 7 8 429 710.63 backdoor## 8 11 93 875.27 maladjusted## 9 6 2434 627.11 Elaine## 10 11 162 1458.75 diacritical

2.4 glimpseFor data with many columns, RStudio will omit the columns that do not fit the console window. There’s a function glimpse in the dplyr package that prints the first couple of lines of your data frame as a list.

library(dplyr)glimpse(dataSet)

## Observations: 100## Variables: 4## $ Length <int> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8,...## $ Freq <int> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523,...## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.6...## $ Word <chr> "marveled", "persuaders", "midmost", "crutch", "resusp...

2.5 summaryThe function summary prints a general overview of the values (minimum, maximum, median,mean,…) in a data.frame.

summary(dataSet)

## Length Freq Mean_RT Word

## Min. : 3.00 Min. : 0.0 Min. : 564.2 Length:100

## 1st Qu.: 6.00 1st Qu.: 53.5 1st Qu.: 713.1 Class :character ## Median : 8.00 Median : 310.5 Median : 784.9 Mode :character ## Mean : 8.23 Mean : 3350.3 Mean : 808.3

## 3rd Qu.:10.00 3rd Qu.: 2103.2 3rd Qu.: 905.2

## Max. :15.00 Max. :75075.0 Max. :1458.8

2.6 Dollar-sign operator: extracting a single columnTo extract a column (a single vector) from the data.frame, we use the dollar sign operator e.g., dataSet$Word

#first 6 lines of just the column Wordhead(dataSet$Word)

## [1] "marveled" "persuaders" "midmost" "crutch" ## [5] "resuspension" "efflorescent"

#last 10 linestail(dataSet$Word, 10)

## [1] "club's" "minced" "drain" "justice" "numerology" ## [6] "powwow" "backdoor" "maladjusted" "Elaine" "diacritical"

2.7 Extracting multiple columnsTo extract multiple columns from a data.frame, we simply put a character vector of the column names we want to extract between square brackets, preceded by a comma:

#first 6 lines of Word and Freqhead(dataSet[, c("Word", "Freq")])

## # A tibble: 6 x 2## Word Freq## <chr> <int>## 1 marveled 131## 2 persuaders 82## 3 midmost 0## 4 crutch 592## 5 resuspension 2## 6 efflorescent 9

2.8 Subsetting by rowsTo extract rows from a data.frame we use a numeric vector of row numbers between square brackets, followed by a comma

dataSet[c(1, 2, 5), ]

## # A tibble: 3 x 4## Length Freq Mean_RT Word## <int> <int> <dbl> <chr>## 1 8 131 819.19 marveled## 2 10 82 977.63 persuaders## 3 12 2 1125.42 resuspension

2.9 Subsetting: the system• You will have noticed that the general pattern for subsetting dataframes is the

following:– dataframeName[rows, columns]

2.10 Conditional selection by row (1/2)• Instead of writing the row numbers ourselves, we can also ask R to compute the

indices of rows that fulfill some condition.• The following statement will return the first six rows of all rows for which Freq >

100head(dataSet[dataSet$Freq > 100, ])

## # A tibble: 6 x 4## Length Freq Mean_RT Word## <int> <int> <dbl> <chr>## 1 8 131 819.19 marveled## 2 6 592 766.30 crutch## 3 3 14013 641.67 row## 4 5 290 654.12 jumpy## 5 6 3264 583.81 enjoys## 6 6 3523 667.18 screws

2.10 Conditional selection by row (2/2)• Other useful conditions are:

– is.na() (e.g., dataSet[is.na(dataSet$Freq), ]) - “is empty value”– !is.na() - “is NOT is empty value”– < - “smaller than”– <= - “smaller than or equal to”– > - “larger than”– >= - “larger than or equal to”– %in% (e.g., dataSet[dataSet$Freq %in% c(1,2,4)]) - “one of list”

– %in% with NOT operator (e.g., dataSet[!dataSet$Freq %in% c(1,2,4)]) - “not one of list”

2.11 Adding columns• Adding columns is just as easy as selecting columns• We simply type the dataframe name, followed by the dollar-sign operator and a name

for the column. Then we can assign whatever value we want to the new columndataSet$LogFreq <- log(dataSet$Freq)glimpse(dataSet)

## Observations: 100## Variables: 5## $ Length <int> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8,...## $ Freq <int> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523,...## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.6...## $ Word <chr> "marveled", "persuaders", "midmost", "crutch", "resusp...## $ LogFreq <dbl> 4.8751973, 4.4067192, -Inf, 6.3835066, 0.6931472, 2.19...

2.12 Excercises• Go to http://www.jeroenclaes.be/statistics_for_linguistics/class2.html• Perform the exercises under data.frames

3. Distributions and measures of central tendency

3. Distributions and measures of central tendency (1/2)• In statistics a set of scores/values/numbers is called a distribution.• The values in the columns Length, Freq, and Mean_RT each constitute a distribution.• The typical values of a distribution are called its central tendency• Before we can establish relationships between variables, we have to understand their

distributions first!

3. Distributions and measures of central tendency (2/2)• We can describe distributions by calculating the following statistics:

– Mean (average)– Median– Quantiles/quartiles– Mode– Minimum– Maximum

http://www.jeroenclaes.be/statistics_for_linguistics/class2.html

Each of these statistics will allow us to say something about the typical values (central tendency) of the distribution.

3.1 Mean• The mean of a distribution equals to the sum of the values of a distribution divided by

the number of values in the distribution.• Summarizes the typical values of the distribution in a single number#Mean word lengthmean(dataSet$Length, na.rm=TRUE)

## [1] 8.23

3.2 Median• The median of a distribution equals to the value that marks the middle of the

distribution.• If a distribution has 100 values and we rank them from small to large, the median will

be the 50th value.median(dataSet$Length, na.rm=TRUE)

## [1] 8

3.3 MedianWhen combined with the mean, the median provides a good idea of the central tendency of a distribution:

• The mean is prone to shift upwards or downwards if there is a single atypical value (called outlier in statistics)

• The median remains identical#Comparemedian(c(1, 2, 3, 4, 5, 6, 7, 8000, 190, 1000000)) # mean = 100821.8

## [1] 5.5

median(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) # mean = 5.5

## [1] 5.5

3.4 Quantiles• A related concept are quantiles (median = 0.5 quantile)• 0.25 quantile is also called the first quartile, median is second quartile etc.• Quantiles express that a percentage of the values of the distribution is lower than a

particular value• Useful to study the spread of the values in the distribution# One quantilequantile(dataSet$Length, 0.25)

## 25% ## 6

# Multiple quantilesquantile(dataSet$Length, c(0.25, 0.5, 0.75))

## 25% 50% 75% ## 6 8 10

3.5 Mode• Most frequent value in a distribution• Not widely used# table() counts the number of times a value occurs in a vectortab <- table(dataSet$Length)# We sort the table by its values, in decreasing ordersorted <- sort(tab, decreasing = TRUE)# We extract the first value, which is the highest.mode <- sorted[1]# The number 8 is the mode. It occurs 16 timemode

## 8 ## 16

# In one line:sort(table(dataSet$Length), decreasing = TRUE)[1]

## 8 ## 16

3.6 Minimum• Minimum: lowest valuemin(dataSet$Length, na.rm=TRUE)

## [1] 3

3.7 Maximum• Maximum: highest valuemax(dataSet$Length, na.rm=TRUE)

## [1] 15

3.8 Calculating most of these at onceBy calling summary on a single column, you can calculate many of the statistics discussed here with one command

summary(dataSet$Length)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 6.00 8.00 8.23 10.00 15.00

3.9 Describing a distribution• Mean• Median

If the mean is lower than the median, the distribution is negatively skewed. If the mean is higher than the median, it is positively skewed.

3.10 Excercises• Go to http://www.jeroenclaes.be/statistics_for_linguistics/class2.html• Perform the excercises under 2.Measures of central tendency

4. The normal distribution• Statisticians have identified different types of distributions• The most important is the normal or Gaussian distribution• Most statistical tests assume that your data approaches this theoretical distribution:

– Important to check whether your data is normally distributed– Otherwise statistical tests may produce unreliable results!

• Most data drawn from sufficiently large samples will be normally distributed

4.1 Characteristics (1/2)• The mean and the median are not very different (easy test)• The largest amount of values occur around the mean, the lowest amount of values

occur near the minimum and maximum• The distribution is symmetrical: there are about as many values that are smaller than

the mean than there are values that are larger than the mean

4.1 Characteristics (2/2)• This pattern shows up as a bell-shaped curve in density plots


4.2 Checking for normality• Plotting (preferred)• Shapiro-Wilk test

4.2.1 Plotting

4.2.1.1 Plotting in ggplot• Let’s take a moment to learn the basic syntax of ggplot,a highly structured (almost

language-like) plotting package• This command sets up the basis of ggplot:

– It defines the data we are working with (dataSet)– It defines the aesthetics (aes()) that are inherited by the other plotting

functions that follow behind the + signs:• Our x axis (the horizontal axis) will plot the Length variable• Our y axis (the vertical axis) will plot the Mean_RT variable• If we don’t specify a y axis, this will have to be calculated by a plotting

function (stat="bin" or stat="density")library(ggplot2)ggplot(dataSet, aes(x=Length, y=Mean_RT))

4.2.1.1 Plotting in ggplot• After the basic setup, we can add multiple plot layers by adding plotting functions

separated by a + sign:– geom_line() for line graphs– geom_bar() for bar charts– geom_vline() for vertical lines– geom_abline() for horizontal lines– geom_point() for points (scatterplots)– stat_summary() to summarise the y values in your data in function of x

before plotting (here we take the mean of all Mean_RT values that correspond to the same Length value)

library(ggplot2)ggplot(dataSet, aes(x=Length, y=Mean_RT)) + stat_summary(fun.y = "mean", geom = "line")

4.2.1.1 Plotting in ggplotlibrary(ggplot2)ggplot(dataSet, aes(x=Length, y=Mean_RT)) + stat_summary(fun.y = "mean", geom = "line")

4.2.1.2 Density plots (1/2)• Suppose we want to know if our variable Length approaches a normal distribution:

– We draw up a density plot, an ordered plot that visualizes the number of occurrences of each value in our distribution

– We add a red vertical line at the mean– We look for the characteristic bell curve and symmetrical pattern– Observe that the geom_vline has its own aesthetic: xintercept, which

defines the point where it crosses/intercepts the x axis (the x - intercept). We set this aesthetic to the mean of our Lengthvariable

library(ggplot2)ggplot(dataSet, aes(x=Length)) + geom_line(stat="density") + geom_vline(aes(xintercept=mean(dataSet$Length)), color="red")

4.2.1.2 Density plots (2/2)library(ggplot2)ggplot(dataSet, aes(x=Length)) + geom_line(stat="density") + geom_vline(aes(xintercept=mean(dataSet$Length)), color="red")

4.2.1.3 Histograms (1/2)• Suppose we want to know if our variable Length approaches a normal distribution:

– We draw up a histogram, an ordered plot that visualizes the number of occurrences of each value in our distribution

– We add a red vertical line at the mean– We look for the characteristic bell curve and symmetrical pattern

library(ggplot2)ggplot(dataSet, aes(x=Length)) + geom_histogram(bins=11) + geom_vline(aes(xintercept=mean(dataSet$Length)), color="red")

4.2.1.3 Histograms (2/2)

4.2.1.4 QQ-plots (1/2)• Another way to identify if your data are normally distributed are QQ-plots• If the dots follow the line closely, the data are normally distributedqqnorm(dataSet$Length)qqline(dataSet$Length)

4.2.1.4 QQ-plots (2/2)qqnorm(dataSet$Length)qqline(dataSet$Length)

4.2.2 Shapiro-Wilk test of normality• If p < 0.05 (don’t worry about what that means for now), the data is not normally

distributed• Plotting is preferred (especially QQ-plots), because this test depends on sample sizeshapiro.test(dataSet$Length)

## ## Shapiro-Wilk normality test## ## data: dataSet$Length## W = 0.97756, p-value = 0.08559

4.3 Excercises• Go to http://www.jeroenclaes.be/statistics_for_linguistics/class2.html• Do the excercises under 3. The normal distribution


5. Measures of dispersion

5. Measures of dispersion• If you data are normally distributed, the mean and the median are good

representatives of the typical values of a distribution• Our analysis of the Freq and Mean_RTvariables have shown that this is not the case for

non-normal distributions• Measures of dispersion express how much variation there is between values of a

distribution. They include:– Range– Variance– Standard deviation– Interquartile Range– Median absolute deviation

• A boxplot allows you to display dispersion graphically

5.1 RangeThe range is the difference between the maximum and the minimum value of a distribution

max(dataSet$Length) - min(dataSet$Length)

## [1] 12

5.2 Variance• Sum of the squared deviations from the mean, divided by the sample size minus one.• The larger the variance, the larger the differences between the individual values and

the mean• Not on the same scale as the data: squared (the sum of the deviations from the mean

would be zero without squaring it)• If the distribution is non-normal, this will lead us to overestimate the variation

between the mean and an average score# Same as: # sum((dataSet$Length-mean(dataSet$Length))^2 )/(length(dataSet$Length)-1)var(dataSet$Length)

## [1] 6.259697

5.3 Standard deviation• Then standard deviation is the square root of the variance• By taking the square root of the variance (which are the squared deviations from the

mean) we get back to the original units of our data• Expresses the average deviation from the mean

• If the distribution is non-normal, the standard deviation will lead us to overestimate the variation between the mean and an average score

sd(dataSet$Length)

## [1] 2.501939

5.4 Interquartile Range• The interquartile range expresses the difference between the first quartile (the 25%

lowest values) and the third quartile (the 75% lowest values)• More robust for non-normal distributions than the variance and standard deviationsIQR(dataSet$Length)

## [1] 4

5.5 Median absolute deviation• The median of the deviations between the values in the distribution and the median• Based on the median, so it is not sensitive to outliersmad(dataSet$Length)

## [1] 2.9652

5.6 Displaying dispersion graphically (1/2)• A box-and-whisker plot displays the mean and the min and max individual values• If there are outliers, the lines will be longer on one end of the graphlibrary(ggplot2)ggplot(dataSet, aes(x=1, y=Length)) + geom_boxplot()

5.6 Displaying dispersion graphically (2/2)library(ggplot2)ggplot(dataSet, aes(x=1, y=Length)) + geom_boxplot()

5.7 How to report it all?When you describe a distribution, you report:

• Mean• Median• AND:

– If the data are normally distributed, you report the standard deviation– If the data are not normally distributed, you report IQR or mad scores

• Ideally you also provide a box-and-whisker plot

5.8 Excercises• Go to http://www.jeroenclaes.be/statistics_for_linguistics/class2.html• Do the excercises under 4. Measures of dispersion

6. What to do when your data is not distributed normally

6. What to do when your data is not distributed normally• We have seen that Freq and Mean_RT are not normally distributed.


• This does not mean that we cannot analyse them, but we will have to be careful. Most tests do not behave well when the data is not normally distributed or contains outliers (e.g., standard deviation and variance)

• In general when you encounter a non-normal distribution you have two options:– Remove outliers– Apply a transformation

6.1 Remove outliers (1/6)• If the deviation from the normal distribution is caused by a few atypical values, you

may exclude these from your analysis• Let’s consider Mean_RTsummary(dataSet$Mean_RT)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 564.2 713.1 784.9 808.3 905.2 1458.8

6.1 Remove outliers (2/6)• The summary shows that the interquartile range is 192.1• The difference between the 3rd quartile and the maximum value is 553.6.• This suggest that there may be a few exceptionally large values in the right tail of the

distribution

6.1 Remove outliers (3/6)• Subjects appear to have had much difficulty with recognizing the top-4 words

diacritical, dessertspoon, acquisitiveness, and resuspension• Maybe these were not the best test items?

Length Freq Mean_RT Word LogFreq

11 162 1459 diacritical 5.088

12 11 1314 dessertspoon 2.398

15 4 1217 acquisitiveness 1.386

12 2 1125 resuspension 0.6931

11 185 1089 denominated 5.22

11 46 1055 euphemistic 3.829

6.1 Remove outliers (4/6)• It is usually a good idea to select outliers based on z-scores• z-scores express how far the value is removed from the mean in standard deviation

units• Positive z-scores are values that are larger than the mean, negative z-scores are values

that are smaller

#(dataSet$Mean_RT - mean(dataSet$Mean_RT))/sd(dataSet$Mean_RT)dataSet$zscores <- scale(dataSet$Mean_RT)summary(dataSet$zscores)

## V1 ## Min. :-1.5931 ## 1st Qu.:-0.6210 ## Median :-0.1522 ## Mean : 0.0000 ## 3rd Qu.: 0.6328 ## Max. : 4.2459

6.1 Remove outliers (5/6)• With this score we remove all values that are more than two standard deviations

removed from the mean (absolute value of z >= 2) (Urdan, 2010: 18)• We need the absolute values (we have to ignore their +/- sign), so we wrap the scores

with absdataSet <- dataSet[abs(dataSet$zscores) <= 2, ]

6.1 Remove outliers (6/6)• To check if our problem is solved, we can plot our variableggplot(dataSet, aes(x=Mean_RT)) + geom_line(stat="density") + geom_vline(aes(xintercept=mean(dataSet$Mean_RT)), color="red")

6.2 Apply a transformation (1/6)• The variable Freq cannot be brought to a normal distribution by removing a few

outliers• It has the typical power-law distribution of word frequencies (Zipf’s law):

– A few words occur very often– The rest of the words have a very low frequency

• Distribution with a very broad range• Can only be corrected with a transformation

6.2 Apply a transformation (2/6)• Popular transformations are:

– Logarithm:• Expresses to which power we have to raise the mathematical constant e

= 2.718281828459 (the base) to obtain the input number– log(1000) = 6.907755, because 2.718281828459 ^

6.907755 = 1000• Other bases are:

– 10 (log10(1000) = 3; 10 ^ 3 = 1000)– 2 (log2(1000) = 9.965784; 2 ^ 9.965784 = 1000)

• Fixes power-law distributions• Exponentiate the base with the logarithm to undo

6.2 Apply a transformation (3/6)• Popular transformations are:

– Logarithm (fixes power-law distributions)– Square root (fixes positively skewed distributions)– Square transformation (fixes negatively skewed distributions)– Reciprocal transformation (1/(vector + 1)), for distributions that look like a

j• Best practice is to check which transformation works best for your data

6.2 Apply a transformation (4/6)• Let’s apply the logarithm to the Freq variable• The log of 0 is -Inf, which would render our data useless.• We add 1 to the values of Freq and then we apply the transformation (the log of 1 is 0)

6.2 Apply a transformation (5/6)dataSet$transformedFreq <- log(1+dataSet$Freq)qqnorm(dataSet$transformedFreq)qqline(dataSet$transformedFreq)

6.2 Apply a transformation (6/6)• There’s always a trade-off to be made:

– If you need to have your distribution to be normal for some test, then you can consider a transformation

– Transforming your data makes it harder to interpret the values. E.g., log(dataSet$Freq) is less intuitively understandable than dataSet$Freq

– We will revisit transformations in later classes

Questions??

References• Balota, D.A., Yap, M.J., & Cortese, M.J., et al. (2007). The English Lexicon Project.

Behavior Research Methods, 39(3), 445–459. DOI: 10.3758/BF03193014. Data taken from Levshina (2015).

• Levshina, N. (2015). How to do Linguistics with R: Data exploration and statistical analysis. Amsterdam/Philadelphia, PA: John Benjamins.

Topic 2: Distributions and basic descriptive statistics ... · Web viewExperiment in which...

Documents

Transcript of Topic 2: Distributions and basic descriptive statistics ... · Web viewExperiment in which...