CA200 Dublin City University (based on the book by Prof. Jane M. Horgan) 1.
CA200 (based on the book by Prof. Jane M. Horgan) 1.
-
Upload
sergio-frome -
Category
Documents
-
view
221 -
download
0
Transcript of CA200 (based on the book by Prof. Jane M. Horgan) 1.
![Page 1: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/1.jpg)
1
CA200
(based on the book by Prof. Jane M. Horgan)
3. Basics of R – cont.Summarising Statistical Data
Graphical Displays 4. Basic distributions with R
![Page 2: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/2.jpg)
CA200 2
Basics– 6+7*3/2 #general expression
[1] 16.5– x <- 1:4 #integers are assigned to the vector x
x #print x[1] 1 2 3 4
– x2 <- x**2 #square the element, or x2<-x^2x2[1] 1 4 9 16
– X <- 10 #case sensitive!prod1 <- X*xprod1[1] 10 20 30 40
![Page 3: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/3.jpg)
3
Getting Help
• click the Help button on the toolbar• help()• help.start()• demo()• ?read.table• help.search ("data.entry")• apropos (“boxplot”) - "boxplot",
"boxplot.default", "boxplot.stat”
CA200
![Page 4: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/4.jpg)
4
Statistics: Measures of Central Tendency
Typical or central points:• Mean: Sum of all values divided by the number
of cases
• Median: Middle value. 50% of data below and 50% above
• Mode: Most commonly occurring value, value with the highest frequency
CA200
![Page 5: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/5.jpg)
5
Statistics: Measures of Dispersion
Spread or variation in the data• Standard Deviation (σ): The square root of the
average squared deviations from the mean
- measures how the data values differ from the mean- a small standard deviation implies most values are near the average- a large standard deviation indicates that values are widely spread above and below the average.
CA200
![Page 6: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/6.jpg)
6
Statistics: Measures of Dispersion
Spread or variation in the data• Range: Lowest and highest value• Quartiles: Divides data into quarters. 2nd
quartile is median• Interquartile Range: 1st and 3rd quartiles,
middle 50% of the data.
CA200
![Page 7: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/7.jpg)
7
Data Entry
• Entering data from the screen to a vector• Example: 1.1
downtime <-c(0, 1, 2, 12, 12, 14, 18, 21, 21, 23, 24, 25, 28, 29, 30,30,30,33,36,44,45,47,51)
mean(downtime) [1] 25.04348median(downtime)[1] 25range(downtime)[1] 0 51sd(downtime)[1] 14.27164
CA200
![Page 8: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/8.jpg)
8
Data Entry – cont.
• Entering data from a file to a data frame• Example 1.2: Examination results: results.txt
gender arch1 prog1 arch2 prog2m 99 98 83 94m NA NA 86 77m 97 97 92 93m 99 97 95 96m 89 92 86 94m 91 97 91 97m 100 88 96 85f 86 82 89 87and so on
CA200
![Page 9: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/9.jpg)
9
Data Entry – cont.• NA indicates missing value.• No mark for arch1 and prog1 in second record.• results <- read.table ("C:\\results.txt", header = T) # download the file to desired location
• results$arch1[5][1] 89
• Alternatively• attach(results)• names(results)
• allows you to access without prefix results.• arch1[5]
[1] 89
CA200
![Page 10: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/10.jpg)
Data Entry – Missing values•mean(arch1)
[1] NA #no result because some marks are missing
•na.rm = T (not available, remove) or •na.rm = TRUE•mean(arch1, na.rm = T)
[1] 83.33333
•mean(prog1, na.rm = T)[1] 84.25
•mean(arch2, na.rm = T)
•mean(prog2, na.rm = T)
•mean(results, na.rm = T)
gender arch1 prog1 arch2 prog2
NA 94.42857 93.00000 89.75000 90.37500
10
![Page 11: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/11.jpg)
11
Data Entry – cont.
• Use “read.table” if data in text file are separated by spaces
• Use “read.csv” when data are separated by commas
• Use “read.csv2” when data are separated by semicolon
CA200
![Page 12: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/12.jpg)
12
Data Entry – cont.
Entering a data into a spreadsheet:
• newdata <- data.frame() #brings up a new spreadsheet called newdata
• fix(newdata)#allows to subsequently add data to this data
frame
CA200
![Page 13: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/13.jpg)
Summary StatisticsExample 1.1: Downtime:
summary(downtime)Min. 1st Qu. Median Mean 3rd Qu. Max.0.00 16.00 25.00 25.04 31.50 51.00
Example 1.2: Examination Results:
summary(results)
Gender arch1 prog1 arch2 prog2 f: 4 Min. : 3.00 Min. :65.00 Min. :56.00 Min. :63.00 m:22 1st Qu.: 79.25 1st Qu.:80.75 1st Qu.:77.75 1st Qu.:77.50 Median : 89.00 Median :82.50 Median :85.50 Median :84.00 Mean : 83.33 Mean :84.25 Mean :81.15 Mean :83.85 3rd Qu.: 96.00 3rd Qu.:90.25 3rd Qu.:91.00 3rd Qu.:92.50 Max. :100.00 Max. :98.00 Max. :96.00 Max. :97.00 NA's : 2.00 NA's : 2.00
![Page 14: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/14.jpg)
Summary Statistics - cont.Example 1.2: Examination Results:
For a separate analysis use:
mean(results$arch1, na.rm=T) # hint: use attach(results)[1] 83.33333
summary(arch1, na.rm=T)Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3.00 79.25 89.00 83.33 96.00 100.00 2.00
14
![Page 15: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/15.jpg)
Programming in R
• Example 1.3: Write a program to calculate the mean of downtime
Formula for the mean:
x <- sum(downtime) # sum of elements in downtimen <- length(downtime) #number of elements in the vectormean_downtime <- x/n
ormean_downtime <- sum(downtime) / length(downtime)
15
![Page 16: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/16.jpg)
Programming in R – cont.• Example 1.4: Write a program to calculate the standard deviation of
downtime
#hint - use sqrt function
CA200 16
![Page 17: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/17.jpg)
Graphical displays - Boxplots• Boxplot – a graphical summary based on the median, quartile and
extreme values
boxplot(downtime)
• box represents the interquartile range which contains 50% of cases
• whiskers are lines that extend from max and min value
• line across the box represents median• extreme values are cases on more than
1.5box length from max/min value
CA200 17
![Page 18: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/18.jpg)
Graphical displays – Boxplots – cont.
• To improve graphical display use labels:
boxplot(downtime, xlab = "downtime", ylab = "minutes")
18
![Page 19: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/19.jpg)
Graphical displays – Multiple Boxplots
• Multiple boxplots at the same axis - by adding extra arguments to boxplot function:
boxplot(results$arch1, results$arch2, xlab = " Architecture, Semesters 1 and 2" )
• Conclusions: – marks are lower in sem2– Range of marks in narrower in sem2
• Note outliers in sem1! 1.5 box length from max/min value. Atypical values.
![Page 20: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/20.jpg)
Graphical displays – Multiple Boxplots – cont.
• Displays values per gender: boxplot(arch1~gender,
xlab = "gender", ylab = "Marks(%)", main = "Architecture Semester 1")
• Note the effect of using:main = "Architecture Semester 1”
![Page 21: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/21.jpg)
ParDisplay plots using par function
• par (mfrow = c(2,2)) #outputs are displayed in 2x2 array• boxplot (arch1~gender,
main = "Architecture Semester 1")• boxplot(arch2~gender,
main = "Architecture Semester 2")• boxplot(prog1~gender,
main = "Programming Semester 1")• boxplot(prog2~gender,
main = "Programming Semester 2")
To undo matrix type:
• par(mfrow = c(1,1)) #restores graphics to the full screen21
![Page 22: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/22.jpg)
Par – cont.
Conclusions: - female students are doing less well in programming for sem1- median for female students for prog. sem1 is lower than for male students
22
![Page 23: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/23.jpg)
Histograms• A histogram is a graphical display of frequencies in the categories of a
variable
hist(arch1, breaks = 5, xlab ="Marks(%)", ylab = "Number of students", main = "Architecture Semester 1“ )
• Note: A histogram with five breaks equal width - count observations that fill within categories or “bins”
23
![Page 24: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/24.jpg)
Histograms
hist(arch2, xlab ="Marks(%)", ylab = "Number of students", main = “Architecture Semester 2“ )
• Note: A histogram with default breaks
CA200 24
![Page 25: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/25.jpg)
Using par with histograms• The par can be used to represent all the subjects in the diagram
• par (mfrow = c(2,2))• hist(arch1, xlab = "Architecture",
main = " Semester 1", ylim = c(0, 35))• hist(arch2, xlab = "Architecture",
main = " Semester 2", ylim = c(0, 35))• hist(prog1, xlab = "Programming",
main = " ", ylim = c(0, 35))• hist(prog2, xlab = "Programming",
main = " ", ylim = c(0, 35))
Note: ylim = c(0, 35) ensures that the y-axis is the same scale for all four objects!CA200 25
![Page 26: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/26.jpg)
CA200 26
![Page 27: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/27.jpg)
Stem and leaf• Stem and leaf – more modern way of displaying data! Like histograms: diagrams
gives frequencies of categories but gives the actual values in each category• Stem usually depicts the 10s and the leaves depict units.
stem (downtime, scale = 2)
The decimal point is 1 digit(s) to the right of the |
0 | 012 1 | 2248 2 | 1134589 3 | 00036 4 | 457 5 | 1
CA200 27
![Page 28: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/28.jpg)
Stem and leaf – cont.• stem(prog1, scale = 2)
The decimal point is 1 digit(s) to the right of the |
6 | 5 7 | 12 7 | 66 8 | 01112223 8 | 5788 9 | 012 9 | 7778
Note: e.g. there are many students with mark 80%-85%
CA200 28
![Page 29: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/29.jpg)
Scatter Plots
• To investigate relationship between variables:
plot(prog1, prog2, xlab = "Programming, Semester 1", ylab = "Programming, Semester 2")
• Note: - one variable increases with other! - students doing well in prog1 will do well
in prog2!CA200 29
![Page 30: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/30.jpg)
Pairs
• If more than two variables are involved:
courses <- results[2:5]pairs(courses) #scatter plots for all possible pairs
or
pairs(results[2:5])CA200 30
![Page 31: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/31.jpg)
Pairs – cont.
CA200 31
![Page 32: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/32.jpg)
Graphical display vs. Summary Statistics
• Importance of graphical display to provide insight into the data!
• Anscombe(1973), four data sets
• Each data set consist of two variables on which there are 11 observations
CA200 32
![Page 33: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/33.jpg)
Graphical display vs. Summary Statistics
Data Set 1 Data Set 2 Data Set 3 Data Set 4x1 y1 x2 y2 x3 y3 x4 y410 8.04 10 9.14 10 7.46 8 6.588 6.95 8 8.14 8 6.77 8 5.7613 7.58 13 8.74 13 12.74 8 7.719 8.81 9 8.77 9 7.11 8 8.8411 8.33 11 9.26 11 7.81 8 8.4714 9.96 14 8.10 14 8.84 8 7.046 7.24 6 6.13 6 6.08 8 5.254 4.26 4 3.10 4 5.39 19 12.5012 10.84 12 9.13 12 8.15 8 5.567 4.82 7 7.26 7 6.42 8 7.915 5.68 5 4.74 5 5.73 8 6.89
CA200 33
![Page 34: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/34.jpg)
First read the data into separate vectors:
• x1<-c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)• y1<-c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
• x2 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)• y2 <-c(9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74)
• x3<- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)• y3 <- c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73)
• x4<- c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8)• y4 <- c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89)
CA200 34
![Page 35: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/35.jpg)
For convenience, group the data into frames:
• dataset1 <- data.frame(x1,y1)• dataset2 <- data.frame(x2,y2)• dataset3 <- data.frame(x3,y3)• dataset4 <- data.frame(x4,y4)
CA200 35
![Page 36: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/36.jpg)
• It is usual to obtain summary statistics:
1. Calculate the mean:
mean(dataset1)x1 y1 9.000000 7.500909
mean(data.frame(x1,x2,x3,x4))x1 x2 x3 x4
9 9 9 9
mean(data.frame(y1,y2,y3,y4)) y1 y2 y3 y4
7.500909 7.500909 7.500000 7.500909
2. Calculate the standard deviation:
sd(data.frame(x1,x2,x3,x4)) x1 x2 x3 x4
3.316625 3.316625 3.316625 3.316625
sd(data.frame(y1,y2,y3,y4)) y1 y2 y3 y4
2.031568 2.031657 2.030424 2.030579
Everything seems the same! CA200 36
![Page 37: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/37.jpg)
• But when we plot:
• par(mfrow = c(2, 2))• plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13))• plot(x2,y2, xlim=c(0, 20), ylim =c(0, 13))• plot(x3,y3, xlim=c(0, 20), ylim =c(0, 13))• plot(x4,y4, xlim=c(0, 20), ylim =c(0, 13))
CA200 37
![Page 38: CA200 (based on the book by Prof. Jane M. Horgan) 1.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649c765503460f94929e23/html5/thumbnails/38.jpg)
Everything seems different!Graphical displays are the core of getting insight/feel for the data!
Note: 1. Data set 1 in linear with some
scatter2. Data set 2 is quadratic3. Data set 3 has an outlier.
Without them the data would be linear
4. Data set 4 contains x values which are equal expect one outlier. If removed, the data would be vertical.
38