Introduction, Sections 1.1 and 1.3, The R Statistical …people.stat.sc.edu/jiana/STAT 205/Lecture...
Transcript of Introduction, Sections 1.1 and 1.3, The R Statistical …people.stat.sc.edu/jiana/STAT 205/Lecture...
Introduction, Sections 1.1 and 1.3,The R Statistical Package
Instructor: Chendi Jiang
Department of StatisticsUniversity of South Carolina
Elementary Statistics for the Biological and Life Sciences
(Stat 205)
1 / 37
Grading, homework, exams
Stat 205:Elementary Statistics for the Biological and Life SciencesCourse website: https://blackboard.sc.edu.Grade is: 35% for homework and quizzes, 20% each forexams I, II, and 25% for final.Homework: there will be weekly homework graded over thecourse except the exam weeks.Exams are fixed time and location (see course website).Do not ask to turn in homework late; late or missedhomework will receive zero score. Lowest homework scorewill be dropped from your grade.
2 / 37
Homework, exams and course webpage, office hours
About 7-10 homework and random quizzes you will turn infor credit (35% of your grade).Two midterm exams (20% each) and one Final exam(25%) will account for 65% of the grade percentage. Allexams are comprehensive.Strong, positive correlation between attendance and yourfinal grade.All resources for the course will be posted in coursewebsite: https://blackboard.sc.edu. Course website will beupdated frequently.
Both the instructor and teaching assistant will hold officehours, see syllabus for details.
3 / 37
Topics we’ll cover
Graphical displays and summary statisticsProbability, random variables (normal and binomial)Confidence intervals for µ and pTwo-sample testing and CI2× 2 tables: relative risk & odds ratiosAnalysis of varianceLinear regressionLogistic regression, survival analysis, ROC curvesUse of the statistical package R to analyze real data
4 / 37
R computing & graphics package
R is a powerful, free statistical computing and graphicspackage.Popular with many researchers due to contributedpackages: R functions to do specialized, advanced, &often complex statistical analysis.R can also do many important, routine calculations,analysis, and provide common graphical displays used inthis course.Installed in several of the computing labs across campus,e.g. Sloan 108 & 109, Gambrell 003.You can download it and install it from CRAN:http://cran.r-project.org/
5 / 37
6 / 37
Get Started
• Installationgoogle R→ The R project for Statistical ComputingR64 bits (large datasets) vs. R32 bits
Getting help with RAt the command prompt, type, for example ?read.table orhelp(read.table)At the command prompt, type, for example,help.search(”read”) or apropos(”read”).Within R, use the menu bar: Help: R help.Quititng R. q()How to save work space
7 / 37
R Resources
John Verzani’s SimpleR notesR Reference CardCRAN (Document/Manuals)
Note: To run some of the example in John Verzani’s notes, run:> install.packages(”UsingR”)> library(UsingR)
8 / 37
More on R
R allows you to do all analysis covered in this course, andbeyond.There are some tutorials, both installed in R and on theweb. Under Help choose Manuals (in PDF) andchoose An introduction to R. This is a longerintroduction to R.For homework, I will give you a skeleton set of commandsto get the basic job done with no frills.R’s error messages can be cryptic and therefore R is notas “user friendly” as some other packages such as Minitab.However it is free and is being used by hundreds ofthousands of people.
9 / 37
The Comprehensive R Archive Network
Here is where you download R.
10 / 37
Installing R
From http://cran.r-project.org/, under Download and Install Rclick on your platform (Linux, MacOS X, or Windows).
for Windows click on Download R for Windows.
Click Save File and when it’s done downloading run the executable by clicking
on it – alternatively you can choose to Run Program directly after downloadingfrom the web.
The installation program will ask you a series of questions; choose the defaults.(e.g. English language, the suggested installation folder, the checked selectedcomponents to install, not to customize startup options, shortcut in the StartMenu, and additional tasks).
When it’s done, click on the new R desktop icon. Click on the console. This iswhere you will type commands to R.
11 / 37
The R interface
Initially, there is only the console window open. If you makeplots, other windows will open too.
12 / 37
Some code to try
Note that the # sign is a “comment” – R ignores anything after#.
# generate some random normal datadata=rnorm(100)# look at a histogram and a boxplothist(data)boxplot(data)# compute the sample mean, median, variance, standard deviationmean(data)median(data)var(data)sd(data)# if you have a question about a command, preface it with ??hist
13 / 37
Motivation: why analyze data?
Clinical trials/drug development compare existingtreatments with new methods to cure disease.Agriculture enhance crop yields, improve pest resistance.
Ecology study how ecosystems develop/respond toenvironmental impacts.Lab studies learn more about biological tissue/cellular
activity.
14 / 37
1.1 Statistics and the life sciences
Statistics is the science ofcollecting,summarizing,analyzing,interpreting
data.Goal: to understand the underlying biological phenomenathat generate the data.Statistics separates signal from noise.Are there associations or relationships among variables inthe data?
15 / 37
Numerical Reasoning: Example1
The following are profits (in millions $) per month from Jan. toDec. 20, 20.4, 20.8, 20.9, 21, 21.2, 21.2, 21.2, 21.4, 21.6, 21.8,22.Suppose you want to show your boss how much the profits aregrowing over the year.
How might you graph the data?
16 / 37
17 / 37
Numerical Reasoning: Example1
Suppose the data are exactly the same as before, but the datarepresent expenses.Now you want to convince your boss that the expenses are notgrowing over the year.
How might you graph the data?
18 / 37
19 / 37
Example 1.1.2: liver tumors in mice
Is there a risk difference between germ environment(germ-free vs. E. coli) and whether liver tumors develop?Is the association perfect?Statistics can help answer whether there’s a difference andfurther quantify the effect of germ exposure (Chapter 10).
20 / 37
Example 1.1.4: MAO and schizophrenia
Monoamine oxidase (MAO) enzyme thought to regulatebehavior.Blood from n = 42 schizophrenia patients collected, stratified bydiagnosis (I, II, III).Are there differences in MAO enzyme and disease diagnosis?
21 / 37
Example 1.1.4: MAO and schizophrenia
What happens to MAO as severity of diagnosis increases? Is therelationship perfect?These are side-by-side dotplots, described in Sec. 2.2. Formalapproach in Chapter 11.
22 / 37
1.2 Types of Evidence: Two main types of datacollection
Observational study : In an observational study theresearcher systematically collects data from subjects, butonly as an observer and not as someone who ismanipulating conditions.Experiment : A research method where the researcherimpose the conditions and manipulates one variable at atime in order to prove causality.
23 / 37
Observational studiesExample 1.2.2 Sexual orientation and AC
Study enrolled 30 homosexual men, 30 heterosexual men,and 30 heterosexual womenThen observe endpoints: midsagittal area of the anteriorcommissure (AC)
24 / 37
Observational studies
25 / 37
ExperimentExample 1.2.4 Toxicity in Dogs
Before being tested on humans, toxicity of a certain drug wastested on 8 female and 8 male dogs. Females were randomlyallocated into two groups of 4 dogs - one taking 8mg/kg andone taking 25 mg/kg of the drug; similarly for the male dogs.The alkaline phosphate level was measured after receiving thedrug.
26 / 37
Example 1.2.4 Toxicity in Dogs
27 / 37
Other topics and terms
Blinding: An experiment is said to be double-blind if boththe subject and the response measurer do not know whichtreatment the subject is getting.Control Groups: Usually in the life sciences we comparetwo or more groups to one another. Often, one of thegroups is a control group.
28 / 37
1.3 Random sampling
The population is all the subjects/animals/specimens/etc.of interest.Since we can’t measure the entire population (usually) wetake a small sample of size n and use the data collected toinfer about the population.
29 / 37
Simple Random Sampling
A simple random sample of n items is a data set where
every population element has an equal chance of selectionevery population element is chosen independently of everyother element
30 / 37
Cluster Sampling
Divide the population (sampling frame) into clusters, samplewhole clusters.
31 / 37
Stratified Random Sampling
First, divide the population (sampling frame) into strata, thensample a specified number of individuals from each strata.
32 / 37
MAO data: R code
# read data from web, take 1st & 2nd columns as mao and group indicators, plotstuff=read.table("http://people.stat.sc.edu/Chakrabp/mao.txt",header=FALSE)stuffmao=stuff[,1]maogroup=stuff[,2]groupplot(group,mao)# you can also read data from a file on your computer (text, Excel, etc.)
33 / 37
MAO data: output
>stuff=read.table("http://people.stat.sc.edu/Chakrabp/mao.txt",header=FALSE)> stuff
V1 V21 6.8 12 4.1 13 7.3 14 14.2 15 18.8 16 9.9 17 7.4 18 11.9 19 5.2 110 7.8 111 7.8 112 8.7 113 12.7 114 14.5 115 10.7 116 8.4 117 9.7 118 10.6 119 7.8 220 4.4 221 11.4 222 3.1 223 4.3 224 10.1 225 1.5 2
34 / 37
MAO data: output continued
26 7.4 227 5.2 228 10.0 229 3.7 230 5.5 231 8.5 232 7.7 233 6.8 234 3.1 235 6.4 336 10.8 337 1.1 338 2.9 339 4.5 340 5.8 341 9.4 342 6.8 3> mao=stuff[,1]> mao[1] 6.8 4.1 7.3 14.2 18.8 9.9 7.4 11.9 5.2 7.8 7.8 8.7 12.7 14.5 10.7 8.4 9.7
[18] 10.6 7.8 4.4 11.4 3.1 4.3 10.1 1.5 7.4 5.2 10.0 3.7 5.5 8.5 7.7 6.8 3.1[35] 6.4 10.8 1.1 2.9 4.5 5.8 9.4 6.8> group=stuff[,2]> group[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
> plot(group,mao)
35 / 37
Plot of MAO data from R
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1.0 1.5 2.0 2.5 3.0
510
15
group
moa
You can right click on an R plot to save it to the clipboard as a metafile or bitmap.
These can be saved into Microsoft applications such as Word. You can also left click on
the plot then under Save choose Save as and save the plot, e.g. PDF.
36 / 37
Plot of MAO data from R
R window after cutting and pasting the commands a few slidesago.
37 / 37