Carolin Kosiol
Institute of Population Genetics
Vetmeduni Vienna
Spezielle Statistik in der Biomedizin
WS 2014/15
Basic Statistical Concepts
Aims of the course
An intuitive understanding of the fundamental
concepts in probability and statistics
Computational approaches using R
Practical understanding of linear models and related
concepts
The statistical model and frameworks that allow us
to identify specific genetic differences responsible
for differences in organisms that we can measure
You will be able to analyze a large data set for this
particular problem, e.g., a Genome-Wide Association
Study (GWAS)
Dates
26.11.2014 9:00-12:00 13.00-16:00
27.11.2014 9:00-12:00 13:00-16:00
28.11.2014 9:00-13:00 (homework assignment)
10.12.2014 9:00-12:00 13:00-15:00
11.12.2014 9:00-12:00 13:00-16:00
18.12.14: 13:00-14:00 (exam date)
Grading
Participation during lectures 5%
Homework assignment 35%
Written exam 60%
Website Resource
i122server.vuwien.ac.at/pop/Kosiol_website/
SpezStatistik2014/
login: student
Password: statistics
The information will be updated throughout the
course.
We will post slides for the computer labs and code
We will post all homeworks, exams, solutions, etc.
Book recommendations
Statistics: An Introduction using R
- Michael Crawley
- http://www3.imperial.ac.uk/naturalsciences/research/statisticsusingr
- Wiley and Sons
- ISBN-10: 0470022981
- ISBN-13: 978-0470022986
- 1 copy @ library
• ground floor signature 52079 ; BL-05.00
Book recommendations
Applied Statistical Genetics with R
- Andrea S. Foulkes
- Springer 2009
- ISBN: 978-0-387-89554-3
(quite technical)
Genome-Wide Disease Association
Analysis: A Primer
-Bruce Rannala
-Preview chapter
http://rannala.org/books/CUPChap2.pdf
(similar choice of topics, free)
Aims of Statistics
• Development and application of methods to
– collect
– summarize
– analyze
– interpret
data.
Problem Structure
• Data result from some underlying parameters +
random fluctuation
• Statistical methods permit to find out about
underlying parameters
• Statistical model: Specify how the data are
influenced by parameters and randomness
Situations where Statistics helps …
• Agriculture: which variety and fertilizer produces
highest yield?
• Medicine: which therapy for a certain disease is the
most effective?
• Nutrition: does green tea reduce the risk of certain
types of cancer?
• Statistical Genetics: find out about mutations that
convey resistance or susceptibility to a disease.
Statistical Genetics
We know that aspects of an organism (measurable
attributes and states such as disease) are influenced by
the genome (the entire DNA sequence) of an individual
This means difference in genomes (genotype) can
produce differences in a phenotype:
• Genotype - any quantifiable genomic difference among
individuals, e.g. Single Nucleotide Polymorphisms (SNPs).
• Phenotype - any measurable aspect of an organisms
(that is not the genotype!).
Examples?
An Illustration
Example: People are different...
We know that environment plays a role in these differences ...and
for many, differences in the genome play a role
For any two people, there are millions of differences in their DNA, a
subset of which are responsible for producing differences in a given
measurable aspect.
An Illustration (cont.)
The problem: for any two people, there can be millions of
differences their genomes...
How do we figure out which differences are involved in
producing differences and which ones are not?
This course is concerned with how we do this.
Note that the problem (and methodology) applies to any
measurable difference, for any type of organism!!
Why do we want to know this?
We target genomic differences responsible for genetic
diseases for gene therapy
We can manipulate genomes of agricultural crops to be
disease resistant strains
We can explain why a disease has a particular frequency
in a population, why we see a particular set of differences
These differences provide a foundation for understanding
how pathways, developmental processes, physiological
processes work
The list goes on...
Statistical Genomics
Traditionally, determining the impact of genome
differences on phenotypes was the province of fields of
“Genetics”
Given this dependence on genomes, it is no surprise that
modern genetic fields now incorporates genomics: the
study of an organism’s entire genome (wikipedia
definition)
However, one can study genetics without genomics (i.e.
without direct information concerning DNA) and the
merging of genetics genomics is quite recent
History of Statistical Genetics
In sum: during the last decade, the greater availability of DNA
sequence data has completely changed our ability to make
connections between genome differences and phenotypes
Quantitative Genomics
In this course, we will use statistical modeling to say
something about biology, specifically the relationships
between genotype (DNA) and phenotype
Quantitative genomics is a field concerned with the
modeling of the relationship between genomes and
phenotypes and using these models to discover and
predict
We will use frameworks from the fields of probability and
statistics for this purpose
Advances in sequencing
Sources of random fluctuation
• Random sampling
• Measurement error
• Sequencing errors
• Genetic drift
• Others sources (depending on application)
Genomics & Statistics
A non-technical definition of probability: a mathematical
framework for modelling under uncertainty
Such a system is particularly useful for modelling
systems where we don’t know and/or cannot measure
critical information for explaining the patterns we observe
This is exactly the case we have in quantitative
genomics when connecting differences in a genome to
differences in phenotypes
Genomics & Statistics
We are interested in using a probability model to identify
relationships between genomes and phenotypes using
DNA sequences and phenotype measurements.
For this purpose, we will use the framework of statistics,
which we can (non-technically) define as a system for
interpreting data for the purposes of prediction and
decision making given uncertainty.
Basic Statistical Concepts
Population and Sample
Random Samples
Controlled experiment versus observational studies
Types of Data
Descriptive Statistics vs. Statistical Inference
Population and Sample
observations are taken
Population can be real or hypothetical, choice of
population depends on aim of investigation
Sample: Subset of the population that is actually
investigated
Different samples lead to different conclusions: statistical
methods take uncertainty due to sampling into account
Random Samples
Simple random sampling: each member of population has same chance of being included into sample.
Other types of random sampling: chance of being included is well defined.
Random sampling ensures that sample represents population.
Ideal, but often difficult to achieve.
Observational Study versus Experiment
In a controlled experiment investigator is able to choose
explanatory variables of interest.
Observational study: Both explanatory variables and
responses are observed.
Important consequences when investigating cause-effect
relationships.
Response vs. Explanatory variables
Example 1
Response vs. Explanatory variables
Example 1
Response vs. Explanatory variables
Example 2
Response vs. Explanatory variables
Example 2
Response vs. Explanatory variables
Example 3
Response vs. Explanatory variables
Example 3
Response vs. Explanatory variables
Example 4:
Types of Data
• Qualitative Variables: gender, species, hair color, …
Quantitative variables
Discrete: things you count
(number of people successfully treated, number of “A”
alleles)
Continuous: things you measure
(temperature, blood pressure)
Descriptive Statistics vs. Statistical Inference
Descriptive Statistics: Summarize Data.
Statistical Inference: Find out about properties of the
population using sample(s), separate as good as
possible random fluctuation from underlying parameters.
Quantities used to summarize data (>summary statistics)
often also used for statistical inference.
Descriptive Statistics
EDA - Exploratory Data Analysis & Summary Statistics
-> Get a feel for your data!
Always plot your data before using formal tools of
analysis.
EDA is the quickest way to see what the data says,
often reveals interesting features that were not
expected,
helps prevent inappropriate analyses and unfounded
conclusions.
Plots also have a central role in checking up on the
assumptions made by formal methods.
Measures of Location (Central Tendency)
Measures of Location (Cont.)
Order statistics
Quantiles
Measures of spread
Numerical EDA with R
arithmetic mean: mean()
median: median()
weighted mean: weighted.mean(x,w)
variance: var()
standard deviation: sd()
minimum: min()
maximum: max()
quantiles: quantile(x, probs=seq(0, 1, 0.25))
range: range(); diff(range())
interquartile range: IQR()
Graphical EDA with R
boxplot
bar chart / pie chart
histogram
density plots
We continue with the R practical after the
break!
Top Related