Download - Basic Statistical Conceptsi122server.vu-wien.ac.at/.../day1/StatsBasics_lecture.pdf · Basic Statistical Concepts . Aims of the course ... • Statistical Genetics: find out about

Carolin Kosiol

Institute of Population Genetics

Vetmeduni Vienna

<[email protected]>

Spezielle Statistik in der Biomedizin

WS 2014/15

Basic Statistical Concepts

mailto:[email protected]

Aims of the course

An intuitive understanding of the fundamental

concepts in probability and statistics

Computational approaches using R

Practical understanding of linear models and related

concepts

The statistical model and frameworks that allow us

to identify specific genetic differences responsible

for differences in organisms that we can measure

You will be able to analyze a large data set for this

particular problem, e.g., a Genome-Wide Association

Study (GWAS)

Dates

26.11.2014 9:00-12:00 13.00-16:00

27.11.2014 9:00-12:00 13:00-16:00

28.11.2014 9:00-13:00 (homework assignment)

10.12.2014 9:00-12:00 13:00-15:00

11.12.2014 9:00-12:00 13:00-16:00

18.12.14: 13:00-14:00 (exam date)

Grading

Participation during lectures 5%

Homework assignment 35%

Written exam 60%

Website Resource

i122server.vuwien.ac.at/pop/Kosiol_website/

SpezStatistik2014/

login: student

Password: statistics

The information will be updated throughout the

course.

We will post slides for the computer labs and code

We will post all homeworks, exams, solutions, etc.

Book recommendations

Statistics: An Introduction using R

- Michael Crawley

- http://www3.imperial.ac.uk/naturalsciences/research/statisticsusingr

- Wiley and Sons

- ISBN-10: 0470022981

- ISBN-13: 978-0470022986

- 1 copy @ library

• ground floor signature 52079 ; BL-05.00

Book recommendations

Applied Statistical Genetics with R

- Andrea S. Foulkes

- Springer 2009

- ISBN: 978-0-387-89554-3

(quite technical)

Genome-Wide Disease Association

Analysis: A Primer

-Bruce Rannala

-Preview chapter

http://rannala.org/books/CUPChap2.pdf

(similar choice of topics, free)




Aims of Statistics

• Development and application of methods to

– collect

– summarize

– analyze

– interpret

data.

Problem Structure

• Data result from some underlying parameters +

random fluctuation

• Statistical methods permit to find out about

underlying parameters

• Statistical model: Specify how the data are

influenced by parameters and randomness

Situations where Statistics helps …

• Agriculture: which variety and fertilizer produces

highest yield?

• Medicine: which therapy for a certain disease is the

most effective?

• Nutrition: does green tea reduce the risk of certain

types of cancer?

• Statistical Genetics: find out about mutations that

convey resistance or susceptibility to a disease.

Statistical Genetics

We know that aspects of an organism (measurable

attributes and states such as disease) are influenced by

the genome (the entire DNA sequence) of an individual

This means difference in genomes (genotype) can

produce differences in a phenotype:

• Genotype - any quantifiable genomic difference among

individuals, e.g. Single Nucleotide Polymorphisms (SNPs).

• Phenotype - any measurable aspect of an organisms

(that is not the genotype!).

Examples?

An Illustration

Example: People are different...

We know that environment plays a role in these differences ...and

for many, differences in the genome play a role

For any two people, there are millions of differences in their DNA, a

subset of which are responsible for producing differences in a given

measurable aspect.

An Illustration (cont.)

The problem: for any two people, there can be millions of

differences their genomes...

How do we figure out which differences are involved in

producing differences and which ones are not?

This course is concerned with how we do this.

Note that the problem (and methodology) applies to any

measurable difference, for any type of organism!!

Why do we want to know this?

We target genomic differences responsible for genetic

diseases for gene therapy

We can manipulate genomes of agricultural crops to be

disease resistant strains

We can explain why a disease has a particular frequency

in a population, why we see a particular set of differences

These differences provide a foundation for understanding

how pathways, developmental processes, physiological

processes work

The list goes on...

Statistical Genomics

Traditionally, determining the impact of genome

differences on phenotypes was the province of fields of

“Genetics”

Given this dependence on genomes, it is no surprise that

modern genetic fields now incorporates genomics: the

study of an organism’s entire genome (wikipedia

definition)

However, one can study genetics without genomics (i.e.

without direct information concerning DNA) and the

merging of genetics genomics is quite recent

History of Statistical Genetics

In sum: during the last decade, the greater availability of DNA

sequence data has completely changed our ability to make

connections between genome differences and phenotypes

Quantitative Genomics

In this course, we will use statistical modeling to say

something about biology, specifically the relationships

between genotype (DNA) and phenotype

Quantitative genomics is a field concerned with the

modeling of the relationship between genomes and

phenotypes and using these models to discover and

predict

We will use frameworks from the fields of probability and

statistics for this purpose

Advances in sequencing

Sources of random fluctuation

• Random sampling

• Measurement error

• Sequencing errors

• Genetic drift

• Others sources (depending on application)

Genomics & Statistics

A non-technical definition of probability: a mathematical

framework for modelling under uncertainty

Such a system is particularly useful for modelling

systems where we don’t know and/or cannot measure

critical information for explaining the patterns we observe

This is exactly the case we have in quantitative

genomics when connecting differences in a genome to

differences in phenotypes

Genomics & Statistics

We are interested in using a probability model to identify

relationships between genomes and phenotypes using

DNA sequences and phenotype measurements.

For this purpose, we will use the framework of statistics,

which we can (non-technically) define as a system for

interpreting data for the purposes of prediction and

decision making given uncertainty.

Basic Statistical Concepts

Population and Sample

Random Samples

Controlled experiment versus observational studies

Types of Data

Descriptive Statistics vs. Statistical Inference

Population and Sample

observations are taken

Population can be real or hypothetical, choice of

population depends on aim of investigation

Sample: Subset of the population that is actually

investigated

Different samples lead to different conclusions: statistical

methods take uncertainty due to sampling into account

Random Samples

Simple random sampling: each member of population has same chance of being included into sample.

Other types of random sampling: chance of being included is well defined.

Random sampling ensures that sample represents population.

Ideal, but often difficult to achieve.

Observational Study versus Experiment

In a controlled experiment investigator is able to choose

explanatory variables of interest.

Observational study: Both explanatory variables and

responses are observed.

Important consequences when investigating cause-effect

relationships.

Response vs. Explanatory variables

Example 1


Example 2


Example 3


Example 4:

Types of Data

• Qualitative Variables: gender, species, hair color, …

Quantitative variables

Discrete: things you count

(number of people successfully treated, number of “A”

alleles)

Continuous: things you measure

(temperature, blood pressure)

Descriptive Statistics vs. Statistical Inference

Descriptive Statistics: Summarize Data.

Statistical Inference: Find out about properties of the

population using sample(s), separate as good as

possible random fluctuation from underlying parameters.

Quantities used to summarize data (>summary statistics)

often also used for statistical inference.

Descriptive Statistics

EDA - Exploratory Data Analysis & Summary Statistics

-> Get a feel for your data!

Always plot your data before using formal tools of

analysis.

EDA is the quickest way to see what the data says,

often reveals interesting features that were not

expected,

helps prevent inappropriate analyses and unfounded

conclusions.

Plots also have a central role in checking up on the

assumptions made by formal methods.

Measures of Location (Central Tendency)

Measures of Location (Cont.)

Order statistics

Quantiles

Measures of spread

Numerical EDA with R

arithmetic mean: mean()

median: median()

weighted mean: weighted.mean(x,w)

variance: var()

standard deviation: sd()

minimum: min()

maximum: max()

quantiles: quantile(x, probs=seq(0, 1, 0.25))

range: range(); diff(range())

interquartile range: IQR()

Graphical EDA with R

boxplot

bar chart / pie chart

histogram

density plots

We continue with the R practical after the

break!