Ana Jerončić, PhD Department for Research in Biomedicine and Health.

Post on 26-Dec-2015

212 views 0 download

Tags:

Transcript of Ana Jerončić, PhD Department for Research in Biomedicine and Health.

Data analysis:1. Describing

dataAna Jerončić, PhD

Department for Research in Biomedicine and Health

Contact

E-mail: ana.jeroncic@mefst.hrLocation: main building, 5th floor,

room 512Phone: 557-862

Contents of the 2nd week

1. Describing data - Central tendency and variability

2. Estimation - Accuracy, precision, standard error, confidence intervals

3. Hypothesis testing - Test statistics, P-value, choice of a statistical test

4. Interpretation of data - Causality and association, odds ratio, risk, correlation, linear regression

5. Sources of error - Type 1 and type 2 errors, power, bias, confounding

Learning outcome

Critical appraisal of scientific papers

NOT!Implementation of data analysis

Why?

To identify the best available treatment To prevent “medical zombies” To perform your own research

Data analysis:Describing data

1. How the data should be organized prior to data analysis

2. Data types3. Graphical & tabular techniques for

description, summary statistics Qualitative Data Quantitative Data

Height measurements among 1st year medical students

157 204 184 186 197 155 169150 193 205 150 161 169 147167 159 187 173 146 179 201159 147 144 204 184 192 165146 169 198 164 182 165 173147 166 167 180 169 174 201146 151 203 171 186 179 152189 204 189 200 202 147 181145 161 173 155 203 190 164141 163 179 195 155 197 151197 141 146 202 149 197 203172 143 151 200 197 192 192160 173 187 172 177 179 188

What is the unit of measurement? How many observations per subject ?

Variables and their values, Units of observation

Entity Height(cm)

Weight(kg)

Age(years)

Sex(category)

Person 1Person 2Person 3**

176171182**

706075**

333862**

MaleFemaleMale**

OBSERVATIONS

VARIABLES

Measurement/Observation

Variable Features of variables

Example Descriptive statistics

Informativeness level

Categorical,Nominal

Unordered/unarranged

categories

Gender, urbanization

Number, proportion

Low

Ordinal Orded/arranged categories

Grades, scales

Median Medium

Numerical Arranged categories with equal intervals

Height, weight

Mean or median

High

Types of data (variables)

CategoricalNominal

QualitativeOrdinal

Numerical

Quantitative

Types of data

Type of data?

Height Grades Age in years Weight Insuline concentration Blood glucose

Type of data?

How many cigarettes do you smoke a day?

1-5 6-10 11-15 16-20 21 and more

Type of data?

Have you ever had a heart attack? Yes No

Do you suffer from hypertension? Yes No ?

Type of data?

Gender: Male Female

Type of data?

Marital status: married divorced widowed single lives alone ?

Type of data?

Education: elementary school high school two-year college four-year college ?

Type of data?

Likert scale 

Claim: Violence among the youth is becoming an increasing problem in Croatia.

I agree completely I agree Undecided I disagree I argue strongly against

1 2 3 4 5

Type of data?

Visually analogous scale

E.g. pain level that examinee experiences

I don’t feel pain

I feel intolerable pain

Variables – Transformation of variables

Numerical Distance is meaningfull

Ordinal Atributes can be ordered

Nominal Attributes are only named; weakest

Watch out for… literature inconsistency in data type

classification

Person No. Height [cm]Person 1 148Person 2 142Person 3 154Person 4 153Person 5 160Person 6 177Person 7 204Person 8 192Person 9 191Person 10 203Person 11 197Person 12 202Person 13 177

Observe and Describe

Organized data are input forGraphical & Tabular data representations

Person No. Height [cm]Person 1 148Person 2 142Person 3 154Person 4 153Person 5 160Person 6 177Person 7 204Person 8 192Person 9 191Person 10 203Person 11 197Person 12 202Person 13 177

Observe and Describe

QUALITATIVE DATA

Tabular Techniques for Qualitative Variable(s) –YPEL5 example –

Contingency TableIn one study researchers investigated genotype of the

YPEL5 gene in a population sample from Split. They got the following results on 10 examinees :

IndividualYPEL5 Genotype

1 AA2 BB3 BB4 BB5 AB6 AB7 BB8 AA9 AB

10 BB

Genotype Frequency Relative Frequency

Relative Frequency [%]

AA 2 0.2 20%

AB 3 0.3 30%

BB 5 0.5 50%

Total 10 1.00 100%

Table Frequency Distribution of YPEL5 genotypes

proportion percentage

Graphical Techniques for Qualitative Variable(s) –YPEL5 example – Bar Chart

Bar Charts are often used to display frequencies…

categories’ names

CountsOr Percentages

BB AA AB0

1

2

3

4

5

YPEL5 genotype

Freq

uen

cy

Is there an association between the medicine taken and the length of cold?

(84%)

(16%)

(100%)

(19%)

(81%)

(100%)

Graphical & Tabular Techniques

The only allowable calculation => count the frequency of category.

We can summarize the data in a contingency table that presents the categories and their counts called a frequency distribution.

A relative frequency distribution lists the categories and the proportion with which each occurs.

Graphical Techniques for Qualitative Variable(s) –Bar Chart – pareto chart

Nominal data has no order. However, sometimes it is usefull to arrange the outcomes from the most frequently occurring to the least frequently occurring. We call this bar chart representation a “pareto chart”

categories’ names

counts

Chart with relative frequency is more informative

categories’ names

percentages

Graphical Techniques for Qualitative Variable(s) –Bar Chart – pareto chart

Pie Charts show relative frequencies…

Graphical Techniques for Qualitative Variable(s) –YPEL5 example - Pie Chart

50%

20%

30%

Pie Chart

BBAAAB

Watch out for . . . Authors can use percentages to hide the true size of the data. To say that 50% of a sample has a certain condition

when there are only four people in the sample is clearly not providing the same level of information as 50% of a sample based on 400 people.

So, percentages should be used as an additional help for the reader rather than replacing the actual data

Chart that changed the medicine

QUANTITATIVE DATA

Graphical Technique for Quantitative Data

Height measurements among 1st year medical students

Individual Height (cm)1 1862 1443 1754 1995 1496 1577 1508 1769 179

10 16511 15112 16413 16714 17515 19116 16317 18718 17619 18420 19121 17222 15123 179

Frequency distribution for quantitative data:

Building a Histogram

Building a Histogram…

Category limits[cm] Freq.

Relative Freq.

Percent Relative Freq.

>140;<=150 3 0,13 13%150-160 3 0,13 13%160-170 4 0,17 17%170-180 7 0,30 30%180-190 5 0,22 22%190-200 1 0,04 4%Total 23 1,00 100%

Frequency distribution of height

145 155 165 175 185 1950%5%

10%15%20%25%30%35%

Height [cm]P

erce

nt

Rel

ativ

e Fr

equ

ency

Graphical Techniques for Quantitative Data

There are several graphical methods that are used when the data are quantitative ( numeric).

The most important of these graphical methods is the histogram.

The histogram is not only a powerful graphical technique used to summarize interval data, but it is also used to help explain probabilities.

Histogram simulations http://www.shodor.org/interactivate/activities/Histogram/

Summary: Graphs/Tables for Describing Data

Qualitative Frequency Distribution – tabular summary of data Bar Chart Pie Chart

Quantitative Frequency Distribution – tabular summary of data Histogram Line Chart (Time-Series Plot) Stem and Leaf Display

Relationship between two variables

Relationship between two variables

To compare two variables we use: Scatter plot/diagram (quantitative) Cross table (qualitative)

Scatter plot – for two quantitative variables

Scatter plot, showing the strong association between enzyme activity at pH 5.5 and the 5α-reductase 2-specific mRNA expression, as expressed on the basis of β-actin (n = 30; rs = 0.81; 95% confidence interval, 0.64–0.91; P < 0.0001).

Patterns of Scatter Diagrams…Linearity and Direction are two concepts we

are interested in

Positive Linear Relationship Negative Linear Relationship

Weak or Non-Linear Relationship

Scatter plot

Analysis of expression level from microarray data

Squamous cell carcinoma tumor and perilesional display distinctly different scatter plots from normal tissue. Expresion levels for gene subset 1 in patient 1

Cross Table - for two qualitative variables

Used to compare two qualitative variables

If first variable has r categories, second variable c categories, then we have an r×c cross table.

Association of two qualitative variables

Disease X YES NO TOTAL

YPEL5 Genot

ype

AA 2 0 2

AB 1 3 4BB 0 4 4TOTAL 3 7 10

AA AB BB 0

1

2

3

4

Disease Healthy

Freq

uen

cy

Based on data presented do you think that YPEL5 could be associated with disease X?

Questions?

Room 512 (5th floor)E-mail: ajeronci@mefst.hr

The projected clinical cost for breast cancer detection program in 2011-12, broken down by service category.

The results of measuring the height among med. students

Individual Height (cm)1 1862 1443 1754 1995 1496 1577 1508 1769 179

10 16511 15112 16413 16714 17515 19116 16317 18718 17619 18420 19121 17222 15123 179

Histogram – common mistake!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230

50

100

150

200

250

subjects

Height [cm]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230

50

100

150

200

250

Same data –with reshuffled subjects

subjects

Height [cm]