1 Descriptive Studies

8/8/2019 1 Descriptive Studies

1/65

Descriptive Studies


2/65

Statistical methods fall into two broad areas:Descriptive statistics

Inferential statistics.


3/65

Descriptive statisticsDescriptive statistics merely describe,

organize, or summarize data; they referonly to the actual data available.

Examples include the mean blood pressure

of a group of patients and the success rateof a surgical procedure.


4/65

Inferential statisticsInferential statistics involve making

inferences that go beyond the actual data.

They usually involve inductive reasoning(i.e., generalizing to a population after

having observed only a sample).Examples include the mean blood pressure

of all Americans and the expected successrate of a surgical procedure in patients

who have not yet undergone theoperation.


5/65

POPULATIONS, SAMPLES,AND ELEMENTS


6/65

A population is the universe about which aninvestigator wishes to draw conclusions; itneed not consist of people but may be apopulation of measurements.

Strictly speaking, if an investigator wants todraw conclusions about the blood pressureof Americans, the population consists ofthe blood pressure measurements, not the

Americans themselves.


7/65

A sample is a subset of the populationthe partthat is actually being observed or studied.

Because researchers rarely can study wholepopulations, inferential statistics are almostalways needed to draw conclusions about a

population when only a sample has actuallybeen studied.

A single observationsuch as one person'sblood pressureis an element, denoted by X.

The number of elements in a population isdenoted by N, and the number of elements ina sample by n.

A population therefore consists of all theelements from X to XN, and a sample of n of

these N elements .


8/65

Most samples used in biomedical research areprobability samples samples in which theresearcher can specify the probability of anyone element in the population being included.

For example, if someone is picking a sample of 4

playing cards at random from a pack of 52cards, the probability that any one card will beincluded is 4/52.

Probability samples permit the use of inferentialstatistics, whereas non-probability samplesallow only descriptive statistics to be used.

There are four basic kinds of probabilitysamples:Simple random samplesStratified random samplesCluster samples, andSystematic samples.


9/65

Simple random samplesThe simple random sample is the simplest

kind of probability sample.

It is drawn in such a way that every elementin the population has an equal probability

of being included, such as in the playingcard example above.

A random sample is defined by the methodof drawing the sample, not by the

outcome.If four hearts were picked out of the pack of

cards, this does not in itself mean that thesample is not random.


10/65

A sample is representative if it closelyresembles the population from which it isdrawn.

All types of random samples tend to be

representative, but they cannot guaranteerepresentativeness.

Nonrepresentative samples can causeserious problems. (Four hearts are clearly

not representative of all the cards in apack.)


11/65

A sample or a result demonstrates bias if itconsistently errs in a particular direction.

For example, in drawing a sample of 10 froma population consisting of 500 white

people and 500 black people, a samplingmethod that consistently produces morethan 5 white people would be biased.

Biased samples are therefore

unrepresentative, and true randomizationis proof against bias.


12/65

Stratified random samplesIn a stratified random sample, the

population is first divided into relativelyinternally homogeneous groups, or strata,from which random samples are then

drawn.This stratification results in greater

representativeness.

For example, instead of drawing one sample

of 10 people from a total populationconsisting of 500 white and 500 blackpeople, one random sample of 5 could betaken from each ethnic group (or stratum)

separately, thus guaranteeing the racial


13/65

Cluster samplesCluster samples may be used when it is too

expensive or laborious to draw a simplerandom or stratified random sample.

For example, in a survey of 100 medical

students in the United States, aninvestigator might start by selecting arandom set or - groups or "clusters"suchas a random set of 10 U.S. medical schoolsand then interviewing all the students in

those 10 schools.This method is much more economical and

practical than trying to take a randomsample of 100 directly from the population

of all U.S. medical students.


14/65

Systematic samplesThese involve selecting elements in a

systematic waysuch as every fifthpatient admitted to a hospital or everythird baby born in a given area.

This type of sampling usually provides theequivalent of a simple random samplewithout actually using randomization.


15/65

Sampling problems are commonin clinical research.

For example, if a researcher advertises in anewspaper to recruit people suffering froma particular problemwhether it is acne,diabetes, or depressionthe people whorespond form a self-selected sample, whichis probably not representative of thePopulation of all people with this problem.

Similarly, if a dermatologist reports on theresults of a new treatment for acne which

he has been using with his patients, thesample may not be representative of allpeople with acne, as it is likely that onlypeople with more severe acne (or withgood insurance coverage!) seek treatmentfrom a dermatologist.


16/65

In any case, his practice is probably limitedto people in a particular geographic,climatic, and possibly ethnic area.

In this case, although his study may be valid

as far as his or her patients are concerned(this is called internal validity), it may notbe valid to generalize his findings topeople with acne in general (so the study

may lack external validity).


17/65

PROBABILITYProbability of an event is denoted by p.Probabilities are usually expressed as

decimal fractions, percentages, and mustlie between zero (zero probability) and one

(absolute certainty).The probability of an event cannot be

negative.

The probability of an event can also be

expressed as a ratio of the number oflikely outcomes to the number of possibleoutcomes.


18/65

For example, if a fair coin was tossed an infinitenumber of times, heads would appear on 50%of the tosses, therefore, the probability ofheads, or p (heads), is 0.50.

If a random sample of 10 people was drawn an

infinite number of times from a population of100 people, each person would be included inthe sample 10% of the time; therefore, p(being included in any one sample) is 0.10.

The probability of an event not occurring isequal to one minus the probability that it willoccur; this is denoted by q.

In the above example, the probability of any oneperson not being included in any one sample,

q, is therefore (1 - p) = (1 - 0.10) = 0.90.


19/65

There are three main method of calculatingprobability:The ADDITION rule

The MULTIPLICATION rule

The BINOMIAL DISTRIBUTION


20/65


21/65

Addition rule

Addition-rule of probability states that theprobability of any one or several particularevents occurring is equal to the sum oftheir individual probabilities, provided the

events are mutually exclusive; i.e., theycannot both happen.

Because the probability of picking a heartcard from a deck of cards is 0.25, this rule

states that the probability of picking a cardthat is either a diamond or heart is 0.25 +0.25 = 0.50. Because no card can be botha heart and diamond, these events meet

the requirement of mutual exclusiveness.


22/65

Multiplication rule

The multiplication rule of probability statesthat the probability of two or morestatistically independent events alloccurring is equal to the product of their

individual probabilities.If the lifetime probability of a person

developing cancer is 0.25, and the lifetimeprobability of developing schizophrenia is

0.01, the lifetime probability that a personmight have both cancer and schizophreniais 0.25 X0.01 = .0025, provided that thetwo illnesses are independentin other

words, that having one illness neitherincreases nor decreases the risk of havin


23/65

Binomial Distribution

The probability that a mutually exclusiveindependent events will occur can bedetermined by the use of binomialdistribution.

A binomial distribution is one in which thereare only two possibilities, such as yes / no,male/female, healthy/sick.

If an experiment has exactly two outcomes,

one of which is generally termed success,the binomial distribution gives theprobability of obtaining an exact number ofsuccesses in a series of independent

trials.


24/65

A typical use of binomial distribution is ingenetic counseling.

Inheritance of a disorder such asPhenylketonuria follows a binomial

distribution : there are two possible events,inheriting the disease and not inheritingthe disease; and the possibilities areindependent (if a child in a family inherits

the disorder, this does not affect thechance of another child inheriting it).


25/65

A physician could therefore use thebinomial distribution to inform the couplewho are the carrier of the disease howprobable it is that some specific

combination of events might occur- suchas the probability that if they are to havetwo children , neither will inherit thedisease.


26/65

Types of Data


27/65

Types of Data

The choice of an appropriate statisticaltechnique depends upon the type of datain question.

Data forms one of the four scales of

measurement:Nominal

Ordinal

Interval

Ratio


28/65

Nominal scale data

Nominal scale data are divided intoqualitative categories or groups such asmale/female, urban/rural, or red/green.

There is no implication of order or ratio.

Nominal data that fall under only two groupsare called dichotomous data.


29/65

Ordinal scale data

Ordinal scale data can be placed inmeaningful order; e.g. ranking of students.

However, there is no information about thesize of the interval; no conclusion can be

drawn about whether the differencebetween the first and second students issame as that between second and third.


30/65

Interval scale data

They are like ordinal data in that they can beplaced in a meaningful order.

In addition, they have meaningful intervalsbetween items, which are usually

measured quantities. E.g. temperaturescale.

However, because interval scales do nothave an absolute zero, ratios of scores are

not meaningful. E.g. 100 C is not twice ashot as 50 C.


31/65

Ratio scale data

A ratio scale has the same properties asinterval scale, however meaningful ratiosexist as there is an absolute zero.

Most biomedical variables form a ratio scale:

weights in pounds, time in seconds ordays, blood pressure in mm of Hg, pulserate in beats per minute are all ratio data.

A pulse rate of zero indicates absolute lack

of pulse. Therefore it is correct to say thata pulse rate of 120 BPM is twice that of 60BPM.


32/65

Discrete variables

Discrete variables can take only certainvalues and nothing in between.

For example, the number of patients in ahospital census may be 200 or 220, but it

cannot be in between these two; thenumber of syringes used in a clinic on anygiven day may increase or decrease onlyby units of one.


33/65

Continuous variables

Continuous variables may take any value(typically between certain limits).

Most biomedical variables are continuous(e.g., a patient's weight, height, age, and

blood pressure).However, the process of measuring or

reporting continuous variables will reducethem to a discrete variable.

Blood pressure may be reported to thenearest whole millimeter of mercury,weight to the nearest pound, and age tothe nearest year.


34/65

FREQUENCY DISTRIBUTIONS


35/65

A set of unorganized data is difficult to digestand understand.

Consider a study of the serum cholesterol levelsof a sample of 200 men: a list of the 200 levels

would be of little value in itself.A simple first way of organizing the data is to list

all the possible values between the highestand the lowest in order, recording thefrequency (f) with which each score occurs.

This forms a frequency distribution.

If the highest serum cholesterol level were 260mg/dl, and the lowest were 161 mg/dl, thefrequency distribution would be:


36/65

G d f


37/65

Grouped frequencydistributions

Data can be made more manageable bycreating a grouped frequency distribution.

Individual scores are grouped (between 5and 20 groups are usually appropriate).

Each group of scores encompasses an equalclass interval.

In this example there are 10 groups with aclass interval of 10 (161 to 170, 171 to

180, and so on.


38/65

Interval requency f e la t i v e f% e l f

u m u l at iv e f% u m f

251-260 5 2.5 100.0

241-250 13 6.5 97.5

231-240 19 9.5 91.0

221-230 18 9.0 81.5

211-220 38 19.0 72.5

201-210 72 36.0 53.5

191-200 14 7.0 17.5

181-190 12 6.0 10.5

171-180 5 2.5 4.5

161-170 4 2.0 2.0

R l ti f


39/65

Relative frequencydistributions

A grouped frequency distribution can betransformed into a relative frequencydistribution, which shows the percentage of allthe elements that fall within each classinterval.

The relative frequency of elements in any givenclass interval is found by dividing f, thefrequency (or number of elements) in thatclass interval, by n (the sample size, which in

this case is 200).By multiplying the result by 100, it is converted

into a percentage.

Thus, this distribution shows, for example, that

19% of this sample had serum cholesterol


40/65



251-260 5 2.5 100.0

241-250 13 6.5 97.5

231-240 19 9.5 91.0

221-230 18 9.0 81.5

211-220 38 19.0 72.5

201-210 72 36.0 53.5

191-200 14 7.0 17.5

181-190 12 6.0 10.5

171-180 5 2.5 4.5

161-170 4 2.0 2.0

C l ti f


41/65

Cumulative frequencydistributions

This is also expressed as a percentage; itshows the percentage of elements lyingwithin and below each class interval.

Although a group may be called the 211-220group, this group actually includes therange of scores that lie from 210.5 up toand including 220.5so these figures arethe exact upper and lower limits of thegroup.

The relative frequency column shows that2% of the distribution lies in the 161-170group and 2.5% lies in the 171-180 group;therefore, a total of 4.5% of thedistribution lies at or below a score of180.5, as shown by the cumulative


42/65

A further 6% of the distribution lies in the181-190 group; therefore, a total of (2 +2.5 + 6) = 10.5% lies at or below a scoreof 190.5.

A man with a serum cholesterol level of 190mg/dl can be told that roughly 10% of thissample had lower levels than his, andapproximately 90% had scores above his.

The cumulative frequency of the highestgroup (251-260) must be 100, showingthat 100% of the distribution lies at orbelow a score of 260.5.


43/65



251-260 5 2.5 100.0

241-250 13 6.5 97.5

231-240 19 9.5 91.0

221-230 18 9.0 81.5

211-220 38 19.0 72.5

201-210 72 36.0 53.5

191-200 14 7.0 17.5

181-190 12 6.0 10.5

171-180 5 2.5 4.5

161-170 4 2.0 2.0


44/65

Presentation of StatisticalData


45/65

Statistical data, once collected, must bearranged purposively, in order to bring outthe important points clearly and strikingly.

Therefore the manner in which statistical

data is presented is of utmost importance.There are several methods of presenting

data - tables, charts, diagrams, graphs,pictures and special curves.


46/65

Methods of presenting data

TablesDiagrams

Bar Charts

Histogram

Frequency polygonPie charts

Pictogram


47/65

Bar charts

To display nominal scale data, a bar graph istypically used. For example, if a group of100 men had a mean serum cholesterolvalue of 212 mg/dl, and a group of 100

women had a mean value of 185 mg/dl,the means of these two groups could bepresented as a bar graph.

Bar graphs are identical to frequency

histograms, except that each rectangle onthe graph is clearly separated from theothers by a space, showing that the dataform separate categories (such as maleand female) rather than continuous

rou s.


48/65

Bar chart


49/65


50/65

Interval requency f e la t i v e f% e l f u m u l a t i v e f% u m f251-260 5 2.5 100.0

241-250 13 6.5 97.5

231-240 19 9.5 91.0

221-230 18 9.0 81.5

211-220 38 19.0 72.5

201-210 72 36.0 53.5

191-200 14 7.0 17.5

181-190 12 6.0 10.5

171-180 5 2.5 4.5

161-170 4 2.0 2.0


51/65

Histogram


52/65

Frequency polygon

For ratio or interval scale data, a frequencydistribution may be drawn as a frequencypolygon, in which the midpoints of eachclass interval are joined by straight lines.


53/65


54/65

A cumulative frequency distribution can alsobe presented graphically as a polygon.

Cumulative frequency polygons typicallyform a characteristic S-shaped curve

known as an ogive.


55/65


56/65

Pie chart

Instead of comparing the length of a bar, theareas of segments of a circle arecompared.

The area of each segment depends upon the

angle.It is often necessary to indicate the

percentages in the segments as it may notbe easy to compare the areas of segments.


57/65

Pie chart


58/65

Pictogram

Pictograms are a popular method ofpresenting data to the layman.

Small pictures or symbols are used topresent the data.

For example, a picture of doctor to represent& population per physician .

Fractions of the picture can be used torepresent numbers smaller than the valueof a whole symbol.


59/65


60/65

Centiles and other quantiles

The cumulative frequency polygon and thecumulative frequency distribution bothillustrate the concept of centile (orpercentile) rank, which states the

percentage of observations that fall belowany particular score.

In the case of a grouped frequencydistribution, centile ranks state the

percentage of observations that fall withinor below any given class interval.

Centile ranks provide a way of givinginformation about one individual score in

relation to all the other scores in a


61/65



251-260 5 2.5 100.0

241-250 13 6.5 97.5

231-240 19 9.5 91.0

221-230 18 9.0 81.5

211-220 38 19.0 72.5

201-210 72 36.0 53.5

191-200 14 7.0 17.5

181-190 12 6.0 10.5

171-180 5 2.5 4.5

161-170 4 2.0 2.0


62/65

For example, the cumulative frequencycolumn of above table shows that 91% ofthe observations fall below 240.5 mg/dl,which therefore represents the 91st centile

(which can be written as C91 ).A man with a serum cholesterol level of 240

mg/dl lies at the 91st centile-about 9% ofthe scores in the sample are higher than

his.


63/65


64/65

Centile ranks are widely used in reportingscores on educational tests.

They are one member of a family of valuescalled quantiles, which divide distributions

into a number of equal parts.Centiles divide a distribution into 100 equal

parts.

Other quantiles include quartiles, which

divide the data into 4 parts, and deciles,which divide a distribution into 10 parts.


65/65

1 Descriptive Studies

Documents

Transcript of 1 Descriptive Studies