6.Descriptive Stats

8/2/2019 6.Descriptive Stats

1/37

Descriptive Statistics

Purpose of descriptive statistics

Frequency distributionsMeasures of central tendency

Measures of dispersion


2/37

Gary Geisler Simmons College LIS 403 Spring, 2004

Statistics as a Tool for LIS Research

Importance of statistics in research

Summarize observations to provide answers to research questions andhypotheses

Make general conclusions based on specific study observations

Objectively evaluate reliability of study conclusions


3/37



Main purposes of statistics in research

Describe central point in a set of data/observations

Describe how broad, diversified, or variable the data in a set is

Indicate whether specfic features of a set of data are related, and howclosely they are related

Indicate probability of features of data being influenced by factorsother than simply chance


4/37



Two main types or branches of statistics

Descriptive statistics

Characterizing or summarizing data set

Presenting data in charts and tables to clarify characteristics

No inference, just describing a particular group of observations

Inferential statistics

Using sample data to make generalizations (inferences) or estimatesabout a population

Statements made in terms of probability


5/37



Descriptive and inferential statistics not mutually exclusive

Overlap in what can be called descriptive and what can becalled inferential

Intent is important:

Group of observations intended to describe an event: descriptive

Group of observations collected from a sample and intended topredict what a larger population is like: inferential


6/37



Choosing statistical methods

Type of data collected largely determines choice of statistical analysis

techniques

Decisions about how and what type of data is collected will determinethe specific statistical tests that can be performed to analyze the data

Data collected should determine statistical tests used, not the other way

around

But consideration of how you want to analyze data should be done aspart of research design to ensure study can produce the type ofconclusions you want to make


7/37


Descriptive Statistics

Commonly used in LIS research

Cannot test causal relationships

Primary strength is describing and summarizing data:

Describing data in terms of frequency distributions

Describing most typical value in data set-

measures of centraltendency

Describing variability of data- measures of dispersion


8/37


Frequency Distributions

Describing data in terms offrequency distributions

Counts of totals by value orcategory for each measured variable

Can be presented as absolute totals,cumulative totals, percentages,

grouped totals

Often afirst step in statisticalanalysis of data

Usually presented in tables orcharts (histogram, bar graph, etc.)

0-10 11-20 21-40 41-60 61+

Age group

0

20

40

60

80

Bookschecked

out


9/37


Measures of Central Tendency

Describing most typical value in data set - measures of centraltendency

Mean is often referred to as average though average can be any ofthese measures of central tendency:

Mean (arithmetic average)

Median

Mode


10/37



Mean

Most popular statistic for summarizing data

Can be used for interval or ratio data

Based on all observations of the data set

Arithmetic average of a set of observations

Example: mean of 5, 10, and 30 is 15, since 453 = 15

Mean of a set of numbers can be a number not in set

Example: mean of 1, 2, 3, and 4 is 3.5, since 104 = 2.5


11/37


12/37



Median

Value that is above the lower one-half and below the upper one-half of the

values -- middle value of set of observations when they have been arrangedin order

Can be used for ordinal, interval or ratio data

Most central measure of a distribution

Every data set has a median that is unique

Difference in sets with odd numbers of observations than for even numbersof observations

Example: median of thefive observations 1, 3, 15, 16, and 17 = 15

Example: median of thesix observations 1, 2, 3, 5, 8, and 9 = 4


13/37



Mode

Can be used for any type of data

Most frequently occuring value among a set of observations

Examples:

Mode of the observations 1, 2, 2, 3, 4, 5 = 2

Set of observations 1, 2, 3, 4, 5 has no modeSet of observations 1, 2, 3, 3, 4, 5, 5 has no single mode, but can beconsidered to have two modes, or is bi-modal


14/37



Advantages of mean

Always exists

Is unique

Can always be calculated by a simple formula

Disadvantages of mean

Mean value for a data set is not necessarily one of the values of the data set

Sensitive to extreme scores, either high or low

Easily distorted by extremely large or extremely small values among the set ofobservations,

Example: mean of 1, 2, and 1,000,000 is 333,334.33


15/37



Advantages of median

Not affected by extreme scores

Useful way of describing sets of observations that are skewed byincluding extremely large or small values

Disadvantages of median

Median is not necessarily one of the values of the data set

Defined differently for odd and even numbers of observations


16/37



Advantages of mode

Can be used with any scale of measurement

If set of observations has a mode, mode usefully characterizing the set

For example, set of observations noting result of rolling two dice will have amode of 7

Disadvantages of mode

Many sets of observations lack a mode because no observed value occursmore than once

Other sets of observations may have several different most frequent values

Doesnt characterize set beyond most frequently occuring value


17/37


Calculatingmean

13

14

15

16

17

18

19

Age Frequency


18/37


Calculatingmean

13

14

15

16

17

18

19

13 x 3 = 39

14 x 4 = 56

15 x 6 = 90

16 x 8 = 128

17 x 4 = 68

18 x 3 = 54

19 x 3 = 57

Sum of X = 492N = 31

Mean = 15.87

492/31 = 15.87

Age Frequency


19/37


Calculatingmode 13

14

15

16

17

18

19

Mode = 16

Age Frequency


20/37


Calculatingmedian

Non-grouped data

13

14

15

16

17

18

19

Median = 16

1 - 3

4 - 7

8 - 13

14 - 21

22 - 25

26 - 28

29 - 31

N = 31 so midpoint is 16th value

Age Frequency


21/37


Calculatingmedian

Grouped data:

Each value issomewhere within

each age range

Values are assumedto be equally

distributed withinrange

13

14

15

16

17

18

19

Median = 16.31

1 - 3

4 - 7

8 - 13

14 - 21

22 - 25

26 - 28

29 - 31

N = 31 so midpoint is 16th value

Age Frequency

16.19

15

16.31

16

16.44

17

16.56

18

16.69

19

16.81

20

16.94

21

16.06

14


22/37


Mean = 15.87

Mode = 16

Median = 16.31


23/37



Normal distribution

Normal curve, bell-shaped curve,

Gaussian distributionMany types of data are normallydistributed in a population

Histogram of data approximates abell-shaped, symmetrical curve

Concentration of scores in themiddle, with fewer and fewerscores as you approach extremes

Example: heights of people in apopulation are normallydistributed


24/37



Skewness

Not all sets of data will exhibit properties of a normal distribution

Some data sets are asymmetrical around a central point

Majority of scores are closer to one extreme or the other: skeweddistribution

In a skewed distribution, the mean does not equal the median


25/37


Measures of Central TendencyPositively skewed distribution, tail goes to the right - median is less than themean

Example: Annual income of populationNegatively skewed distribution tail goes to the left - mean is less than themedian


26/37



Special case of skewness: J-Curve

Extreme skewness

Proposed by Allport to describe conforming behavior in groups of people

Large majority of scores fall at end representing socially acceptablebehavior, small minority represent deviation from norm

Example: amount of time drivers who park in No Parking zone stay there

< 5 5 to 10 10 to 15 15 to 20 20 to 25 >25

0

25

50

75

100


27/37



Determining when a distribution is skewed too much to beconsidered normal

General rule of thumb: values beyond 2 standard errors of skewness (ses) areprobably significantly skewed

ses = or use ses statistic from software (SPSS, for example) output

Example: if sample size = 30 and skewness statistic is .9814:

Other factors (histograms, normal probability plots, type of test to be used)should influence decision, depending on exact circumstances of analysis

6/N

ses =6/30 = .20 = .4472 2 ses = .4472 x 2 = .8944

skewness statistic of .9814 is beyond 2 ses, so is significantly skewed


28/37



Kurtosis - amount of peakednessor flatness of the distribution

Mesokurtic-

normalLeptokurtic - peaked, manyscores around middle

Platykurtic -flat, many scoresdispersed from middle

Non-normal kurtosis determinedby similar process to skewness

Non-normal kurtosis only aconcern with some statistical tests


29/37



Selecting appropriate measure of central tendency

Interactive selection at Selecting Statistics by William M.K.

Trochim: http://trochim.human.cornell.edu/selstat/ssstart.htm

Rules below can be bent, depending on situation

Unimodal, Ratio or interval data, skewed median

Unimodal, Ratio or interval data, not skewed mean

Unimodal, ordinal median

Unimodal, Nominal mode

Bi-modal or multi-modal distribution mode


30/37


Measures of Dispersion

Variability is a fundamental characteristic of most data sets, but is notaddressed by measures of central tendency

Measures of central tendency are not enough to accurately describe a data setAlso need to be able to describe the variability or dispersion of the data

Dispersion: scatteredness or flucuation of scores around average score

Several types of measures of dispersion

Range

Standard deviation

Variance


31/37


32/37



Interquartile range

Simplified version: ignore the top and bottom 25% after sorting

Difference between the remaining largest and smallest numbers isinterquartile range

Addresses the problem of outliers

Other methods of calculating interquartile range are slightly morecomplicated but take into account more data


33/37



Standard deviation

Measures the variability or the degree of dispersion of the data set

Square root of the average squared deviations from the mean

Roughly speaking, standard deviation is the average distance betweenthe individual observations and the center of the set of observations


34/37



Calculating standard deviation

1. Subtract each each observation

from sample/population meanand square

2. Add squared distances

3. Divide sum by n - 1 or N(adjusted mean of squared

distances)

4. Take square root of meansquared distances

s

(x x)2

n

1

(x )2

N

SD of sample:

SD of population:


35/37



Variance

Square of standard deviation

Not used for descriptivestatistics, but is importantfor specific inferentialstatistics tests

Variance of sample

Variance of population


36/37



Advantages of range as measure of dispersion

Very simple to calculate

Provides a meaningful characteristic of a set of observations (total spread of theobservations)

Disadvantages of range as measure of dispersion

Extreme values distort range

Only measures the total spread; tells us nothing about the pattern of data distributionExamples:

Data set 1, 2, 3, 4, 5, 6, 7, 8, 9 has a range of 8

Data set 1, 9, 9, 9, 9, 9, 9, 9, 9 also has range of 8, though clearly less scattered


37/37

Gary Geisler Simmons College LIS 403 Spring 2004

Measures of DispersionAdvantages of standard deviation as measure of dispersion

Can always be calculated

Meaningful characteristic of a set of observations; takes every observation intoaccount to express the scatteredness of observations

Examples:

Set of observations 1, 2, 3, 4, 5, 6, 7, 8, 9 has a standard deviation s = 2.74

Set of observations 1, 9, 9, 9, 9, 9, 9, 9, 9 has a standard deviation s = 2.67Range doesnt distinguish difference in scatteredness of sets, but standarddeviation does

Disadvantage of standard deviation as measure of dispersion is that it is morecomplicated to calculate -- though not for computers

6.Descriptive Stats

Documents

Transcript of 6.Descriptive Stats

Essential Stats for Decision Making-1 Descriptive Stats-2011

AP Stats Chapter 6 Review

PY1PR1 Stats Lecture 6 Handout

Stats Review Chapters 5-6 - SCTCC

05 Descriptive Spatial Stats Part1

6- Descriptive Statistics

Stats/Methods I JEOPARDY. Jeopardy Validity Research Strategies Frequency Distributions Descriptive Stats Grab Bag $100 $200$200 $300 $500 $400 $300 $400.

EngStats Wk1 Descriptive Stats PDF

STATS 330: Lecture 6

Chapter 6 Descriptive Designs: Survey and Observation.

Princeton Real Estate Stats 9/6/2011

Unit 2: Numerical Descriptive Measures Summation Notationmsiclassroom.weebly.com/uploads/3/7/6/6/37669011/unit_2_student_notes.pdf · Unit 2: Numerical Descriptive Measures •Summation

Numerical_Methods for Descriptive Stats

HEART ISEASE VS ANCER AN EPIDEMIOLOGIC TRANSITION … › health-stats › report › epitrans › epitrans08.pdfThe descriptive categories comprising malignant neoplasms (cancer)

Tabular data management - Bioconductor - Home · Tabular data management. data cleaning data wrangling descriptive stats inferential stats reporting. data cleaning data wrangling

AP Stats Descriptive Stats Review - Solutionscypress.auhsd.us/view/28478.pdfSince we typically want to draw a histogram of the data and do some other statistical calculations, once

3 Descriptive Stats

Exploring Data · 2018. 11. 20. · Exploring Data . 6 . Strata variables and nested strata STRATA In addition to full dataset Descriptive Stats, you can request

Chap 6 Descriptive Research

Descriptive Stats How To