Percentiles and Deciles

52
Deciles and Percentiles Deciles: If data is ordered and divided into 10 parts, then cut points are called Deciles Percentiles: If data is ordered and divided into 100 parts, then cut points are called Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th percentile of the data is Q3. Suppose PC= ((n+1)/100)p, where n=number of observations and p is the desired percentile. If PC is an integer than pth percentile of a data set is the (PC)th observation of the ordered set of that data. Otherwise let PI be the integer part of PC and f be the fractional part of PC. Then pth percentile= OI + (OII -OI)x`f where OI is the (PI)th observation of the ordered set of data and OII is the (PI +1)th observation of the ordered set of data. For example, Consider the following ordered set of data: 3, 5, 7, 8, 9, 11, 13, 15. PC= (9/100)p For 25 th percentile, PC=2.25 (not an integer), then 25 th percentile = 5 + (7-5)x.25= 5.5

Transcript of Percentiles and Deciles

Page 1: Percentiles and Deciles

Deciles and Percentiles Deciles If data is ordered and divided into 10 parts then cut

points are called Deciles Percentiles If data is ordered and divided into 100 parts then cut

points are called Percentiles 25th percentile is the Q1 50th percentile is the Median (Q2) and the 75th percentile of the data is Q3

Suppose PC= ((n+1)100)p where n=number of observations and p is the desired percentile If PC is an integer than pth percentile of a data set is the (PC)th observation of the ordered set of that data Otherwise let PI be the integer part of PC and f be the fractional part of PC Then pth percentile= OI + (OII -OI)x`f where OI is the (PI)th observation of the ordered set of data and OII is the (PI +1)th observation of the ordered set of data For example Consider the following ordered set of data 3 5 7 8 9 11 13 15 PC= (9100)p For 25 th percentile PC=225 (not an integer) then 25th percentile = 5 + (7-5)x25= 55

Coefficient of Variation

Coefficient of Variation The standard deviation of data divided by itrsquos mean It is usually expressed in percent Coefficient of Variation= 100times

Five Number Summary Five Number Summary The five number summary of a

distribution consists of the smallest (Minimum) observation the first quartile (Q1) the median(Q2) the third quartile and the largest (Maximum) observation written in order from smallest to largest

Box Plot A box plot is a graph of the five number summary The central box spans the quartiles A line within the box marks the median Lines extending above and below the box mark the smallest and the largest observations (ie the range) Outlying samples may be additionally plotted outside the range

Boxplot

Distribution of Age in Month

0

20

40

60

80

100

120

140

160

1

q1

min

median

max

q3

Side by Side Boxplot

60

80

100

120

140

Side by Side boxplots of ages of three treatment groups

Trt 3Trt 2Trt 1

Choosing a Summary The five number summary is usually better than the mean

and standard deviation for describing a skewed distribution or a distribution with extreme outliers The mean and standard deviation are reasonable for symmetric distributions that are free of outliers

In real life we canrsquot always expect symmetry of the data Itrsquos a common practice to include number of observations (n) mean median standard deviation and range as common for data summarization purpose We can include other summary statistics like Q1 Q3 Coefficient of variation if it is considered to be important for describing data

Shape of Data

Shape of data is measured by Skewness Kurtosis

Skewness Measures of asymmetry of data

Positive or right skewed Longer right tail Negative or left skewed Longer left tail

23

1

2

1

3

21

)(

)(Skewness

Then nsobservatio be Let

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis Formula

3

)(

)(Kurtosis

Then nsobservatio be Let

2

1

2

1

4

21

minus

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 2: Percentiles and Deciles

Coefficient of Variation

Coefficient of Variation The standard deviation of data divided by itrsquos mean It is usually expressed in percent Coefficient of Variation= 100times

Five Number Summary Five Number Summary The five number summary of a

distribution consists of the smallest (Minimum) observation the first quartile (Q1) the median(Q2) the third quartile and the largest (Maximum) observation written in order from smallest to largest

Box Plot A box plot is a graph of the five number summary The central box spans the quartiles A line within the box marks the median Lines extending above and below the box mark the smallest and the largest observations (ie the range) Outlying samples may be additionally plotted outside the range

Boxplot

Distribution of Age in Month

0

20

40

60

80

100

120

140

160

1

q1

min

median

max

q3

Side by Side Boxplot

60

80

100

120

140

Side by Side boxplots of ages of three treatment groups

Trt 3Trt 2Trt 1

Choosing a Summary The five number summary is usually better than the mean

and standard deviation for describing a skewed distribution or a distribution with extreme outliers The mean and standard deviation are reasonable for symmetric distributions that are free of outliers

In real life we canrsquot always expect symmetry of the data Itrsquos a common practice to include number of observations (n) mean median standard deviation and range as common for data summarization purpose We can include other summary statistics like Q1 Q3 Coefficient of variation if it is considered to be important for describing data

Shape of Data

Shape of data is measured by Skewness Kurtosis

Skewness Measures of asymmetry of data

Positive or right skewed Longer right tail Negative or left skewed Longer left tail

23

1

2

1

3

21

)(

)(Skewness

Then nsobservatio be Let

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis Formula

3

)(

)(Kurtosis

Then nsobservatio be Let

2

1

2

1

4

21

minus

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 3: Percentiles and Deciles

Five Number Summary Five Number Summary The five number summary of a

distribution consists of the smallest (Minimum) observation the first quartile (Q1) the median(Q2) the third quartile and the largest (Maximum) observation written in order from smallest to largest

Box Plot A box plot is a graph of the five number summary The central box spans the quartiles A line within the box marks the median Lines extending above and below the box mark the smallest and the largest observations (ie the range) Outlying samples may be additionally plotted outside the range

Boxplot

Distribution of Age in Month

0

20

40

60

80

100

120

140

160

1

q1

min

median

max

q3

Side by Side Boxplot

60

80

100

120

140

Side by Side boxplots of ages of three treatment groups

Trt 3Trt 2Trt 1

Choosing a Summary The five number summary is usually better than the mean

and standard deviation for describing a skewed distribution or a distribution with extreme outliers The mean and standard deviation are reasonable for symmetric distributions that are free of outliers

In real life we canrsquot always expect symmetry of the data Itrsquos a common practice to include number of observations (n) mean median standard deviation and range as common for data summarization purpose We can include other summary statistics like Q1 Q3 Coefficient of variation if it is considered to be important for describing data

Shape of Data

Shape of data is measured by Skewness Kurtosis

Skewness Measures of asymmetry of data

Positive or right skewed Longer right tail Negative or left skewed Longer left tail

23

1

2

1

3

21

)(

)(Skewness

Then nsobservatio be Let

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis Formula

3

)(

)(Kurtosis

Then nsobservatio be Let

2

1

2

1

4

21

minus

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 4: Percentiles and Deciles

Boxplot

Distribution of Age in Month

0

20

40

60

80

100

120

140

160

1

q1

min

median

max

q3

Side by Side Boxplot

60

80

100

120

140

Side by Side boxplots of ages of three treatment groups

Trt 3Trt 2Trt 1

Choosing a Summary The five number summary is usually better than the mean

and standard deviation for describing a skewed distribution or a distribution with extreme outliers The mean and standard deviation are reasonable for symmetric distributions that are free of outliers

In real life we canrsquot always expect symmetry of the data Itrsquos a common practice to include number of observations (n) mean median standard deviation and range as common for data summarization purpose We can include other summary statistics like Q1 Q3 Coefficient of variation if it is considered to be important for describing data

Shape of Data

Shape of data is measured by Skewness Kurtosis

Skewness Measures of asymmetry of data

Positive or right skewed Longer right tail Negative or left skewed Longer left tail

23

1

2

1

3

21

)(

)(Skewness

Then nsobservatio be Let

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis Formula

3

)(

)(Kurtosis

Then nsobservatio be Let

2

1

2

1

4

21

minus

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 5: Percentiles and Deciles

Side by Side Boxplot

60

80

100

120

140

Side by Side boxplots of ages of three treatment groups

Trt 3Trt 2Trt 1

Choosing a Summary The five number summary is usually better than the mean

and standard deviation for describing a skewed distribution or a distribution with extreme outliers The mean and standard deviation are reasonable for symmetric distributions that are free of outliers

In real life we canrsquot always expect symmetry of the data Itrsquos a common practice to include number of observations (n) mean median standard deviation and range as common for data summarization purpose We can include other summary statistics like Q1 Q3 Coefficient of variation if it is considered to be important for describing data

Shape of Data

Shape of data is measured by Skewness Kurtosis

Skewness Measures of asymmetry of data

Positive or right skewed Longer right tail Negative or left skewed Longer left tail

23

1

2

1

3

21

)(

)(Skewness

Then nsobservatio be Let

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis Formula

3

)(

)(Kurtosis

Then nsobservatio be Let

2

1

2

1

4

21

minus

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 6: Percentiles and Deciles

Choosing a Summary The five number summary is usually better than the mean

and standard deviation for describing a skewed distribution or a distribution with extreme outliers The mean and standard deviation are reasonable for symmetric distributions that are free of outliers

In real life we canrsquot always expect symmetry of the data Itrsquos a common practice to include number of observations (n) mean median standard deviation and range as common for data summarization purpose We can include other summary statistics like Q1 Q3 Coefficient of variation if it is considered to be important for describing data

Shape of Data

Shape of data is measured by Skewness Kurtosis

Skewness Measures of asymmetry of data

Positive or right skewed Longer right tail Negative or left skewed Longer left tail

23

1

2

1

3

21

)(

)(Skewness

Then nsobservatio be Let

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis Formula

3

)(

)(Kurtosis

Then nsobservatio be Let

2

1

2

1

4

21

minus

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 7: Percentiles and Deciles

Shape of Data

Shape of data is measured by Skewness Kurtosis

Skewness Measures of asymmetry of data

Positive or right skewed Longer right tail Negative or left skewed Longer left tail

23

1

2

1

3

21

)(

)(Skewness

Then nsobservatio be Let

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis Formula

3

)(

)(Kurtosis

Then nsobservatio be Let

2

1

2

1

4

21

minus

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 8: Percentiles and Deciles

Skewness Measures of asymmetry of data

Positive or right skewed Longer right tail Negative or left skewed Longer left tail

23

1

2

1

3

21

)(

)(Skewness

Then nsobservatio be Let

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis Formula

3

)(

)(Kurtosis

Then nsobservatio be Let

2

1

2

1

4

21

minus

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 9: Percentiles and Deciles

Kurtosis Formula

3

)(

)(Kurtosis

Then nsobservatio be Let

2

1

2

1

4

21

minus

⎟⎠

⎞⎜⎝

⎛minus

minus=

sum

sum

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 10: Percentiles and Deciles

Kurtosis

Kurtosis relates to the relative flatness or peakedness of a distribution A standard normal distribution (blue line micro = 0 σ = 1) has kurtosis = 0 A distribution like that illustrated with the red curve has kurtosis gt 0 with a lower peak relative to its tails

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 11: Percentiles and Deciles

Summary of the Variable lsquoAgersquo in the given data set

Mean 9041666667

Standard Error 3902649518

Median 84

Mode 84

Standard Deviation 3022979318

Sample Variance 9138403955

Kurtosis -1183899591

Skewness 0389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Histogram of Age

Age in Month

Number of Subjects

40 60 80 100 120 140 160

0

2

4

6

8

10

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 12: Percentiles and Deciles

Summary of the Variable lsquoAgersquo in the given data set

60

80

100

120

140

Boxplot of Age in Month

Age(month)

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 13: Percentiles and Deciles

Brief concept of Statistical Softwares There are many softwares to perform statistical

analysis and visualization of data Some of them are SAS (System for Statistical Analysis) S-plus R Matlab Minitab BMDP Stata SPSS StatXact Statistica LISREL JMP GLIM HIL MS Excel etc We will discuss MS Excel and SPSS in brief

Some useful websites for more information of statistical softwares-

httpwwwgalaxygmuedupapersastr1htmlhttpourworldcompuservecomhomepages

Rainer_WuerlaenderstatsofthtmarchivhttpwwwR-projectorg

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 14: Percentiles and Deciles

Microsoft Excel A Spreadsheet Application It features calculation graphing

tools pivot tables and a macro programming language called VBA (Visual Basic for Applications)

There are many versions of MS-Excel Excel XP Excel 2003 Excel 2007 are capable of performing a number of statistical analyses

Starting MS Excel Double click on the Microsoft Excel icon on the desktop or Click on Start --gt Programs --gt Microsoft Excel

Worksheet Consists of a multiple grid of cells with numbered rows down the page and alphabetically-tilted columns across the page Each cell is referenced by its coordinates For example A3 is used to refer to the cell in column A and row 3 B10B20 is used to refer to the range of cells in column B and rows 10 through 20

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 15: Percentiles and Deciles

Microsoft Excel

Creating Formulas 1 Click the cell that you want to enter the formula 2 Type = (an equal sign) 3 Click the Function Button 4 Select the formula you want and step through the on-screen instructions

xf

Opening a document File Open (From a existing workbook) Change the directory area or drive to look for file in other locations

Creating a new workbook FileNewBlank Document

Saving a File FileSave

Selecting more than one cell Click on a cell eg A1) then hold the Shift key and click on another (eg D4) to select cells between and A1 and D4 or Click on a cell and drag the mouse across the desired range

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 16: Percentiles and Deciles

Microsoft Excel Entering Date and Time Dates are stored as

MMDDYYYY No need to enter in that format For example Excel will recognize jan 9 or jan-9 as 192007 and jan 9 1999 as 191999 To enter todayrsquos date press Ctrl and together Use a or p to indicate am or pm For example 830 p is interpreted as 830 pm To enter current time press Ctrl and together

Copy and Paste all cells in a Sheet Ctrl+A for selecting Ctrl +C for copying and Ctrl+V for Pasting

Sorting Data Sort Sort By hellip Descriptive Statistics and other Statistical

methods ToolsData Analysis Statistical method If Data Analysis is not available then click on Tools Add-Ins and then select Analysis ToolPack and Analysis toolPack-Vba

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 17: Percentiles and Deciles

Microsoft Excel

Statistical and Mathematical Function Start with lsquo=lsquo sign and then select function from function wizard xf

Inserting a Chart Click on Chart Wizard (or InsertChart) select chart give Input data range Update the Chart options and Select output range Worksheet

Importing Data in Excel File open FileType Click on File Choose Option ( DelimitedFixed Width) Choose Options (Tab Semicolon Comma Space Other) Finish

Limitations Excel uses algorithms that are vulnerable to rounding and truncation errors and may produce inaccurate results in extremecases

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 18: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS)

A general purpose statistical package SPSS is widely used in the social sciences particularly in sociology and psychology

SPSS can import data from almost any type of file to generate tabulated reports plots of distributions and trends descriptive statistics and complex statistical analyzes

Starting SPSS Double Click on SPSS on desktop or ProgramSPSS

Opening a SPSS file FileOpen

Data EditorVarious pull-down menus appear at the top of the Data Editor window These pull-down menus are at the heart of using SPSSWIN The Data Editor menu items (with some of the uses of the menu) are

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 19: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values used to find data in a file insert variables and cases OPTIONS allows the user to set general preferences as well as the setup for the Navigator Charts etc

VIEW user can change toolbars value labels can be seen in cells instead of data values

DATA select sort or weight cases merge files

MENUS AND TOOLBARS

TRANSFORM Compute new variables recode variables etc

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 20: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS) MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts etc

UTILITIES add comments to accompany data file (and other advanced features)

ADD-ons these are features not currently installed (advanced statistical procedures)

WINDOW switch between data syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 21: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created the output will appear in the Navigator window The Navigator window contains many of the pull-down menus found in the Data Editor window Some of the important menus in the Navigator window include

INSERT used to insert page breaks titles charts etc

FORMAT for changing the alignment of a particular portion of the output

MENUS AND TOOLBARS

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 22: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS)

bull Formatting Toolbar

When a table has been created by a statistical procedure the user can edit the table to create a desired look or adddelete information Beginning with version 140 the user has a choice of editing the table in the Output or opening it in a separate Pivot Table (DEFINE) window Various pulldown menus are activated when the user double clicks on the table These include

EDIT undo and redo a pivot select a table or table body (eg to change the font)

INSERT used to insert titles captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 23: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS)

bull Additional menusCHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

bull Show or hide a toolbar

Click on VIEW TOOLBARS rArr rArr 1048635to show it to hide it

bull Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to its new location

bull Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZErArr rArr

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 24: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheetData from an Excel spreadsheet can be imported into SPSSWIN as follows1 In SPSSWIN click on FILE OPEN DATA The OPEN DATA FILE Dialog rArr rArrBox will appear2 Locate the file of interest Use the Look In pull-down list to identify the folder containing the Excel file of interest3 From the FILE TYPE pull down menu select EXCEL (xls)

4 Click on the file name of interest and click on OPEN or simply double-click on the file name

5 Keep the box checked that reads Read variable names from the first row of data This presumes that the first row of the Excel data file contains variable names in the first row [If the data resided in a different worksheet in the Excel file this would need to be entered]

6 Click on OK The Excel data file will now appear in the SPSSWIN Data Editor

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 25: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS)

Importing data from an EXCEL spreadsheet

7 The former EXCEL spreadsheet can now be saved as an SPSS file (FILE rArrSAVE AS) and is ready to be used in analyses Typically you would label variable and values and define missing values

Importing an Access tableSPSSWIN does not offer a direct import for Access tables Therefore we must follow these steps1 Open the Access file2 Open the data table3 Save the data as an Excel file4 Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN

Importing Text Files into SPSSWINText data points typically are separated (or ldquodelimitedrdquo) by tabs or commas Sometimes they can be of fixed format

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 26: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS) Importing tab-delimited data

In SPSSWIN click on FILE rArr OPEN rArr DATA Look in the appropriate location for the text file Then select ldquoTextrdquo from ldquoFiles of typerdquo Click on the file name and then click on ldquoOpenrdquo You will see the Text Import Wizard ndash step 1 of 6 dialog boxYou will now have an SPSS data file containing the former tab-delimited data You simply need to add variable and value labels and define missing values

Exporting Data to Excel click on FILE rArr SAVE AS Click on the File Name for the file to be

exported For the ldquoSave as Typerdquo select from the pull-down menu Excel (xls) You will notice the checkbox for ldquowrite variable names to spreadsheetrdquo Leave this checked as you will want the variable names to be in the first row of each column in the Excel spreadsheet Finally click on Save

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 27: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS) Running the FREQUENCIES procedure

1 Open the data file (from the menus click on FILE rArr OPEN rArr DATA) of interest2 From the menus click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr FREQUENCIES3 The FREQUENCIES Dialog Box will appear In the left-hand box will be a listing (source variable list) of all the variables that have been defined in the data file The first step is identifying the variable(s) for which you want to run a frequency analysis Click on a variable name(s) Then click the [ gt ] pushbutton The variable name(s) will now appear in the VARIABLE[S] box (selected variable list) Repeat these steps for each variable of interest

4 If all that is being requested is a frequency table showing count percentages (raw adjusted and cumulative) then click on OK

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 28: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS) Requesting STATISTICS

Descriptive and summary STATISTICS can be requested for numeric variables To request Statistics

1 From the FREQUENCIES Dialog Box click on the STATISTICS pushbutton

2 This will bring up the FREQUENCIES STATISTICS Dialog Box

3 The STATISTICS Dialog Box offers the user a variety of choices DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics (click on ANALYZE rArr DESCRIPTIVE STATISTICS rArr DESCRIPTIVES) The procedure offers many of the same statistics as the FREQUENCIES procedure but without generating frequency analysis tables

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 29: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS) Requesting CHARTS

One can request a chart (graph) to be created for a variable or variables included in a FREQUENCIES procedure

1 In the FREQUENCIES Dialog box click on CHARTS2 The FREQUENCIES CHARTS Dialog box will appear Choose the intended chart (eg Bar diagram Pie chart histogram

Pasting charts into Word1 Click on the chart2 Click on the pulldown menu EDIT rArr COPY OBJECTS3 Go to the Word document in which the chart is to be embedded Click on EDIT rArr PASTE SPECIAL4 Select Formatted Text (RTF) and then click on OK5 Enlarge the graph to a desired size by dragging one or more of the black squares along the perimeter (if the black squares are not visible click once on the graph)

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 30: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS) BASIC STATISTICAL PROCEDURES CROSSTABS

1 From the ANALYZE pull-down menu click on DESCRIPTIVE STATISTICS rArr CROSSTABS

2 The CROSSTABS Dialog Box will then open

3 From the variable selection box on the left click on a variable you wish to designate as the Row variable The values (codes) for the Row variable make up the rows of the crosstabs table Click on the arrow (gt) button for Row(s) Next click on a different variable you wish to designate as the Column variable The values (codes) for the Column variable make up the columns of the crosstabstable Click on the arrow (gt) button for Column(s)

4 You can specify more than one variable in the Row(s) andor Column(s) A cross table will be generated for each combination of Row and Column variables

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 31: Percentiles and Deciles

Statistics Packagefor the Social Science (SPSS) Limitations SPSS users have less control over

data manipulation and statistical output than other statistical packages such as SAS Stata etc

SPSS is a good first statistical package to perform quantitative research in social science because it is easy to use and because it can be a good starting point to learn more advanced statistical packages

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 32: Percentiles and Deciles

Normal Distribution

A density curve describes the overall pattern of a distribution The total area under the curve is always 1

A distribution is normal if its density curve is symmetric single-symmetric single-peakedpeaked and bell-shaped

Mean Median and mode are same for a normal distribution

A normal distribution can be described if we know their mean and standard deviation The probability density function of a normal variable with mean micro and standard deviation σ can be expressed as

Normality and independence of the data are two very important assumptions for most statistical methods

ltproppropltminus=minusminus

xexfx

2

1)(

22

)(2

σ

πσ

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 33: Percentiles and Deciles

Normal Distribution

-10 -5 0 5 10 15

00

01

02

03

04

05

A Normal Density Curve

x

f(x)

micro

σ2σ

Total area under the curve is 1

If we know micro and σ we know every thing about the normal distribution

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 34: Percentiles and Deciles

Normal DistributionThe 68-95-997 Rule

In the normal distribution with mean micro and standard deviation σ

68 of the observations fall within σ of the mean micro

95 of the observations fall within 2σ of the mean micro

997 of the observations fall within 3σ of the mean micro

-20 -10 0 10 20

000

002

004

006

008

Normal Density Plot

x

f(x)

3σ 2σσσ

2σ3σ

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 35: Percentiles and Deciles

Normal Density Plot

-2 -1 0 1 2

01

02

03

04

Normal density function

x

f(x)

A sample of 100 observations from a normal distribution with mean 0 and standard deviation 1

68

95

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 36: Percentiles and Deciles

Normal DistributionStandardizing and z-ScoresStandardizing and z-Scores

If x is an observation from a distribution that has mean micro and standard deviation σ the standardized value of x is

A standardized value is often called a z-score If x is normal distribution with mean micro and standard deviation σ then z is a standard normal variable with mean 0 and standard deviation 1

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 37: Percentiles and Deciles

Normal DistributionLet x1 x2 hellip xn be n random variables each with mean micro and standard deviation σ then sum of all of them sumxi be also a normal with mean nmicro and standard deviation σradicn The distribution of mean is also a normal with mean micro and standard deviation σradicn

The standardized score of the mean is

The mean of this standardized random variable is 0 and standard deviation is 1

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 38: Percentiles and Deciles

Assessing the normality of databull Most statistical methods assume that data are from a normal

population So itrsquos important to test the normality of the data

bull Normal quantile plots If the points on a normal quantile plot lie close to diagonal line the

plot indicates that the data are normal Otherwise it indicates departure from normality Points far away from the overall pattern indicates outliers Minor wiggles can be overlooked We will see normal quantile plots in next two slides

bull Shapiro-Wilk W statistics Kolmogorov-Smirnov (K-S) tests etc are being used for testing normality of the data

bull To perform a K-S Test for Normality in SPSS Analyzegt Nonparametric Tests gt 1 Sample K-S Choose OK after selecting variable (s)

bull To perform Shapiro-Wilk test of normality in SAS use procedure lsquoUnivariatersquo

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 39: Percentiles and Deciles

Normal quantile plot

q-q plot 100 sample observations from a normal distribution with mean 0 and standard deviation 1

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 40: Percentiles and Deciles

Normal quantile plot

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 41: Percentiles and Deciles

Population and Sample Population The entire collection of individuals or

measurements about which information is desired eg Average height of 5-year old children in USA

Sample A subset of the population selected for study Primary objective is to create a subset of population whose center spread and shape are as close as that of population There are many methods of sampling Random sampling stratified sampling systematic sampling cluster sampling multistage sampling area sampling qoata sampling etc

Random Sample A simple random sample of size n from a population is a subset of n elements from that population where the subset is chosen in such a way that every possible unit of population has the same chance of being selected

Example Consider a population of 5 numbers (1 2 3 4 5) How many random sample (without replacement) of size 2 can we draw from this population (12) (13) (1 4) (1 5) (2 3) (2 4) (2 5) (34) (35) (45)

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 42: Percentiles and Deciles

Population and Sample Why do we need randomness in sampling

It reduces the possibility of subjective and other biases

Mean and variance of a random sample is an unbiased estimate of the population mean and variance respectively

Population mean of the five numbers in previous slide is 3 Averages of 10 samples of sizes 2 are 15 2 25 3 25 3 35 35 4 45 Mean of this 10 averages (15 +2 + 25 + 3 + 25 + 3+ 35+ 35+ 4+ 45)10 =3 which is the same as the population mean

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 43: Percentiles and Deciles

Parameter and Statistic Parameter Any statistical characteristic of a

population Population mean population median population standard deviation are examples of parameters

Statistic Any statistical characteristic of a sample Sample mean sample median sample standard deviation are some examples of statistics

Statistical Issue Describing population through census or making inference from sample by estimating the value of the parameter using statistic

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 44: Percentiles and Deciles

Census and Inference Census Complete enumeration of population units Statistical Inference We sample the population (in a

manner to ensure that the sample correctly represents the population) and then take measurements on our sample and infer (or generalize) back to the populationExample We may want to know the average height of all adults (over 18 years old) in the US Our population is then all adults over 18 years of age If we were to census we would measure every adult and then compute the average By using statistics we can take a random sample of adults over 18 years of age measure their average height and then infer that the average height of the total population is ``close to the average height of our sample

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 45: Percentiles and Deciles

Univariate Bivariate and Multivariate Data

Depending on how many variables we are measuring on the individuals or objects in our sample we will have one of the three following types of data sets Univariate Measurements made on only one

variable per observation Bivariate Measurements made on two variables

per observation Multivariate Measurements made on more than

two variables per observation

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 46: Percentiles and Deciles

Examining Relationship Response Variable Measures the outcome of the

study treatment or experimental manipulation Explanatory Variable Explains or influences changes

in a response variable This is also known as an independent variable or prediction variable

Scatter plot Shows the relationship between two quantitative variables measured on the same individuals We look for the overall pattern and striking deviations from that pattern Overall pattern of a scatter plot by the form direction and strength of the relationship

Positive relation Association in the same direction Negative relation Association in the opposite direction

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 47: Percentiles and Deciles

Examining Relationship Form Linear relationship Curve linear

relationship Cluster etc Linear Relationship Points of the scatter plot

show a straight-line pattern Strength of the Relationship is determined by

how close the points in the scatter plot lie to a simple form such as line

Correlation measures the strength between two variablesWe will learn more about the relationship of variables later

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 48: Percentiles and Deciles

Proportion Proportion In many cases it is appropriate to summarize a

group of independent observations by the number of observations in the group that represent one of two outcomes

Consider a variable X with two outcomes 1 and 0 for happening and not happening of some events correspondingly Let p be the probability that the event happens then p=Prob(X=1)

Suppose we want to estimate of the proportion of the Patients coming to duPont having some particular disease To estimate this proportion (population) we need to take a sample of size n and examine if the patient is bearing that particular disease Then the estimated proportion is

n

Xp ==

size Sampledisease Particular with thatPatients of Numberˆ

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 49: Percentiles and Deciles

p For large n the sampling distribution of is approximately normal with mean P (Population Proportion) and the standard deviation

If probability of happening one event is p then probability of not happening of the same event is 1-p and total probability is 1

What is the difference between proportion and a sample mean If X takes two values 0 or 1 and p is the proportion of happening an event i e p=prob(x=1) then proportion is the same as sample mean

)1(

n

pp minus

Proportion

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 50: Percentiles and Deciles

Binomial Distribution Let us consider an experiment with two outcomes

success (s) and failure (F) for each subject and the experiment was done for n subjects The sequence of S and F can be arranged as follows-

SSFSFFFSSFShelliphellipF where there are x success out of n trial Then the

probability distribution of x can written as

prob(F)1 and prob(s) where

10 )1()(

=minus=

=minus⎟⎟⎠

⎞⎜⎜⎝

⎛= minus

pp

nxppxn

xf xnx

The mean and variance of x are np and np(1-p)

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 51: Percentiles and Deciles

Binomial Distribution

If p=12 then Binomial distribution is symmetric

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml

Page 52: Percentiles and Deciles

Useful Website(s)

httpwwwcaslancsacukglossary_v11mainhtml