Wk1 statnotes

6
http://www.slideshare.net/statcave/week8-finalexamlivelecture-2010june http://www.facebook.com/statcave Data Types General speaking, statistical techniques are determined by the type of data. A basic understanding about the data types is helpful for choosing statistical procedures. In SPSS, a column is for a variable and a row is for a case. There are, generally speaking, two major types of data: Qualitative variables: The data values are non-numeric categories. Examples: Blood type, Gender. Quantitative variables: The data values are counts or numerical measurements. A quantitative variable can be either discrete such as # of students receiving an 'A' in a class, or continuous such as GPA, salary and so on. Another way of classifying data is by the measurement scales. In statistics, there are four generally used measurement scales: Nominal data: data values are non-numeric group labels. For example, Gender variable can be defined as male = 0 and female =1. Ordinal data (we sometimes call 'Discrete Data'): data values are categorical and may be ranked in some numerically meaningful way. For example, strongly disagree to strong agree may be defined as 1 to 5. Continuous data: Interval data : data values are ranged in a real interval, which can be as large as from negative infinity to positive infinity. The difference between two values are meaningful, however, the ratio of two interval data is not meaningful. For example temperature, IQ. Today is 1.2 times hotter than yesterday is not much useful nor meaningful. Ratio data: Both difference and ratio of two values are meaningful. For example, salary, weight. NOTE: The statistical procedures mentioned below are demonstrated using movie clips in the Statistical Procedures Page . http://www.cst.cmich.edu/users/lee1c/spss/datatype.htm

Transcript of Wk1 statnotes

Page 1: Wk1 statnotes

http://www.slideshare.net/statcave/week8-finalexamlivelecture-2010june

http://www.facebook.com/statcave

Data TypesGeneral speaking, statistical techniques are determined by the type of data. A basic understanding about the data types is helpful for choosing statistical procedures. In SPSS, a column is for a variable and a row is for a case. There are, generally speaking, two major types of data:

Qualitative variables: The data values are non-numeric categories. Examples: Blood type, Gender.

Quantitative variables: The data values are counts or numerical measurements. A quantitative variable can be either discrete such as # of students receiving an 'A' in a class, or continuous such as GPA, salary and so on.

Another way of classifying data is by the measurement scales. In statistics, there are four generally used measurement scales:

Nominal data: data values are non-numeric group labels. For example, Gender variable can be defined as male = 0 and female =1.

Ordinal data (we sometimes call 'Discrete Data'): data values are categorical and may be ranked in some numerically meaningful way. For example, strongly disagree to strong agree may be defined as 1 to 5.

Continuous data:

Interval data : data values are ranged in a real interval, which can be as large as from negative infinity to positive infinity. The difference between two values are meaningful, however, the ratio of two interval data is not meaningful. For example temperature, IQ. Today is 1.2 times hotter than yesterday is not much useful nor meaningful.

Ratio data: Both difference and ratio of two values are meaningful. For example, salary, weight.

NOTE: The statistical procedures mentioned below are demonstrated using movie clips in the Statistical Procedures Page.

http://www.cst.cmich.edu/users/lee1c/spss/datatype.htm

Page 2: Wk1 statnotes

NominalNominal data are items which are differentiated by a simple naming system. The only thing a nominal scale does is to say that items being measured have something in common, although this may not be described.

Nominal items may have numbers assigned to them. This may appear ordinal but is not -- these are used to simplify capture and referencing.

Nominal items are usually categorical, in that they belong to a definable category, such as 'employees'.

Example: The number pinned on a sports person. A set of countries.

OrdinalItems on an ordinal scale are set into some kind of order by their position on the scale. This may indicate such as temporal position, superiority, etc.

The order of items is often defined by assigning numbers to them to show their relative position. Letters or other sequential symbols may also be used as appropriate.

Ordinal items are usually categorical, in that they belong to a definable category, such as '1956 marathon runners'.

You cannot do arithmetic with ordinal numbers -- they show sequence only.

Example

The first, third and fifth person in a race.

Pay bands in an organization, as denoted by A, B, C and D.

IntervalInterval data (also sometimes called integer) is measured along a scale in which each position is equidistant from one another. This allows for the distance between two pairs to be equivalent in some way.

This is often used in psychological experiments that measure attributes along an arbitrary scale between two extremes.

Interval data cannot be multiplied or divided.

Example

My level of happiness, rated from 1 to 10.

Temperature, in degrees Fahrenheit.

Page 3: Wk1 statnotes

RatioIn a ratio scale, numbers can be compared as multiples of one another. Thus one person can be twice as tall as another person. Important also, the number zero has meaning.

Thus the difference between a person of 35 and a person 38 is the same as the difference between people who are 12 and 15. A person can also have an age of zero.

Ratio data can be multiplied and divided because not only is the difference between 1 and 2 the same as between 3 and 4, but also that 4 is twice as much as 2.

Interval and ratio data measure quantities and hence are quantitative.  Because they can be measured on a scale, they are also called scale data.

Example

A person's weight

The number of pizzas I can eat before fainting

Mode: The mode of a set of data is the value in the set that occurs most often.

Problem:   The number of points scored in a series of football games is listed below. Which score occurred most often?

  7,  13,  18,  24,  9,  3,  18

Solution:   Ordering the scores from least to greatest, we get:

  3,  7,  9,  13,  18,  18,  24

Answer:   The score which occurs most often is 18.

This problem really asked us to find the mode of a set of 7 numbers.

Mode: The mode of a set of data is the value in the set that occurs most often.

Biomodal: a set of data is bimodal if it has 2 modes (i.e., two numbers that occur most often, and the same number of times).

No Mode: When each value occurs only once in the data set, there is no mode for this set of data.

Zero Mode, when 0 occurs most often in the set.

0 mode and no mode are two different data sets.

Page 4: Wk1 statnotes

In a blind experiment, the subjects do not know whether they are in the treatment group or the control group. In order to have a blind experiment with human subjects, it is usually necessary to administer a placebo to the control group.

Blinding is a basic tool to prevent conscious as well as subconscious bias in research. For example, in open taste tests comparing different product brands, consumers usually choose their regular brand. However, in blind taste tests, where the brand identities are concealed, consumers may favor a different brand.

In a double-blind experiment, neither the subjects nor the people evaluating the subjects knows who is in the treatment group and who is in the control group. This moderates the placebo effect and guards against conscious and unconscious prejudice for or against the treatment on the part of the evaluators.

Double-blind methods can be applied to any experimental situation where there is the possibility that the results will be affected by conscious or unconscious bias on the part of the experimenter. Random assignment of the subject to the experimental or control group is a critical part of double-blind research design. The key that identifies the subjects and which group they belonged to is kept by a third party and not given to the researchers until the study is over

Computer-controlled experiments are sometimes also referred to as double-blind experiments, since software should not cause any bias. An analogy to the above, the part of the software that provides interaction with the human is the blinded researcher, while the part of the software that defines the key is the third party. An example is the ABX test, where the human subject has to identify an unknown stimulus X as being either A or B.

http://www.worldlingo.com/ma/enwiki/en/Blind_experiment

http://www.worldlingo.com/ma/enwiki/en/Blind_experiment

Random sampling ensures that every member of the population has an equal chance of being selected.

Simple random sample is a sampling method

Randomization is the process of making something random.

Randomization-based inference is especially important in experimental design and in survey sampling.

Randomization involves randomly allocating the experimental units across the treatment groups. For example, if an experiment compares a new drug against a standard drug, then the patients should be allocated to either the new drug or to the standard drug control using randomization.

Page 5: Wk1 statnotes

Simple random sampling refers to a sampling method that has the following properties.

The population consists of N objects.

The sample consists of n objects.

All possible samples of n objects are equally likely to occur.

The main benefit of simple random sampling is that it guarantees that the sample chosen is

representative of the population. This ensures that the statistical conclusions will be valid.

There are many ways to obtain a simple random sample. One way would be the lottery method. Each of

the N population members is assigned a unique number. The numbers are placed in a bowl