Test 1 Notes for Statistics

download Test 1 Notes for Statistics

of 21

Transcript of Test 1 Notes for Statistics

  • 8/8/2019 Test 1 Notes for Statistics

    1/22

    CHAPTER ONE:Introduction

    TERMS: We shall begin our discussion with some basic terminology:

    Statistics-a method of quantification, organizing and summarizing data.

    Data-another name for the numbers or scores collected.

    Population-the collection of individuals to whom the research is

    concerned.Whomever the study tries to describe.

    Sample-a portion or subset of the population of interest.

    Parameter-a characteristic in the population, such as the average age of it's

    members.

    Statistic-a characteristic in the sample; therefore, an estimate of a parameter. Theaverage age of the members of a sample may be the best guess of the average age

    of the population from which the sample was drawn.

    Inferential Statistics-a method of quantification, organizing and summarizing data

    based upon a sample.

    Descriptive Statistics-a method of quantification, organizing and summarizing data

    based upon a population.

    *Note...The plural of statistic (statistics) is not defined as more than one statistic,as the term statistics shall be reserved for the method of quantification and each

    statistic shall be evaluated one characteristic at a time.

    DISCUSSION: Statistics is a technique that is used to describe and assess the

    features of a specified group of people, the population.

    The term population is not necessarily defined by a geographic boundary ornationality, the way it usually is in casual conversation. If the researcher isinterested in the AIDS epidemic, then all people with AIDS may be included in thereferent population. If interest is limited to only those cases within a specificnationality, say the United States, then the referent population will be the US AIDScases, only. If interest is limited to young men, as they are the majority of AIDS

    cases, the population may be called US AIDS AMONG MEN. Note the title of thepopulation grows larger as the population grows more specific.

    As it is not likely that every member of the population can participate in the study,a portion of the population will actually be assessed. That portion, or sample, will

    serve to 'best guess' the status of the referent population.

  • 8/8/2019 Test 1 Notes for Statistics

    2/22

    Statistical techniques will be used to quantify certain features, or parameters, ofthe population. When a sample is used, those same techniques (with minoradjustments to the formulas) will be applied to each feature, or statistic, of thatsample. As a result, there are two types of Statistics; Inferential, in the case of thesample, and Descriptive, in those cases when the researcher can actually use theentire population. As you might guess, the former happens much more often thanthe latter. Unless the population is very small, Inferential Statistics will be used.

    LEVEL OF DATA: Traditionally, statistical techniques had been chosen to fit the levelof data that was collected. The data level, or scale, refers to the degree of precisionof the measure used to score the subjects. The higher the level of the data, thegreater the precision, and presumably, the greater the degree of detail gleanedfrom the measures. This was considered very desirable, as it was presumed thathigher, more sophisticated analyses, which require more precise measures, couldbe applied to the analysis.

    In actuality, it is common practice for researchers to ignore the level of the data.

    Further, there is mathematical support that in the great number of studies aresearcher might conduct in a career, any minor variation in interpretation of one'sdata would balance out. In so much as many social scientists and especiallyclinicians will not do many studies, this generous indulgence may be rather risky.As a result, many discussions of introductory statistics advise the novice to consider

    the level, or scale, of the data being analyzed.

    An Englishman named Stevens, suggested a standard:

    Nominal-data which is classified or named only.

    ie. Cats and Dogs.

    Ordinal-data which is ranked by class.

    ie. Course grades, A > B > C > D > F.

    Interval-data ranked by equal classes.

    ie. The number of stars a critic gives a movie.

    Ratio-ranked by equal classes, with an absolute zero.

    ie. Height or weight.

    Most sophisticated statistical procedures would require interval scale, though ratiowas better still. The vague nominal and ordinal scales would require specialtechniques.

  • 8/8/2019 Test 1 Notes for Statistics

    3/22

    An American by the name of Savage, however, pointed out, that as there were only

    two levels of techniques, only two levels of data were required.

    Discrete-data with a limited number of classes.

    ie.1 It is either a dog or a cat; never both,

    though in some ways a dog may be preferred to a cat and vice versa.

    ie.2 Nominal and ordinal scales.

    Continuous-data with an unlimited number of classes.

    ie.1 Inches of height and satisfaction with one's

    spouse can be considered in degrees, while an absolute zero can only be specifiedin the case of inches. One may no longer be satisfied with one's spouse but it maynot make sense to specify an absolute zero.

    ie.2 Interval and ratio scales.

    It bears pointing out that the two levels of statistical techniques referred to areparametric and non-parametric procedures, which concerns a separate set offormulas. These terms should not be equated with the two types of statistics,descriptive and inferential, which eludes to the source of the data. Non-parametricprocedures, which are often given less credence, will be discussed further in the

    final chapters of this course.

    NOTATION: Sometimes letters are use to denote specific functions in statistics,particularly, greek letters. Knowing these will aid calculating outcomes based uponformulas. The only new symbol introduced at this point is the upper case (capital)sigma [ ], which indicates that a summation is required. This does not negate thedirectives of parentheses [ ( ) ] to specify order of calculation. There in the notationX, one is to sum all numbers in the column called X. If X is squared and noparentheses appear, then it is insinuated that only the numbers in a column calledX are each squared. It is the square of each of those numbers that are added.

    Therefore, X reads 'the sum of the squares'.

    If parentheses incorporate the X and the upper case sigma, the numbers in a row

    called X are to be summed first, and it is the sum of those numbers that is squared.Therefore, ( X) is called 'the square of the sum'. The 'sum of the squares', then, isnot equal to the 'square of the sums' as they are distinct concepts. This distinctionbecomes crucial in succeeding chapters, so be sure to do the problems in the text

    and the work book to clarify any obscurities you may have about them.

    METHODOLOGY: It was noted in the BASIC SKILLS section that the method of

    research determined the statistical technique required.

  • 8/8/2019 Test 1 Notes for Statistics

    4/22

    The basic types of investigation in the social sciences are:

    Natural Observation(NO)-subjects are observed in their natural setting, withoutinterference. Ideally, the observer will not be detected, though this is not alwayspossible.

    Advantage-NO provides frequencies without the contrived trappings of a laboratoryto tamper with the true flow of events. That makes NO the most realistic type of

    study.

    Disadvantage-NO can cause subjects to behave self-consciously if the observer is

    detected, and they usually are. Further, observers who hope to stay long enoughfor self-consciousness to wareoff, may linger long enough to become engrossed inthe phenomenon to the point of losing their objectivity. So NO is the least

    controlled type of investigation there is.

    Correlation (Corr)-Studies in which subjects may be asked questions for further

    clarification, usually in the form of a survey. Scales can add a degree of precision,or magnitude, as well (Likerk's Strongly Agree, Moderately Agree, etc.)

    Advantage-Corr provides frequency and magnitude of response, allowing

    comparison of fluctuations across factors, or variables.

    Disadvantage-Corr can only assess linear relations. In the case of answer scales,such as Likerk's subjective rating scale, subjects are being told to restrict theiranswers to those available, which may or may not fit the actuality. Further, thequality of the data received can be determined, in part, by the form of the survey.

    Consider the pros and cons of each:

    INTERVIEW-Most expensive and most dishonest, as the least anonymous; however,most returns and detail.

    QUESTIONNAIRE-Least expensive and most honest, as the most anonymous;however least detail, as not their to make comments per item. Even worse, many

    subjects will simply not respond. Losing 70% of the subjects is common.

    PHONE SURVEY-Rather popular, as it seems to have a medial impact in all aspects(expense, honesty, detail and returns). Still, if money or subjects are scarce, theresearcher may not have a choice.

    The Experiment-Unlike the first two discussed, which are only quasi-experimental,the true experiment is the only investigation techniques appropriate whenconsidering a causal relation. This is because, only an experiment includes amanipulation under controlled circumstances. Specifically, the presumed causalagent (independent variable) is manipulated and the presumed effect (thedependent variable) is observed. Variables should be defined operationally; that is,

  • 8/8/2019 Test 1 Notes for Statistics

    5/22

    in a way which can be measured. Thus, the dependent variable is also referred to

    as the dependent measure.

    Advantage-The experiment can support a causal inference, in addition tofrequencies and magnitudes.

    Disadvantage-The experiment is the most contrived, and may behave in a way

    quite different from outside of the testing situation (In the real world).

    Even though the results of an experiment are generally given more credence thanquasi-experimental procedures, it may be necessary to do then for preliminary or

    exploratory purposes. Grant requests require some initial investigation to justifysupporting a potential experiment. Further, conditions and data quality may make

    true experimentation impossible.

    DOCUMENTATION: Research of all levels benefit from clear documentation.

    Documentation may mean official certificates which verify subject records. In the

    most technical sense, the CASE STUDY is a form of record keeping, ordocumentation. Not a specific research method, all manner of information about aspecific case may be included: the individuals's response to a survey; theindividual's participation in an experiment, their IQ, health record, school grades oreven their credit rating.

    DATA COLLECTION: The method of data collection also affects the data quality. Thisincludes concerns about from whom the data was collected. If data was collectedfrom all the major subsets of the population, the study can be described as CROSS-SECTIONAL. If subjects are followed for a long time, the study can be described asLONGITUDINAL. These are not types of research, per se. A given study can be both

    cross-sectional and longitudinal.

    MEDICAL RESEARCH: Medical research is especially sensitive to ethical restraints,so many studies are abridged compared to the social sciences. Natural Observationis not directly useful to the development of medical interventions. Correlations are

    distinguished as:

    Cross-section-Usually a preliminary one, time survey.

    Retrospective-A longitudinal study of archival cases, which is convenient, but based

    upon the memory of survivors or old documents which may not be accurate.

    Prospective-A longitudinal study of individual cases (cohort)

    followed forward in time; more accurate but very expensive.

    Medical experiments are called CLINICAL TRIALS. These are abridged, in that, themoment a treatment's effective is suspected, ethical restraint requires that allpatients receive that treatment; even the control group.

  • 8/8/2019 Test 1 Notes for Statistics

    6/22

    Statistics

    CHAPTER 2: DEPICTING THE DATA

    In order that observations about the population of interest might be made, measures of certaincharacteristics will be taken and collected into sets. This represents the data that will beanalyzed. The data sets, themselves are often large and unwieldy, so methods of depictingthem at a glance have been developed. Representing data sets in graphs, charts or tables,then, is one way of organizing and summarizing the data. If sets are presented in terms of theway they fall across a scale (What numbers can be in the set) they are called DISTRIBUTIONS.If sets are presented in terms of how many cases of each number exist, they are calledFREQUENCIES.

    DISTRIBUTIONS: A distribution is a set of numbers, generally depicted in a way that makes orpossible types of members apparent (Are there twos or f ives? What numbers are possible inthis set?).

    When distributions are depicted in terms of counts of each type of member, they are calledFREQUENCY DISTRIBUTIONS. The manner of presenting the set is called a CHART orTABLE. These are comprised of columns and rows: A column for the data, and a column for thefrequency. The frequency for a given number in on the same row,

    so that:

    X f

    4 2

    3 0

    2 4

    1 4

    The data set, represented by the label X, depicted above is actually X = 1, 1, 1, 1, 2, 2, 2, 2, 4,4. Notice the set has been reduced in terms of how much space is required to depict it on apage. Imagine a set with 10,000 members. Obviously, this organizes and summarizes the set insuch a way, that it is rendered intelligible. Indeed, sets which could cover an entire wall whenmembers are listed one at a time, can be depicted on a single page. Notice further, it is

    appropriate to point out when there are no members in the distribution at a particular point onthe scale. In the above example, there happen to be no threes. By including the three in the Xcolumn, the scale remains complete and the reader is assured any potential threes were notoverlook. The numbers of members in a distribution (N) can be determined by adding the fcolumn. In the above example, N = 10. The notation would appear N = the sum of f. You can notdetermine the sum of a set by adding the X column, except in those case in which exactly oneof each possible number on the scale occurs. That is because X the scores is different from Xthe scale which is what appears on the chart..

  • 8/8/2019 Test 1 Notes for Statistics

    7/22

    Additional columns can be added to the table, providing even more organization to thedistribution. For each extra column, the type of the distribution grows more specific.

    Statistics

    Types of distributions:

    Frequency distribution (FD)-A set (X) and a frequency (f).

    X f

    4 2

    3 0

    2 4

    1 4

    Grouped frequency distribution (GFD)-A set (x) in which intervals of possible members arepresented by frequency (f). Here, the number of rows required to depict the entire set can bereduce, so that an enormous data set can be depicted on a single page. However, groupeddistributions lose some detail as the original raw data in no longer observable. If a set has 20

    members on the interval 10-19, how can the observercan not determine if the interval includes20 tens or not?

    X : f

    40-49 2

    30-39 0

    20-29 4

    10-19 4

    Cumulative grouped frequency distribution (CGFD)-A set called (X) in which the frequency ofeach interval is tallied, or accumulated up to the point of a specific interval or lower than thatinterval and is represented in an additional column called cumulative frequency (cf)

    X f cf

    40-49 2 10

  • 8/8/2019 Test 1 Notes for Statistics

    8/22

    30-39 0 8

    20-29 4 8

    10-19 4 4

    Cumulative group percentile frequency distribution(CGPFD) in which the scale of (X) is depictedin the percent of each grouped cumulative frequency (%). In fact, the percentage may becummulative, or accummulated in yet another column called cummulative percentile (c%).

    X f cf % c%

    40-49 2 10 20 100

    30-39 0 8 0 80

    20-29 4 8 40 80

    10-19 4 4 40 40

    Statistics

    GRAPHS: Distributions can be depicted pictorially, rendering them concrete. These pictures, orGRAPHS, should be fitted to the scale of the data set. Just as the scale of the data may beenignored in practice, the appropriateness of the graph may be obscured by the limits of thegraphics package of a researcher's computer program.

    Still, it is helpful to note these distinctions, when possible.

    BAR GRAPH: A graph comprised of distinct bars

    or lines for each interval of the data. The height

    of the bars indicate the frequency of interval of data.

    These are ideal for nominal data.

    HISTOGRAMS: The bars are touching, indicating

    continuity of scale. In this way, the rank or order

    of intervals is depicted. Due to the limits of dot

    matrix printers, this is the most common form of graph

    produced by personal computers.

  • 8/8/2019 Test 1 Notes for Statistics

    9/22

    POLYGON: Instead of a bar's height determining the

    frequency of an interval of data, just a dot is placed.

    So streamlined is the FREQUENCY POLYGON that multiple

    distributions can be depicted on a single graph. In

    fact, polygon means multiple shapes.

    OGIVE: Frequency polygons can be cumulative. This

    is helpful when noting additive impacts, such as total

    growth rates, or total losses, such as in the case of

    epidemics.

    STEM AND LEAF: This graph is the only picture drawn with the original data set. The raw datais stack with in columns defined by some interval. The interval is define some range within thedata. If the columns are arranged in decades (tens), then the second integer defines theinterval. In the case of 40, four is the interval or stem. The zero is it's leaf. In the case of 45, thestem is still 4, but the lead if five. In the case of 52, the stem is 5 and the leaf is 2. Whatdevelops is a distribution with a shape or curve, just as in the case of a polygon. However,instead of just a simple line, the observer can still see the original data set. No information islost. Unfortunately, stem and leaf graph's can only be used for data sets of limited size, due tothe physical limit of space on the page. Consider the set X = 4*, 10, 19, 21, 23, 24, 33, 36, 37,40, 45, 46, 55, 58, 63. This is appear as:

    : : 4 7 6

    : 9 3 6 5 8

    4 0 1 3 0 5 3

    _____________

    0 1 2 3 4 5 6

    *Note...the first decade includes the integers 0 through 9.

    Statistics

    INTERPOLATION: To calculate the interposition of a score within an interval is toINTERPOLATE. But why would you want to? One of the values of compiling data sets intogrouped percentile frequencies, is to note the placement of a score within the distribution. Thiscan be achieved by determining the percentile of a score. If a score is at the 50th percentile,then 50% of the distribution is below it (lesser in value). In fact, if a particular percentile is theinitial interest, the score at that percentile rank, or position can be determined after the fact. In

  • 8/8/2019 Test 1 Notes for Statistics

    10/22

    the case of grouped frequencies, the exact placement of a specific score must be approximated.This is done with INTERPOLATION. Lets consider percentiles, percentile ranks and the processof interpolation, separately.

    Percentile Ranks-The rank of a score as determined by the percentage of the distribution thatlies below it.

    Percentile-The score that is located when the rank is noted first.

    Consider the table:

    X f cf % c%

    5 1 10 10 100

    4 2 9 20 90

    3 4 7 40 70

    2 2 3 20 30

    1 1 1 10 10

    When a data set is depicted in a percentile distribution, it is treated as a continuous scale. Toread the table, then, ABSOLUTE LIMITS must be applied. The absolute upper limit of thenumber five is 5.5 and it's absolute lower limit is 4.5. The integer 5, then, can be thought of as acontinuous range from 4.5 to 5.5. All potential members of the distribution within that range canbe counted as a five in the frequency column. More importantly, if a cumulative percentile mustaccumulate all possible members up to a point, regardless of the number of decimal places,

    than a percentile rank is necessarily the upper limit of that integer.

    Determining percentiles-In the above table, the 100th% is exactly equal to the percentile (score)5.5. The score at the 70th% is 3.5. The percentile of a given percentile rank depicted in the C%column is determined by simply determining the upper limit of the score on the same line.

    Determining percentile ranks--If the initial interest is a given score, it's percentile rank, orposition, can be determined

    by noting the percentile rank on the same line as the starting score. If one starts with apercentile of 2.5, the percentile rank is the 30th%. Note that this was easy because 2.5 is anupper limit.

    Statistics

    Interpolation- When determining a percentile rank for a score which is not an upper limit, orwhen determining a percentile for a percentile rank which is not depicted in the C% column, onemust interpolate. To find the interposition appropriate within one column, one must proceed thesame distance they went in the original column. If the starting point is the percentile 2 (exactly 2is 2.0), then the distance must be matched within the c% column. The integer 2 is exactly

  • 8/8/2019 Test 1 Notes for Statistics

    11/22

    midway within the 2 interval (1.5 to 2.5). The percentile rank for 2 is midway within theappropriate percentile rank interval: in this instance, the 10th to the 30th%. The midpoint is the20th percentile rank, because 20 is equidistance from 10 and 30 (ten points either way).

    In interpreting the appropriate percentile for a given percentile rank, the process is reversed. Forexample, to determine the exact score for the 60th percentile rank, which does not appear on

    the table, one must first locate the 60th rank. It is located between the 30th and 70th ranks, aninterval of 40 percentage points. As the 60th% is 10 points below the 70th%, in a spacecovering 40 percentage points, the 60th% is 1/4th of the way down from the top of thatpercentage range. This is how far below the percentile for the 60th% must be from thepercentile for the 70th%. So, an equivalent range has been identified.

    The percentile for the 70th% is 3.5 and the percentile for the 30th% is 2.5. The range ofpercentiles (2.5 to 3.5) is equivalent to the range of ranks (30th to 70th%). The distance withinthe ranks is 40%, as 70-30 = 40. The distance within the equivalent range for the percentiles is1, as 3.5 - 2.5 = 1. Since 60 is 1/4th of the way down within the range of 40 percentage points,the percentile is 1/4th of the way down within the range of 1 score point, or .25 points. Thismeans the percentile for the 60th% is equal to 3.5 - .25 = 3.25. This is your answer. A percentile

    of 3.25 is at the 60th%. Notice that interpolation is really a way of translating scales. In thiscase, we went from a scale based upon 100ths to a scale based upon the raw data.

    Can you see that the rank for a percentile of 3 is the 50th, and that the percentile for the 80th%is exactly 4?

    Practice traveling the same distance in the column you are going to as you went in the columnyou are starting from will make this seem less foreign. Numerous sample questions areprovided at the end of chapter two in the text, and in the work book as well.

    CAUTION hackers. Please resist the temptation to solve this with a 'push of the button'. If younever get your 'hands in the data' you will never get a feel for what you have completed. Onceyou get a 'feel' for what you are doing, simplifying your work with a computer program will comeeasily.

    CHAPTER THREE: Measures of Central Tendency

    Sometimes a most representative number will be used to depict a distribution, for summarypurposes. But what is a most representative number. It is a measure of CENTRAL TENDENCY,the place within a distribution where most members tend to occur. This chapter discusses thethree types of representative numbers, as well as, the computation and application of each.

    MEASURES OF CENTRAL TENDENCY: There are three methods of measuring the central

    tendency of a distribution.

    Types of measures:

    The Mode (Mo)-The most frequently occurring score in a distribution tends to fall in the middlewhen a distribution is symmetrical. But there is no guarantee of that, so the mode can be verymisleading. In fact, there is no mode when all the members occur the same number of times.The mode will be all that is possible when the data is nominal. When data is represented in afrequency distribution, the mode is simply the score with the highest frequency. This is not

  • 8/8/2019 Test 1 Notes for Statistics

    12/22

    appropriate in a grouped frequency distribution. Finally, the mode is oblivious to extreme scoresor outliers, so it does not help you recognize a odd member is in the set.

    The Median (Md)-The 50th%, or exact middle of a distribution. The median also falls near thecenter of the distribution. It is slightly sensitive to extreme numbers , or OUTLIERS, so it maynot be the true balance point of the distribution. Still, it is not oblivious to outliers and is generally

    not as potentially misleading as the mode. The median requires at least ordinal data, aspercentiles must be ranked. When data is collected into a cumulative percentile frequencydistribution, the median is the percentile equivalent to the 50th percentile rank.

    The Mean (Mn)-The arithmetic average of a distribution, or mean, is the balanced center of adistribution. Denoted by X, it's formula is: X = The Sum of X/N. The sum of the distances of allthe scores from the mean is always equal to zero. This is true even when only one score isgreater than the mean and all the other scores in the distribution are less than the mean. Forthis to happen, the larger score is simply much further away from the mean. The mean is themost sensitive to outliers, so it can be a very misleading measure of central tendency when anextreme score has been added to the distribution. For this reason, it may be best to computethe mean of the distribution first with all of the scores in and then recalculate the mean with all of

    the outliers out, so the impact of those extreme scores can be taken into consideration.Consider the set X = 10, 10, 10, 10 and 60. Since the average is 20, all the small scores are tenpoints below the mean. The largest score is 40 points above the mean. Consider the set: X If N= 5

    10 _

    10 and X = 100/5,

    10 _

    10 then X = ___

    Sum of X = 100

  • 8/8/2019 Test 1 Notes for Statistics

    13/22

    _

    Additional columns can depict the distance from X:

    _

    X X - X (distance from the mean)

    60 60 - 20 = 40

    10 10 - 20 = -10

    10 10 - 20 = -10

    10 10 - 20 = -10

    10 10 - 20 = -10

    ___ ___

    100 SS = 0 Note..The sum of the distances of the scores of a distribution, are always equal tozero, unless the mean is miscalculated.

    Furthermore, the mean lends itself to more complex calculations and requires a continuousscale of data. Unfortunately, the mean is extremely sensitive to outliers. That is what ishappening in the example depicted above. The mean is suppose to measure the centraltendency of the distribution. The mean is 20 and not one of the scores is a 20.

    When the distribution is symmetrical, or balanced in shape, the mean falls where most of the

    members of the distribution are. Regardless of whether the distribution is symmetrical or not, thesum of the distances from the mean is generally smaller and never larger than the distancesfrom the mode or the median. Interestingly, when a distribution is perfectly symmetrical, all threemeasures of tendency are equal to each other.

    Let's consider a CPDF. Notice the sum of X, _X, and the number of scores, N, seem to becomputed differently. That is because frequency distributions no longer contain the original dataset, unless all the numbers appear exactly once. By adding aXf column, in which each possiblescore is multiplied by the frequency in the f column, you can correct this problem. Note,

    N still = _f, but now _X = _Xf. Further, the median is the 50%

    and the mode is simply the score with the greatest number in the f column

    Xf X f cf % c% _X = _Xf = 100

    60 60 1 5 20 100

  • 8/8/2019 Test 1 Notes for Statistics

    14/22

    0 50 0 4 0 80 N = _f = 5

    0 40 0 4 0 80

    0 30 0 4 0 80 -

    0 20 0 4 0 80 X = 20

    40 10 4 4 80 80

    __ __ Mo = 10 &Md = 10.125

    Sum of X = 100 N = 5

    Interpolate to see the score for a rank of 50% is 10.125.

    DISTRIBUTION SHAPE: The shape of the distribution is depicted with a polygon. A SKEW, or

    pull in the distribution, will jeopardize the symmetry of that distribution. Consider these shapes.

    Symmetry- A distribution is described as symmetrical when the curve of the polygon depicts abalanced image, such as these. Only graph B is normal (balanced in form and measures ofcentral tendency).

    A B C

    -------------- -------------- -------------

    -------------- -------------- -------------

    Skew-A skew, or tail, in a distribution can be pulled toward extreme scores. When the majorityof scores are large, and the exceptions are small, the data set (B) will have a NEGATIVE skew.The point or skewer will be toward the smaller numbers.

    POSITIVE skews occur when most of the numbers in the set are relatively small and the

    exceptions are relatively large, (A & C). The tail or skewer will be toward the larger numbers.

    SYMMETRY AND CENTRAL TENDENCY: The position of the measures of central tendency, inrelation to each other, can indicate the shape of the distribution.

    When the curve is symmetrical: Mn = Md.

  • 8/8/2019 Test 1 Notes for Statistics

    15/22

    When the curve is normal: Mo = Mn = Md.

    When the curve has -skew: Mo >Md>Mn.

    When the curve has +skew: Mo

  • 8/8/2019 Test 1 Notes for Statistics

    16/22

    Remember, you must account for all the spaces occupied on the scale, including allthose in between. If the highest number is 9, and the lowest is 0, there are tenspaces included on that scale. Count them: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9 )

    The Inter-quartile Range(IQR)-You recall that you could determine the median bylocating the percentile (score) at the fiftieth percentile rank (50th%). The samemethods can be used to locate the score (X) at the first quartile (25th%) and thethird quartile (75th%). By definition, QUARTILES divide a distribution into quarters,or fourths. The IQR is the range between the 1st and 3rd quartiles (Q). Therefore,

    simply subtract the score at the 25th% (Q1) from the score at the 75th% (Q3):

    IQR = X(75th%) - X(25th%)

    The Semi-Interquartile Range(SIQR)-When the data is very skewed or incomplete,

    the SIQR replaces the IQR. To compute the SIQR, simply divide the IQR by 2:

    SIQR = (IQR)/2

    The Standard of Deviation(S)-The best way to determine the appropriateness of themean is to determine the average (mean) amount of dispersion from the mean.Unfortunately, the mean (X) is the sum of the scores (SumX) divided by thenumber of scores (N). The sum of the distance scores from the mean always equals

    zero, making computation useless. Recall:

    _

    Sum of the Distances = Sum of (X - X) = 0

    To correct this problem, the distance score is squared. This sum of the distancescores squared is called the SUM SQUARES (SS). The formula, then is:

    SS = The Sum of (The score The Sample Mean) Squared

    This can serve as the sum of the distances that is divided by the number ofdistances to suggest a mean distance. But to return to the original scale, the squareroot of this squared mean distance should be computed. The standard of deviation

    is the square root of the variance:

    (S)(S) = SS/N

    The Variance (S)-The standard of deviation (S) squared is called the variance. Thisis the very number one takes the square root of to determine the S; the squaredmean distance from the mean. The purpose of the variance was primarily to

    calculate the S. It can be thought of as a squared measure of variability.

  • 8/8/2019 Test 1 Notes for Statistics

    17/22

    S(S) = [Sum of (X - X)Squared / N]

    The formula for the SUM Squares (SS). Use the formula you are more comfortable

    with. They measure the same thing.

    DEGREES OF FREEDOM: Variability can be determined in both the inferential anddescriptive cases. Descriptive statistics are based upon populations. That is whatthe above formulas apply to. The formulas for both the variance and the standardof deviation should be adjusted for the case of samples, because there is a risk of

    bias in the estimate. The sample may not be the most representative of the truepopulation. To assure unbiased estimates, subtract a 1 from the denominator, N.

    This adjusted denominator is called DEGREES OF FREEDOM (df = n-1)

    NOTATION: To further distinguish the descriptive from the inferential case, English

    letters will be reserved for inferential statistics. The descriptive case will have Greek

    letters for notation. This signals that the denominator is an N, and not the df, asthat is not required in the descriptive case.

    APPLICATION: While the S and S require continuous data, the ranges only require

    ordinal scale. They, therefore, complement measures of central tendency of thesame scale. When all the members of the distribution are multiplied by a constant,the SS is inflated, but the mean is not changed. If the members of the distributionall have a constant added to them, neither the mean nor the SS changes. This is

    true in both the descriptive and inferential cases.

    Consider the data set X = 1, 4, 3, 2, 2, 0, 1, 2

    _

    X f cf c% X-X Squared XfMn = SumX/N = 16/8 = 2

    4 1 8 100.0 2 4 4

    3 1 7 87.5 1 1 3 Mo = 2

    2 4 6 75.0 0 0 8

    1 1 2 25.0 -1 1 1 Md = 50th% = 2

    0 1 1 12.5 -2 4 0

  • 8/8/2019 Test 1 Notes for Statistics

    18/22

    __ ____ ____ __ Rg = 4.5 - (-.5) = 5

    8 0 10 16

    IQR = X75% - X25%

    = 2.5 - 1.5 = 1

    SIQR = IQR/2 = 1/2 .5

    It sometimes eases interpolation to see the set in a line.

    In this case, there are 2 integers in each quartile.

    X = 0 1 1 2 2 2 3 4

    25% 50% 75%

    Now that the ordinal measures are complete, consider the continuous solutions:

    SS = Sum (X - X)Sq = 10

    = SS/N = 10/8 = 1.25 S = SS/df = 10/7 = 1.428

    _____

    = \ 1.25 = 1.118 S = Square Root of1.428 = 1.195

  • 8/8/2019 Test 1 Notes for Statistics

    19/22

    Notice that the measures of variability based upon SS, are always larger in theinferential case. That is the effect that the df has upon them. The estimate ofspread based upon a sample has an element of risk, depending upon the degree ofsimilarity of the sample to the population from which it is drawn. The df will inflatethat 'best guess' of variability, to 'hedge your bet' and be certain to cover that true

    variability.

  • 8/8/2019 Test 1 Notes for Statistics

    20/22

  • 8/8/2019 Test 1 Notes for Statistics

    21/22

  • 8/8/2019 Test 1 Notes for Statistics

    22/22