Describing, Exploring, And Comparing Data

download Describing, Exploring, And Comparing Data

of 61

Transcript of Describing, Exploring, And Comparing Data

  • 7/28/2019 Describing, Exploring, And Comparing Data

    1/61

    DESCRIBING, EXPLORING, ANDCOMPARING DATA

    APPLIED STATISTICS

    Submitted to : Dr. IMELDA E. CUATEL

    GRADUATE SCHOOLUNIVERRSITY OF LUZON

    Sunday 8:00-12:30

    Prepared by : SAIFULDEEN SINAN

  • 7/28/2019 Describing, Exploring, And Comparing Data

    2/61

    Introduction to Statistics

    What is Statistics?

    a set of procedures and rulesfor reducing

    large masses of data to manageableproportions and for allowing us to drawconclusions from those data

    Statistics is a branch of mathematics that deals with

    the effective management and

    analysis of data.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    3/61

    What can Stats do? Allow us to draw conclusions from the data

    Make data more manageable

    Allows us to do this objectively and quantitatively

  • 7/28/2019 Describing, Exploring, And Comparing Data

    4/61

    Why Statistics?

    To develop an appreciation for variability and how it effectsproducts and processes.

    Build an appreciation for the advantages & Limitations ofinformed observation and Experimentation.

    Determine how to analyze data from designed experimentsin order to build knowledge and continuously improve.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    5/61

    GroupedFrequency Distributions

    A frequency distribution is a table used to organize

    data . The left column (called classes or groups)

    includes numerical intervals on a variable being

    studied. The right column is a list of the frequencies,or number of observations, for each class. .

  • 7/28/2019 Describing, Exploring, And Comparing Data

    6/61

    Grouped frequency distributions -can be used when therange of values in the data set is very large. The datamust be grouped into classes that are more than one unitin width

  • 7/28/2019 Describing, Exploring, And Comparing Data

    7/61

    Construction of a Frequency Distribution

    Find the highest and lowest value. Find the range.

    Select the number of classes desired.

    Find the width by dividing the range by the number of

    classes and rounding up

    Select a starting point (usually the lowest value); add thewidth to get the lower limits.

    Find the upper class limits.

    Find the boundaries.

    Tally the data, find the frequencies and find thecumulative frequency.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    8/61

    Example

    In a survey of 20 patients who smoked, the following

    data were obtained. Each value represents thenumber of cigarettes the patient smoked per day.

    Construct a frequency distribution using six classes.

    10 8 6 14

    22 13 17 19

    11 9 18 14

    13 12 15 15

    5 11 16 11

  • 7/28/2019 Describing, Exploring, And Comparing Data

    9/61

    Answer

    Step 1:Find the highest and lowest

    values: H = 22 and L = 5.

    Step 2:Find the range:R = H L = 22 5 = 17.

    Step 3:Select the number of classes desired. In this case it is equal to6.

    Step 4: Find the class width by dividing the range by the number ofclasses. Width = 17/6 = 2.83. This value is rounded up to 3.

    Step 5: Select a starting point for the lowest class limit. Forconvenience, this value is chosen to be 5, the smallest data value. Thelower class limits will be 5, 8, 11, 14, 17 and 20.

    Step 6: The upper class limits will be 7, 10, 13, 16, 19 and 22.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    10/61

    Step 7: Find the class boundaries by subtracting 0.5 from each lowerclass limit and adding 0.5 to the upper class limit

    Step 8: Tally the data, write the numerical values for the tallies in the

    frequency column and find the cumulative frequencies.

    Class Limits Class Boundaries Frequency Cumulative Frequency

    05 to 07 4.5 - 7.5 2 2

    08 to 10 7.5 - 10.5 3 5

    11 to 13 10.5 - 13.5 6 11

    14 to 16 13.5 - 16.5 5 16

    17 to 19 16.5 - 19.5 3 19

    20 to 22 19.5 - 22.5 1 20

    Note:The dash - represents to.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    11/61

    Histogram

    What is a histogram

    It is "a representation of a frequency distribution by means of

    rectangles whose widths represent class intervals andwhose areas are proportional to the correspondingfrequencies

    A histogram is like a bar chart, but there are some important

    differences.

    It can only be used to show continuous data

    It can only be used to show numerical data

    The data is always grouped.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    12/61

    So The width of a bar represents a quantitative variable x, such as agerather than a category

    The height of each bar indicates frequency

    How is a Real Histogram Made?

    Example

    * Consider the set Below

    {3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}.A graph which shows how many ones, how many twos, how many threes,

    etc. would be meaningless. Instead we bin the data into convenientranges. In this case, with a bin width of 10, we can easily group the dataas below

    Bin =The class size (width of the rectangles) in a histogram

    SEE NEXT SLIDE

  • 7/28/2019 Describing, Exploring, And Comparing Data

    13/61

    SOLUTION

    {3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}.

    a bin width of 10

    DataRange

    Frequency

    0-10 1

    10-20 3

    20-30 6

    30-40 4

    40-50 2

    Note: Changing the size of the bin changes the apprearance of the graph

  • 7/28/2019 Describing, Exploring, And Comparing Data

    14/61

    Histogram shapes

  • 7/28/2019 Describing, Exploring, And Comparing Data

    15/61

  • 7/28/2019 Describing, Exploring, And Comparing Data

    16/61

    Box plot

    A box plot (also referred to as a box and whisker diagram) is a

    diagram showing statistical distribution.

    A box plot summarizes data using the median, upper and lowerquartiles, and the extreme (least and greatest) values. It allows you

    to see important characteristics of the data at a glance.

    We need 5 numbers, called the 5 number summary:

    1. minimum value

    2. Q1

    3. median

    4. Q3

    5. maximum value

  • 7/28/2019 Describing, Exploring, And Comparing Data

    17/61

    Construction of BOX PLOT

    28 32 42 37

    30 25 44 38

    24 32 33 44

    38 34 30 44

    31 28 31 29

    39 29 32 29

    MPG of 4-cylinder cars

  • 7/28/2019 Describing, Exploring, And Comparing Data

    18/61

    To make a box plot, organize the data in order least to

    greatest :

    24 25 28 28 29 29 29 30 30 31 31 32 32 32 33 34 37 38 38

    39 42 44 44 44

    * THEN we Find the median of the data. It is 32

    * This divides the data in half. The lower half : 24

    25 28 28 29 29 29 30 30 31 31 32 and the upper

    half: 32 32 33 34 37 38 38 39 42 44 44 44

  • 7/28/2019 Describing, Exploring, And Comparing Data

    19/61

    Find the median of the top half of the data.32 32 33 34 37 38 38 39 42 44 44 44

    This is called the high median, upper quartile or quartile 3 . Q 3 = 38.Take the lower half of the data and find the median of it.

    24 25 28 28 29 29 29 30 30 31 31 32This is called the low median, or quartile 1. Q1 = 29

    Next, find the lowest data, 24, and the highest data, 44.Lets organize all 5 pieces of data together so we can see

    Lower extreme = 24

    Lower quartile(Q1) =29Median (Q2) = 32

    Upper quartile(Q3) =38

    Upper extreme(Q4)=44

  • 7/28/2019 Describing, Exploring, And Comparing Data

    20/61

    Next, make a number line that will best display the 5 pieces of data(24 ,29 , 32 ,38, 44)

    Place a dot above the number line to show the lowerextreme and one for the upper extreme.

    Put a vertical slash above the number line for the medianand one for the lower and upper quartiles.

    20 24 28 32 36 40 44

    20 24 28 32 36 40 44

  • 7/28/2019 Describing, Exploring, And Comparing Data

    21/61

    Enclose the vertical slashes into a box. Draw a line from the rightcenter of the box to the upper extreme and one from the lower endof the box to the lower extreme, forming the whiskers.

    THEN

    All graphs must have a title that clearly represents what your graphis showing

    Miles per Gallon of 4-cylinder Cars

    Miles per gallon (mpg)

    20 24 28 32 36 40 44

  • 7/28/2019 Describing, Exploring, And Comparing Data

    22/61

    OGIVE

    An ogive, sometimes called a cumulative line graph, is aline that connects points that are the cumulativepercentage of observations below the upper limit of eachclass in a cumulative frequency distribution.

    How to Construct Ogives ? Make a frequency table showing class boundaries and

    cumulative frequencies.

    For each class, put a dot over the upper class boundary atthe height of the cumulative class frequency.

    Place dot on horizontal axis at the lower class boundaryof the first class.

    Connect the dots.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    23/61

    Example

  • 7/28/2019 Describing, Exploring, And Comparing Data

    24/61

    Draw the x and y axis , Plot the points

  • 7/28/2019 Describing, Exploring, And Comparing Data

    25/61

  • 7/28/2019 Describing, Exploring, And Comparing Data

    26/61

    Pie Chart

    Pie graph -A pie graph is a circle that is divided into

    sections or wedges according to the percentage offrequencies in each category of the distribution

    How to make a Pie Chart ?

    1. Organize your information

    2. Add the data all together and reach a sum

    3. Know the angle between the two sides of the piece

    4. Use a mathematical compass to draw a circle

    5. Draw the radius6. Draw each section division

    7. Color each segment.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    27/61

    Example

    A family's weekly expenditure on its house mortgage, food

    and fuel is as follows:

    Draw a pie chart to display the information.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    28/61

    Solution :

    We can find what percentage of the total expenditure eachitem equals.Percentage of weekly expenditure on:

  • 7/28/2019 Describing, Exploring, And Comparing Data

    29/61

    To draw a pie chart, divide the circle into 100 percentage parts.Then allocate the number of percentage parts required for eachitem.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    30/61

    Measures of Central Tendency (Averages)

    A measure of central tendencyis a univariate statistic thatindicates, in one manner or another.

    the average or typicalobserved value of a variable in adata set.

    Central Tendency = values that summarize/ represent themajority of scores in a distribution

    Three main measures of central tendency:

    Mean

    Median

    Mode

    Averages

    M d

  • 7/28/2019 Describing, Exploring, And Comparing Data

    31/61

    Mode

    The mode (or modal value) of a variable in a set of data is

    the value of the variable that is observed most frequentlyin that data (or, given a continuous frequency curve, is atthe point ofgreatest

    Note: the mode is the value that is observed mostfrequently, not the frequency itself )

    The mode is defined for everytype of variable [i.e.,nominal, ordinal, interval, or ratio].

  • 7/28/2019 Describing, Exploring, And Comparing Data

    32/61

    0

    510

    15

    20

    25

    30

    35

    40

    Frequency

    1 2 3 4 5 6 7 8 9

    DV

  • 7/28/2019 Describing, Exploring, And Comparing Data

    33/61

    Mode = most frequently occurring data point

    Mode = (3+4)/2 = 3.5

    Data Point Frequency

    0 2

    1 5

    2 7

    3 14

    4 15

    5 8

    6 5

  • 7/28/2019 Describing, Exploring, And Comparing Data

    34/61

    Median

    Middle-most Value

    50% of observations are above the Median, 50% arebelow it

    The difference in magnitude between the observationsdoes not matter

    Therefore, it is not sensitive to outliers

    Formula Median = n + 1 / 2

  • 7/28/2019 Describing, Exploring, And Comparing Data

    35/61

    Median = the middle number when data arearranged in numerical order

    Data: 3 5 1

    Step 1: Arrange in numerical order

    1 3 5

    Step 2: Pick the middle number (3)

    Data: 3 5 7 11 14 15 Median = (7+11)/2 = 9

  • 7/28/2019 Describing, Exploring, And Comparing Data

    36/61

    MedianMedian Location = (N +1)/2 = (56 + 1)/2 = 28.5

    Median = (3+4)/2 = 3.5Data Point Frequency

    0 2

    1 5

    2 7

    3 14

    4 15

    5 8

    6 5

  • 7/28/2019 Describing, Exploring, And Comparing Data

    37/61

    Mean

    The mean (or mean value) of a variable in a set of data isthe result of adding up all the observed values of thevariable and dividing by the number of cases ( the

    average as the term is most commonly used). The mean is defined if and only if the variable is at least

    interval in nature [i.e., interval or ratio].

  • 7/28/2019 Describing, Exploring, And Comparing Data

    38/61

    Mean = Average =X/NX = 191 Mean = 191/56 = 3.41

    Data Point Frequency X

    0 2 0

    1 5 5

    2 7 14

    3 14 42

    4 15 60

    5 8 40

    6 5 30

  • 7/28/2019 Describing, Exploring, And Comparing Data

    39/61

    Advantages and Disadvantages of the Measures:

    Median1. Also unaffected by extreme scores

    Data: 5 8 11 Median = 8

    Data: 5 8 5 million Median = 8

    2. Usually its value actually occurs in the data3. But cannot be entered into equations, because

    there is no equation that defines it

    4. And not as stable from sample to sample,

    because dependent upon the number of scores inthe sample

  • 7/28/2019 Describing, Exploring, And Comparing Data

    40/61

    Advantages and Disadvantages of the Measures:

    Mean1. Defined algebraically

    2. Stable from sample to sample

    3. But usually does not actually occur in the data

    4. And heavily influenced by outliersData: 5 8 11 Mean = 8

    Data: 5 8 5 million Mean = 1,666,671

  • 7/28/2019 Describing, Exploring, And Comparing Data

    41/61

    Measures of Variation

    Measures of variation is a measure that describes how spreadout or scattered a set of data. It is also known as measures ofdispersion or measures of spread.

    Measures of Variation include:

    1. The range

    2. The Variance

    3. The Standard Deviation

    The standard deviation isjust the square root of thevariance

  • 7/28/2019 Describing, Exploring, And Comparing Data

    42/61

    Range: difference between the extreme values (max - min),actual values are most often reported in the literature (min -max) rather than the difference

    Variance - measure of variation in a sample of data: meansquared deviations of a value from the mean, often referred toas the mean square or MS

    Standard deviation: square root of the variance, measuresamount of variation of values around the mean

    E l

  • 7/28/2019 Describing, Exploring, And Comparing Data

    43/61

    Example

    Heights (in inches) of 5 starting players from basketball

    team A:

    A: 72 , 73, 76, 76, 78

    The rangeis the difference between maximum andminimum values of the data set.

    Range of team A: 78-72=6

    The sample standard deviationtakes into account alldata values. The following procedure is used to find thesample standard deviation.

  • 7/28/2019 Describing, Exploring, And Comparing Data

    44/61

    Step 1.

    Find the mean of data

  • 7/28/2019 Describing, Exploring, And Comparing Data

    45/61

    Step 2.

    Find the deviation of each score from the mean

    Note that the sum of the deviations is zero:

    xi

    72 72-75 = -3

    73 7375 = -2

    76 76-75 = 1

    76 76-75 = 1

    78 78-75= 3

    x x

  • 7/28/2019 Describing, Exploring, And Comparing Data

    46/61

    Step 3.Square each deviation from the mean .Find the sum of the squared deviations.

    xi

    72 72-75 = -3 9

    73 7375 = -2 4

    76 76-75 = 1 1

    76 76-75 = 1 1

    78 78-75= 3 9

    0 24

    ixx

    2)(i

    xx

  • 7/28/2019 Describing, Exploring, And Comparing Data

    47/61

    Step 4.The sample variance is determined by dividing the sum of thesquared deviations by (n-1) (number of scores minus one)

    Team A, the sample variance is

  • 7/28/2019 Describing, Exploring, And Comparing Data

    48/61

    Step 5.The standard deviation Is the square root of the variance.

    The mathematical formula for the sample standard deviation is

    The sample standard deviation for Team A is

  • 7/28/2019 Describing, Exploring, And Comparing Data

    49/61

    Measures of Position

    Identify the position of a data value in a data set, using

    various measures of position such as percentiles andquartiles

    Are used to locate the relative position of a data value ina data set

    Can be used to compare data values from different datasets

    Can be used to compare data values within the samedata set

    Can be used to help determine outliers within a data set Includes z-(standard) score, percentiles, quartiles

  • 7/28/2019 Describing, Exploring, And Comparing Data

    50/61

    z-scores

    Also called the standard score

    Represents the number of standard deviations a score isfrom the mean

    Always round value to 2 decimal places

    Can be used to compare data values from different datasets by converting raw data to a standardized scale

    Calculation involves the mean and standard deviation ofthe data set

    Represents the number of standard deviations that adata value is from the mean for a specific distribution

  • 7/28/2019 Describing, Exploring, And Comparing Data

    51/61

    Z -score

    Is obtained by subtracting the

    mean from the given datavalue and dividing the resultby the standard deviation.

    Symbol of BOTH population

    and sample is z Can be positive, negative or

    zero A date point can be considered

    unusual if its z-score is

    sufficiently large or small

    Formula

    Sample

  • 7/28/2019 Describing, Exploring, And Comparing Data

    52/61

    ExampleHuman body temperatures have a mean of 98.20 degrees

    and a standard deviation of 0.62 degrees.Find the z score for temperatures of:

    a. 100 degrees

    b. 97 degrees

    Solution

    Z = (100 98.20)/0.62

    Z = 2.90

    Z = (97 98.20)/0.62

    Z = -1.94

  • 7/28/2019 Describing, Exploring, And Comparing Data

    53/61

    Significance of Z

    Z scores above 2 or below -2 are considered to be

    UNUSUAL.

    Z scores above 3 or below -3 are considered to be VERYUNUSUAL.

    So

    The temperature of 100 degrees is UNUSUAL.

    The temperature of 97 degrees is ordinary

  • 7/28/2019 Describing, Exploring, And Comparing Data

    54/61

    Percentiles

    Are position measures used indicate the position of an

    individual in a group Divides the data set in 100 (per cent) equal groups Used to compare an individual data value with the

    national norm Symbolized by P

    1,P

    2 ,..

    Percentile rank indicates the percentage of data valuesthat fall belowthe specified rank

    Where B = number of scores belowxE = number of scores equal toxn = number of scores

  • 7/28/2019 Describing, Exploring, And Comparing Data

    55/61

    A percentile tells the percent of scores that are lowerthan a given score.

    Example : If Jason graduated 25th out of a class of 150students, then 125 students were ranked belowJason. Jason's percentile rank would be:

    Jason's standing in the class at the 84th percentile is as

    higher or higher than 84% of the graduates.

    Q til

  • 7/28/2019 Describing, Exploring, And Comparing Data

    56/61

    Quartiles

    Quartiles divide the data set into 4 groups, each of which

    has the same number of members. Q1 corresponds to P25

    Q2 corresponds to P50 or the median

    Q3 corresponds to P75

    Q1, Q2, Q3

    divides ranked scores into four equal parts

  • 7/28/2019 Describing, Exploring, And Comparing Data

    57/61

    Example

    Find : Q1,Q2,Q3 ?

  • 7/28/2019 Describing, Exploring, And Comparing Data

    58/61

    Q2(Median)

    The median is theaverage of the 6th and7th scores.

    (80.2+ 82.5)/2

    Q2= 81.35

  • 7/28/2019 Describing, Exploring, And Comparing Data

    59/61

    Q1

    Find the median ofthe first 6 scores

    (78.6 + 79.2)/2 78.9

  • 7/28/2019 Describing, Exploring, And Comparing Data

    60/61

    Q3

    Find the medianof the last 6

    scores

    (84.3+84.6)/2

    84.45

    THE END

  • 7/28/2019 Describing, Exploring, And Comparing Data

    61/61

    THE END