(4) Condensation of Data
Transcript of (4) Condensation of Data
-
7/30/2019 (4) Condensation of Data
1/22
Applied Statistics and Computing Lab
CONDENSATION OF DATA
Applied Statistics and Computing Lab
Indian School of Business
-
7/30/2019 (4) Condensation of Data
2/22
Applied Statistics and Computing Lab
Learning goals Understanding a possible approach to data
analysis
Studying three data representation
techniques:Stem and leaf plot
Frequency table
Dot plot
2
-
7/30/2019 (4) Condensation of Data
3/22
Applied Statistics and Computing Lab
Data Analysis Exploratory
CleaningSummarization
Exploration of salient features
Location Variability (spread)
Concentration
Shape
SkewnessTail information
Inferential
3
-
7/30/2019 (4) Condensation of Data
4/22
Applied Statistics and Computing Lab
Dataset The percentage of employees involved in a certain worker involvement in decision making
program, in 30 companies:
(5, 32, 53, 35, 42, 43, 52, 45, 46, 44, 37, 48, 58, 49, 57, 50, 47, 78, 34, 51, 42, 52, 47, 33, 55, 56,49, 48, 63, 38)
Arranged in ascending order:
(5, 32, 33, 34, 35, 37, 38, 42, 42, 43, 44, 45, 46, 47, 47, 48, 48, 49, 49, 50, 51, 52, 52, 53, 55, 56,
57, 58, 63, 78)
4
0 | 5
1 |
2 |
3 | 234578
4 | 223456778899
5 | 012235678
6 | 3
7 | 8
Data taken from Aczel A., Sounderpandian J. Complete business statistics
-
7/30/2019 (4) Condensation of Data
5/22
Applied Statistics and Computing Lab
Stem and leaf plot Most basic and an easy method of visualizing data in its original form
Stem and leaf plot displays the actual values of all the data points
Each value separated into a stem and a leaf, separated by |, with stem on the
left side and leaf on the right side of the vertical line
Which part of the number qualifies as a stem and which part a leaf, is
determined on data-to-data basis
For example, a data consisting of 2 digit values may consider the digits at tens
place to be the stem and the digits at units place to be the leaves, similar to
our previous diagram
The leaves generally consist of the last or unit digit of a number and the other
digits may be considered as the stem The numbers can sometimes be rounded up to a particular number of digits
and the last digit may be considered to be the leaf
A common format applies to all the values of a dataset
All the stems must be listed, irrespective of whether any leaf follows or not5
-
7/30/2019 (4) Condensation of Data
6/22
Applied Statistics and Computing Lab
Example
GPA of 50 students in the first semester exam
for their second course in Quantitativemethods
The GPA range is 0-10
The numbers have 7 values after the decimal
point
Converted into 1 value after decimal pointformat
6
-
7/30/2019 (4) Condensation of Data
7/22
Applied Statistics and Computing Lab
Stem and leaf plot (contd.)The decimal point is at the |
7
0| 3446
1| 1145677
2| 222245993| 1344488
4| 3556
5| 23578899
6| 4
7|
14789
8| 13
9|10|11|12|13|14|15|16|17|
18| 6
Represents 4 values: 0.3,0.4,0.4,0.6
Represents 8 values: 2.2,2.2,2.2,2.2,2.4,2.5,2.9,2.9
Represents the only value with 6 at its tens place: 6.4
For negative values, a ve sign is put in front of the stem
Stem and leaf plot is a powerful tool to study a data
Gives an idea about the distribution of values; their spread and
density Useful in detecting unusual values and the value occurring with
the highest frequency
Easy to read and understand
Not very informative if there are too few or too many values
-
7/30/2019 (4) Condensation of Data
8/22
Applied Statistics and Computing Lab
Frequency table A table listing the frequency counts for each value of
a variable Useful tool to give a basic idea about the data in a
quick glance
Very easy to construct and is mostly self-explanatory
Can accommodate many types of data, whether
categorical or numerical. Both types of numerical
data; discrete and continuous, can be represented ina frequency table
8
-
7/30/2019 (4) Condensation of Data
9/22
Applied Statistics and Computing Lab
Cars dataset Consists of data on 804 used cars in the USA
Data is collected on 12 features, such as theprice, make and model of the car, the number
of cylinders, number of doors etc. Collected from the Kelly Blue Book
9
-
7/30/2019 (4) Condensation of Data
10/22
Applied Statistics and Computing Lab
Frequency table (contd.) For Cars data, let us take a look at various frequencytables:
10
Car make Frequency of each make
Buick 80
Cadillac 80
Chevrolet 320
Pontiac 150SAAB 114
Saturn 60
No. of
cylinders
Frequency of cars with corresponding
no. of cylinders
4 394
6 310
8 100
Price Frequency
8638.93 18769 1
8870.95 1
9041.91 1
9220.83 1
9482.22 19506.05 1
9563.79 1
9654.06 1
9665.85 1
9720.98 1
There are 798 unique prices!
-
7/30/2019 (4) Condensation of Data
11/22
Applied Statistics and Computing Lab
Frequency table (contd.) Is there a better way of tabulating the prices?
What if we split into bands of prices and calculate thefrequencies?
Would such a table be useful?
The prices of cars range from $8639 to $70760
11
-
7/30/2019 (4) Condensation of Data
12/22
Applied Statistics and Computing Lab
Frequency table for class intervalsPrice range Number of cars
[$8000, $13000) 135
[$13000, $18000) 265[$18000, $23000) 150
[$23000, $28000) 75
[$28000, $33000) 76
[$33000, $38000) 45[$38000, $43000) 33
[$43000, $48000) 11
[$48000, $53000) 5
[$53000, $58000) 2[$58000, $63000) 1
[$63000, $68000) 3
[$68000, $73000) 3
12
-
7/30/2019 (4) Condensation of Data
13/22
Applied Statistics and Computing Lab
Determining class intervals Each band of prices or a group of values of a variable, is referred to
as a class or a class interval
The number of class intervals and size of each interval can be best
determined by the researcher or analyst, who has prior knowledge
of the behaviour of the variable
Classes must be determined keeping the range of values in mind Very few, yet wide class intervals, may not be very informative as
most of the information may get hidden into the large intervals
Too many small intervals may be able to capture a detailed picture
but such a table will be sparse and the sheer length of it may take
away the usefulness of the table
As far as possible, having class intervals of equal width makes the
table easier to understand
13
-
7/30/2019 (4) Condensation of Data
14/22
Applied Statistics and Computing Lab
Determining class intervals (contd.) The class limits i.e. the highest and lowest values of a class interval must be chosen carefully
Must ensure that classes are determined such that any one value of the dataset can not possiblybelong to more than one class intervals
Using two types of brackets; closed [] or open () A class interval can have one open and one closed bracket
Closed bracket => include the number on that side of the interval
Open bracket => all numbers up to or starting from, but excluding the number on that side of theinterval
For a discrete data, limits of class intervals can be easily determined in a non-overlapping manner
For continuous data, values at the limits can repeat across classes
14
Interval Meaning
[1,3] Includes every number from 1 to 3, including the limits
e.g. 1, 1.3, 1.8, 2.24, 2.6, 2.98, 2.999999, 3
[1,3) Includes every number starting from 1 and reaching up to but not including 3
e.g. 1, 1.01, 1.3, 1.78,2.4, 2.9, 2.99, 2.999, 2.9999, 2.99999 (There can be as many 9s after the decimal)
(1,3] Includes every number starting after 1 (but not 1) and reaching up to and including 3
e.g. 1.000000000001, 1.0000001, 1.1, 1.24, 1.7, 2.3, 2.69, 2.99, 3(There can be as many zeroes after the decimal point but the last digit must be a 1)
(1,3) Includes every number in between 1 and 3, excluding 1 and 3
e.g. 1.0000000000000000001, 1.15, 1.6, 1.92, 2.3, 2.89, 2.99999999999999999999999
-
7/30/2019 (4) Condensation of Data
15/22
Applied Statistics and Computing Lab
Dot plot A simple tool to depict the frequencies of values in a dataset
X-axis denotes the value and the corresponding frequency isdenoted on the Y-axis
Gives an idea about the distribution of values
Indicates the intervals within which the variable may not take
any values
The value with highest frequency is easily determined
To create a dot plot in R, the variable has to be numeric
In case of a categorical variable or a variable with class intervals,
an equivalent variable assigning a numeric value to each
category or class must be created15
-
7/30/2019 (4) Condensation of Data
16/22
Applied Statistics and Computing Lab16
-
7/30/2019 (4) Condensation of Data
17/22
Applied Statistics and Computing Lab
Comparison
17
Stem and leaf plot Frequency table Dot plot
Discrete data
Continuous data Constructing classintervals can be useful
Need to create class
intervals
Categorical data
Advantages Depicts actual values
Can detect unusual
observations
Most informative with
large data
Best depiction if there are
many values but only a few
of them have a high
frequency
Disadvantages Not very informative for
a large dataset
Gives less information
than a stem and leaf plot
-
7/30/2019 (4) Condensation of Data
18/22
Applied Statistics and Computing Lab
In this case
stem plot is not
at all a good
idea. Most
importantly, for
this variable, we
do not need to
know the exact
values. Knowingthe range
within which
they lie might
be sufficient
Heightis in cms.
18
Height (in
cms.)
Frequency
147.2 1
149.5 1
149.9 1
151.1 1
198.1 1
Clearly, for this
continuous data we
need to make class
intervals!
Out of the 507 total
data points, 147 haveunique height values
Height (in
cms.)
Frequency
[146, 152) 5
[152, 158) 31
[158, 164) 92
[164, 170) 101
[170, 176) 118
[176, 182) 92
[182, 188) 40
[188, 194) 26
[194, 200) 2
-
7/30/2019 (4) Condensation of Data
19/22
Applied Statistics and Computing Lab 19
Height (in cms.) Frequency
[146, 152) 5
[152, 158) 31
[158, 164) 92
[164, 170) 101
[170, 176) 118
[176, 182) 92
[182, 188) 40
[188, 194) 26
[194, 200) 2
Heightis in cms.
-
7/30/2019 (4) Condensation of Data
20/22
Applied Statistics and Computing Lab
Conclusion Easy to construct
Tools important to get a feel of the data!
Must use the appropriate representation
based on the characteristics of the data Helpful in determining the further course of
data analysis
20
-
7/30/2019 (4) Condensation of Data
21/22
Applied Statistics and Computing Lab
R-codesFunctions R-code
Stem and leaf plot stem(variable name)
Note: scaleis an important parameter toexplore in Rs stem function
Frequency table table(variable name)
Dot plot Install.packages(TeachingDemos)
library(TeachingDemos)dots(variable name)
21
-
7/30/2019 (4) Condensation of Data
22/22
Applied Statistics and Computing Lab
Thank you