Visualization and Exploration - Carleton...
Transcript of Visualization and Exploration - Carleton...
![Page 1: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/1.jpg)
Visualization and Exploration
January 25, 2016
![Page 2: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/2.jpg)
Visualization
Reasons to use visualization:I to find problems with the dataI to explore dependencies and featuresI to present results
General guidelines:I Display as much information as possible with least amount
of effort required from the viewer to get it.I Clarity is paramount - make the data stand out
I avoid overusing colours, shapes, patternsI avoid distracting elements that don’t add value, e.g. grid
lines, background coloursI use the right aspect ratio
I Visualization is an iterative process
![Page 3: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/3.jpg)
Visualization in R
Standard graphing capabilities in R are the graphicspackage.
Package lattice improves by adding easy display ofmultivariate and conditional relationships. Implementation ofthe trellis project:http://ect.bell-labs.com/sl/project/trellis/
See chapters 3 and 4 in "Using R for Data Analysis andGraphics" for introduction and examples:http://cran.r-project.org/doc/contrib/usingR.pdf
Also try:demo(graphics)demo(lattice)
![Page 4: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/4.jpg)
Visualization in R
Package ggplot2 is the new kid on the block. Implements theThe Grammar of Graphics by Leland Wilkinson:https://www.springer.com/statistics/computational+
statistics/book/978-0-387-24544-7
Documentation at http://docs.ggplot2.org/current/
I In ggplot2 graphs are defined on data frames.I Graphs are produced by adding layers and
transformations.I Data are displayed using aesthetics, such as position,
colour, size, shape
![Page 5: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/5.jpg)
Package ggplot2
Some graph elements in ggplot2:I geom: geometric objects define the type of plotI stat: statistical transformationsI facet: displays subsets of the data in different panels
allowing for visualization of conditional relationships.
Use ggplot function to create graph object and add layerswith the + operator.
Use qplot function for a simplified interface to ggplot2.
![Page 6: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/6.jpg)
Example
qplot(mpg, disp, data=mtcars, colour=factor(cyl),main="Engine displacement vs MPG", xlab="MPG",ylab="Engine displacement (cb.in)")
●●
●
●
●
●
●
●●
●●
● ●●
●●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
100
200
300
400
10 15 20 25 30 35MPG
Eng
ine
disp
lace
men
t (cb
.in)
factor(cyl)
●
●
●
4
6
8
Engine displacement vs MPG
![Page 7: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/7.jpg)
Example
qplot(mpg, disp, data=mtcars, colour=factor(cyl),main="Engine displacement vs MPG", xlab="MPG",ylab="Engine displacement (cb.in)") +
theme_bw() + labs(colour="Cylinders")
●●
●
●
●
●
●
●●
●●
● ●●
●●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
100
200
300
400
10 15 20 25 30 35MPG
Eng
ine
disp
lace
men
t (cb
.in)
Cylinders
●
●
●
4
6
8
Engine displacement vs MPG
![Page 8: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/8.jpg)
Dataset for examples
A copy of the file is available on the course webpage.custdata <- read.table("custdata.tsv", header=T, sep="\t")
The business objective is to predict whether your customer hashealth insurance. This synthetic dataset contains customersinformation for ones whose health insurance status is known.
![Page 9: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/9.jpg)
Spot problems
Missing values
> dim(custdata)[1] 1000 11> mv <- colSums(is.na(custdata))> cbind(mv) % cbind to display as column
mvcustid 0sex 0is.employed 328income 0marital.stat 0health.ins 0housing.type 56recent.move 56num.vehicles 56age 0state.of.res 0
![Page 10: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/10.jpg)
Spot problems
Values out of range
> summary(custdata$income)Min. 1st Qu. Median Mean 3rd Qu. Max.
-8700 14600 35000 53500 67000 615000>> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.0.0 38.0 50.0 51.7 64.0 146.7
![Page 11: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/11.jpg)
Values out of range
> qplot(age, data=custdata)stat_bin: binwidth defaulted to range/30. Use’binwidth = x’ to adjust this.
0
30
60
90
0 50 100 150age
coun
t
![Page 12: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/12.jpg)
Values out of range
qplot(age, data=custdata, binwidth=5)
0
25
50
75
100
0 50 100 150age
coun
t
![Page 13: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/13.jpg)
Values out of range
qplot(income, data=custdata, binwidth=10000)
0
50
100
150
0e+00 2e+05 4e+05 6e+05income
coun
t
![Page 14: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/14.jpg)
Values out of range
library(scales)qplot(income, data=custdata, binwidth=10000) +
scale_x_continuous(labels=dollar)
0
50
100
150
$0 $200,000 $400,000 $600,000income
coun
t
![Page 15: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/15.jpg)
Values out of range
qplot(income, data=custdata, binwidth=5000) +scale_x_continuous(labels=dollar)
0
40
80
120
$0 $200,000 $400,000 $600,000income
coun
t
![Page 16: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/16.jpg)
I qplot selects automatically the type of graph from thenumber and type of arguments
I for a single numerical variable the default is histogram
I the same plot can be done using the following commands
ggplot(custdata) +geom_histogram(aes(x=income), binwidth=5000) +scale_x_continuous(labels=dollar)
![Page 17: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/17.jpg)
Logarithmic scale
I Use logarithmic scale for variables where percent changeis more important than change in value.
I Use logarithmic scale when data spans a wide range, e.g.multiple orders of magnitude
![Page 18: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/18.jpg)
Logarithmic histogram
custdata2 <- subset(custdata, income > 0)qplot(income, data=custdata2, binwidth=5000) +
scale_x_log10(breaks=10^(1:6), labels=dollar)
0
250
500
750
$10$100$1,000$10,000$100,000$1,000,000income
coun
t
![Page 19: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/19.jpg)
Logarithmic histogram
binwidth should be in percent change, not dollar amountqplot(income, data=custdata2, binwidth=0.05) +
scale_x_log10(breaks=10^(1:6), labels=dollar)
0
20
40
$100 $1,000 $10,000 $100,000 $1,000,000income
coun
t
![Page 20: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/20.jpg)
Density plots
qplot(income, data=custdata2, geom="density") +scale_x_log10(breaks=10^(1:6), labels=dollar)
0.00
0.25
0.50
0.75
1.00
$100 $1,000 $10,000 $100,000 $1,000,000income
dens
ity
![Page 21: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/21.jpg)
Histogram vs density
I Both apply to continuous variables.
I Both give an idea of the underlying probability distribution.
I Two histograms of the same data may look very differentwith different bin widths and choosing the best bin width isimportant.
I A density plot is a "continuous histogram". It plots anestimated probability distribution function.
![Page 22: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/22.jpg)
Bar chartsA bar chart is a histogram for categorical variable. It is thedefault geometry in qplot for factor and logical variables
qplot(marital.stat, data=custdata)
0
100
200
300
400
500
Divorced/Separated Married Never Married Widowedmarital.stat
coun
t
![Page 23: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/23.jpg)
Bar charts
qplot(state.of.res, data=custdata)
0
25
50
75
100
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingstate.of.res
coun
t
What a mess!
![Page 24: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/24.jpg)
Bar charts
qplot(state.of.res, data=custdata) + coord_flip()
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyoming
0 25 50 75 100count
stat
e.of
.res
Better! When you have more than a few categories, usehorizontal bars!
![Page 25: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/25.jpg)
Bar charts
qplot(state.of.res, data=custdata) + coord_flip() +theme(axis.text.y=element_text(size=rel(0.6)))
AlabamaAlaska
ArizonaArkansasCaliforniaColorado
ConnecticutDelaware
FloridaGeorgia
HawaiiIdahoIllinois
IndianaIowa
KansasKentuckyLouisiana
MaineMaryland
MassachusettsMichigan
MinnesotaMississippi
MissouriMontana
NebraskaNevada
New HampshireNew JerseyNew Mexico
New YorkNorth Carolina
North DakotaOhio
OklahomaOregon
PennsylvaniaRhode Island
South CarolinaSouth Dakota
TennesseeTexasUtah
VermontVirginia
WashingtonWest Virginia
WisconsinWyoming
0 25 50 75 100count
stat
e.of
.res
Better yet! The labels are small, but at least they don’t overlap.
![Page 26: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/26.jpg)
Sorted bar chart
custdata <- transform(custdata, state.of.res.ord=reorder(state.of.res, state.of.res, length))
qplot(state.of.res.ord, data=custdata) + coord_flip() +theme(axis.text.y=element_text(size=rel(0.6)))
DelawareNorth Dakota
WyomingRhode Island
AlaskaMontanaVermont
IdahoKansasNevadaHawaii
New HampshireSouth Dakota
UtahMaine
New MexicoArkansas
MississippiOregon
NebraskaArizona
IowaAlabamaColorado
OklahomaWest Virginia
ConnecticutLouisiana
South CarolinaKentucky
North CarolinaMaryland
WashingtonMinnesota
MissouriTennessee
MassachusettsGeorgia
WisconsinIndianaVirginia
New JerseyFloridaIllinois
OhioMichigan
TexasPennsylvania
New YorkCalifornia
0 25 50 75 100count
stat
e.of
.res
.ord
![Page 27: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/27.jpg)
Aside: reorder a factor
Let’s reorder states by average number of vehicles percustomer.state.by.num.vehicles <- reorder(custdata$state.of.res,
custdata$num.vehicles, mean, na.rm=TRUE)
What is the average number of vehicles per customer in eachstate?
In Alabama:with(custdata, mean(
num.vehicles[state.of.res=="Alabama"], na.rm=TRUE))
Repeat for each of the 50 states. There has to be a better way!
![Page 28: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/28.jpg)
Aside: reorder a factor
Let’s reorder states by average number of vehicles percustomer.state.by.num.vehicles <- reorder(custdata$state.of.res,
custdata$num.vehicles, mean, na.rm=TRUE)
What is the average number of vehicles per customer in eachstate?
In Alabama:with(custdata, mean(
num.vehicles[state.of.res=="Alabama"], na.rm=TRUE))
Repeat for each of the 50 states. There has to be a better way!
![Page 29: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/29.jpg)
Aside
Using base R# splitpieces <- split(custdata, custdata$state.of.res)
# applyresult <- lapply(pieces, function(p) data.frame(
state.of.res=p$state.of.res[[1]],state.avg.vehicles=mean(p$num.vehicles, na.rm=TRUE)
))
# combineresult <- do.call("rbind", result)
![Page 30: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/30.jpg)
Aside
Package plyr implements split-apply-combine framework veryneatly in a single function call.
library(plyr)result <- ddply(
custdata, # dataframe"state.of.res", # split-by variablessummarize, # function to apply to each piece
# function argumentsstate.avg.vehicles=mean(num.vehicles, na.rm=TRUE)
)
![Page 31: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/31.jpg)
Single variable
To summarize visualization of single variable
I For a numerical variable use a histogram or density plot tolook for outliers, or incorrect values.
I Also get a feel for the distribution – is it symmetric, normal,lognormal.
I For categorical variables use a bar chart to comparefrequencies of categories.
![Page 32: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/32.jpg)
Scatter plot
custdata2 <- with(custdata,subset(custdata, age>0 & age < 100 & income > 0))
qplot(age, income, data=custdata2) +scale_y_continuous(labels=dollar)
●●● ●
●
●
●
●●
●
●
● ●
●
●
●●
●
●●
●
●●
●
●●●
●●
●●●
●●
●
● ●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●●
● ●●
●●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●●
●
● ●
●
●
●
●
●
●● ●
●
●
●●
●
●●
●●●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
● ●
● ●
●●
●
●● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
● ●
●
●●
●
●●●
●
●
●
●
●
● ●
●
●●●
●●
●
●
●
●
●●
●
● ●
●
●●
●
●
●
●●
●
●
●● ●●
●
●●●
●
●
●
●
●● ●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●●●
●
●
● ●●
●
●
●
●
●●
●●●
●
●
● ●
●
●
●
● ●●
●
●● ●
●
●●
●
●
●●
●
●
● ●
●
●
●
● ●
●
●● ●
● ●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●● ●
●
●●
●
●
●
●
●
●
●
●
● ●●●
●
●
●
●● ●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
● ●
●
●
●
●●
●
●
●●
●
●●● ●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●● ●●
●
●
●
●
●
●
●●●
●
●
●
● ●●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●●
● ●
●●
●
●●● ●●
●
●
●
●●
●●
●
●●
●●
● ●
●
●
●●
●
●
●
●
●●
●
● ●●●
●
●
●
●
●
●●
●
●
●●
●● ●
●●
●●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●●
●
●●
●●
●● ●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●●
●●
●●●
●●
●
●
●●
●
●
●
●
●●●●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●●●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●●
●●
●
●●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●●●
●●
●●
●●
●
●
●
●● ●
●
●
●
●
●
●●
● ●
● ●
●
●
●
●
●
●●
●●
●
●
●●● ●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●●
●
●●
●
● ●
●
$0
$200,000
$400,000
$600,000
20 40 60 80age
inco
me
![Page 33: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/33.jpg)
Scatter plot
qplot(age, income, data=custdata2, colour=health.ins) +scale_y_continuous(labels=dollar)
●●● ●
●
●
●
●●
●
●
● ●
●
●
●●
●
●●
●
●●
●
●●●
●●
●●●
●●
●
● ●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●●
● ●●
●●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●●
●
● ●
●
●
●
●
●
●● ●
●
●
●●
●
●●
●●●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
● ●
● ●
●●
●
●● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
● ●
●
●●
●
●●●
●
●
●
●
●
● ●
●
●●●
●●
●
●
●
●
●●
●
● ●
●
●●
●
●
●
●●
●
●
●● ●●
●
●●●
●
●
●
●
●● ●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●●●
●
●
● ●●
●
●
●
●
●●
●●●
●
●
● ●
●
●
●
● ●●
●
●● ●
●
●●
●
●
●●
●
●
● ●
●
●
●
● ●
●
●● ●
● ●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●● ●
●
●●
●
●
●
●
●
●
●
●
● ●●●
●
●
●
●● ●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
● ●
●
●
●
●●
●
●
●●
●
●●● ●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●● ●●
●
●
●
●
●
●
●●●
●
●
●
● ●●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●●
● ●
●●
●
●●● ●●
●
●
●
●●
●●
●
●●
●●
● ●
●
●
●●
●
●
●
●
●●
●
● ●●●
●
●
●
●
●
●●
●
●
●●
●● ●
●●
●●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●●
●
●●
●●
●● ●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●●
●●
●●●
●●
●
●
●●
●
●
●
●
●●●●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●●●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●●
●●
●
●●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●●●
●●
●●
●●
●
●
●
●● ●
●
●
●
●
●
●●
● ●
● ●
●
●
●
●
●
●●
●●
●
●
●●● ●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●●
●
●●
●
● ●
●
$0
$200,000
$400,000
$600,000
20 40 60 80age
inco
me health.ins
●
●
FALSE
TRUE
![Page 34: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/34.jpg)
2D histogram
qplot(age, income, data=custdata2, geom="bin2d") +scale_y_continuous(labels=dollar)
$0
$200,000
$400,000
$600,000
20 40 60 80age
inco
me
4
8
12
16
count
![Page 35: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/35.jpg)
2D histogram
library(hexbin)qplot(age, income, data=custdata2, geom="hex") +
scale_y_continuous(labels=dollar)
$0
$200,000
$400,000
$600,000
25 50 75age
inco
me
5
10
15
count
![Page 36: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/36.jpg)
2D histogram
library(hexbin)ggplot(custdata2, aes(x=age, y=income)) +
geom_hex(binwidth=c(5, 10000)) +geom_smooth(color="white", se=F) +ylim(0,200000)
0
50000
100000
150000
200000
25 50 75age
income
4
8
12
16count
![Page 37: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/37.jpg)
Scatter plot
Also works for continuous vs. categorical.
qplot(age, health.ins, data=custdata2)
●
●●
● ●● ●● ●●● ● ●● ● ●● ● ●●
●
●● ●● ●
●
● ●● ●●● ●● ● ●●●●● ●● ● ●●●●●●
●
● ● ●●
●●
●
●
● ●●
●
●
●
●●● ●● ● ●●● ● ●● ● ●● ●● ●● ●●
●
●
●
● ● ●●● ●●● ● ●
●
●●
●
● ●
●
● ●● ●● ● ● ●●
●
● ●● ●● ● ●● ●●● ●● ●●
●
●
●
● ●
●
●● ● ● ●● ●● ●
●
●● ●
●
●●●● ● ●●● ●●● ●●● ●●●● ● ●● ●●●
●
● ●● ●● ●● ●●
●
● ●● ● ●●● ●● ●●● ●●●
●
● ● ●● ●● ●●
●
●● ● ●●● ● ●●● ●●● ●●● ● ● ●
●
●
● ●
●●● ●● ●● ● ● ●●● ●
●
● ●● ●● ●
● ●
●● ●
●
●●
●
● ●● ●●● ●● ●●● ● ●● ● ●●● ●● ●● ●●● ●●●
●●
● ● ●● ●●● ●● ● ●●
●
● ● ● ●
●
●●
●
● ● ●● ●●● ●● ●● ●
●
●●
●
● ● ●
●
●
●
●● ●● ●●●●● ●●●● ●●
●
● ●● ●●● ● ● ●● ● ●● ●●● ●●
●
● ●● ●● ● ●● ●●● ● ●● ●●● ●● ●
●
●
●
● ●●● ● ● ●● ●●
●●
●●●
●
● ● ●●●●
●●
●● ●●●
●
● ●●● ● ●● ●●
●
● ●●● ● ●● ● ●
●
●
●
● ●●
●
●● ● ●● ●●● ●
● ●●
● ● ●●●●
●
●●●●●● ● ●
●
● ●●
●
●● ●● ●● ●●● ●● ●● ●● ●● ● ● ● ●● ●
●
●●● ● ●
●
●● ● ●● ●
●
● ●● ● ●●●● ●● ●●
●
●●
●
●
●
●● ●● ● ●●
●
● ●● ● ●
●
●●●
●
●●● ●
●
● ●●● ● ●
●
●●●
●
●●● ●● ●
●●
●● ●
●
●●●● ● ● ●
●
●● ● ●● ●● ● ●
●●
● ●●
●
●● ●●●● ● ●
●
●● ● ●● ● ●
●
● ●●● ●●
●
● ●●● ●● ●●● ●
●
● ● ● ● ●
●
●●●●●
●
● ●●●● ● ●●●
●
●●●●
●
●● ●● ●● ●
●
●
●
●●● ●● ●● ●●●● ● ●● ●●●
●
●●●●
●
●● ● ●●●
●
●●● ●
●
● ●● ●● ●● ●
● ●
● ● ● ●●● ●● ●● ●● ● ●● ●●●● ● ●● ●● ●●●● ●●●● ●●●● ●
●
●● ● ● ●● ●
●
● ●
●
●
●
●●●● ● ● ●
●
●● ●●
●
● ●●● ●●●● ●●●● ●
●
●● ●
●
●● ●●● ● ●
●
●
●
●● ● ●●● ● ●●
●
●●
●
●
●
● ●● ● ● ●●
●
●●● ● ●●●●●
●
●● ●●● ●● ●●●● ●●
●
●●●● ●
●
●
●
● ●● ●●●
●
●● ●
●
●●●● ●● ●● ● ●●● ●● ● ●●
●
● ●● ● ● ●● ●● ●
●FALSE
TRUE
20 40 60 80age
heal
th.in
s
![Page 38: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/38.jpg)
Scatter plotThis is better – it gives a better feel for the density at each level.
qplot(age, health.ins, data=custdata2,position=position_jitter(height=0.2))
●
●●
●
●
●
●●
●●
● ● ●
●
●
●● ●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●●
●●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●●
●●
●
●
●●
●
●●
●
●●
●
●
● ●
●●●●●
●
●
● ●●
●●●
●
●
●
●
●●●
● ●
●●
●
●
●
●
●
●
●
●
●
●●
●● ●●●
●
●
●● ●
●
●●
● ●
●●
●
●
●●●
●
●
●
●
●
●●
●
●
●
● ●
●●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●●
●
●●
●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
● ●●●
●
●●
●●
●
●
●
●
●
●●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●●●
●
● ●
●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●●
●●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●● ●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●●
●●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
FALSE
TRUE
20 40 60 80age
heal
th.in
s
![Page 39: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/39.jpg)
Scatter plot
qplot(age, health.ins, data=custdata2, color=log10(income),position=position_jitter(height=0.2))
●
●●
●
●
● ●● ●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●●
●
● ●●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●● ●●
●●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●●●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
● ●●
●
●●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●●●
●
●●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●●●
● ●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
● ●
●
●
●
●
●●
●
●●
● ●●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●●
●●
●●
● ●
●
●●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
● ●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●●● ●
●
●
●
●
●
●●
●
●● ●
●●
●
●
●
●
●
●●●
●
●●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
●●●
●
●
●
●
● ● ●
●
●
●
●●
● ●
●
●
● ●
●
●
● ●●●
●
●●●
●
●
●
●
●
●●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
● ●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●FALSE
TRUE
20 40 60 80age
heal
th.in
s
2
3
4
5
log10(income)
![Page 40: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/40.jpg)
Bar chart for two variables
Use the fill aesthetic as the second variable
qplot(marital.stat, data=custdata2, fill=health.ins)
0
100
200
300
400
Divorced/Separated Married Never Married Widowedmarital.stat
coun
t health.ins
FALSE
TRUE
![Page 41: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/41.jpg)
Bar chart for two variablesSome prefer side-by-side
ggplot(custdata2) +geom_bar(aes(marital.stat, fill=health.ins),
position="dodge")
0
100
200
300
400
Divorced/Separated Married Never Married Widowedmarital.stat
coun
t health.ins
FALSE
TRUE
![Page 42: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/42.jpg)
Bar chart for two variablesFilled bar chart shows the proportion of insured within eachlevel of marital status.
ggplot(custdata2) +geom_bar(aes(marital.stat, fill=health.ins),
position="fill")
0.00
0.25
0.50
0.75
1.00
Divorced/Separated Married Never Married Widowedmarital.stat
coun
t health.ins
FALSE
TRUE
![Page 43: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/43.jpg)
Bar chart for two variables
Add a cloud of points to convey the size of each level.
last_plot() + geom_point(aes(x=marital.stat, y=-0.05),position=position_jitter(h=0.02), size=0.75, alpha=0.75)
0.00
0.25
0.50
0.75
1.00
Divorced/Separated Married Never Married Widowedmarital.stat
coun
t health.ins
FALSE
TRUE
![Page 44: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/44.jpg)
Bar chart for two variablesMore than two levels
ggplot(custdata2) +geom_bar(aes(housing.type, fill=marital.stat),
position="dodge")
0
100
200
Homeowner free and clearHomeowner with mortgage/loanOccupied with no rentRented NAhousing.type
coun
t
marital.stat
Divorced/Separated
Married
Never Married
Widowed
![Page 45: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/45.jpg)
Bar chart for two variablesRemove NA from housing.type and fix labels
ggplot(subset(custdata2, !is.na(housing.type))) +geom_bar(aes(housing.type, fill=marital.stat),
position="dodge") +theme(axis.text.x=element_text(angle=15))
0
100
200
Homeowner free and clear
Homeowner with mortgage/loan
Occupied with no rent
Rented
housing.type
coun
t
marital.stat
Divorced/Separated
Married
Never Married
Widowed
![Page 46: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/46.jpg)
Bar chart for two variablesUse facetting instead of fill to get a better picture.
ggplot(subset(custdata2, !is.na(housing.type))) +geom_bar(aes(marital.stat)) +facet_wrap(~housing.type, scales="free_y") +theme(axis.text.x=element_text(size=rel(0.8)))
Homeowner free and clear Homeowner with mortgage/loan
Occupied with no rent Rented
0
20
40
60
80
0
100
200
0
1
2
3
4
5
0
40
80
120
Divorced/SeparatedMarried Never Married Widowed Divorced/SeparatedMarried Never Married Widowed
marital.stat
coun
t
![Page 47: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to](https://reader036.fdocuments.in/reader036/viewer/2022063012/5fc813ad3179bd350a462ef9/html5/thumbnails/47.jpg)
Visualization with R
Further readings
I A short course on ggplot2 by Hadley Wickhamhttp://courses.had.co.nz/11-rice/