A picture speaks a thousand words - Data Visualisation with R

63
A picture speaks a thousand words Data Visualisation with R Barbara Fusinska @BasiaFusinska

Transcript of A picture speaks a thousand words - Data Visualisation with R

Page 1: A picture speaks a thousand words - Data Visualisation with R

A picture speaks a thousand words

Data Visualisation with RBarbara Fusinska@BasiaFusinska

Page 2: A picture speaks a thousand words - Data Visualisation with R

About me

ProgrammerMachine Learning

Data Solutions Architect@BasiaFusinska

Page 3: A picture speaks a thousand words - Data Visualisation with R

Agenda• Exploratory Data Analysis

• Elements of EDA• Visual artifacts

• R Visualisation ecosystem• Base/Lattice/ggplot2 comparison• Layers in ggplot2• Interesting visualisations

Page 4: A picture speaks a thousand words - Data Visualisation with R

https://github.com/BasiaFusinskahttps://katacoda.com/BasiaFusinska

Page 5: A picture speaks a thousand words - Data Visualisation with R

Exploratory Data Analysis (EDA) is an approach to analysing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task.

Page 6: A picture speaks a thousand words - Data Visualisation with R

Why do we need visualisations?

Insight Impress

Page 7: A picture speaks a thousand words - Data Visualisation with R

Use Case - Online Learning Platform

UserArea

Vendor

CourseCourseTaken

Cloud (25%)Data Science (50%)Web (15%)Software Engineering (10%)Software Mind (20%)

Cloud Solutions (3%)InfraNet (12%)DataLearn (7%)WWW Way (11%)Soft Skills (4%)Edu Zen (10%)Data Foundation (25%)Learning Island (5%)Design Your Way (3%)

201420152016

Prices:10$ (25%) 99$ (20%)19$ (15%) 250 (15%)49$ (20%) 500 (5%)

Page 8: A picture speaks a thousand words - Data Visualisation with R

courses.aggregateName Area Vendor Year Month Price [$]

Perez, Lisa Data Science Data Foundation 2015 7 99

Tran, Janiro Software Engineering DataLearn 2016 2 10

Bajwa, John Cloud InfraNet 2015 9 250

Lindsey, Aaron Web Software Mind 2014 6 19

Cooper, Duncan Software Engineering Learning Island 2014 7 250

Grumbach, Alexander Web Design Your Way 2015 2 99

Page 9: A picture speaks a thousand words - Data Visualisation with R
Page 10: A picture speaks a thousand words - Data Visualisation with R

Categorical data - count occurrences

Cloud Data Science

Software Engineering

Web

693 2271 462 1574

# Count occurrencescourses.areas <- table(courses.aggregate$area

Page 11: A picture speaks a thousand words - Data Visualisation with R

Bar plot – Number of courses taken by Area

# Draw the plotbarplot(courses.areas,

ylab="Count", main="Areas")

Page 12: A picture speaks a thousand words - Data Visualisation with R

Categorical data count occurrences

# Count occurrencesvendor.area <- table(data.frame(

courses.aggregate$area,courses.aggregate$vendor))

CSol DataF DataL DesYW EZen …

Cloud 0 263 49 28 0

Data Science

91 636 90 0 192

Software 0 44 83 95 0

Web 0 267 207 0 158

Page 13: A picture speaks a thousand words - Data Visualisation with R

Stacked Bar plot – Areas by Vendors

# Draw the plotbarplot(vendor.area, ylab="Count",

main="Areas by Vendor", col=rainbow(4))

legend("topright", fill=rainbow(4),

legend=row.names(vendor.area))

Page 14: A picture speaks a thousand words - Data Visualisation with R

Stacked Beside Bar plot – Areas by Year

# Count occurrencesareas.year <- table(data.frame(

courses.aggregate$area,courses.aggregate$year))

# Draw the plotbarplot(areas.year, ylab="Count",

main="Areas By Year", col=rainbow(4), beside=TRUE)

legend("topleft", fill=rainbow(4),legend=row.names(areas.year))

Page 15: A picture speaks a thousand words - Data Visualisation with R

Stacked Bar plot – Areas by Year# Draw the plotbarplot(areas.year, ylab="Count",

main="Areas by year", col=rainbow(4))

legend("topright",

legend=row.names(areas.year), fill=rainbow(4))

Page 16: A picture speaks a thousand words - Data Visualisation with R

100% Stacked Bar plot – Areas by Year

# Draw the plotbarplot(prop.table(areas.year, 2)*100,

col=rainbow(4), ylab="%",main="Years by Areas")

legend("topright",

legend=row.names(areas.year), fill=rainbow(4))

Page 17: A picture speaks a thousand words - Data Visualisation with R

Pie chart – Areas# Areas occurrencesper_labels <- round( courses.areas/sum(courses.areas) * 100, 1)per_labels <- paste(per_labels, "%", sep="")

# Draw the plotpie(courses.areas, col=rainbow(4), labels=per_labels)legend("topleft", fill=rainbow(4)

legend=names(courses.areas))

Page 18: A picture speaks a thousand words - Data Visualisation with R
Page 19: A picture speaks a thousand words - Data Visualisation with R

Numerical data – summarise

# Calculate yearly revenuerevenue.year <- aggregate(price~year, data=courses.aggregate, sum)

Year Price

2014 139001

2015 159002

2016 180197

Page 20: A picture speaks a thousand words - Data Visualisation with R

Bar plot – Revenue per year

# Draw the plotbarplot(revenue.year$price, names.arg = revenue.year$year, ylab="Count [$]", main="Revenue per year")

Page 21: A picture speaks a thousand words - Data Visualisation with R

Categorical data - count occurrences

# Prepare datalibrary(reshape)revenue.year.area <- aggregate( price ~ year + area, data=courses.aggregate, sum)rya <- t(cast(revenue.year.area, year ~ area, value="price"))

2014 2015 2016

Cloud 127474 17873 16819

Data Science

65639 73645 74289

Software 8342 9976 11781

Web 52556 57508 77308

Page 22: A picture speaks a thousand words - Data Visualisation with R

Stacked Bar plot – Revenue by Year and Area

# Draw the plotbarplot(rya, col=rainbow(4), ylab="Count [$]", main="Revenue by Year & Area")legend("topright", fill=rainbow(4), legend=row.names(rya))

Page 23: A picture speaks a thousand words - Data Visualisation with R

Stacked Beside Bar plot – Areas Revenue by Year

# Draw the plotbarplot(rya, col=rainbow(4), ylab="Count [$]", main="Revenue by Year & Area", beside=TRUE)

legend("topright", fill=rainbow(3), legend=row.names(rya))

Page 24: A picture speaks a thousand words - Data Visualisation with R

Histograms – Frequency & Density

Page 25: A picture speaks a thousand words - Data Visualisation with R

Histogram – Course Prices

# Draw the plothist(courses.aggregate$price,

main="Ditribution of prices",

xlab="Course price",breaks=20,col=heat.colors(20))

Page 26: A picture speaks a thousand words - Data Visualisation with R

Histogram – Course Prices per month

# Prepare the datarevenue.year.month <-

aggregate(price ~ year + month, data=courses.aggregate, sum)

# Draw the plothist(revenue.year.month$price, main="Distribution of revenue per month", xlab="Revenue per month", breaks=20, col=heat.colors(20))

Page 27: A picture speaks a thousand words - Data Visualisation with R

Density – Course Prices per month# Probability densityhist(revenue.year.month$price, main="Distribution of revenue per month", xlab="Revenue per month", breaks=20, col=heat.colors(20), prob=TRUE)

lines(density(revenue.year.month$price))

Page 28: A picture speaks a thousand words - Data Visualisation with R

Bivariate graphs

Page 29: A picture speaks a thousand words - Data Visualisation with R

Bar & line plot – Revenue by month# Draw the plotrevenue.bar <- barplot( revenue.month$price, names.arg = labels , ylab="Revenue [$]", main="2016 Revenue by month")lines(x=revenue.bar, y=revenue.month$units*100)points(x=revenue.bar, y=revenue.month$units*100)

Page 30: A picture speaks a thousand words - Data Visualisation with R

Line plot & trend – Revenue by month

# Draw the plotmonths <- 1:12plot(price ~ month, data=revenue.month,

xaxt="n", type="l", ylab="Revenue [$]", xlab="",main="Revenue in 2016")

axis(1, at=months, labels=labels)

# Display the trendlines(c(1,12), c(25000, 12000), type="l",

lty=2, col="blue")legend("topright", c("Revenue", "Trend"),

col=c("black", "blue"), lty=1:2)

Page 31: A picture speaks a thousand words - Data Visualisation with R

Line plot & trend – Revenue by Units# Draw the plotplot(price~units, data=revenue.month, xlab="Units", ylab="Revenue [$]", main="Revenue by Units in 2016")lines(c(30, 380), c(3000, 35000), type='l', lty=2, col="blue")legend("topleft", c("revenue/freq", "trend"), col=c("black", "blue"), lty=c(0,2), pch=c(21, -1))

Page 32: A picture speaks a thousand words - Data Visualisation with R

Line plot & trend – Revenue by Units# Draw the plotplot(price~units, data=revenue.month.area, xlab="Units", ylab="Revenue [$]", col=area, main="Revenue by Units (All years)")legend("topleft", legend=levels(revenue.month.area$area), col=1:length( levels(revenue.month.area$area)), pch=21, text.width = 30)

Page 33: A picture speaks a thousand words - Data Visualisation with R

base vs. lattice vs. ggplot2

Page 34: A picture speaks a thousand words - Data Visualisation with R

Stacked Bar chart – base vs. lattice

barplot(rya, col=rainbow(4), ylab="Count [$]", main="Revenue by Year & Area")legend("topright", fill=rainbow(4), legend=row.names(rya))

barchart(Cloud + `Data Science` + `Software Engineering` + Web ~ year data=t(rya), auto.key=TRUE, stack=TRUE, horizontal=FALSE, ylab="Count [$]", main="Areas by Year")

Page 35: A picture speaks a thousand words - Data Visualisation with R

Stacked Bar chart – base vs. ggplot2

barplot(rya, col=rainbow(4), ylab="Count [$]", main="Revenue by Year & Area")legend("topright", fill=rainbow(4), legend=row.names(rya))

ggplot(revenue.year.area, aes(x = year, y=price, fill = area)) + geom_bar(stat = "identity") + ggtitle("Revenue by Year & Area") + ylab("Count [$]")

Page 36: A picture speaks a thousand words - Data Visualisation with R

Histogram – base vs. lattice

hist(revenue.year.month$price, main="Ditribution of revenue per month", xlab="Revenue per month", breaks=20, col=heat.colors(20))

histogram(~price, data=revenue.year.month, main="Ditribution of revenue per month", xlab="Revenue per month", breaks = 20, type = "count", col=heat.colors(20))

Page 37: A picture speaks a thousand words - Data Visualisation with R

Histogram – base vs. ggplot2

hist(revenue.year.month$price, main="Ditribution of revenue per month", xlab="Revenue per month", breaks=20, col=heat.colors(20))

ggplot(revenue.year.month, aes(x = price)) + geom_histogram(stat = "bin", binwidth=2500, aes(fill=..count..)) + ggtitle("Ditribution of revenue per month") + xlab("Revenue per month")

Page 38: A picture speaks a thousand words - Data Visualisation with R

Box plot – base vs. lattice

boxplot(price~year, data=revenue.year.month, col=2:4, main="Revenue by Year", xlab="Year", ylab="Revenue")

boxplot(price~year, data=revenue.year.month, col=2:4, main="Revenue by Year", xlab="Year", ylab="Revenue")

Page 39: A picture speaks a thousand words - Data Visualisation with R

Box plot – base vs. ggplot

boxplot(price~year, data=revenue.year.month, col=2:4, main="Revenue by Year", xlab="Year", ylab="Revenue")

ggplot(revenue.year.month, aes(x=factor(year), y=price)) + geom_boxplot(aes(fill=factor(year))) + ggtitle("Total by Year") + ylab("Revenue") + xlab("Year")

Page 40: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – base vs. lattice

plot(price~units, data=revenue.month.area, xlab="Units", ylab="Revenue [$]", col=area, main="Revenue by Units (All years)")# And you need legend manually created

xyplot(price~units, data=revenue.month.area, xlab="Units", ylab="Revenue [$]", pch=19, group = area, auto.key = TRUE)

Page 41: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – base vs. ggplot2

plot(price~units, data=revenue.month.area, xlab="Units", ylab="Revenue [$]", col=area, main="Revenue by Units (All years)")# And you need legend manually created

ggplot(revenue.month.area, aes(x=units, y=price)) + geom_point(aes(col=area)) + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")

Page 42: A picture speaks a thousand words - Data Visualisation with R

ggplot2 & layers

Page 43: A picture speaks a thousand words - Data Visualisation with R

Scatter plot# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point()

Page 44: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – Colours per area# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point(aes(col=area))

Page 45: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – Labels# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point(aes(col=area)) + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")

Page 46: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – Dots’ size# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point(aes(col=area, size=dltotal)) + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")

Page 47: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – Lines# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point(aes(col=area)) + geom_line() + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")

Page 48: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – ab line# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point(aes(col=area)) + geom_abline(intercept = 0, slope = 110) + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")

Page 49: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – smooth line# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point(aes(col=area)) + stat_smooth() + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")

Page 50: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – smooth line# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point(aes(col=area)) + stat_smooth() + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units") + theme(legend.title=element_text( colour="chocolate", size=16, face="bold"))

Page 51: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – smooth line# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point(aes(col=area)) + stat_smooth() + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units") + theme(legend.title=element_text( colour="chocolate", size=16, face="bold")) + scale_color_discrete( name="Learning Areas")

Page 52: A picture speaks a thousand words - Data Visualisation with R

Scatter plot – smooth line# Draw the dotsggplot(revenue.month,

aes(x=units, y=total)) + geom_point(aes(col=area)) + ... theme(legend.title=element_text( colour="chocolate", size=16, face="bold")) + scale_color_discrete( name="Learning Areas") + guides(colour = guide_legend( override.aes = list(size=4)))

Page 53: A picture speaks a thousand words - Data Visualisation with R

ggplot2 & maps (ggmap)

Page 54: A picture speaks a thousand words - Data Visualisation with R

Treemap – Revenue by Vendor

# Draw the plotlibrary(treemap)

treemap(courses.aggregate, index=c("vendor"), vSize="price", title="Revenue per vendor",type="index")

Page 55: A picture speaks a thousand words - Data Visualisation with R

Interactive and dynamic graphs• plotly• ggiraph• D3.js• streamgraph• animation

Page 56: A picture speaks a thousand words - Data Visualisation with R

plotly - Interactive graphs# Draw the plotlibrary(plotly)plot_ly(revenue.month.vendor, x=~units, y=~total, mode="markers", color = ~factor(area), size=~dltotal/1000, text=~paste("Units:", units, "</br>Revenue", total, "</br>DataLearn cut:", dltotal), hoverinfo="text", type="scatter") %>% layout(title="Revenue per vendor", xaxis=list(title="Units"), yaxis=list(title="Revenue [$]"))

Page 57: A picture speaks a thousand words - Data Visualisation with R

Make an interactive graph from ggplot

# Draw the plotlibrary(plotly)ggbar <- ggplot(revenue.year.area, aes(x = year, y=price, fill = area)) + geom_bar(stat = "identity")

ggplotly(ggbar)

Page 58: A picture speaks a thousand words - Data Visualisation with R

Network visualisation• igraph• ggnet• ggnetwork• ggraph• visNetwork• sna

Page 59: A picture speaks a thousand words - Data Visualisation with R

igraph – Courses taken by Users# Draw the plotuser.area <- data.frame(

user=courses.aggregate$name, area=courses.aggregate$area)user.area <- user.area[

sample(1:500, 50, replace=FALSE),]user.area <- aggregate(

cbind(user.area[0], width=1),user.area, length)

# Build the graphlibrary(igraph)user.area.graph <- graph.data.frame(

user.area, directed = FALSE,vertices=vertices)

plot(user.area.graph, main="Courses taken by users")

Page 60: A picture speaks a thousand words - Data Visualisation with R

visNetwork – Dynamic Networks

# Draw the plotvisNetwork(nodes, edges, main="Courses taken by users")

Page 61: A picture speaks a thousand words - Data Visualisation with R

Circular graph – Area per Vendor# Prepare the dataarea.vendor <- data.frame(

area=courses.merge$areaname,vendor=courses.merge$vname)

circular.data <- with(area.vendor,table(vendor, area))

# Draw the plotlibrary(circlize)chordDiagram(

as.data.frame(circular.data), transparency = 0.5)

Page 62: A picture speaks a thousand words - Data Visualisation with R
Page 63: A picture speaks a thousand words - Data Visualisation with R

Keep in touch

BarbaraFusinska.com@BasiaFusinska