ggplot2
package
ggplot2
packageggplot
packageThe ggplot
package is a part of the core of tidyverse
.
ggplot2
is a plotting system for R, based on the grammar of graphics. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics 1.
It is the most elegant and versatile tool for graphically visualizing data in R, offering a coherent system (or grammar) for building graphs.
It is a concept coined by Leland Wilkinson in 2005.
An abstraction which facilitates reasoning and communicating graphics.
ggplot2
is a layered grammar of graphics which allow users to: independently specify the building blocks of a plot and combine them to create just about any kind of graphical display.
ggplot2
characteristicsAdvantages of ggplot2
:
ggplot2
has a large user base => it’s easy find to help.Weaknesses of ggplot2
rgl
or ggplot2
+ plotly
instead,igraph
insteadggplot2
graphical objectsdata
aesthetic mapping
geometric objects
statistical transformations
facets
scales
coordinate system
ggplot(data = <DATA>) +
GEOM_FUNCTION(
mapping = aes(<mappings>),
stat = <statistic transformation>,
position = <position options>,
color = <fixed color>,
<other arguments>) +
FACET_FUNCTION(<facet options>) +
SCALE_FUNCTION(<scale options>) +
theme(<theme elements>)
ggplot()
functionggplot()
function is initializes a basic graph structure.
It cannot produce a plot alone by itself.
You need to add extra components to generate a graph.
Different parts of a plot can be added together using +
. Note similarity with the %>%
operator.
Any data or arguments you supply to ggplot()
function, can later be used by added functions without repeated specification.
ggplot2
compared to base graphicsis more verbose for simple/out of the box graphics,
is less verbose for complex/custom graphics,
generates graphs by adding building blocks, instead of calling different functions to draw new layers on top,
makes it easier to edit and tweak elements of a plot,
more details on advantages of ggplot2
over base plot are in this blog.
ggplot2
has a built-in economics
dataset, which inclides time series data on US unemployment from 1967 to 2015.
economics
## # A tibble: 574 x 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <int> <dbl> <dbl> <int>
## 1 1967-07-01 507. 198712 12.5 4.5 2944
## 2 1967-08-01 510. 198911 12.5 4.7 2945
## 3 1967-09-01 516. 199113 11.7 4.6 2958
## 4 1967-10-01 513. 199311 12.5 4.9 3143
## 5 1967-11-01 518. 199498 12.5 4.7 3066
## 6 1967-12-01 526. 199657 12.1 4.8 3018
## 7 1968-01-01 532. 199808 11.7 5.1 2878
## 8 1968-02-01 534. 199920 12.2 4.5 3001
## 9 1968-03-01 545. 200056 11.6 4.1 2877
## 10 1968-04-01 545. 200208 12.2 4.6 2709
## # ... with 564 more rows
economics <- mutate(economics, unemp_rate = unemploy/pop)
plot(unemp_rate ~ date, data = economics, type = "l")
ggplot2
packagelibrary(tidyverse)
ggplot(data = economics, aes(x = date, y = unemp_rate)) + geom_line()
ggplot(data = economics, aes(x = date, y = unemp_rate))
ggplot(data = economics, aes(x = date, y = unemp_rate)) + geom_line()
ggplot(data = economics, aes(x = date, y = unemp_rate)) +
geom_line() + theme_bw()
# Add new variables for plotting
economics <- economics %>%
mutate(month = as.numeric(format(date, format="%m")),
year = as.factor(format(date, format="%Y")))
economics %>%
select(date, month, year, unemp_rate)
## # A tibble: 574 x 4
## date month year unemp_rate
## <date> <dbl> <fct> <dbl>
## 1 1967-07-01 7 1967 0.0148
## 2 1967-08-01 8 1967 0.0148
## 3 1967-09-01 9 1967 0.0149
## 4 1967-10-01 10 1967 0.0158
## 5 1967-11-01 11 1967 0.0154
## 6 1967-12-01 12 1967 0.0151
## 7 1968-01-01 1 1968 0.0144
## 8 1968-02-01 2 1968 0.0150
## 9 1968-03-01 3 1968 0.0144
## 10 1968-04-01 4 1968 0.0135
## # ... with 564 more rows
data09 <- subset(economics, year == "2009")
data14 <- subset(economics, year == "2014")
plot(unemp_rate ~ month, data = data09, ylim = c(0.02, 0.05), type = "l")
lines(unemp_rate ~ month, data = data14, col = "red")
legend("topleft", c("2009", "2014"), col = c("black", "red"), lty = c(1,1))
There is no need of specifing a legend:
ggplot(data = economics %>% filter(year %in% c(2014, 2009)),
aes(x = month, y = unemp_rate)) +
geom_line(aes(group = year, color = year))
ggplot(data = economics, aes(x = month, y = unemp_rate)) +
geom_line(aes(color = year)) +
theme(axis.text.x = element_text(angle = 45))
ggplot(data = economics, aes(x = month, y = unemp_rate)) +
geom_line(aes(group = year, color = pop)) +
theme(axis.text.x = element_text(angle = 45))
diamond
datasetdiamond
is a built-in dataset, included in tidyverse
. It contains prices and other attributes of almost 54,000 diamonds. We will use this dataset to illustrate how to use functions in ggplot2
.
data(diamonds)
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
More information with ?diamonds
. Spreadsheet view in RStudio with View(diamonds)
.
Geometric objects are the actual elements you put on the plot. Examples include:
geom_point()
, used for scatter plots)geom_text()
, geom_label()
, used for text labels)geom_line()
, used for time series, trend lines, etc.)geom_boxplot()
used for, well, boxplots!)There is no upper limit to how many geom objects you can use. You can add a geom objects to a plot using an +
operator.
To get a list of available geometric objects use the following:
help.search("geom_", package = "ggplot2")
# Note that we can save `ggplot` as an object
p <- ggplot(diamonds, aes(x = carat, y = price))
p + geom_point()
plog <- ggplot(
sample_n(diamonds, 100),
aes(x = log10(carat), y = log10(price)))
plog + geom_text(aes(label = clarity))
plog + geom_label(aes(label = clarity))
ggrepel
package for annotationggrepel
helps annotating overlapping labels.
# Uncomment the line below if you don't have 'ggrepel'
# install.packages("ggrepel")
library(ggrepel)
plog + geom_point() + geom_text_repel(aes(label = clarity), size = 3)
In ggplot an aesthetic mapping, defined with aes()
, describes how variables are mapped to visual properties or aesthetics.
The details of mapping can be described by using scale functions.
You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset.
Each type of geom
objects accepts only a subset of aesthetics; refer to the geom
help pages for details.
# We first generate a subset of 'diamnonds' dataset
dsmall <- sample_n(diamonds, 500)
p1 <- ggplot(dsmall, aes(x = carat, y = price))
# set shape by diamond cut
p1 + geom_point(aes(shape = cut))
## Warning: Using shapes for an ordinal variable is not advised
ggplot(data.frame(x = 1:5 , y = 1:25, z = 1:25), aes(x = x, y = y)) +
geom_point(aes(shape = z), size = 5, colour = "darkgreen", fill = "orange") +
scale_shape_identity()
# color by diamonds color
p1 + geom_point(aes(color = color))
p1 + geom_point(aes(shape = cut, color = color))
## Warning: Using shapes for an ordinal variable is not advised
p1 + geom_point(aes(color = color))
p1 + geom_point(color = "darkgreen")
p1 + geom_point(aes(fill = cut), size = 3, color = "black", shape = 25)
a1 <- p + geom_point(alpha = 1/5)
a2 <- p + geom_point(alpha = 1/50)
a3 <- p + geom_point(alpha = 1/500)
# We use grid.arrange from gridExtra to display multiple plots
library(gridExtra)
grid.arrange(a1, a2, a3, ncol = 3)
aes()
assign an aesthetic to a variable; it doesn’t determine how mapping should be done.
For example, aes(shape = x)
or aes(color = z)
do not specify what shapes or what colors should be used.
To choose colors/shapes/sizes etc. you need to modify the corresponding scale.
ggplot2
includes scales for:
Scales can be modified with functions of the form: scale_<aesthetic>_<type>()
, e.g. scale_color_discrete()
.
Try typing scale_<tab>()
to see a list of scale modification functions.
Common Scale Arguments:
# Square root y-axis transformation
p1 <- ggplot(dsmall, aes(x = carat, y = price))
psqrt <- p1 + geom_point() + scale_y_sqrt()
# Log base 10 y-axis transformation
plog10 <- p1 + geom_point() + scale_y_log10()
grid.arrange(psqrt, plog10, ncol = 2)
Log base 10 transformation of x and y axes. Note the differences.
ploglog1 <- p1 + geom_point() + scale_y_log10() + scale_x_log10()
ploglog2 <- ggplot(dsmall, aes(x = log(carat), y = log(price))) + geom_point()
grid.arrange(ploglog1, ploglog2, ncol = 2)
p11 <- p1 + geom_point(aes(shape = cut), size = 3)
p12 <- p1 + geom_point(aes(shape = cut), size = 3) +
scale_shape_manual(values = c(1:5))
grid.arrange(p11, p12, ncol = 2)
## Warning: Using shapes for an ordinal variable is not advised
To choose specific colors for discrete variables we use scale_color_manual
.
p1 + geom_point(aes(color = cut), size = 3) +
scale_color_manual(values = c("red", "orange", "yellow", "green", "blue"))
For continuous variables we use scale_color_gradient
, and specify the ends of the color spectrum.
p1 + geom_point(aes(color = price), size = 3) +
scale_color_gradient(low = "blue", high = "red")
You can also scale the values of the variable corresponding to color.
p1 + geom_point(aes(color = price), size = 3) +
scale_color_gradient(low = "blue", high = "red", trans = "log10")
Or and add your own breaks
p1 + geom_point(aes(color = price), size = 3) +
scale_color_gradient(low = "blue", high = "red", trans = "log10",
breaks = c(1000, 2000, 5000, 10000),
labels = c(" 1000", " 2000", " 5000", "10000"))
scale_color_brewer
lets you choose nice color palettes for discrete variables.
p1 + geom_point(aes(color = cut), size = 3) +
scale_color_brewer(palette = "Set2")
Unfortunately, scale_color_brewer
doesn’t work for continuous variables:
# This will result in an error
p1 + geom_point(aes(shape = price), size = 3) +
scale_color_brewer(palette = "Spectral")
## Error: A continuous variable can not be mapped to shape
We can get around this issue using the RColorBrewer
package and scale_color_gradientn
function, which interpolates colors from the brewer palettes.
# install.packages("RColorBrewer")
library(RColorBrewer)
p1 + geom_point(aes(color = price), size = 3) +
scale_color_gradientn(colours = brewer.pal(name = "Spectral", n = 10))
Another popular color scheme package, viridis
, supports both discrete and continuous variables:
# install.packages("viridis")
library(viridis)
p1 + geom_point(aes(color = price), size = 3) + scale_color_viridis()
p1 + geom_point(aes(color = cut), size = 3) +
scale_color_viridis(discrete = TRUE, option = "magma")
… there are also other unconventional schemes such as, one based on Wes Anderson movies :
#install.packages("wesanderson")
library(wesanderson)
names(wes_palettes)
## [1] "BottleRocket1" "BottleRocket2" "Rushmore1" "Rushmore"
## [5] "Royal1" "Royal2" "Zissou1" "Darjeeling1"
## [9] "Darjeeling2" "Chevalier1" "FantasticFox1" "Moonrise1"
## [13] "Moonrise2" "Moonrise3" "Cavalcanti1" "GrandBudapest1"
## [17] "GrandBudapest2" "IsleofDogs1" "IsleofDogs2"
# For discrete variables
p1 + geom_point(aes(color = cut), size = 3) +
scale_color_manual(values = wes_palette("Darjeeling1", n = 5))
# For continuous variables:
p1 + geom_point(aes(color = price), size = 3) +
scale_color_gradientn(colours = wes_palette("Darjeeling1", 100, type = "continuous"))
Facettng allows you to split up your data by one or more variables and plot the subsets of data together.
dsmall <- diamonds[sample(nrow(diamonds), 1000), ]
p0 <- ggplot(data = dsmall, aes(x = carat, y = price)) +geom_point(size = 1) +
geom_smooth(aes(colour = cut, fill = cut))
p1 <- p0 + facet_wrap(~ cut)
grid.arrange(p0, p1, ncol = 2)
ggplot(diamonds, aes(x = carat)) +
geom_density() +
facet_grid(color ~ cut)
Go to “Lec4_Exercises.Rmd” on the class website.
Complete Exercise 1.
Plots often require some statistical data transformation or computation
before they can be plotted. Examples include:
boxplots: the the median, lower and upper quartiles,
histograms: group the values into bins,
bar charts: number of class occurrences.
smoothers: prediction lines / predicted y-values,
Plotting a summary (less data) can be more insightful.
ggplot(data = diamonds, aes(x = cut, y =carat)) +
geom_boxplot()
# Distribution of the carats (weights) of the diamonds.
h <- ggplot(data = diamonds, aes(x = carat)) + geom_histogram()
d <- ggplot(data = diamonds, aes(x = carat)) + geom_density()
grid.arrange(h, d, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In histograms, the smoothness is controlled with bins
and binwidth
arguments. (or by specifying using the breaks
explicitly).
p <- ggplot(data = diamonds, aes(x = carat)) + xlim(0, 3)
h1 <- p + geom_histogram(binwidth = 0.5)
h2 <- p + geom_histogram(binwidth = 0.1)
h3 <- p + geom_histogram(binwidth = 0.05)
grid.arrange(h1, h2, h3, ncol = 3)
In density plots, the bw
(the smoothing bandwidth) and adjust
arguments control the smoothness.
d1 <- p + geom_density(adjust = 5)
d2 <- p + geom_density(adjust = 1)
d3 <- p + geom_density(adjust = 1/5)
grid.arrange(d1, d2, d3, ncol = 3)
# Here we show grouping by diamonds cut.
h <- p + geom_histogram(aes(fill = cut), position = "dodge", bins = 10)
d <- p + geom_density(aes(color = cut))
grid.arrange(h, d, ncol = 2)
Instead of marginal distributions, we can plot distribution of components stacked on top of each other to see the contribution from each of group.
h <- p + geom_histogram(aes(fill = cut), position = "stack")
d <- p + geom_density(aes(fill = cut), position = "stack")
grid.arrange(h, d, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Position adjustments are used to adjust the position of each geom
. The following position adjustments are available:
position_identity
: default of most geomsposition_jitter
: adds a small amount of random variationposition_dodge
: default of geom_boxplot
position_stack
: default of geom_bar
, geom_histogram
position_fill
: useful for geom_bar
, geom_histogram
The position parameter can be set as follows:
geom_point(..., position="jitter")
Overplotting: many points overlap each other. Here variables are categorical, but sometimes rounding causes overplotting.
plt <- ggplot(diamonds, aes(x = cut, y = depth))
plt + geom_point()
plt + geom_point(position = "jitter")
A discrete analogue of a histogram is the bar chart, geom_bar()
.
Instead of partitioning the values into bins like histograms, the bar geom counts the number of instances of each discrete class. The counts are then plotted as columns for each distinct class.
If you’d like include unequal weights for different observations, you can use the weight
aesthetic.
b1 <- ggplot(diamonds, aes(x = clarity)) + geom_bar()
b2 <- ggplot(diamonds, aes(x = clarity)) + geom_bar(aes(weight = carat)) + ylab("carat")
grid.arrange(b1, b2, ncol = 2)
The left plot shows the number of diamonds in each clarity group, and the right plot shows the count weighted by carat, which is equivalent to showing the total weight of diamonds in clarity color group.
As you see, in ggplot2
(unlike base graphics) it is not necessary tabulate the values, i.e. compute the counts of each category beforehand. The computation is done automatically for you.
However, if you have already summarized data, you can still use geom_bar
but you need to specify an identity transformation, stat = "identity
rather than the default stat = "count"
.
diamond.counts <- diamonds %>%
group_by(color) %>%
summarise(count = n())
diamond.counts
## # A tibble: 7 x 2
## color count
## <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
With the frequency counts already computed, the default options of the barplot generates an error:
diamond.counts
## # A tibble: 7 x 2
## color count
## <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
ggplot(diamond.counts, aes(x=color, y=count)) + geom_bar()
## Error: stat_count() must not be used with a y aesthetic.
# You need to do the following:
ggplot(diamond.counts, aes(x=color, y=count)) + geom_bar(stat="identity")
# Smoothers help discern patterns in the data
set.seed(438756)
dsmall <- diamonds %>% sample_frac(0.1)
ggplot(dsmall, aes(x = carat, y = price)) +
geom_point(aes(color = color)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(dsmall, aes(x = carat, y = price)) +
geom_point(aes(color = color)) + geom_smooth(method = "lm")
Now that you have your beautiful plot, you may want to save it as an image.
ggsave()
is a convenient function for saving a plot.
By default, it saves the last plot that you displayed, using the size of the current graphics device. It also guesses the type of graphics device from the extension.
ggsave(filename, plot = last_plot(), device = NULL, path = NULL,
scale = 1, width = NA, height = NA, units = c("in", "cm", "mm"),
dpi = 300, limitsize = TRUE, ...)
“Device” can be either be a device function (e.g. png), or one of “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” or “wmf” (windows only).
Go back to “Lec4_Exercises.Rmd”
Complete the exercise 2 and 3