Exercise 1: Customized scatter plot

You will try to recreate a plot from an Economist article showing the relationship between well-being and financial inclusion.

You can find the accompanying article at this link

The data for the exercises EconomistData.csv can be downloaded from the class github repository.

library(tidyverse)
── Attaching packages ─────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0.9000      ✔ purrr   0.2.5      
✔ tibble  1.4.2           ✔ dplyr   0.7.99.9000
✔ tidyr   0.8.1           ✔ stringr 1.3.1      
✔ readr   1.1.1           ✔ forcats 0.3.0      
── Conflicts ────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
url <- paste0("https://raw.githubusercontent.com/cme195/cme195.github.io/",
              "master/assets/data/EconomistData.csv")
dat <- read_csv(url)
Parsed with column specification:
cols(
  Country = col_character(),
  SEDA.Current.level = col_double(),
  SEDA.Recent.progress = col_double(),
  Wealth.to.well.being.coefficient = col_double(),
  Growth.to.well.being.coefficient = col_double(),
  Percent.of.15plus.with.bank.account = col_double(),
  EPI_regions = col_character(),
  Region = col_character()
)
head(dat)

Part 1

  1. Create a scatter plot similar to the one in the article, where the x axis corresponds to percent of people over the age of 15 with a bank account (the Percent.of.15plus.with.bank.account column) and the y axis corresponds to the current SEDA score SEDA.Current.level.
  2. Color all points blue.
  3. Color points according to the Region variable.
  4. Overlay a fitted smoothing trend on top of the scatter plot. Try to change the span argument in geom_smooth to a low value and see what happens.
  5. Overlay a regression line on top of the scatter plot Hint: use geom_smooth with an appropriate method argument.
  6. Facet the previous plot by Region.
#1. Create a scatter plot with percent of people over the age of 15 with a bank 
p <- ggplot(
  dat, aes(x = Percent.of.15plus.with.bank.account, y = SEDA.Current.level)) 
p + geom_point()

#2. Color the points in the previous plot blue.
p + geom_point(color = "blue")

#3. Color the points in the previous plot according to the `Region`.
(p3 <- p + geom_point(aes(color = Region)))

# 4. Overlay a smoothing line on top of the scatter plot using the default method.
p3 + geom_smooth()

#4. Changing the span parameter
p3 + geom_smooth(span = 0.2)

#5. Overlay a smoothing line on top of the scatter plot using the lm method
(p5 <- p3 + geom_smooth(method = "lm"))

# 6. Facetting plots
p5 + facet_wrap(~ Region)

Exercise 2: Distribution of categorical variables

  1. Generate a bar plot showing the number of countries included in the dataset from each Region.
ggplot(dat, aes(x = Region)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

dat <- dat %>%
  mutate(reg = reorder(Region, Region, function(x) -length(x)))
barplot <- ggplot(dat, aes(x = reg)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))
barplot

  1. Rotate the plot so the bars are horizontal
barplot + coord_flip()

Exercise 3: Distribution of continuous variables

  1. Create boxplots of SEDA scores, SEDA.Current.level separately for each Region.
  2. Overlay points on top of the box plots
  3. The points you added are on top of each other, in order to distinguish them jitter each point by a little bit in the horizontal direction.
  4. Now substitute your boxplot with a violin plot.
plt <- ggplot(dat, aes(x = Region, y = SEDA.Current.level)) + 
  theme(axis.text.x = element_text(angle = 15, hjust = 1))
plt + geom_boxplot()  

plt + geom_boxplot() + geom_point()

plt + geom_boxplot() + geom_jitter(width = 0.1)

plt + geom_violin() + geom_jitter(width = 0.1)

Emulating the Economist ‘style’

Below, I will show you how to obtain an ‘Economist-look’ for your scatter plot in few lines of code. To generate a replicate plot we need to:

  1. Change ordering of the regions, by converting Region column to a factor.
  2. Use seetings for the markers to best match the points on the original Economist plot. Note that the points are bigger and have white borders, and specific fill colors. The following colors match the ones on the plot: colors <- c("#28AADC","#F2583F", "#76C0C1","#24576D", "#248E84","#DCC3AA", "#96503F")
  3. Change the axes ratio.
  4. Change the plot background and theme. Note that ggthemes package has a convenient functions for generating “Economist” style plots, e.g. theme_economist_white().
  5. Format the legend.
  6. Add “Country” labels to the points.
  7. Add a title and format the axes.

First, change order of and lables for Regions

regions <- c("Europe", "Asia", "Oceania", "North America", 
             "Latin America & the Caribbean", "Middle East & North Africa",
             "Sub-Saharan Africa")
# Here we are just modifying labels so that some names are on two lines
region_labels <-  c("Europe", "Asia", "Oceania", "North America",
                    "Latin America & \n the Caribbean", 
                    "Middle East & \n North Africa", "Sub-Saharan \n Africa")
dat <- dat %>%
  mutate(
    Region = as.character(Region),
    Region = factor(Region, levels = regions, labels = region_labels)
  )
custom_colors <- c("#28AADC","#F2583F", "#76C0C1","#24576D", "#248E84",
                   "#DCC3AA","#96503F")
p <- ggplot(
  dat, aes(Percent.of.15plus.with.bank.account, SEDA.Current.level)) +
  geom_point(aes(fill = Region), color = "white", size = 4, pch = 21) +
  geom_smooth(method = "lm", se = FALSE, col = "black", size = 0.5) +
  scale_fill_manual(name = "", values = custom_colors) +
  coord_fixed(ratio = 0.4) +
  scale_x_continuous(name = "% of people aged 15+ with bank account, 2014",
                     limits = c(0, 100),
                     breaks = seq(0, 100, by = 20)) +
  scale_y_continuous(name = "SEDA Score, 100=maximum",
                     limits = c(0, 100),
                     breaks = seq(0, 100, by = 20)) +
  labs(title="Laughing all the way to the bank",
       subtitle="Well-being and financial inclusion* \n 2014-15")
p

To change the background and theme to match the ‘Economist style’, you can install the ggthemes package that implements the themes from:

  • Base graphics
  • Tableau
  • Excel
  • Stata
  • Economist
  • Wall Street Journal
  • Edward Tufte
  • Nate Silver’s Fivethirtyeight
  • etc.
#install.packages("ggthemes")
library(ggthemes)
(p <- p + theme_economist_white(gray_bg = FALSE))

Format the legend

p + theme(
  text = element_text(color = "grey35", size = 11),
  legend.text = element_text(size = 10),
  legend.position = c(0.72, 1.12),   
  legend.direction = "horizontal") +
  guides(fill = guide_legend(ncol = 4, byrow = FALSE))

Add point labels

# Choose a subset of countries
pointsToLabel <- c(
  "Yemen", "Iraq", "Egypt", "Jordan", "Chad", "Congo", "Angola", "Albania",
  "Zimbabwe", "Uganda", "Nigeria", "Uruguay", "Kazakhstan", "India", "Turkey",
  "South Africa", "Kenya", "Russia", "Brazil", "Chile", "Saudi Arabia", 
  "Poland", "China", "Serbia", "United States", "United Kingdom")
# install.packages("ggrepel")
library(ggrepel)
(p <- p + 
    geom_text_repel(
      aes(label = Country), color = "grey20",
      data = dat %>% filter(Country %in% pointsToLabel),
      force = 15))

