Lecture 2: Communicating and Programming in R

CME/STATS 195

Lan Huong Nguyen

October 2, 2018

Data science workflow
Communicating with R Markdown
Programming
- Style
- Control flow statements
- Functions

Data Science

Data Science Workflow

Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. ¹

Import
Wrangle (tidy & transform)
Visualize
Model
Communicate

`tidyverse`

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ²

tidyverse includes packages for importing, wrangling, exploring and modeling data.

The system is intended to make data scientists more productive. To use tidyverse do the following:

# Install the package
install.packages("tidyverse")
# Load it into memory
library("tidyverse")

The `tibble` package

The tibble package is part of the core tidyverse.

Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating.

tibbles are data frames, tweaked to make life a little easier. Unlike regular data.frames they:

never change the type of the inputs (e.g. do not convert strings to factors!)
never changes the names of variables
never creates row.names()
only recycles inputs of length 1

Using `tibbles`

To use functions from tibble and other tidyverse packages:

# load it into memory
library(tidyverse)

Printing tibble is much nicer, and always fits into your window:

# e.g. a built-in dataset 'diamonds' is a tibble:
class(diamonds)

## [1] "tbl_df"     "tbl"        "data.frame"

diamonds

## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

Using `tibbles`

Creating tibbles is similar to data.frames, but no strict rules on column names:

(tb <- tibble(x = 1:5, y = 1,z = x ^ 2 + y, `:)` =  "smile"))

## # A tibble: 5 x 4
##       x     y     z `:)` 
##   <int> <dbl> <dbl> <chr>
## 1     1     1     2 smile
## 2     2     1     5 smile
## 3     3     1    10 smile
## 4     4     1    17 smile
## 5     5     1    26 smile

Subsetting tibbles is stricter than subsetting data.frames, and ALWAYS returns objects with expected class: a single [ returns a tibble, a double[[ returns a vector.

class(diamonds$carat)

## [1] "numeric"

class(diamonds[["carat"]])

## [1] "numeric"

class(diamonds[, "carat"])

## [1] "tbl_df"     "tbl"        "data.frame"

More on `tibbles`

You can read more about other tibble features by calling on your R console:

vignette("tibble")

Communicating with R Markdown

R Markdown

R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary.

R Markdown was designed to be used:

for communicating your conclusions with people who do not want to focus on the code behind the analysis.
for collaborating with other data scientists, interested in both conclusions, and the code.
as a modern day lab notebook for data science, where you can capture both your work and your thought process.

R Markdown sourse files

R Markdown files are a plain text files with “.Rmd” extension.

    ---
    title: "Title of my first document"
    date: "2018-09-27"
    output: html_document
    ---
    
    # Section title
    
    ```{r chunk-name, include = FALSE}
    library(tidyverse)
    summary(cars)
    ```
    
    ## Subsection title
    
    ```{r pressure, echo=FALSE}
    plot(pressure)
    ```
    
    Note that the `echo = FALSE` parameter was added to the code chunk to 
    prevent printing of the R code that generated the plot.

The documents must contain YAML header marked with dashes. You can ass both code chunks and plain text. Sections and subsections are marked with hashtags.

Compiling R Markdown files

To produce a complete report containing all text, code, and results:

In RStudio, click on “Knit” or press Cmd/Ctrl + Shift + K.
From the R command line, type rmarkdown::render(“filename.Rmd”)

This will display the report in the viewer pane, and create a self-contained HTML file that you can share with others.

After compiling the R Markdown document from the previous slide, you get this html.

Viewing the report in RStudio

YAML header

A YAML header is a set of key: value pairs at the start of your file. Begin and end the header with a line of three dashes (- - -), e.g.

---
title: "Untitled"
author: "Anonymous"
output: html_document
---

You can tell R Markdown what type of document you want to render: html_document (default), pdf_document, word_document, beamer_presentation etc.

You can print a table of contents (toc) with the following:

---
title: "Untitled"
author: "Anonymous"
output: 
  html_document:
    toc: true
---

Text in R Markdown

In “.Rmd” files, prose is written in Markdown, a lightweight markup language with plain text files formating syntax.

Section headers/titles:

# 1st Level Header

## 2nd Level Header

### 3rd Level Header

Text formatting:

*italic* or _italic_
**bold** __bold__

`code`
superscript^2^ and subscript~2~

Text in R Markdown

Lists:

* unordered list
* item 2
    + sub-item 1
    + sub-item 2 
    
1. ordered list 
1. item 2. The numbers are incremented automatically in the output.

Links and images:

<http://example.com> 

[linked phrase](http://example.com)

![optional caption text](path/to/img.png)

Text in R Markdown

Tables:

Table Header | Second Header
-------------| -------------
Cell 1       | Cell 2
Cell 3       | Cell 4

Math formulae

$\alpha$ is the first letter of the Greek alphabet.

Using $$ prints a centered equation in the new line.

$$\sqrt{\alpha^2 + \beta^2} = \frac{\gamma}{2}$$

Code chunks

In R Markdown R code must go inside code chunks, e.g.:

  ```{r chunk-name}
      x <- runif(10)
      y <- 10 * x + 4
      plot(x, y)
  ```

Keyboard shortcuts:

Insert a new code chunk: Ctrl/Cmd + Alt + I
Run current chunk: Ctrl/Cmd + Shift + Enter
Run current line (where the cursor is): Ctrl/Cmd + Enter

Chunk Options:

Chunk output can be customized with options supplied to chunk header. Some non-default options are:

eval = FALSE : prevents code from being evaluated
include = FALSE : runs the code, but hides code and its output in the final document
echo = FALSE : hides the code, but not the results, in the final document
message = FALSE : hides messages
warning = FALSE : hides warnings
results = ‘hide’ : hides printed output
fig.show = ‘hide’ : hides plots
error = TRUE : does not stop rendering if error occurs

Inline code

You can evealuate R code in a middle of your text:

  There are 26 in the alphabet, and 12 months in each year. 
  Today, there are `as.Date("2019-08-23") - Sys.Date()` days left till my next birthday.

There are 26 in the alphabet, and 12 months in a year. Today, there are 325 days left till my next birthday.

More on R Markdown

R Markdown is relatively young, and growing rapidly.

Official R Markdown website: (http://rmarkdown.rstudio.com)

Some R Markdown advice

See your future self as a collaborator.
Ensure each notebook has a descriptive title and name.
Use the header date to record start time
Keep track of failed attempts
If you discover an error in a data file, write code to fix it.
Regularly knit the notebook
Use random seeds before sampling.
Keep track the versions of the packages you use, e.g. by including sessionInfo() command at the end of your document.

All the above will help you increase the reproduciblity of your work.

Programming: style guide

Naming conventions

The first step of programming is naming things.
In the “Hadley Wickam” R style convention:

File names are meaningful. Script files end with “.R”, and R Markdown with “.Rmd”

# Good
fit-models.R
utility-functions.R

# Bad (works but does not follow style convention)
foo.r
stuff.r

Variable and function names are lowercase.

# Good
day_one
day_1

# Bad (works but does not follow style convention)
first_day_of_the_month
DayOne

Spacing

Spacing around all infix operators (=, +, -, <-, etc.):

average <- mean(feet / 12 + inches, na.rm = TRUE)     # Good
average<-mean(feet/12+inches,na.rm=TRUE)              # Bad

Spacing before left parentheses, except in a function call

# Good
if (debug) do(x)
plot(x, y)

# Bad
if(debug)do(x)
plot (x, y)

Assignment use ‘<-’ not ‘=’:

# Good
x <- 1 + 2

# Bad (works but does not follow style convention)
x = 1 + 2

Curly braces

An opening curly brace “{” should not go on its own line and be followed by a new line.
A closing curly “}” brace can go on its own line.
Indent the code inside curly braces.
It’s ok to leave very short statements on the same line

# Good
if (y < 0 && debug) {
  message("Y is negative")
}
if (y == 0) {
  log(x)
} else {
  y ^ x
}

# Bad
if (y < 0 && debug)
message("Y is negative")

if (y == 0) {
  log(x)
} 
else {
  y ^ x
}

if (y < 0 && debug) message("Y is negative")

Comments and documentation

Comment your code!

# 'get_answer' returns the answer to life, the universe and everything else.
get_answer <- function(){ return(42)}    # This is a comment

Comments are not subtitles, i.e. don’t repeat the code in the comments.

# Bad comments:
# Loop through all bananas in the bunch
for(banana in bunch) {
  # make the monkey eat one banana
  MonkeyEat(b)  
}

Use dashes to separate blocks of code:

# Generate Data ---------------------------
x <- rnorm(100)
y <- 12 * x + 5

# Plot Data -------------------------------
plot(x, y)

Programming: control flow

Booleans/logicals

Booleans are logical data types (TRUE/FALSE) associated with conditional statements, which allow different actions and change control flow.

# equal "==""
5 == 5

## [1] TRUE

# not equal: "!="" 
5 != 5

## [1] FALSE

# greater than: ">""
5 > 4

## [1] TRUE

# greater than or equal: ">="" (# similarly < and <=)
5 >= 5

## [1] TRUE

# You can combine multiple boolean expressions
TRUE & TRUE

## [1] TRUE

TRUE & FALSE

## [1] FALSE

TRUE | FALSE

## [1] TRUE

!(TRUE)

## [1] FALSE

Booleans/logicals

In R if you combine 2 vectors of booleans, by each element then use &. Rember the recycling property for vectors.

c(TRUE, TRUE) & c(FALSE, TRUE)

## [1] FALSE  TRUE

c(5 < 4, 7 == 0, 1< 2) | c(5==5, 6> 2, !FALSE)

## [1] TRUE TRUE TRUE

c(TRUE, TRUE) & c(TRUE, FALSE, TRUE, FALSE)  # recycling

## [1]  TRUE FALSE  TRUE FALSE

Booleans/logicals

If we use double operators && or || is used only the first elements are compared:

c(TRUE, TRUE) && c(FALSE, TRUE)

## [1] FALSE

c(5 < 4, 7 == 0, 1< 2) || c(5==5, 6> 2, !FALSE)

## [1] TRUE

c(TRUE, TRUE) && c(TRUE, FALSE, TRUE, FALSE)

## [1] TRUE

Booleans/logicals

Another possibility to combine booleans is to use all() or any() functions:

all(c(TRUE, FALSE, TRUE))

## [1] FALSE

any(c(TRUE, FALSE, TRUE))

## [1] TRUE

all(c(5 > -1, 3 >= 1, 1 < 1))

## [1] FALSE

any(c(5 > -1, 3 >= 1, 1 < 1))

## [1] TRUE

Control statements

Control flow is the order in which individual statements, instructions or function calls of a program are evaluated.
Control statements allow you to do more complicated tasks.
Their execution results in a choice between which of two or more paths should be followed.
- If / else
- For
- While

If statements

Decide on whether a block of code should be executed based on the associated boolean expression.
Syntax. The if statements are followed by a boolean expression wrapped in parenthesis. The conditional block of code is inside curly braces {}.

if (traffic_light == "green") {
  print("Go.")
}

‘if-else’ statements let you introduce more options

if (traffic_light == "green") {
   print("Go.")
} else {
   print("Stay.")
}

You can also use else if()

if (traffic_light == "green") {
  print("Go.")
} else if (traffic_light == "yellow") {
  print("Get ready.")
} else {
  print("Stay.")
}

Switch statements

For very long sequence of if statements, use the switch() function

operator <- function(x, y, op) {
  switch(as.character(op),
    '+' = x + y,
    '-' = x - y,
    '*' = x * y,
    '/' = x / y,
  stop("Unknown op!")
  )
}

operator(2, 7, '+')

## [1] 9

operator(2, 7, '-')

## [1] -5

operator(2, 7, '/')

## [1] 0.2857143

operator(2, 7, "a")

## Error in operator(2, 7, "a"): Unknown op!

For loops

A for loop is a statement which repeats the execution a block of code a given number of iterations.

for (i in 1:5){
  print(i^2)
}

## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25

While loops

Similar to for loops, but repeat the execution as long as the boolean condition supplied is TRUE.

i = 1
while(i <= 5) {
  cat("i =", i, "\n")
  i = i + 1
}

## i = 1 
## i = 2 
## i = 3 
## i = 4 
## i = 5

next halts the processing of the current iteration and advances the looping index.

for (i in 1:10) {
  if (i <= 5) {
    print("skip")
    next
  }
  cat(i, "is greater than 5.\n")
}

## [1] "skip"
## [1] "skip"
## [1] "skip"
## [1] "skip"
## [1] "skip"
## 6 is greater than 5.
## 7 is greater than 5.
## 8 is greater than 5.
## 9 is greater than 5.
## 10 is greater than 5.

next applies only to the innermost of nested loops.

for (i in 1:3) {
  cat("Outer-loop i: ", i, ".\n")
  for (j in 1:4) {
    if(j > i) {
      print("skip")
      next
    }
    cat("Inner-loop j:", j, ".\n")
  }
}

## Outer-loop i:  1 .
## Inner-loop j: 1 .
## [1] "skip"
## [1] "skip"
## [1] "skip"
## Outer-loop i:  2 .
## Inner-loop j: 1 .
## Inner-loop j: 2 .
## [1] "skip"
## [1] "skip"
## Outer-loop i:  3 .
## Inner-loop j: 1 .
## Inner-loop j: 2 .
## Inner-loop j: 3 .
## [1] "skip"

Break

The break statement allows us to break out out of a for, while loop (of the smallest enclosing).
The control is transferred to the first statement outside the inner-most loop.

for (i in 1:10)  {
  if (i == 6)  {
    print(paste("Coming out from for loop Where i = ", i))
    break
  }
  print(paste("i is now: ", i))
}

## [1] "i is now:  1"
## [1] "i is now:  2"
## [1] "i is now:  3"
## [1] "i is now:  4"
## [1] "i is now:  5"
## [1] "Coming out from for loop Where i =  6"

Exercise 1

Go to “Lec2_Exercises.Rmd” in RStudio.
Do Exercise 1.

Programming: functions

What is a function in R?

A function is a procedure/routine that performs a specific task.
Functions are used to abstract components of larger program.
Similarly to mathematical functions, they take some input and then do something to find the result.
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
If you’ve copied and pasted a block of code more than twice, you should use a function instead.

Why should you use functions?

Functions become very useful as soon as your code becomes long enough.

Functions will make your code easier to understand.
Errors are less likely to occure and easier to fix.
For repeated taskes, changes can be made once by editing a function and not many distant chunks of code. Example:

set.seed(1)
a <- rnorm(10); b <- rnorm(10); c <- rnorm(10); d <- rnorm(10)
# Bad
a <- (a - min(a, na.rm = TRUE)) /
  (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
b <- (b - min(b, na.rm = TRUE)) /
  (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
c <- (c - min(c, na.rm = TRUE)) /
  (max(b, na.rm = TRUE) - min(c, na.rm = TRUE))
d <- (d - min(d, na.rm = TRUE)) /
  (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))

# Good
rescale_data <- function(x) {
  rng <- range(x, na.rm = TRUE)
  return((x - rng[1]) / (rng[2] - rng[1]))
}
a <- rescale_data(a)
b <- rescale_data(b)
c <- rescale_data(c)
d <- rescale_data(d)

Function Definition

To define a function you assign a variable name to a function object.
Functions take arguments, mandatory and optional.
Provide the brief description of your function in comments before the function definition.

# Computes mean and standard deviation of a vector,
# and optionally prints the results.
summarize_data <- function(x, print=FALSE) {
  center <- mean(x)
  spread <- sd(x) 
  if (print) {
    cat("Mean =", center, "\n", 
        "SD   =", spread, "\n")
  }
  list(mean=center, sd=spread)
}

Calling functions

# without printing
x <- rnorm(n = 500, mean = 4, sd = 1) 
y <- summarize_data(x)

# with printing
y <- summarize_data(x, print = TRUE)

## Mean = 4.009679 
##  SD   = 1.01561

# Results are stored in list "y"
y$mean

## [1] 4.009679

y$sd

## [1] 1.01561

# The order of arguments does not matter if the names are specified
y <- summarize_data(print=FALSE, x = x)

Explicit return statements

The value returned by the function is usually the last statement it evaluates. You can choose to return early by using return(); this makes you code easier to read.

# Complicated function simplified by the use of early return statements
complicated_function <- function(x, y, z) {
  # Check some condition
  if (length(x) == 0 || length(y) == 0) {
    return(0)
  }
  # Complicated code here
}

Returning invisible objects can be done with invisible()

show_missings <- function(df) {
  cat("Missing values:", sum(is.na(df)), "\n")
  invisible(df)       # this result doesn’t get printed out
}

show_missings(mtcars)

## Missing values: 0

dim(show_missings(mtcars))

## Missing values: 0

## [1] 32 11

Environment

The environment of a function controls how R finds an object associated with a name.

f <- function(x) {
  x + y
}

R uses rules called lexical scoping to find the value associated with a name. Here, R will look for y in the environment where the function was defined

y <- 100
f(10)

## [1] 110

This behaviour attracts bugs. You should try to avoid using global variables.

apply, lapply, sapply functions

The apply family functions, are functions which manipulate slices of data stored as matrices, arrays, lists and data-frames in a repetitive way.
These functions avoid the explicit use of loops, and might be more computationally efficient, depending on how big a dataset is. For more details on runtimes see this link.
apply allow you to perform operations with very few lines of code.
The family comprises: apply, lapply , sapply, vapply, mapply, rapply, and tapply. The difference lies in the structure of input data and the desired format of the output).

apply function

apply operates on arrays/matrices.

In the example below we obtain column sums of matrix X.

(X <- matrix(sample(30), nrow = 5, ncol = 6))

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   11   21   10   16    7   15
## [2,]   30   13   14   27   23    2
## [3,]   18    3    5    8    4   28
## [4,]    1   20    6   24   26   25
## [5,]   19    9   12   29   22   17

apply(X, MARGIN = 2 , FUN = sum)

## [1]  79  66  47 104  82  87

Note: that in a matrix MARGIN = 1 indicates rows and MARGIN = 2 indicates columns.

apply function

apply can be used with user-defined functions:

# number entries < 15
apply(X, 2, function(x) 10*x + 2)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  112  212  102  162   72  152
## [2,]  302  132  142  272  232   22
## [3,]  182   32   52   82   42  282
## [4,]   12  202   62  242  262  252
## [5,]  192   92  122  292  222  172

a function can be defined outside apply(),

logColMeans <- function(x, eps = NULL) {
  if (!is.null(eps)) x <- x + eps
  return(mean(x))
}
apply(X, 2, logColMeans)

## [1] 15.8 13.2  9.4 20.8 16.4 17.4

apply(X, 2, logColMeans, eps = 0.1)

## [1] 15.9 13.3  9.5 20.9 16.5 17.5

lapply/sapply functions

lapply() is used to repeatedly apply a function to elements of a sequential object such as a vector, list, or data-frame (applies to columns).
The output is a list with the same number of elements as the input object.

# lapply returns a list
lapply(1:3, function(x) x^2)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9

sapply is the same as lapply but returns a “simplified” output.

sapply(1:3, function(x) x^2)

## [1] 1 4 9

like with apply(), user-defined functions can be used with sapply/lapply.

Functional Programming

The idea of passing a function to another function is extremely powerful idea, and it’s one of the behaviours that makes R a functional programming (FP) language.

The apply family of functions in base R are basically tools to extract out this duplicated code, so each common for loop pattern gets its own function.

The package purrr in tidyverse framework solves similar problems, more in line with the ‘tidyverse-philosophy’. We will learn in in following lectures.

Exercise 2 and 3

Go back to “Lec2_Exercises.Rmd” in RStudio.
Do Exercise 2 and 3.

Lecture 2: Communicating and Programming in R

CME/STATS 195

Lan Huong Nguyen

October 2, 2018

Contents

Data Science

Data Science Workflow

tidyverse

The tibble package

Using tibbles

Using tibbles

More on tibbles

Communicating with R Markdown

R Markdown

R Markdown sourse files

Compiling R Markdown files

Viewing the report in RStudio

YAML header

Text in R Markdown

Text in R Markdown

Text in R Markdown

Code chunks

Chunk Options:

Inline code

More on R Markdown

Some R Markdown advice

Programming: style guide

Naming conventions

Spacing

Curly braces

Comments and documentation

Programming: control flow

Booleans/logicals

Booleans/logicals

Booleans/logicals

Booleans/logicals

Control statements

If statements

Switch statements

For loops

While loops

Next

Break

Exercise 1

Programming: functions

What is a function in R?

Why should you use functions?

Function Definition

Calling functions

Explicit return statements

Environment

apply, lapply, sapply functions

apply function

apply function

lapply/sapply functions

Functional Programming

Exercise 2 and 3

`tidyverse`

The `tibble` package

Using `tibbles`

Using `tibbles`

More on `tibbles`