R Basics

Matthew DeHaven

Course Home Page

2024-01-01

Lecture Goals

  • R Basics
    • Variable assignment and types
    • Data.frames
  • Control Flow
    • If statements, loops, apply

What is R?

  • a programming language
  • open source (free!)

Why is it called R?

  • Created in the 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland (in New Zealand)
  • A successor to the programming language “S”

Created as a programming language for statistics and graphics.

Why R for this class?

  • Created as a programming language for statistics and graphics

    • This is what we do as economists!
  • Used more outside academia than STATA
  • IMO, easier to learn than Python or Julia

    • But also is a good stepping stone to those languages
  • It is the language I know best

Why R for this class?

Chart showing Usage Shares of Programming Languages in Economics Research.

Taken from Economics and R Blog of Sebastian Kranz. Calculated from file extensions used in code files for published papers.

R Basics

Arithmetic

2 + 2
[1] 4
5 - 2
[1] 3
2 * 4
[1] 8
9 / 2
[1] 4.5
2 ^ 2
[1] 4

Arithmetic

Order of operations as expected

2 + 2 * 2 ^ 2
[1] 10

Modulo

9 %/% 2
[1] 4
9 %% 2
[1] 1

Logic

Booleans are objects that are either TRUE or FALSE

They are very useful.

2 > 3
[1] FALSE
2 < 3
[1] TRUE
2 == 2
[1] TRUE
2 >= 2
[1] TRUE

Logic

  • “And” operator &
  • “Or” operator |
TRUE & FALSE
[1] FALSE
TRUE | FALSE
[1] TRUE
  • “Negation” operator !
!FALSE
[1] TRUE
!TRUE
[1] FALSE

Logic

Value Matching %in%

Compares if object on left is “in” the object on the right1

2 %in% 1:10
[1] TRUE
2 %in% 5:10
[1] FALSE

What is the : operator? “Sequence”

1:10
 [1]  1  2  3  4  5  6  7  8  9 10

Logic

Computers handle decimals in odd ways!

0.2 + 0.2 == 0.4
[1] TRUE
0.1 + 0.2 == 0.3
[1] FALSE

? ? ?

Because computers use binary, cannot represent 0.1 exactly

Same as we cannot represent 1/3 in base 10 exactly

all.equal(0.1 + 0.2, 0.3)
[1] TRUE

Assignment

In R, we use a special “arrow” operator for assignment: <-

x <- 2
x
[1] 2

This lets us declare new variables or objects.

y <- 3

x + y
[1] 5

Why not use =?

Technically you can in R, but <- is preferred becauase = is used to assign values for a function call.

Sequences

We have seen the “sequence” operator :

Which it turns out is just a shortcut for seq()

1:10
 [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10)
 [1]  1  2  3  4  5  6  7  8  9 10

But the function gives more options

seq(1, 10, by = 2)
[1] 1 3 5 7 9
seq(1, 10, length.out = 4)
[1]  1  4  7 10

Help!

How do we know all of the options for a function?

Documentation!

All R functions have a help file explaining them, which can be accessed using

help(seq)

or

?seq

This documentation is also hosted online

Comments

Commenting code can be useful to yourself and others.

In R, commments are any line that begins with #

# This is a comment

## Also a comment

In VS Code

  • comment out a block of code: ⌘k⌘c
  • uncomment a block of code: ⌘k⌘u
## I use ## for actual comments
## and # for commenting out code
# x <- 5
# x

Comment Malpractice: All cap notes

Never use the following:

## TODO

## HARD CODED

## CHANGE WHEN XXXXX

## UPDATED ON XXXXX

## Set to X to run Y analysis

All of this is important information!

But it shouldn’t be stored as a comment.

Instead we should use a task manager, variables declared at the top of a script, git for version control, etc.

Comment Malpractice: Duplicating the Code

You may be tempted to add this sort of comment

## Adding 5 to my variable and squaring it
x <- (x + 5)^2

But what happens when you decide to change your code?

## Adding 5 to my variable and squaring it
x <- (x + 5)^3

If someone else reads your code, do they trust

  • the comment, or
  • the code?

Code forces you to do exactly what you say (i.e. square, not cube). But comments do not, so they tend to get out of sync with the code.

Commenting Best Practices

“Good code does not need comments”

This is the goal.

Your code should be readable without any comments.

But that’s probably unrealisitc for most of us.

Some good rules:

  1. Comments should not duplicate code
  2. Good comments do not excuse unclear code
  3. Comments should dispel confusion, not cause it

My commenting practice

I like to use comments to give sections to my code

## Reading in the data


## Runnning regressions


## Making Charts


## Storing results

I find this useful as a way to structure my code and make it more readable later on.

Data Types

Data Types in R

  • Character

  • Logical (“boolean”)

  • Integer

  • Numeric

  • Complex (imaginary numbers)

  • Raw (bytes)

class("text")
[1] "character"

Character type

Characters (a.k.a. “strings”) store text information

You can create a character variable with either '' or ""

x <- 'some text'
y <- "some text"
class(x)
[1] "character"
class(y)
[1] "character"

Logical type

We saw logical types before. Stored as TRUE or FALSE.

class(TRUE)
[1] "logical"
class(FALSE)
[1] "logical"

A special type of logical in R are missing values, stored as NA

class(NA)
[1] "logical"

Missing values always create more missing values

NA == 5
[1] NA

Integer type

All whole numbers (no decimal): (…, -2, -1, 0, 1, 2, …)

An exact number storage, compared to the approximate “numeric” type.

To create an integer value, add an L at the end of the number

class(1)
[1] "numeric"
class(1L)
[1] "integer"

Can be useful for setting ID values,

but usually we will store numbers as “numeric” type instead.

Numeric

Numeric is a class that stores numbers as floating point values.

  • think of this as a very long decimal, but not infinite

In R, “double” is the only numeric type.

Equivalent to “float64” in other languages.

There used to be a “single” precision. Equivalent to “float32”.

class(1)
[1] "numeric"
class(1.23143)
[1] "numeric"
class(pi)
[1] "numeric"
class(1L / 3L)
[1] "numeric"

Converting from one type to another

R has a full set of as.___() functions for each type.

x <- 12
as.character(x)
[1] "12"
as.integer(x)
[1] 12
as.raw(x) # Hexidecimal representation of bytes
[1] 0c

Sometimes the conversion is not as expected.

as.logical(x)
[1] TRUE

Sometimes will return missing.

as.numeric("text")
[1] NA

Automatic Conversions

Some languages are very strict about data types, R is not.

This is convenient, but somewhat dangerous.

paste() takes multiple strings and pastes them together

paste("Hello", "World!")
[1] "Hello World!"

R will try to convert other types to a string to paste.

paste("Hello", 123)
[1] "Hello 123"
paste("Hello", TRUE)
[1] "Hello TRUE"

This also happens for math operations. Can be unexpected.

TRUE * FALSE + 5
[1] 5

Data Structures

Data Structures

  • vector
  • matrix / array
  • list
  • data.frame
  • factor

Vectors

Vectors are an ordered set of values all of the same type.

They are created with the c() function (for “concatenate”).

x <- c(1, 5, 3, 7, 10)
x
[1]  1  5  3  7 10
is.vector(x)
[1] TRUE

Technically, everything we have seen are vectors of length 1.

is.vector(1)
[1] TRUE
class(c(1, 5, 3))
[1] "numeric"

Vectors

Vectors all have lengths.

x <- c("a", "d", "b", "z")
length(x)
[1] 4

Vectors can have names for each element.

names(x) <- c("NameOne", "NameTwo", "NameThree", "NameFour")
x
  NameOne   NameTwo NameThree  NameFour 
      "a"       "d"       "b"       "z" 

Vectors can have NA values, but otherwise, no mixing types.

c(1, 2, "a")
[1] "1" "2" "a"
c(1, 2, NA)
[1]  1  2 NA

Accessing Vector Elements

Vector elements can be accessed by their position, or name,

or using square brackets[].

x <- c("providence", "boston", "new york")
x[2]
[1] "boston"
names(x) <- c("p", "b", "ny")
x["b"]
       b 
"boston" 

You can select multiple elements if you wish.

x[c(1,2)]
           p            b 
"providence"     "boston" 

Applying Functions to Vectors

A lot of fuctions are “vectorized” to apply to each element.

x <- c(1, 5, NA, 3, 8)

x^2
[1]  1 25 NA  9 64
1 + x
[1]  2  6 NA  4  9
is.na(x)
[1] FALSE FALSE  TRUE FALSE FALSE

Some functions instead take in a vector.

sum(x)
[1] NA
sum(x, na.rm = TRUE)
[1] 17

We will learn later how to vectorize any function.

Matrices and Arrays

Vectors are only one dimensional.

Matrices have 2 dimensions.

matrix(1:4, nrow = 2)
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Arrays have ‘n’ dimensions.

array(1:8, dim = c(2,2,2))
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

Lists

What if I need to store a mix of data types? - Use a list!

x <- list(1, 2, "a", TRUE)
x
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] "a"

[[4]]
[1] TRUE

Each element has preserved its type!

Accessing Lists

We can again access list elements by their index position.

x <- list(1, 2, "a", TRUE)
x[3]
[[1]]
[1] "a"

Note: x[3] returns a list with one element

Use double square brackets to return the element instead.

x[[3]]
[1] "a"

Named lists

Lists can also have names for each element. We can assign them using names() or at construction.

x <- list(p = "Providence", b = "Boston", nyc = "New York")
x
$p
[1] "Providence"

$b
[1] "Boston"

$nyc
[1] "New York"

We can access list elements using the dollar sign $, followed by the element name.

x$b
[1] "Boston"

Lists of vectors

We saw earlier that 1 element objects are actually vectors.

This means that we can have lists of multiple element vectors.

x <- list(v1 = c(1, 3, 9, 5), v2 = c("A", "D", "E"))
x
$v1
[1] 1 3 9 5

$v2
[1] "A" "D" "E"

Lists of lists

We can also have lists of lists!

x <- list(l1 = list(5:7, c("A", "D", "E")), l2 = list(1:5, TRUE))
x
$l1
$l1[[1]]
[1] 5 6 7

$l1[[2]]
[1] "A" "D" "E"


$l2
$l2[[1]]
[1] 1 2 3 4 5

$l2[[2]]
[1] TRUE

This can be as many list layers deep as you want.

Lists vs. vectors

Lists are much more general than vectors.

So why use vectors?

  • Vectors, by enforcing a single type, are much more efficient.
  • Also, functions generally won’t apply to each element in a list by default.
x <- list(1, 2, 5)
x^2
<simpleError in x^2: non-numeric argument to binary operator>

Data.frames

Think of them as “tables” of elements.

x <- data.frame(col1 = c(5, 2, 7), col2 = c("f", "a", "e"))
x
  col1 col2
1    5    f
2    2    a
3    7    e

Behind the scenes, they are a

  • list of vectors,
  • where each vector has the same length.

Accessing data.frame elements

Data.frame values can be accesesd using index values:

x[row, col]

x[2, 2]
[1] "a"

You can leave one index blank to get a whole row or column.

x[2, ]
  col1 col2
2    2    a
x[ , 2]
[1] "f" "a" "e"

Or use the column names (remember, they are just lists).

x$col2
[1] "f" "a" "e"

What is your data structure?

The function str() will return information about the data structure of the passed object.

str(data.frame(col1 = c(5, 2, 7), col2 = c("f", "a", "e")))
'data.frame':   3 obs. of  2 variables:
 $ col1: num  5 2 7
 $ col2: chr  "f" "a" "e"
str(c(1, 2, 5))
 num [1:3] 1 2 5
str(list(1, 2, 5))
List of 3
 $ : num 1
 $ : num 2
 $ : num 5
str(TRUE)
 logi TRUE

Factors

A specific type of vector.

Details to be covered in the problem set!

Controlling Program Flow

If Statements

If statements evaluate a condition,

and then execute code if the condition is TRUE.

x <- 3 
if (x == 3) {
  print("X is equal to 3")
}
[1] "X is equal to 3"

If we give another value for x…

x <- 5
if (x == 3) {
  print("X is equal to 3")
}

Nothing is printed. Because print() never was run.

Mulitple if statements

Sometimes you want to check a series of conditions,

x <- "5"
if (is.numeric(x)) {
  print("X is a number.")
} else if (is.character(x)) {
  print("X is a character.")
}
[1] "X is a character."

This code,

  1. Checks if ‘x’ is a number, then
  2. Checks if ‘x’ is a character.

Catching everything else

To catch any cases that do not pass any condition, you can use

x <- NA
if (is.numeric(x)) {
  print("X is a number.")
} else if (is.character(x)) {
  print("X is a character.")
} else {
  print("I'm not sure what X is.")
  str(x)
}
[1] "I'm not sure what X is."
 logi NA

If (and if-else) statements are the basics of controlling the flow of your program.

You can make sections of code that only execute for one dataset, or a robustness check that runs on only one model.

Loops

Loops are another key component for controlling your program flow.

Two basic loops are:

  • for(){}
  • while(){}

For Loops

For loops execute code for a defined number of times.

for (i in 1:3) {
  print(i)
}
[1] 1
[1] 2
[1] 3

The construction here is

for (each_value in vector_of_values) {
  ## Do something
}

While Loops

While loops execute code repeatedly until a condition is met.

i <- 1
while (i <= 3) {
  print(i)
  i <- i + 1
}
[1] 1
[1] 2
[1] 3

Here we emulated the function of the for loop from before.

But while loops only require one thing:

while (condition) {
  ## Do Something
}

Dangerous While Loops

It is easy to write a while loop that will run forever.

while (1 < 3) {
  print("running!")
}

This one is inane, but you can inadevertantly construct them.

Why use While Loops?

While loops let you execute code for an unspecified duration.

x <- 0
while (x < 1) {
  x <- rnorm(1)
  print(x)
}
[1] -0.03751376
[1] -1.574604
[1] -0.4859675
[1] 0.4651862
[1] -0.9040981
[1] -0.2774328
[1] 0.3864344
[1] -0.06040412
[1] -0.6861798
[1] -1.906137
[1] 1.80376

Setting Safety Valves

If you are going to use a while loop, it’s a good idea to set a “safety” option to limit the maximum number of iterations.

max_iter <- 1000
i <- 0
x <- 0
while (x < 1 & i < max_iter) {
  x <- rnorm(1)
  i <- i + 1
}
print(i)
[1] 3
print(x)
[1] 1.106937

Apply Functions

There are some disadvantages to loops,

  • They tend to be inefficient
  • The sometimes go on forever

One alternative is to use one of the family of apply() functions.

lapply()

We will see how to use the l + apply() function.

  • l stands for “list”, which is what the function returns.

We can rewrite our prior for(){} loop as,

x <- lapply(1:3, print)
[1] 1
[1] 2
[1] 3

The construction is…

results_as_list <- lapply(vector_of_values, function_applied_to_each_value)

lapply()

One nice thing about lapply() is it returns the values as list.

lapply(1:3, sqrt)
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

It also works with lists as the input,

lapply(list(1, 2, 3), sqrt)
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

Why are Apply Functions More Efficient?

Apply functions assume that each of your elements can be operated on separately.

For loops operate on each element sequentially.

  • This makes it much faster for computers to handle apply functions.
  • But sometimes you will have operations that depend on prior steps, and you will have to use for loops.

Summary

Summary

  • Introduced to R
    • “Calculator” operations and logic
    • Assigning values
    • Different data types (logical, character, numeric, etc.)
    • Different data structures (vectors, lists, data.frames, etc.)
  • Control Flow
    • If Else
    • Loops
    • lapply()

Live Coding