Introduction

What is R?

R is a open-sourced programming language, and a free software environment for data manipulation, calculations, graphical displays that widely used among statisticians and data miners.

R studio is a popular IDE (integrated development environment) of presenting R programming. You can use your console as a calculators, and also run lines of code in the console, but you can run lines in the script pane and save it. All error message appear in the console pane. To use R’s help assistance, just type in help(pnorm) or ?pnorm. (pnorm here is a command for obtaining the probability of a specific quantile that normally distributed.)

Advantage

  • Most comprehensive statistical analysis package available
  • Graphical capabilities are outstanding
  • Free, open source, no license restrictions
  • over 4800 packages from multiple repositories specializing in topics like econometrics, data mining, spatial analysis, and bio-informatics
  • R plays well with many other tools for importing data (e.g. SAS, SPSS, STATA, SQL, … etc)

Packages

R gives some basic and popular functions, but if you need something else, you need to install the package (or library). Some common library: CRAN (Comprehansive), R, Archive, Network.

To install a new package, you can:

  • via GUI:
    Tools tab \(\rightarrow\) Install Packages tab

  • via command:
    install.packages()
    e.g.: to install MASS package, use the command install.packages("MASS")

You only need to install a package once. To use the package, you must first execute library(). Some packages come with datasets, you can load those datasets by data().

For example, if you want to install a new package MASS and use the dataset iris in that package:

install.packages("MASS")
library(MASS)
data(iris)

Setting Environment

To specify the directory that you will read the datasets from, you have to set up your working directy at the begining by setwd(). If you are not sure which directory you are in, use getwd() to get the path of the directory.

  • setwd("path/to/your/folder/"): set working directory
  • getwd(): get working directory

Coding Habits

  • Comment your code,
    use # if you want to comment in the scripts and don’t want R to run it.
  • Name your objects (datasets, variables, …) sensibly
  • Regularly check your lines are working when you are coding
  • Always check your data are read correctly
  • Always make sure your output makes sense
  • Learn to debug!

Modes

A vectors of mixed mode are cast to the “highest type”:

  • lowest: logical, numeric, complex
  • highest: character
  • mode(): gives the mode of the object

Numeric Mode

Order matters

  • x = c(2, 3.2, 6, 1.4, 3, 7.8)
  • Type x in console then you will get the x vector
  • A numeric vector is an ordered collection of numbers
  • Entries are preceded by [1], value is a vector

Using assign()

  • assign("x", c(2, 3.2, 6, 1.4, 3, 7.8))

Concatenating vectors

  • y = c(x, 0, x)
  • x = c(1, 2, 3), y = c(4, 5, 6)
    v = 2*x + y + 1
  • Vectors in the same expression don’t have to be of the same length. The shorter vectors are recycled until they match the length of the longest.
    x1 = c(1, 2), y1 = c(3, 4, 7, 8)
    x1 + y1 = c(4, 6, 8, 10)
  • R will throw a warning only when the length of the largest is not a multiple of the length of the shortest.
    x2 = c(1, 3, 4), y1 = c(3, 4, 7, 8)
    x2 + y1 = error

Arithmetic

  • Expectation: a*b=\(a^b\)
  • Standard functions performed element-wise: log, exp, sin, cos, sqrt
  • Standard functions performed over the whole vector: sum, prod, mean, var, sd

Character Mode

  • sequence encased in " " or ' '
  • paste(): concatenation for character variables
    • numbers are cast as characters
    • arguments are separated by a space unless you tell it otherwise

Example

### Example 1
y = 2
paste("The value of y is ", y)

### Example 2
labs = paste(c("X", "Y"), 1:10, sep = ""); labs
# sep="": nothing between each values
> [1] "The value of y is  2"
>  [1] "X1"  "Y2"  "X3"  "Y4"  "X5"  "Y6"  "X7"  "Y8"  "X9"  "Y10"

Logical Mode

  • Takes the values TRUE, FALSE, NA
  • logical operations
    • >, <, >=, <=
    • ==, strict equality
    • !=, not equals
    • !, not
    • &&, and
    • or, ||`
  • Logical vector converts into numeric vector, FALSE=0, TRUE=1
    x = c(1, 1, 3, 6, 4, 1, 3, 1)
    sum(x==1) gives the count of number of 1’s
  • Numeric vector converts into logical vector, 0=FALSE, 1=TRUE

Changing Modes

  • Functions:
    • as.logical(): changes all elements to boolean
    • as.numeric(): changes all elements to numeric
    • as.character(): changes all elements to character
    • An element that can’y be transformed gets NA

Example

x = c(TRUE, FALSE)
y = c(0, 1)
z = c('a', '1')

### Example 1: Changes x into numeric
as.numeric(x)
### Example 2: Changes y into character
as.character(y)
### Example 3: Gets NA if it can't be transformed
mode(z)
as.numeric(z)
> [1] 1 0
> [1] "0" "1"
> [1] "character"
> Warning: NAs introduced by coercion
> [1] NA  1

Factor class

  • factor(x) creates levels using the values in x
    x = c(1, 1, 2, 3, 1, 5, 5, 7, 7, 1, 3, 2, 6, 5)
    factor(x) gives levels 1, 2, 3, 5, 6, 7

Example

x = c('like', 'dislike', 'hate', 'like', "don't know", 'like', 'dislike')
# Example 1: Not order the levels
factor(x)
# Example 2: Order the levels
factor(x, levels = c('hate', 'dislike', "don't know", 'like'), ordered = TRUE)
> [1] like       dislike    hate       like       don't know like      
> [7] dislike   
> Levels: dislike don't know hate like
> [1] like       dislike    hate       like       don't know like      
> [7] dislike   
> Levels: hate < dislike < don't know < like

Indexing Vectors

Picks subsets of elements of a vector or matrix

Logical Index

Keeps the elements that are TRUE and drops those that are FALSE

Example

x = c(1:3, NA, 4)
l = c(T, T, T, F, T)
y = x[l]
y # non-missing elements of x
## [1] 1 2 3 4

Positive Integers

Can only use values in 1 to the length

Example

x = 1:15
### Example 1
x[2:7]
### Example 2
x[c(4, 7, 12)]
### Example 3
index = rep(c(1,1,2,2), times = 2); index
z = c("x","y")
z[index]
## [1] 2 3 4 5 6 7
## [1]  4  7 12
## [1] 1 1 2 2 1 1 2 2
## [1] "x" "x" "y" "y" "x" "x" "y" "y"

Negative Integers

Excludes those elements

Example

x = 1:10
### Original vector
x
### Remove 2th and 5th elements
x[c(-2, -5)]
>  [1]  1  2  3  4  5  6  7  8  9 10
> [1]  1  3  4  6  7  8  9 10

Character Index

Only when a name atribute identifies the components

fruit = c(5, 10, 11, 34, 2)
names(fruit) = c('orange', 'banana', 'apple', 'peach', 'grape')
### Original vector
fruit
### Extract values with names of apple and orange
fruit[c('apple', 'orange')]
> orange banana  apple  peach  grape 
>      5     10     11     34      2
>  apple orange 
>     11      5

Replacing elements by index

Assign values to the vectors based on the logical vector as index is true.

x = 1:10
### Extract the vector with TRUE logical index
x[c(T, T, T, F, T)]
### Assign new value to values with TRUE logical index
x[c(T, T, T, F, T)] = 0
x
> [1]  1  2  3  5  6  7  8 10
>  [1] 0 0 0 4 0 0 0 0 9 0

Generating Sequences

:, colon

  • 1:30 is equivalent to c(1, 2, ..., 30)
  • :, colon has high priority, in 2*1:15, colon performs first.

Example

n = 10
### Colon performs first
1:n-1
### Generate forward
1:(n-1)
### Generate Backward
5:1
>  [1] 0 1 2 3 4 5 6 7 8 9
> [1] 1 2 3 4 5 6 7 8 9
> [1] 5 4 3 2 1

seq(), sequence

A more general way to generate sequence

Example

### Example 1 
seq(10)
### Example 2
seq(-1, 1, by = 0.2)
### Example 3
seq(length = 6, from = -5, by = 0.2)
>  [1]  1  2  3  4  5  6  7  8  9 10
>  [1] -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0
> [1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0

rep(), repeat

Example

x = 1:5
### Example 1
rep(x, times = 5)
### Example 2
rep(x, each = 5)
>  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
>  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5

Missing Values

  • NA: not available, missing
  • R does not complete operations if NA is present
  • Most stat functions have an option for ignoring NA (na.rm = TRUE)
  • is.na()
    • returns a logical vector that is TRUE where the argument has NA, FALSE otherwise
    • expression x==NA doesn’t work because NA is not a value, it is a marker
  • NaN, Inf
    • NaN: Not a Number
    • Inf: \(\infty\), Inf is too big, -Inf is too small
    • is.na() returns TRUE for both NA and NaN
    • is.nan() returns TRUE only for NaN

Example

### Incomplete operation if containing NA
x = 1:15; x[c(1, 3, 6)] = NA
mean(x)
### Some functions have argument for ignoring NA
mean(x, na.rm = TRUE)
### Logical vector of index of NAs
is.na(x)
> [1] NA
> [1] 9.166667
>  [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
> [12] FALSE FALSE FALSE FALSE

Objects

Properties

Gives different results depends on the object.

  • length(object)
  • attributes(object)
  • attr(object, name): select a specific attribute
  • class(object): gives numeric, logical, character, matrix, array, factor of the object

Example

age = 18:29
height = c(76.1, 77, 78.1, 78.2, 78.8, 79.7, 
          79.9, 81.1, 81.2, 81.8, 82.8, 83.5)
village = data.frame(age = age, height = height)

### Village, a data frame
village
>    age height
> 1   18   76.1
> 2   19   77.0
> 3   20   78.1
> 4   21   78.2
> 5   22   78.8
> 6   23   79.7
> 7   24   79.9
> 8   25   81.1
> 9   26   81.2
> 10  27   81.8
> 11  28   82.8
> 12  29   83.5
### length()
length(village)
> [1] 2
### attributes()
attributes(village)
> $names
> [1] "age"    "height"
> 
> $row.names
>  [1]  1  2  3  4  5  6  7  8  9 10 11 12
> 
> $class
> [1] "data.frame"
### class()
class(village)
> [1] "data.frame"
### summary()
summary(village)
>       age            height     
>  Min.   :18.00   Min.   :76.10  
>  1st Qu.:20.75   1st Qu.:78.17  
>  Median :23.50   Median :79.80  
>  Mean   :23.50   Mean   :79.85  
>  3rd Qu.:26.25   3rd Qu.:81.35  
>  Max.   :29.00   Max.   :83.50

Vectors

  • A set of elements of the same mode (where the mode has numeric, character, logical, complex)
  • can be a single value
  • can be extended without special consideration

Example

a = c(4, 6, 8)
a[5] = 9
a
> [1]  4  6  8 NA  9

Arrays

  • Data structure of all one type: vector, matrix
  • A matrix of dimension \((10 \times 1)\) is not a vector of length \(10\)
  • A matrix and a vector are different objects

Reshaping vecotr

matrix(vector, nrow = k, ncol = n) builds a \(k\) by \(n\) matrix, column-wise, left to right

Example

vector = c(1, 2, 3, 4, 5, 6)
matrix(vector, nrow = 2, ncol = 3)
>      [,1] [,2] [,3]
> [1,]    1    3    5
> [2,]    2    4    6

Stacking vectors

  • cbind(): vertical combine columns
  • rbind(): horizontally combine rows

Example

x = c(11, 12, 13)
y = c(77, 55, 33)

### Vectically stacking: 2 by 3 matrix
rbind(x, y)
### Horizontally stacking: 3 by 2 matrix
cbind(x, y)
>   [,1] [,2] [,3]
> x   11   12   13
> y   77   55   33
>       x  y
> [1,] 11 77
> [2,] 12 55
> [3,] 13 33

Accessing elements

matrix[i, j] gives you value from the ith row, jth column. Leave it blank if you want to access all columns or all rows.

List

  • Similar as an array, but items can be of different kinds
  • Flexible, we can make a list of lists
  • names() shows named items in a list
  • If list itmes are not named, access them with double brackets ([[]])
  • You can also create a list using objects already in your workspace

Example

### Example 1

L = list() # empty list

# a vector item called coefficients
L$coefficients = c(1, 4, 6, 8)
L
> $coefficients
> [1] 1 4 6 8
# assign item 4, then item 2 and 3 are set to NULL
L[[4]] = c(5, 8)
L
> $coefficients
> [1] 1 4 6 8
> 
> [[2]]
> NULL
> 
> [[3]]
> NULL
> 
> [[4]]
> [1] 5 8
# extract 4th item
L[[4]]
> [1] 5 8
# assign name to item 4
names(L)[[4]] = 'dummy'
L
> $coefficients
> [1] 1 4 6 8
> 
> [[2]]
> NULL
> 
> [[3]]
> NULL
> 
> $dummy
> [1] 5 8

Data Frame

  • Special kind of list where each item (variable) must have the same number of elements. Variables may be different types.
  • Can operate with both list commands and array commands

Example

### Generate the data frame
grp = c(1,2,2,1,1,1,2,2,1,2,2,1)
gpa = c(4.0, 3.5, 2.8, 3.9, 2.2, 3.8, 2.7, 3.8, 4.0, 3.6, 3.4, 2.1)
age = c(21, 22, 19, 32, 25, 22, 20, 23, 21, 24, 22, 30)
sex = c('F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F')
dat = data.frame(Group = grp, GPA = gpa, Age = age, Sex = sex)
dat

### (1) Access observation 1-10 and all associated variables
dat[1:10,]
### (2) Access all observations in group 1
dat[dat$group==1]
>    Group GPA Age Sex
> 1      1 4.0  21   F
> 2      2 3.5  22   M
> 3      2 2.8  19   M
> 4      1 3.9  32   M
> 5      1 2.2  25   F
> 6      1 3.8  22   F
> 7      2 2.7  20   M
> 8      2 3.8  23   M
> 9      1 4.0  21   M
> 10     2 3.6  24   F
> 11     2 3.4  22   M
> 12     1 2.1  30   F
>    Group GPA Age Sex
> 1      1 4.0  21   F
> 2      2 3.5  22   M
> 3      2 2.8  19   M
> 4      1 3.9  32   M
> 5      1 2.2  25   F
> 6      1 3.8  22   F
> 7      2 2.7  20   M
> 8      2 3.8  23   M
> 9      1 4.0  21   M
> 10     2 3.6  24   F
> data frame with 0 columns and 12 rows

Tables

  • Frequency tables
  • You can create factor variables via factor()
  • You can create frequency tables via table()
  • You can create factor variables by numeric vector via cut()
  • You can create two way tables of frequency also by table(a,b)

Example

### 30 tax accountants from Australia
state = c('tas', 'sa', 'qld', 'nsw', 'nsw', 'nt', 'wa', 'wa', 'qld', 'vic', 'nsw', 'vic', 
          'qld', 'qld', 'sa', 'tas', 'sa', 'nt', 'wa', 'vic', 'qld', 'nsw', 'nsw', 'wa', 
          'sa', 'act', 'nsw', 'vic', 'vic', 'act')

### Create a factor variable
statefac = factor(state)
statefac

### Create a frequency table
statefreq = table(statefac)
statefreq

### Create a factor variable from numeric vector
incomes = c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42,
            56, 61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48,
            52, 46, 59, 46, 58, 43)
cats = cut(incomes, breaks = 35+10*(0:7))
incomefac = factor(cats)
# breaks: 8 intervals start with 35 and differred with 10
incomefac

### Two-way table
table(incomefac, statefac)
>  [1] tas sa  qld nsw nsw nt  wa  wa  qld vic nsw vic qld qld sa  tas sa 
> [18] nt  wa  vic qld nsw nsw wa  sa  act nsw vic vic act
> Levels: act nsw nt qld sa tas vic wa
> statefac
> act nsw  nt qld  sa tas vic  wa 
>   2   6   2   5   4   2   5   4
>  [1] (55,65] (45,55] (35,45] (55,65] (55,65] (55,65] (55,65] (45,55]
>  [9] (55,65] (65,75] (65,75] (35,45] (55,65] (55,65] (55,65] (55,65]
> [17] (55,65] (45,55] (45,55] (55,65] (45,55] (45,55] (35,45] (45,55]
> [25] (45,55] (45,55] (55,65] (45,55] (55,65] (35,45]
> Levels: (35,45] (45,55] (55,65] (65,75]
>          statefac
> incomefac act nsw nt qld sa tas vic wa
>   (35,45]   1   1  0   1  0   0   1  0
>   (45,55]   1   1  1   1  2   0   1  3
>   (55,65]   0   3  1   3  2   2   2  1
>   (65,75]   0   1  0   0  0   0   1  0

Data Management

Data Export

Output .txt file

From a data frame

Use write.table() to output a .txt file from a data frame

Example

age = 18:29
height = c(76.1, 77, 78.1, 78.2, 78.8, 79.7, 79.9, 81.1, 81.2, 81.8, 82.8, 83.5)
village = data.frame(Age = age, Height = height)
### write to a file
write.table(village, file = "village.txt", sep = "\t", col.names = NA, quote = F)
# \t: separate columns by a tab
# makes titles/names align with columns, especially with row name/index
# suppress quotes around character values

From a matrix

Use write() to output a .txt file from a matrix

Example

x = matrix(1, 20, 20)
write(x, file = "matrix.txt", ncolumns = 20)

Save R objects

Use save() to save R objects and you can load the object without reading or conversion later.

Example

save(village, file = "village.Rdata")

Data Import

Read files from external files. Generally in a data file, we have:

  1. 1st row is the names for the variables
  2. Other rows contain variable values separated by space, comma, … etc

read.table()

  • Reads a data frame from txt file by read.table("file/name.txt", header = TRUE)
  • Reads variables as numeric or factor by using the argument colClasses
  • Specifys the delimiter by the argument sep
  • Some other arguments:
    • col.names: specify column names, similar in spirit to names()
    • row.names: specify row names
    • skip: number of lines to skip before reading
    • na.strings: how missing values are recorded

Example

Instead of specifying the whole path of your file, you can set your working directory of where you put your file.

### Example 1: Reads a data frame
scores = read.table("scores_names.txt", header = TRUE)
scores
# Extract the interesting column by $ or [['']]
scores$gender
scores[['gender']]
# One pair of brackets gives you a data frame of the column
scores['gender']
>    name gender aptitude statistics
> 1 Linda      F       95         85
> 2 Jason      M       85         95
> 3 Susan      F       80         70
> 4  Mike      M       70         65
> 5  Judy      F       60         70
> [1] F M F M F
> Levels: F M
> [1] F M F M F
> Levels: F M
>   gender
> 1      F
> 2      M
> 3      F
> 4      M
> 5      F
### Example 2: Reads variables as numeric or factor
scores = read.table("scores_names.txt", header = TRUE,
                    colClasses = c('character', 'character', 'integer', 'integer'))
scores[['gender']]
> [1] "F" "M" "F" "M" "F"
### Example 3: Specifies the delimiter
reading = read.table("reading.txt")
reading
reading = read.table("reading.txt", sep = ",")
reading
names(reading) = c('Name', 'Week1', 'Week2', 'Week3', 'Week4', 'Week5')
reading
>                 V1
> 1  Grace,3,1,5,2,6
> 2 Martin,1,2,4,1,3
> 3 Scott,9,10,4,8,6
>       V1 V2 V3 V4 V5 V6
> 1  Grace  3  1  5  2  6
> 2 Martin  1  2  4  1  3
> 3  Scott  9 10  4  8  6
>     Name Week1 Week2 Week3 Week4 Week5
> 1  Grace     3     1     5     2     6
> 2 Martin     1     2     4     1     3
> 3  Scott     9    10     4     8     6

read.fwf()

Same arguments as read.table() but also takes widths as vector of integers indicating number of columns a variable occupies. Negative variables are used to skip columns.

Example

reading2 = read.fwf("readingfwf.txt", widths = c(6, -9, 1, -3, 2, -3, 1, -2, 1, -1, 1))
reading2
>       V1 V2 V3 V4 V5 V6
> 1 Grace   3  1  5  2  6
> 2 Martin  1  2  4  1  3
> 3 Scott   9 10  4  8  6

scan()

  • read.table() uses scan() and processes the results
  • scan() does the same thing as c() but without commas
  • Hit ENTER once, we continue the input onto a new line
  • Hit ENTER twice, scan() stops

Example

Enter the values into Console:

cooperation = scan()
49 64 37 52 68 54
61 79 64 29
27 58 52 41 30 40 39
44 34 44

Built-in datasets

  • There are around 100 datasets in base R, package datasets
  • Other comes with other pacakges
  • You can use data() to load those datasets, and call them by the name
  • You will sometimes need to specify the source package of the datasets

Example

data(AirPassengers)
install.pacakges('rpart')
data(kyphosis, package = 'rpart')

Saved R dataset

Example

load('village.Rdata')

Manipulate data frame

Example

data(airquality)
head(airquality)
>   Ozone Solar.R Wind Temp Month Day
> 1    41     190  7.4   67     5   1
> 2    36     118  8.0   72     5   2
> 3    12     149 12.6   74     5   3
> 4    18     313 11.5   62     5   4
> 5    NA      NA 14.3   56     5   5
> 6    28      NA 14.9   66     5   6
nrow(airquality)
> [1] 153
### Example 1: change Month to a factor
airquality$Month = factor(airquality$Month)

### Example 2: create new indicator variable Solar.I=1 if solar.R > 200
airquality$Solar.I = ifelse(airquality$Solar.R > 200, 1, 0)

### Example 3: replace Temp and Ozone with their z-scores
airquality$Temp = (airquality$Temp - mean(airquality$Temp)) / sd(airquality$Temp)
airquality$Ozone = (airquality$Ozone - mean(airquality$Ozone, na.rm = T)) / sd(airquality$Ozone, na.rm = T)

### Example 4: add a new variable
airquality$New = 1:153

### Example 5: Remove variable Day (column 6)
airquality = airquality[,-6]

### Example 6: Remove all observations with NA for Ozone and Solar.R
airquality = airquality[!is.na(airquality$Solar.R) & !is.na(airquality$Ozon),]

### Print out the manipulated dataset
head(airquality)
>         Ozone Solar.R Wind       Temp Month Solar.I New
> 1 -0.03423409     190  7.4 -1.1497140     5       0   1
> 2 -0.18580489     118  8.0 -0.6214670     5       0   2
> 3 -0.91334473     149 12.6 -0.4101682     5       0   3
> 4 -0.73145977     313 11.5 -1.6779609     5       1   4
> 7 -0.57988897     299  8.6 -1.3610128     5       1   7
> 8 -0.70114561      99 13.8 -1.9949091     5       0   8
nrow(airquality)
> [1] 111

Data Merging

Vertical Merging

  • vertical binding of data frames
  • does not fill in NAs for variables that are missing in one data frame

Example

X = data.frame(a = c(1, 2, 3, 4), b = factor(c(1, 2, 2, 1)), c = c('A', 'B', 'B', 'A'))
Y = data.frame(a = c(3, 7, 8), b = factor(c(3, 3, 3)))
# rbind(X, Y) gives Error: variables number are different
# Fill NA values yourself
Y$c = 'NA'
rbind(X, Y)
>   a b  c
> 1 1 1  A
> 2 2 2  B
> 3 3 2  B
> 4 4 1  A
> 5 3 3 NA
> 6 7 3 NA
> 7 8 3 NA

Horizontal Merging

  • merge() by arguments
  • If there are identical variable names in the data frames, merge() will modify the name to indicate which data frame the variable comes from

Example

A = data.frame(ID = c(1, 2, 3), age = c(11, 12, 14))
B = data.frame(ID = c(2, 3, 4), sex = c('M', 'M', 'F'))

### Merge two data frame by ID
# keep only the ID values in common to A and B
merge(A, B, by = c('ID'))
### Merge two data frame by ID and keep only t
# keep all ID values even there are missing values
merge(A, B, all = TRUE)
>   ID age sex
> 1  2  12   M
> 2  3  14   M
>   ID age  sex
> 1  1  11 <NA>
> 2  2  12    M
> 3  3  14    M
> 4  4  NA    F

One-to-Many merge

Example

C = data.frame(ID = c(1, 1, 2, 1, 3, 3, 3, 4, 4), stars = c(1, 4, 3, 2, 1, 7, 5, 2, 2))
### Order by merge(A,B)
merge(merge(A, B, by = c('ID'), all = TRUE), C, by = c('ID'), all = TRUE)
### Order by C
merge(C, merge(A, B, by = c('ID'), all = TRUE), by = c('ID'), all = TRUE)
>   ID age  sex stars
> 1  1  11 <NA>     1
> 2  1  11 <NA>     4
> 3  1  11 <NA>     2
> 4  2  12    M     3
> 5  3  14    M     1
> 6  3  14    M     7
> 7  3  14    M     5
> 8  4  NA    F     2
> 9  4  NA    F     2
>   ID stars age  sex
> 1  1     1  11 <NA>
> 2  1     4  11 <NA>
> 3  1     2  11 <NA>
> 4  2     3  12    M
> 5  3     1  14    M
> 6  3     7  14    M
> 7  3     5  14    M
> 8  4     2  NA    F
> 9  4     2  NA    F

R Graphics

General functions for all graphics: title(), legend(), axis(), etc.

Bar Charts

25 people were asked for the beer preference. 1 for domestic can, 2 for domestic bottle, 3 for microbrew, and 4 for import.

Bar plot

beer = c(3, 4, 1, 1, 3, 4, 3, 3, 1, 3, 2, 1, 2, 1, 2, 3, 2, 3, 1, 1, 
         1, 1, 4, 3, 1)
barplot(beer, col = "SteelBlue")

Bar plot by frequency

barplot(table(beer), col = "SteelBlue")

Bar plot by propotion

barplot(table(beer)/length(beer), col = "SteelBlue")

Bar plot by propotion horizontally

barplot(table(beer)/length(beer), col = c('lightblue', 'mistyrose', 'lightcyan', 'cornsilk'), 
        horiz = T)

Pie Charts

Pie chart

beer.counts = table(beer)
pie(beer.counts)

Pie chart with variable labels

names(beer.counts) = c('Domestic Can', 'Domestic Bottle', 'Microbrew', 'Import')
pie(beer.counts)

Pie chart with colors

pie(beer.counts, col = c('lightblue', 'mistyrose', 'lightcyan', 'cornsilk'))

Histograms

Top 25 movies on 4th-week gross receipts.

Histogram by frequency

movie = c(29.6, 28.2, 19.6, 13.7, 13.0, 7.8, 3.4, 2.0, 1.9, 1.0,
          0.7, 0.4, 0.4, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 
          0.2, 0.1, 0.1, 0.1, 0.1)
hist(movie, col = "SteelBlue")

Histogram by proportion

hist(movie, freq = F, col = "SteelBlue")

Historgrm by specified bins

hist(movie, col = "SteelBlue", breaks = c(0, 1, 2, 3, 4, 5, 10, 20, max(movie)))

Dot Plots

Use the built-in dataset mtcars.

Dot plot

dotchart(mtcars$mpg, labels = row.names(mtcars), cex = 0.7, xlab = 'MPG')

Dot plot colored by cyl and ordered by mpg

x = mtcars[order(mtcars$mpg), 1:2] #mpg:cyl
x$cyl = factor(x$cyl)
x$color[x$cyl==4]='red'
x$color[x$cyl==6]='blue'
x$color[x$cyl==8]='darkgreen'
dotchart(x$mpg, labels = row.names(x), cex = 0.7, 
         groups = x$cyl, color = x$color, xlab = 'MPG')

Box Plots

Based on 5-number summary: min, 1stQ, med, 3rdQ, max

Single sample

x = c(7, 9.5, 10, 11, 10, 10, 8, 11, 8, 13, 11.5)
boxplot(x, xlab = 'Single Sample', ylab = 'Value Axis', main = 'Simple Box Plot', col = 'lightblue')

Multiple Samples

growth = c(75, 72, 73, 61, 67, 64, 62, 63)
sugar = c('C', 'C', 'C', 'F', 'F', 'F', 'S', 'S')
fly = data.frame(growth = growth, sugar = sugar)

boxplot(growth~sugar, data = fly, xlab = 'Sugar Type', ylab = 'Growth', col = 'bisque')
title(main = 'Growth by Sugar Type', font.main = 4)

Scatter Plots

Use the notation plot(x, y, ...) or plot(y ~ x, ...)

Scatter plot

plot(dist~speed, data = cars, xlab = 'Speed', ylab = 'Distance', col = 'blue')
title(main = 'Scatter Plot with Least Squares Line')
abline(lm(dist~speed, data = cars), col = 'red')

Scatter plot matrix

pairs(~mpg+disp+drat+wt, data = mtcars)

Multiple Plots

number of plots in the graph window

  • par()
  • layout()

Example

### Example 1
# a window of two graphs
par(mfrow = c(2, 1))

### Example 2
# a window of four graphs
layout(matrix(c(1, 3, 2, 4), 2, 2, byrow = T), 
       widths = c(3, 1), heights = c(1, 2))

clear and return to defaults

dev.off()