Getting Started in R
Introduction
What is R?
R is a open-sourced programming language, and a free software environment for data manipulation, calculations, graphical displays that widely used among statisticians and data miners.
R studio is a popular IDE (integrated development environment) of presenting R programming. You can use your console as a calculators, and also run lines of code in the console, but you can run lines in the script pane and save it. All error message appear in the console pane. To use R’s help assistance, just type in help(pnorm)
or ?pnorm
. (pnorm
here is a command for obtaining the probability of a specific quantile that normally distributed.)
Advantage
- Most comprehensive statistical analysis package available
- Graphical capabilities are outstanding
- Free, open source, no license restrictions
- over 4800 packages from multiple repositories specializing in topics like econometrics, data mining, spatial analysis, and bio-informatics
- R plays well with many other tools for importing data (e.g. SAS, SPSS, STATA, SQL, … etc)
Packages
R gives some basic and popular functions, but if you need something else, you need to install the package (or library). Some common library: CRAN
(Comprehansive), R
, Archive
, Network
.
To install a new package, you can:
via GUI:
Tools
tab \(\rightarrow\)Install Packages
tabvia command:
install.packages()
e.g.: to installMASS
package, use the commandinstall.packages("MASS")
You only need to install a package once. To use the package, you must first execute library()
. Some packages come with datasets, you can load those datasets by data()
.
For example, if you want to install a new package MASS
and use the dataset iris
in that package:
install.packages("MASS")
library(MASS)
data(iris)
Setting Environment
To specify the directory that you will read the datasets from, you have to set up your working directy at the begining by setwd()
. If you are not sure which directory you are in, use getwd()
to get the path of the directory.
setwd("path/to/your/folder/")
: set working directorygetwd()
: get working directory
Coding Habits
- Comment your code,
use#
if you want to comment in the scripts and don’t want R to run it. - Name your objects (datasets, variables, …) sensibly
- Regularly check your lines are working when you are coding
- Always check your data are read correctly
- Always make sure your output makes sense
- Learn to debug!
Modes
A vectors of mixed mode are cast to the “highest type”:
- lowest: logical, numeric, complex
- highest: character
mode()
: gives the mode of the object
Numeric Mode
Order matters
x = c(2, 3.2, 6, 1.4, 3, 7.8)
- Type
x
in console then you will get thex
vector - A numeric vector is an ordered collection of numbers
- Entries are preceded by
[1]
, value is a vector
Using assign()
assign("x", c(2, 3.2, 6, 1.4, 3, 7.8))
Concatenating vectors
y = c(x, 0, x)
x = c(1, 2, 3)
,y = c(4, 5, 6)
v = 2*x + y + 1
- Vectors in the same expression don’t have to be of the same length. The shorter vectors are recycled until they match the length of the longest.
x1 = c(1, 2)
,y1 = c(3, 4, 7, 8)
x1 + y1 = c(4, 6, 8, 10)
- R will throw a warning only when the length of the largest is not a multiple of the length of the shortest.
x2 = c(1, 3, 4)
,y1 = c(3, 4, 7, 8)
x2 + y1 = error
Arithmetic
- Expectation:
a*b
=\(a^b\) - Standard functions performed element-wise:
log
,exp
,sin
,cos
,sqrt
- Standard functions performed over the whole vector:
sum
,prod
,mean
,var
,sd
Character Mode
- sequence encased in
" "
or' '
paste()
: concatenation for character variables- numbers are cast as characters
- arguments are separated by a space unless you tell it otherwise
Example
### Example 1
y = 2
paste("The value of y is ", y)
### Example 2
labs = paste(c("X", "Y"), 1:10, sep = ""); labs
# sep="": nothing between each values
> [1] "The value of y is 2"
> [1] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8" "X9" "Y10"
Logical Mode
- Takes the values
TRUE
,FALSE
,NA
- logical operations
>
,<
,>=
,<=
==
, strict equality!=
, not equals!
, not&&
, andor
, ||`
- Logical vector converts into numeric vector,
FALSE=0
,TRUE=1
x = c(1, 1, 3, 6, 4, 1, 3, 1)
sum(x==1)
gives the count of number of 1’s - Numeric vector converts into logical vector,
0=FALSE
,1=TRUE
Changing Modes
- Functions:
as.logical()
: changes all elements to booleanas.numeric()
: changes all elements to numericas.character()
: changes all elements to character- An element that can’y be transformed gets
NA
Example
x = c(TRUE, FALSE)
y = c(0, 1)
z = c('a', '1')
### Example 1: Changes x into numeric
as.numeric(x)
### Example 2: Changes y into character
as.character(y)
### Example 3: Gets NA if it can't be transformed
mode(z)
as.numeric(z)
> [1] 1 0
> [1] "0" "1"
> [1] "character"
> Warning: NAs introduced by coercion
> [1] NA 1
Factor class
factor(x)
creates levels using the values inx
x = c(1, 1, 2, 3, 1, 5, 5, 7, 7, 1, 3, 2, 6, 5)
factor(x)
gives levels1
,2
,3
,5
,6
,7
Example
x = c('like', 'dislike', 'hate', 'like', "don't know", 'like', 'dislike')
# Example 1: Not order the levels
factor(x)
# Example 2: Order the levels
factor(x, levels = c('hate', 'dislike', "don't know", 'like'), ordered = TRUE)
> [1] like dislike hate like don't know like
> [7] dislike
> Levels: dislike don't know hate like
> [1] like dislike hate like don't know like
> [7] dislike
> Levels: hate < dislike < don't know < like
Indexing Vectors
Picks subsets of elements of a vector or matrix
Logical Index
Keeps the elements that are TRUE
and drops those that are FALSE
Example
x = c(1:3, NA, 4)
l = c(T, T, T, F, T)
y = x[l]
y # non-missing elements of x
## [1] 1 2 3 4
Positive Integers
Can only use values in 1 to the length
Example
x = 1:15
### Example 1
x[2:7]
### Example 2
x[c(4, 7, 12)]
### Example 3
index = rep(c(1,1,2,2), times = 2); index
z = c("x","y")
z[index]
## [1] 2 3 4 5 6 7
## [1] 4 7 12
## [1] 1 1 2 2 1 1 2 2
## [1] "x" "x" "y" "y" "x" "x" "y" "y"
Negative Integers
Excludes those elements
Example
x = 1:10
### Original vector
x
### Remove 2th and 5th elements
x[c(-2, -5)]
> [1] 1 2 3 4 5 6 7 8 9 10
> [1] 1 3 4 6 7 8 9 10
Character Index
Only when a name atribute identifies the components
fruit = c(5, 10, 11, 34, 2)
names(fruit) = c('orange', 'banana', 'apple', 'peach', 'grape')
### Original vector
fruit
### Extract values with names of apple and orange
fruit[c('apple', 'orange')]
> orange banana apple peach grape
> 5 10 11 34 2
> apple orange
> 11 5
Replacing elements by index
Assign values to the vectors based on the logical vector as index is true.
x = 1:10
### Extract the vector with TRUE logical index
x[c(T, T, T, F, T)]
### Assign new value to values with TRUE logical index
x[c(T, T, T, F, T)] = 0
x
> [1] 1 2 3 5 6 7 8 10
> [1] 0 0 0 4 0 0 0 0 9 0
Generating Sequences
:, colon
1:30
is equivalent toc(1, 2, ..., 30)
:
, colon has high priority, in2*1:15
, colon performs first.
Example
n = 10
### Colon performs first
1:n-1
### Generate forward
1:(n-1)
### Generate Backward
5:1
> [1] 0 1 2 3 4 5 6 7 8 9
> [1] 1 2 3 4 5 6 7 8 9
> [1] 5 4 3 2 1
seq(), sequence
A more general way to generate sequence
Example
### Example 1
seq(10)
### Example 2
seq(-1, 1, by = 0.2)
### Example 3
seq(length = 6, from = -5, by = 0.2)
> [1] 1 2 3 4 5 6 7 8 9 10
> [1] -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
> [1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0
rep(), repeat
Example
x = 1:5
### Example 1
rep(x, times = 5)
### Example 2
rep(x, each = 5)
> [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
> [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
Missing Values
NA
: not available, missing- R does not complete operations if
NA
is present - Most stat functions have an option for ignoring
NA
(na.rm = TRUE
) is.na()
- returns a logical vector that is
TRUE
where the argument hasNA
,FALSE
otherwise - expression
x==NA
doesn’t work becauseNA
is not a value, it is a marker
- returns a logical vector that is
NaN, Inf
NaN
: Not a NumberInf
: \(\infty\),Inf
is too big,-Inf
is too smallis.na()
returnsTRUE
for bothNA
andNaN
is.nan()
returnsTRUE
only forNaN
Example
### Incomplete operation if containing NA
x = 1:15; x[c(1, 3, 6)] = NA
mean(x)
### Some functions have argument for ignoring NA
mean(x, na.rm = TRUE)
### Logical vector of index of NAs
is.na(x)
> [1] NA
> [1] 9.166667
> [1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
> [12] FALSE FALSE FALSE FALSE
Objects
Properties
Gives different results depends on the object.
length(object)
attributes(object)
attr(object, name)
: select a specific attributeclass(object)
: gives numeric, logical, character, matrix, array, factor of the object
Example
age = 18:29
height = c(76.1, 77, 78.1, 78.2, 78.8, 79.7,
79.9, 81.1, 81.2, 81.8, 82.8, 83.5)
village = data.frame(age = age, height = height)
### Village, a data frame
village
> age height
> 1 18 76.1
> 2 19 77.0
> 3 20 78.1
> 4 21 78.2
> 5 22 78.8
> 6 23 79.7
> 7 24 79.9
> 8 25 81.1
> 9 26 81.2
> 10 27 81.8
> 11 28 82.8
> 12 29 83.5
### length()
length(village)
> [1] 2
### attributes()
attributes(village)
> $names
> [1] "age" "height"
>
> $row.names
> [1] 1 2 3 4 5 6 7 8 9 10 11 12
>
> $class
> [1] "data.frame"
### class()
class(village)
> [1] "data.frame"
### summary()
summary(village)
> age height
> Min. :18.00 Min. :76.10
> 1st Qu.:20.75 1st Qu.:78.17
> Median :23.50 Median :79.80
> Mean :23.50 Mean :79.85
> 3rd Qu.:26.25 3rd Qu.:81.35
> Max. :29.00 Max. :83.50
Vectors
- A set of elements of the same mode (where the mode has numeric, character, logical, complex)
- can be a single value
- can be extended without special consideration
Example
a = c(4, 6, 8)
a[5] = 9
a
> [1] 4 6 8 NA 9
Arrays
- Data structure of all one type: vector, matrix
- A matrix of dimension \((10 \times 1)\) is not a vector of length \(10\)
- A matrix and a vector are different objects
Reshaping vecotr
matrix(vector, nrow = k, ncol = n)
builds a \(k\) by \(n\) matrix, column-wise, left to right
Example
vector = c(1, 2, 3, 4, 5, 6)
matrix(vector, nrow = 2, ncol = 3)
> [,1] [,2] [,3]
> [1,] 1 3 5
> [2,] 2 4 6
Stacking vectors
cbind()
: vertical combine columnsrbind()
: horizontally combine rows
Example
x = c(11, 12, 13)
y = c(77, 55, 33)
### Vectically stacking: 2 by 3 matrix
rbind(x, y)
### Horizontally stacking: 3 by 2 matrix
cbind(x, y)
> [,1] [,2] [,3]
> x 11 12 13
> y 77 55 33
> x y
> [1,] 11 77
> [2,] 12 55
> [3,] 13 33
Accessing elements
matrix[i, j]
gives you value from the i
th row, j
th column. Leave it blank if you want to access all columns or all rows.
List
- Similar as an array, but items can be of different kinds
- Flexible, we can make a list of lists
names()
shows named items in a list- If list itmes are not named, access them with double brackets (
[[]]
) - You can also create a list using objects already in your workspace
Example
### Example 1
L = list() # empty list
# a vector item called coefficients
L$coefficients = c(1, 4, 6, 8)
L
> $coefficients
> [1] 1 4 6 8
# assign item 4, then item 2 and 3 are set to NULL
L[[4]] = c(5, 8)
L
> $coefficients
> [1] 1 4 6 8
>
> [[2]]
> NULL
>
> [[3]]
> NULL
>
> [[4]]
> [1] 5 8
# extract 4th item
L[[4]]
> [1] 5 8
# assign name to item 4
names(L)[[4]] = 'dummy'
L
> $coefficients
> [1] 1 4 6 8
>
> [[2]]
> NULL
>
> [[3]]
> NULL
>
> $dummy
> [1] 5 8
Data Frame
- Special kind of list where each item (variable) must have the same number of elements. Variables may be different types.
- Can operate with both list commands and array commands
Example
### Generate the data frame
grp = c(1,2,2,1,1,1,2,2,1,2,2,1)
gpa = c(4.0, 3.5, 2.8, 3.9, 2.2, 3.8, 2.7, 3.8, 4.0, 3.6, 3.4, 2.1)
age = c(21, 22, 19, 32, 25, 22, 20, 23, 21, 24, 22, 30)
sex = c('F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F')
dat = data.frame(Group = grp, GPA = gpa, Age = age, Sex = sex)
dat
### (1) Access observation 1-10 and all associated variables
dat[1:10,]
### (2) Access all observations in group 1
dat[dat$group==1]
> Group GPA Age Sex
> 1 1 4.0 21 F
> 2 2 3.5 22 M
> 3 2 2.8 19 M
> 4 1 3.9 32 M
> 5 1 2.2 25 F
> 6 1 3.8 22 F
> 7 2 2.7 20 M
> 8 2 3.8 23 M
> 9 1 4.0 21 M
> 10 2 3.6 24 F
> 11 2 3.4 22 M
> 12 1 2.1 30 F
> Group GPA Age Sex
> 1 1 4.0 21 F
> 2 2 3.5 22 M
> 3 2 2.8 19 M
> 4 1 3.9 32 M
> 5 1 2.2 25 F
> 6 1 3.8 22 F
> 7 2 2.7 20 M
> 8 2 3.8 23 M
> 9 1 4.0 21 M
> 10 2 3.6 24 F
> data frame with 0 columns and 12 rows
Tables
- Frequency tables
- You can create factor variables via
factor()
- You can create frequency tables via
table()
- You can create factor variables by numeric vector via
cut()
- You can create two way tables of frequency also by
table(a,b)
Example
### 30 tax accountants from Australia
state = c('tas', 'sa', 'qld', 'nsw', 'nsw', 'nt', 'wa', 'wa', 'qld', 'vic', 'nsw', 'vic',
'qld', 'qld', 'sa', 'tas', 'sa', 'nt', 'wa', 'vic', 'qld', 'nsw', 'nsw', 'wa',
'sa', 'act', 'nsw', 'vic', 'vic', 'act')
### Create a factor variable
statefac = factor(state)
statefac
### Create a frequency table
statefreq = table(statefac)
statefreq
### Create a factor variable from numeric vector
incomes = c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42,
56, 61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48,
52, 46, 59, 46, 58, 43)
cats = cut(incomes, breaks = 35+10*(0:7))
incomefac = factor(cats)
# breaks: 8 intervals start with 35 and differred with 10
incomefac
### Two-way table
table(incomefac, statefac)
> [1] tas sa qld nsw nsw nt wa wa qld vic nsw vic qld qld sa tas sa
> [18] nt wa vic qld nsw nsw wa sa act nsw vic vic act
> Levels: act nsw nt qld sa tas vic wa
> statefac
> act nsw nt qld sa tas vic wa
> 2 6 2 5 4 2 5 4
> [1] (55,65] (45,55] (35,45] (55,65] (55,65] (55,65] (55,65] (45,55]
> [9] (55,65] (65,75] (65,75] (35,45] (55,65] (55,65] (55,65] (55,65]
> [17] (55,65] (45,55] (45,55] (55,65] (45,55] (45,55] (35,45] (45,55]
> [25] (45,55] (45,55] (55,65] (45,55] (55,65] (35,45]
> Levels: (35,45] (45,55] (55,65] (65,75]
> statefac
> incomefac act nsw nt qld sa tas vic wa
> (35,45] 1 1 0 1 0 0 1 0
> (45,55] 1 1 1 1 2 0 1 3
> (55,65] 0 3 1 3 2 2 2 1
> (65,75] 0 1 0 0 0 0 1 0
Data Management
Data Export
Output .txt
file
From a data frame
Use write.table()
to output a .txt
file from a data frame
Example
age = 18:29
height = c(76.1, 77, 78.1, 78.2, 78.8, 79.7, 79.9, 81.1, 81.2, 81.8, 82.8, 83.5)
village = data.frame(Age = age, Height = height)
### write to a file
write.table(village, file = "village.txt", sep = "\t", col.names = NA, quote = F)
# \t: separate columns by a tab
# makes titles/names align with columns, especially with row name/index
# suppress quotes around character values
From a matrix
Use write()
to output a .txt
file from a matrix
Example
x = matrix(1, 20, 20)
write(x, file = "matrix.txt", ncolumns = 20)
Save R objects
Use save()
to save R objects and you can load the object without reading or conversion later.
Example
save(village, file = "village.Rdata")
Data Import
Read files from external files. Generally in a data file, we have:
- 1st row is the names for the variables
- Other rows contain variable values separated by space, comma, … etc
read.table()
- Reads a data frame from
txt
file byread.table("file/name.txt", header = TRUE)
- Reads variables as numeric or factor by using the argument
colClasses
- Specifys the delimiter by the argument
sep
- Some other arguments:
col.names
: specify column names, similar in spirit tonames()
row.names
: specify row namesskip
: number of lines to skip before readingna.strings
: how missing values are recorded
Example
Instead of specifying the whole path of your file, you can set your working directory of where you put your file.
### Example 1: Reads a data frame
scores = read.table("scores_names.txt", header = TRUE)
scores
# Extract the interesting column by $ or [['']]
scores$gender
scores[['gender']]
# One pair of brackets gives you a data frame of the column
scores['gender']
> name gender aptitude statistics
> 1 Linda F 95 85
> 2 Jason M 85 95
> 3 Susan F 80 70
> 4 Mike M 70 65
> 5 Judy F 60 70
> [1] F M F M F
> Levels: F M
> [1] F M F M F
> Levels: F M
> gender
> 1 F
> 2 M
> 3 F
> 4 M
> 5 F
### Example 2: Reads variables as numeric or factor
scores = read.table("scores_names.txt", header = TRUE,
colClasses = c('character', 'character', 'integer', 'integer'))
scores[['gender']]
> [1] "F" "M" "F" "M" "F"
### Example 3: Specifies the delimiter
reading = read.table("reading.txt")
reading
reading = read.table("reading.txt", sep = ",")
reading
names(reading) = c('Name', 'Week1', 'Week2', 'Week3', 'Week4', 'Week5')
reading
> V1
> 1 Grace,3,1,5,2,6
> 2 Martin,1,2,4,1,3
> 3 Scott,9,10,4,8,6
> V1 V2 V3 V4 V5 V6
> 1 Grace 3 1 5 2 6
> 2 Martin 1 2 4 1 3
> 3 Scott 9 10 4 8 6
> Name Week1 Week2 Week3 Week4 Week5
> 1 Grace 3 1 5 2 6
> 2 Martin 1 2 4 1 3
> 3 Scott 9 10 4 8 6
read.fwf()
Same arguments as read.table()
but also takes widths
as vector of integers indicating number of columns a variable occupies. Negative variables are used to skip columns.
Example
reading2 = read.fwf("readingfwf.txt", widths = c(6, -9, 1, -3, 2, -3, 1, -2, 1, -1, 1))
reading2
> V1 V2 V3 V4 V5 V6
> 1 Grace 3 1 5 2 6
> 2 Martin 1 2 4 1 3
> 3 Scott 9 10 4 8 6
scan()
read.table()
usesscan()
and processes the resultsscan()
does the same thing asc()
but without commas- Hit ENTER once, we continue the input onto a new line
- Hit ENTER twice,
scan()
stops
Example
Enter the values into Console:
cooperation = scan()
49 64 37 52 68 54
61 79 64 29
27 58 52 41 30 40 39
44 34 44
Built-in datasets
- There are around 100 datasets in base R, package
datasets
- Other comes with other pacakges
- You can use
data()
to load those datasets, and call them by the name - You will sometimes need to specify the source package of the datasets
Example
data(AirPassengers)
install.pacakges('rpart')
data(kyphosis, package = 'rpart')
Saved R dataset
Example
load('village.Rdata')
Manipulate data frame
Example
data(airquality)
head(airquality)
> Ozone Solar.R Wind Temp Month Day
> 1 41 190 7.4 67 5 1
> 2 36 118 8.0 72 5 2
> 3 12 149 12.6 74 5 3
> 4 18 313 11.5 62 5 4
> 5 NA NA 14.3 56 5 5
> 6 28 NA 14.9 66 5 6
nrow(airquality)
> [1] 153
### Example 1: change Month to a factor
airquality$Month = factor(airquality$Month)
### Example 2: create new indicator variable Solar.I=1 if solar.R > 200
airquality$Solar.I = ifelse(airquality$Solar.R > 200, 1, 0)
### Example 3: replace Temp and Ozone with their z-scores
airquality$Temp = (airquality$Temp - mean(airquality$Temp)) / sd(airquality$Temp)
airquality$Ozone = (airquality$Ozone - mean(airquality$Ozone, na.rm = T)) / sd(airquality$Ozone, na.rm = T)
### Example 4: add a new variable
airquality$New = 1:153
### Example 5: Remove variable Day (column 6)
airquality = airquality[,-6]
### Example 6: Remove all observations with NA for Ozone and Solar.R
airquality = airquality[!is.na(airquality$Solar.R) & !is.na(airquality$Ozon),]
### Print out the manipulated dataset
head(airquality)
> Ozone Solar.R Wind Temp Month Solar.I New
> 1 -0.03423409 190 7.4 -1.1497140 5 0 1
> 2 -0.18580489 118 8.0 -0.6214670 5 0 2
> 3 -0.91334473 149 12.6 -0.4101682 5 0 3
> 4 -0.73145977 313 11.5 -1.6779609 5 1 4
> 7 -0.57988897 299 8.6 -1.3610128 5 1 7
> 8 -0.70114561 99 13.8 -1.9949091 5 0 8
nrow(airquality)
> [1] 111
Data Merging
Vertical Merging
- vertical binding of data frames
- does not fill in NAs for variables that are missing in one data frame
Example
X = data.frame(a = c(1, 2, 3, 4), b = factor(c(1, 2, 2, 1)), c = c('A', 'B', 'B', 'A'))
Y = data.frame(a = c(3, 7, 8), b = factor(c(3, 3, 3)))
# rbind(X, Y) gives Error: variables number are different
# Fill NA values yourself
Y$c = 'NA'
rbind(X, Y)
> a b c
> 1 1 1 A
> 2 2 2 B
> 3 3 2 B
> 4 4 1 A
> 5 3 3 NA
> 6 7 3 NA
> 7 8 3 NA
Horizontal Merging
merge()
by arguments- If there are identical variable names in the data frames,
merge()
will modify the name to indicate which data frame the variable comes from
Example
A = data.frame(ID = c(1, 2, 3), age = c(11, 12, 14))
B = data.frame(ID = c(2, 3, 4), sex = c('M', 'M', 'F'))
### Merge two data frame by ID
# keep only the ID values in common to A and B
merge(A, B, by = c('ID'))
### Merge two data frame by ID and keep only t
# keep all ID values even there are missing values
merge(A, B, all = TRUE)
> ID age sex
> 1 2 12 M
> 2 3 14 M
> ID age sex
> 1 1 11 <NA>
> 2 2 12 M
> 3 3 14 M
> 4 4 NA F
One-to-Many merge
Example
C = data.frame(ID = c(1, 1, 2, 1, 3, 3, 3, 4, 4), stars = c(1, 4, 3, 2, 1, 7, 5, 2, 2))
### Order by merge(A,B)
merge(merge(A, B, by = c('ID'), all = TRUE), C, by = c('ID'), all = TRUE)
### Order by C
merge(C, merge(A, B, by = c('ID'), all = TRUE), by = c('ID'), all = TRUE)
> ID age sex stars
> 1 1 11 <NA> 1
> 2 1 11 <NA> 4
> 3 1 11 <NA> 2
> 4 2 12 M 3
> 5 3 14 M 1
> 6 3 14 M 7
> 7 3 14 M 5
> 8 4 NA F 2
> 9 4 NA F 2
> ID stars age sex
> 1 1 1 11 <NA>
> 2 1 4 11 <NA>
> 3 1 2 11 <NA>
> 4 2 3 12 M
> 5 3 1 14 M
> 6 3 7 14 M
> 7 3 5 14 M
> 8 4 2 NA F
> 9 4 2 NA F
R Graphics
General functions for all graphics: title()
, legend()
, axis()
, etc.
Bar Charts
25 people were asked for the beer preference. 1 for domestic can, 2 for domestic bottle, 3 for microbrew, and 4 for import.
Bar plot
beer = c(3, 4, 1, 1, 3, 4, 3, 3, 1, 3, 2, 1, 2, 1, 2, 3, 2, 3, 1, 1,
1, 1, 4, 3, 1)
barplot(beer, col = "SteelBlue")
Bar plot by frequency
barplot(table(beer), col = "SteelBlue")
Bar plot by propotion
barplot(table(beer)/length(beer), col = "SteelBlue")
Bar plot by propotion horizontally
barplot(table(beer)/length(beer), col = c('lightblue', 'mistyrose', 'lightcyan', 'cornsilk'),
horiz = T)
Pie Charts
Pie chart
beer.counts = table(beer)
pie(beer.counts)
Pie chart with variable labels
names(beer.counts) = c('Domestic Can', 'Domestic Bottle', 'Microbrew', 'Import')
pie(beer.counts)
Pie chart with colors
pie(beer.counts, col = c('lightblue', 'mistyrose', 'lightcyan', 'cornsilk'))
Histograms
Top 25 movies on 4th-week gross receipts.
Histogram by frequency
movie = c(29.6, 28.2, 19.6, 13.7, 13.0, 7.8, 3.4, 2.0, 1.9, 1.0,
0.7, 0.4, 0.4, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2,
0.2, 0.1, 0.1, 0.1, 0.1)
hist(movie, col = "SteelBlue")
Histogram by proportion
hist(movie, freq = F, col = "SteelBlue")
Historgrm by specified bins
hist(movie, col = "SteelBlue", breaks = c(0, 1, 2, 3, 4, 5, 10, 20, max(movie)))
Dot Plots
Use the built-in dataset mtcars
.
Dot plot
dotchart(mtcars$mpg, labels = row.names(mtcars), cex = 0.7, xlab = 'MPG')
Dot plot colored by cyl
and ordered by mpg
x = mtcars[order(mtcars$mpg), 1:2] #mpg:cyl
x$cyl = factor(x$cyl)
x$color[x$cyl==4]='red'
x$color[x$cyl==6]='blue'
x$color[x$cyl==8]='darkgreen'
dotchart(x$mpg, labels = row.names(x), cex = 0.7,
groups = x$cyl, color = x$color, xlab = 'MPG')
Box Plots
Based on 5-number summary: min
, 1stQ
, med
, 3rdQ
, max
Single sample
x = c(7, 9.5, 10, 11, 10, 10, 8, 11, 8, 13, 11.5)
boxplot(x, xlab = 'Single Sample', ylab = 'Value Axis', main = 'Simple Box Plot', col = 'lightblue')
Multiple Samples
growth = c(75, 72, 73, 61, 67, 64, 62, 63)
sugar = c('C', 'C', 'C', 'F', 'F', 'F', 'S', 'S')
fly = data.frame(growth = growth, sugar = sugar)
boxplot(growth~sugar, data = fly, xlab = 'Sugar Type', ylab = 'Growth', col = 'bisque')
title(main = 'Growth by Sugar Type', font.main = 4)
Scatter Plots
Use the notation plot(x, y, ...)
or plot(y ~ x, ...)
Scatter plot
plot(dist~speed, data = cars, xlab = 'Speed', ylab = 'Distance', col = 'blue')
title(main = 'Scatter Plot with Least Squares Line')
abline(lm(dist~speed, data = cars), col = 'red')
Scatter plot matrix
pairs(~mpg+disp+drat+wt, data = mtcars)
Multiple Plots
number of plots in the graph window
par()
layout()
Example
### Example 1
# a window of two graphs
par(mfrow = c(2, 1))
### Example 2
# a window of four graphs
layout(matrix(c(1, 3, 2, 4), 2, 2, byrow = T),
widths = c(3, 1), heights = c(1, 2))
clear and return to defaults
dev.off()