Packages

Quick Review: a cheat sheet for base R

Installation

### Install only once
install.packages("package_name")
### Call the package everytime you want to use it
library(package_name) 

Remember, you don’t have to install all the packages before you work on R. There are too many packages (and it’s getting more and more) in R, just install it before you really need them. You will sometimes have more than one option to do the analysis. For example, itsmr, timeSeries, forecast, Quantmod are all packages for doing time series in R.

Here is a quick list of useful R packages.

For Loading Datasets

haven, Loading Foreign Datasets

  • [Package Source]
  • read_dta('filename.dta'):
    Import dataset in .dta from STATA
  • read_sas('filename.sas'):
    Import dataset in .sas from SAS
  • read_spss('filename.spss'):
    Import dataset in .spss from SPSS

For Modeling

car, ANOVA functions for type II and type III ANOVA tables

  • [Package Source]
  • avPlot(model, ...)
    Added variable plots, plots for linear and generalized linear models.
  • ncvTest(model, ...)
    Score Test for Non-Constant Error Variance

lme4, Linear mixed effects models

gmodels, various tools for model fitting

  • [Package Source]
  • CrossTable(x,y, ...):
    Cross Tabulation with Tests for Factor Independence

nnet, for multinomial log-linear models

  • [Package Source]
  • multinom(formula, data, ...):
    Fit multinomial log-linear models via neural networks

For testing

lmtest,

  • [Package Source]
  • waldtest(model1, model2):
    Wald test of nested models
  • resettest(formula, ...):
    Ramsey’s RESET test for omitting variables, same as ovtest in STATA
  • bptest(formula, ...):
    Breusch-Pagan test for heteroskedasticity, same as hettest in STATA
  • lrtest(model1, model2, ...): <br Likelihood ratio tests for nested models

moments

  • [Package Source]
  • skewness(x, na.rm = TRUE):
    skewness of given data
  • agostino.test(x, alternative = c("two.sided", "less", "greater")):
    D’Agostino test for skewness in normally distributed data
  • kurtosis(x, na.rm = TRUE):
    Pearson’s measure of kurtosis
  • anscombe.test(x, alternative = c("two.sided", "less", "greater")):
    Anscombe-Glynn test of kurtosis for normal samples

olsrr

For Graphing

ggplot2, R’s the most famous package for making beautiful graphics

Data Analysis

Education Investigation

  1. dataset
  2. script

Getting Started I

(1) Setup your environment

setwd('~/r-workshops/education_investigation/')

(2) Load your dataset

### Specify the NA strings
education = read.csv('education2000.csv', header = T, na.strings = "")
### Preview First 7 rows of first 8 columns
head(education[,1:8], 7)
>   id age educ paeduc    sex  race         income          region
> 1  1  26   16     16   male white           <NA> w. sou. central
> 2  2  48   15      9 female white  $8000 to 9999 w. sou. central
> 3  3  67   13     12 female white $15000 - 19999 w. sou. central
> 4  4  39   14     12 female white $25000 or more w. sou. central
> 5  5  25   14     13 female white $25000 or more w. sou. central
> 6  6  25   12     NA female white           <NA> w. sou. central
> 7  7  36   14     18   male white $25000 or more w. sou. central

Data Summary I

(1) Extract rows with miss is 0 (Drop missing values)

### A new dataframe only containing variable "miss" is 0
education_miss0 = education[education$miss==0, ]
### Preview First 7 rows of rist 8 columns
head(education_miss0[,1:8], 7)
>   id age educ paeduc    sex  race         income          region
> 1  1  26   16     16   male white           <NA> w. sou. central
> 2  2  48   15      9 female white  $8000 to 9999 w. sou. central
> 3  3  67   13     12 female white $15000 - 19999 w. sou. central
> 4  4  39   14     12 female white $25000 or more w. sou. central
> 5  5  25   14     13 female white $25000 or more w. sou. central
> 7  7  36   14     18   male white $25000 or more w. sou. central
> 8  8  44   14     12 female white $20000 - 24999 w. sou. central

(2) Read the distribution of educ

### Draw three figures in a window
layout(matrix(c(1, 2, 3), 1, 3, byrow = T))

### qqplot: check normality
qqnorm(education_miss0$educ, col = "SteelBlue")
qqline(education_miss0$educ, col = "red")

### boxplot:
boxplot(education_miss0$educ, col = "SteelBlue")
title('Boxplot of education years')
mtext('Years', side = 2, line = 2.1, cex = 0.7)

### histogram:
hist(education_miss0$educ, col = "SteelBlue", breaks = 10, freq = F, main = "", xlab = "")
title('Histogram of education years')
mtext('Education Years', side = 1, line = 2.1, cex = 0.7)
box()

(3) Test the skewness and kurtosis of educ

library(moments)
### get skewness and test it
cat('Skeness: ', skewness(education_miss0$educ))
agostino.test(education_miss0$educ)
### get kurtosis and test it
cat('Kurtosis: ', kurtosis(education_miss0$educ))
anscombe.test(education_miss0$educ)
> Skeness:  -0.1026498
> 
>   D'Agostino skewness test
> 
> data:  education_miss0$educ
> skew = -0.10265, z = -1.84620, p-value = 0.06487
> alternative hypothesis: data have a skewness
> Kurtosis:  3.749285
> 
>   Anscombe-Glynn kurtosis test
> 
> data:  education_miss0$educ
> kurt = 3.7493, z = 5.0874, p-value = 3.631e-07
> alternative hypothesis: kurtosis is not equal to 3

(4) Generate a new variable educsq instead of dropping missing values

### create a new column in original dataset
education$educsq = education$educ^2

### Read the distribution of new variable
### Draw two figures in a window
layout(matrix(c(1, 2), 1, 2, byrow = T))

### qqplot: check normality
qqnorm(education$educsq, col = "SteelBlue")
qqline(education$educsq, col = "red")

### histogram:
hist(education_miss0$educ, col = "SteelBlue", breaks = 10, freq = F, main = "", xlab = "")
title('Histogram of (education years)^2')
mtext('(Education Years)^2', side = 1, line = 2.1, cex = 0.7)
box()

Modeling I

[Model 1]: Linear regression using educ as dependent variable

model1 = lm(educ ~ female + age + west + midwest + nrthest, data = education_miss0)
summary(model1)
anova(model1)
> 
> Call:
> lm(formula = educ ~ female + age + west + midwest + nrthest, 
>     data = education_miss0)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -11.318  -1.707  -0.387   2.013   7.014 
> 
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)    
> (Intercept) 14.465140   0.210479  68.725  < 2e-16 ***
> femaleYes   -0.179782   0.125302  -1.435  0.15151    
> age         -0.017404   0.003773  -4.613 4.24e-06 ***
> westYes      0.522318   0.170497   3.064  0.00222 ** 
> midwestYes   0.041180   0.163345   0.252  0.80099    
> nrthestYes   0.180938   0.178900   1.011  0.31195    
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 2.737 on 1930 degrees of freedom
> Multiple R-squared:  0.01782, Adjusted R-squared:  0.01528 
> F-statistic: 7.003 on 5 and 1930 DF,  p-value: 1.704e-06
> Analysis of Variance Table
> 
> Response: educ
>             Df  Sum Sq Mean Sq F value   Pr(>F)    
> female       1    17.5  17.529  2.3401 0.126244    
> age          1   165.5 165.498 22.0938 2.78e-06 ***
> west         1    71.4  71.439  9.5370 0.002042 ** 
> midwest      1     0.2   0.167  0.0223 0.881297    
> nrthest      1     7.7   7.662  1.0229 0.311954    
> Residuals 1930 14457.1   7.491                     
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(1) Extract specific values from the results

The results usually have objects more than what it prints out. Use str() to check what you can extract from the results, for example:

str(summary(model1))

Then you will be able to call the results by following options:

### residuals for observation 1 to 10
summary(model1)$residuals[1:10]
# you can also just do this:
residuals(model1)[1:10]

### R squared and adjusted R squared of this model
cat('R squared: ', summary(model1)$r.squared)
cat('adj. R squared: ', summary(model1)$adj.r.squared)

### coefficients of this model
summary(model1)$coefficients
>          1          2          3          4          5          7 
>  1.9873517  1.5500108 -0.1193224  0.3933791  0.1497298  0.1613868 
>          8          9         10         11 
>  0.4803967  4.3006150  2.3528256  4.6370284
>          1          2          3          4          5          7 
>  1.9873517  1.5500108 -0.1193224  0.3933791  0.1497298  0.1613868 
>          8          9         10         11 
>  0.4803967  4.3006150  2.3528256  4.6370284
> R squared:  0.01781975
> adj. R squared:  0.01527524
>                Estimate  Std. Error    t value     Pr(>|t|)
> (Intercept) 14.46513984 0.210479181 68.7248011 0.000000e+00
> femaleYes   -0.17978169 0.125301592 -1.4347918 1.515085e-01
> age         -0.01740352 0.003772953 -4.6127053 4.235625e-06
> westYes      0.52231768 0.170496770  3.0635048 2.217851e-03
> midwestYes   0.04117982 0.163345107  0.2521032 8.009882e-01
> nrthestYes   0.18093847 0.178899819  1.0113955 3.119539e-01

(2) Extract the variables information if they are significant

### coefficients table
coef_table1 = summary(model1)$coefficients
### extract significant variables
coef_table1[coef_table1[,"Pr(>|t|)"]<0.05, ]
>                Estimate  Std. Error   t value     Pr(>|t|)
> (Intercept) 14.46513984 0.210479181 68.724801 0.000000e+00
> age         -0.01740352 0.003772953 -4.612705 4.235625e-06
> westYes      0.52231768 0.170496770  3.063505 2.217851e-03

[Model 2]: Linear regression adding variable race

model2 = lm(educ ~ female + age + west + midwest + nrthest + black + other,
             data = education_miss0)
summary(model2)
anova(model2)
> 
> Call:
> lm(formula = educ ~ female + age + west + midwest + nrthest + 
>     black + other, data = education_miss0)
> 
> Residuals:
>      Min       1Q   Median       3Q      Max 
> -11.4228  -1.7524  -0.2311   1.9397   7.1373 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept) 14.65782    0.21358  68.630  < 2e-16 ***
> femaleYes   -0.13426    0.12472  -1.076   0.2818    
> age         -0.01813    0.00377  -4.809 1.64e-06 ***
> westYes      0.40908    0.17177   2.382   0.0173 *  
> midwestYes  -0.04601    0.16303  -0.282   0.7778    
> nrthestYes   0.09574    0.17858   0.536   0.5919    
> blackYes    -1.08581    0.20164  -5.385 8.13e-08 ***
> otherYes    -0.09373    0.28918  -0.324   0.7459    
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 2.718 on 1928 degrees of freedom
> Multiple R-squared:  0.03237, Adjusted R-squared:  0.02886 
> F-statistic: 9.215 on 7 and 1928 DF,  p-value: 2.963e-11
> Analysis of Variance Table
> 
> Response: educ
>             Df  Sum Sq Mean Sq F value    Pr(>F)    
> female       1    17.5  17.529  2.3729    0.1236    
> age          1   165.5 165.498 22.4029 2.371e-06 ***
> west         1    71.4  71.439  9.6704    0.0019 ** 
> midwest      1     0.2   0.167  0.0226    0.8805    
> nrthest      1     7.7   7.662  1.0372    0.3086    
> black        1   213.5 213.463 28.8958 8.565e-08 ***
> other        1     0.8   0.776  0.1051    0.7459    
> Residuals 1928 14242.8   7.387                      
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test Model Improvement

Although the summary of the model tells us the siginificance of the variables to the dependent variable and also the significance of the model, it does not tell us the improvement of the model 2 to model 1. Use Wald test to see if it improves just by waldtest():

library(lmtest)
waldtest(model1, model2)
> Wald test
> 
> Model 1: educ ~ female + age + west + midwest + nrthest
> Model 2: educ ~ female + age + west + midwest + nrthest + black + other
>   Res.Df Df    F    Pr(>F)    
> 1   1930                      
> 2   1928  2 14.5 5.616e-07 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[Model 3]: Linear regression adding variable paeduc

model3 = lm(educ ~ female + age + west + midwest + nrthest + black + other + paeduc,
            data = education_miss0)
summary(model3)
anova(model3)
> 
> Call:
> lm(formula = educ ~ female + age + west + midwest + nrthest + 
>     black + other + paeduc, data = education_miss0)
> 
> Residuals:
>      Min       1Q   Median       3Q      Max 
> -11.0649  -1.6561  -0.2253   1.4983   8.8581 
> 
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)    
> (Intercept) 10.490089   0.303418  34.573  < 2e-16 ***
> femaleYes   -0.149875   0.115345  -1.299 0.193976    
> age          0.007976   0.003773   2.114 0.034650 *  
> westYes      0.149304   0.159494   0.936 0.349334    
> midwestYes  -0.277784   0.151313  -1.836 0.066537 .  
> nrthestYes   0.022660   0.165197   0.137 0.890911    
> blackYes    -0.670528   0.187877  -3.569 0.000367 ***
> otherYes     0.277297   0.268208   1.034 0.301319    
> paeduc       0.271949   0.015029  18.095  < 2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 2.514 on 1927 degrees of freedom
> Multiple R-squared:  0.1729,  Adjusted R-squared:  0.1695 
> F-statistic: 50.36 on 8 and 1927 DF,  p-value: < 2.2e-16
> Analysis of Variance Table
> 
> Response: educ
>             Df  Sum Sq Mean Sq  F value    Pr(>F)    
> female       1    17.5   17.53   2.7746  0.095934 .  
> age          1   165.5  165.50  26.1957 3.392e-07 ***
> west         1    71.4   71.44  11.3076  0.000787 ***
> midwest      1     0.2    0.17   0.0264  0.870836    
> nrthest      1     7.7    7.66   1.2128  0.270909    
> black        1   213.5  213.46  33.7879 7.176e-09 ***
> other        1     0.8    0.78   0.1228  0.726003    
> paeduc       1  2068.5 2068.52 327.4143 < 2.2e-16 ***
> Residuals 1927 12174.3    6.32                       
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Contingency Table

To examine if there is interaction effect between being female and being african american, read the contigency table between two variables:

xtabs(~female+black, data = education_miss0)
>       black
> female  No Yes
>    No  799  77
>    Yes 927 133

[Model 4]: Linear regression adding interaction variable

### Adding asterisk(*) between two variables gives you the individual and interaction variables
model4 = lm(educ ~ female*black + age + west + midwest + nrthest + other + paeduc,
            data = education_miss0)
summary(model4)
anova(model4)
> 
> Call:
> lm(formula = educ ~ female * black + age + west + midwest + nrthest + 
>     other + paeduc, data = education_miss0)
> 
> Residuals:
>      Min       1Q   Median       3Q      Max 
> -11.0572  -1.6506  -0.2164   1.5079   8.8441 
> 
> Coefficients:
>                     Estimate Std. Error t value Pr(>|t|)    
> (Intercept)        10.500774   0.303996  34.542  < 2e-16 ***
> femaleYes          -0.173247   0.121826  -1.422  0.15516    
> blackYes           -0.812406   0.302953  -2.682  0.00739 ** 
> age                 0.008045   0.003776   2.131  0.03324 *  
> westYes             0.148294   0.159530   0.930  0.35271    
> midwestYes         -0.281842   0.151490  -1.860  0.06297 .  
> nrthestYes          0.023178   0.165227   0.140  0.88846    
> otherYes            0.278341   0.268259   1.038  0.29959    
> paeduc              0.271941   0.015032  18.091  < 2e-16 ***
> femaleYes:blackYes  0.227250   0.380628   0.597  0.55055    
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 2.514 on 1926 degrees of freedom
> Multiple R-squared:  0.1731,  Adjusted R-squared:  0.1692 
> F-statistic: 44.78 on 9 and 1926 DF,  p-value: < 2.2e-16
> Analysis of Variance Table
> 
> Response: educ
>                Df  Sum Sq Mean Sq  F value    Pr(>F)    
> female          1    17.5   17.53   2.7737  0.095990 .  
> black           1   228.0  227.95  36.0691 2.271e-09 ***
> age             1   175.0  175.04  27.6966 1.577e-07 ***
> west            1    51.1   51.14   8.0920  0.004493 ** 
> midwest         1     2.1    2.08   0.3296  0.565960    
> nrthest         1     2.0    2.02   0.3190  0.572273    
> other           1     0.8    0.78   0.1228  0.726047    
> paeduc          1  2068.5 2068.52 327.3049 < 2.2e-16 ***
> female:black    1     2.3    2.25   0.3565  0.550551    
> Residuals    1926 12172.0    6.32                       
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[Model 5]: Linear regression using educsq as dependent variable

model5 = lm(educsq ~ paeduc + age + polviews, data = education)
summary(model5)
anova(model5)
> 
> Call:
> lm(formula = educsq ~ paeduc + age + polviews, data = education)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -187.58  -46.40  -13.63   39.05  251.68 
> 
> Coefficients:
>                              Estimate Std. Error t value Pr(>|t|)    
> (Intercept)                  108.8219     8.8806  12.254  < 2e-16 ***
> paeduc                         6.8701     0.4251  16.160  < 2e-16 ***
> age                            0.3113     0.1063   2.928  0.00345 ** 
> polviewsextremely liberal     19.8902     9.0227   2.204  0.02762 *  
> polviewsextrmly conservative -15.1011     9.5278  -1.585  0.11315    
> polviewsliberal                9.6799     6.0849   1.591  0.11183    
> polviewsmoderate             -13.9357     4.6977  -2.967  0.00305 ** 
> polviewsslghtly conservative   7.2535     5.7511   1.261  0.20738    
> polviewsslightly liberal       6.1740     6.3344   0.975  0.32985    
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 69.44 on 1840 degrees of freedom
>   (968 observations deleted due to missingness)
> Multiple R-squared:  0.161,   Adjusted R-squared:  0.1574 
> F-statistic: 44.14 on 8 and 1840 DF,  p-value: < 2.2e-16
> Analysis of Variance Table
> 
> Response: educsq
>             Df  Sum Sq Mean Sq  F value    Pr(>F)    
> paeduc       1 1449137 1449137 300.5248 < 2.2e-16 ***
> age          1   43437   43437   9.0081  0.002724 ** 
> polviews     6  210140   35023   7.2632 1.105e-07 ***
> Residuals 1840 8872521    4822                       
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Diagnostics I

(1) Check assumptions of linear models via graphs

Assume that model 5 is our selected model, then there are several assumptions we need to check before using it as final model. R can easily generate useful graphs by plot() command:

### Draw four figures in a window
layout(matrix(c(1, 2, 3, 4), 2, 2, byrow = T))

#### which=1 gives residual plot (Residuals vs Fitted)
#### which=2 gives qqplot (theoretical quantiles vs standardized residuals)
#### which=3 gives scale-location plot (fitted values vs sqrt(standardized residuals))
#### which=4 gives cook's distance (cooks.distance vs observation number)
#### which=5 gives residual plot (Residuals vs Standardized residuals)
#### which=6 gives leverage plot (leverage vs cooks distance)

### Call all the figures once
plot(model5) # gives you which=1,2,3,5

### Call only one figure
plot(model5, which=1)

(2) Check assumptions of linear models via tests

library(lmtest)
### Ramsey's RESET test for omitting variables (same as ovtest in STATA)
resettest(model5)
### Breusch-Pagan test for heteroskedasticity (same as hettest in STATA)
bptest(model5)
> 
>   RESET test
> 
> data:  model5
> RESET = 7.6105, df1 = 2, df2 = 1838, p-value = 0.000511
> 
>   studentized Breusch-Pagan test
> 
> data:  model5
> BP = 47.202, df = 8, p-value = 1.403e-07

(3) Get those specific values

### Draw two figures in a window
layout(matrix(c(1, 2), 1, 2, byrow = T))

### standardized residuals vs leverage
#### Plot by built-in
plot(model5, which=5)
#### Plot by your own
rsd = rstandard(model5)
lev = hatvalues(model5)
plot(lev, rsd, cex.main = 0.9,
     main = "Residuals-squared vs Leverage",
     xlab = "Leverage", ylab = "Standardized residuals")
abline(h=0, col="red")

(4) Compare residuals and specific variables

### Draw three figures in a window
layout(matrix(c(1, 2, 3), 1, 3, byrow = T))
### Calculate residuals
res = residuals(model5)
plot(education$paeduc[!is.na(education$paeduc)], res[!is.na(education$paeduc)],
     xlab = 'Parents Education', ylab = "Residuals", cex.lab = 0.95,
     main = 'Residuals Plot', cex.main = 0.95)
abline(h=0, col = "red")
plot(education$polviews[!is.na(education$polviews)], res[!is.na(education$polviews)],
     xlab = 'Polviews', ylab = "Residuals", cex.lab = 0.95,
     main = 'Residuals Plot', cex.main = 0.95)
abline(h=0, col = "red")
plot(education$age[!is.na(education$age)], res[!is.na(education$age)],
     xlab = 'Age', ylab = "Residuals", cex.lab = 0.95,
     main = 'Residuals Plot', cex.main = 0.95)
abline(h=0, col = "red")

(5) Check the side effect of variables

library(car)
avPlots(model5, id.n = 3)

(6) Check outliers

### generate cook's distance values
cooks = cooks.distance(model5)
### cook's distance plot
plot(model5, which = 4)

(7) Measure the influencial points

### Generate DfBETA values for each observation
model5_dfbeta = as.data.frame(dfbeta(model5))
summary(model5_dfbeta)
### Generate DFBeta, DFFIT, cooks.distance, hatbalues
influence.measures(model5)$infmat[1:10,]
>   (Intercept)             paeduc                age            
>  Min.   :-1.1105185   Min.   :-9.756e-02   Min.   :-1.582e-02  
>  1st Qu.:-0.0805633   1st Qu.:-2.628e-03   1st Qu.:-8.816e-04  
>  Median :-0.0034487   Median : 2.488e-04   Median : 1.977e-05  
>  Mean   : 0.0000085   Mean   :-1.160e-06   Mean   :-3.700e-08  
>  3rd Qu.: 0.0617093   3rd Qu.: 2.964e-03   3rd Qu.: 9.736e-04  
>  Max.   : 1.5020763   Max.   : 4.610e-02   Max.   : 1.693e-02  
>  polviewsextremely liberal polviewsextrmly conservative
>  Min.   :-2.250266         Min.   :-2.023347           
>  1st Qu.:-0.005195         1st Qu.:-0.005636           
>  Median :-0.000091         Median : 0.000340           
>  Mean   : 0.000012         Mean   : 0.000006           
>  3rd Qu.: 0.005563         3rd Qu.: 0.007158           
>  Max.   : 3.608127         Max.   : 3.376175           
>  polviewsliberal      polviewsmoderate     polviewsslghtly conservative
>  Min.   :-0.8384594   Min.   :-0.6976161   Min.   :-0.7319911          
>  1st Qu.:-0.0065209   1st Qu.:-0.0255094   1st Qu.:-0.0078125          
>  Median :-0.0001345   Median : 0.0000451   Median :-0.0001453          
>  Mean   : 0.0000045   Mean   : 0.0000043   Mean   : 0.0000060          
>  3rd Qu.: 0.0067487   3rd Qu.: 0.0123677   3rd Qu.: 0.0085105          
>  Max.   : 0.8753166   Max.   : 0.4408732   Max.   : 0.8608520          
>  polviewsslightly liberal
>  Min.   :-0.913665       
>  1st Qu.:-0.004405       
>  Median :-0.000012       
>  Mean   : 0.000007       
>  3rd Qu.: 0.004901       
>  Max.   : 1.230383
>           dfb.1_      dfb.padc       dfb.age   dfb.plvwsxl   dfb.plvwsxc
> 1   0.0010266213  0.0041951619 -0.0058055646 -4.285246e-04  0.0004798882
> 2   0.0183474957 -0.0066922431 -0.0014549069 -1.354441e-02 -0.0133184716
> 3   0.0019382614 -0.0107578262 -0.0207309674  1.468939e-02  0.0147004648
> 4  -0.0013178430  0.0008398094  0.0015641359  2.896797e-05 -0.0000295543
> 5  -0.0036323380  0.0005374319  0.0059923657  2.286285e-04 -0.0002489021
> 7   0.0106637772 -0.0189229650 -0.0011853669  7.808168e-04 -0.0009033801
> 8  -0.0003882401  0.0004697845  0.0002504529 -1.008164e-05  0.0000122433
> 9   0.0046614009 -0.0067746096 -0.0019342603  2.159883e-04 -0.0002538480
> 10 -0.0126846809  0.0194331680  0.0043195350 -6.712959e-04  0.0637079265
> 11  0.0705087927 -0.1195048320 -0.0131476169  4.692818e-03 -0.0054442756
>      dfb.plvwslb    dfb.plvwsm   dfb.plvwssc   dfb.plvwssl       dffit
> 1  -6.084049e-04 -1.627313e-04  1.329870e-02 -4.586724e-04  0.02115599
> 2  -1.991889e-02 -2.663812e-02 -2.110831e-02 -1.941710e-02  0.03263347
> 3   2.240287e-02  2.794892e-02  2.339744e-02  2.092489e-02 -0.04079024
> 4  -4.509726e-06  7.848171e-05 -9.348809e-03  4.174512e-05 -0.01291641
> 5   2.118396e-04  2.535512e-04  3.075398e-04 -1.289507e-02 -0.01785830
> 7   1.566778e-03 -1.061676e-02  1.604404e-03  7.279276e-04 -0.02708655
> 8  -1.382358e-02  1.846227e-05 -2.673349e-05 -7.243484e-06 -0.01805686
> 9   4.959682e-04 -1.979071e-04  4.849183e-04  9.818209e-02  0.12469447
> 10 -1.475568e-03  5.176407e-04 -1.463829e-03 -5.955747e-04  0.07181244
> 11  9.651168e-03 -2.636760e-03  9.796846e-03  1.454730e-01  0.21800431
>        cov.r       cook.d         hat
> 1  1.0089110 4.975498e-05 0.004458081
> 2  1.0066258 1.183706e-04 0.003286386
> 3  1.0073143 1.849331e-04 0.004280059
> 4  1.0084356 1.854669e-05 0.003719386
> 5  1.0105757 3.545365e-05 0.005874361
> 7  1.0068488 8.155398e-05 0.003087354
> 8  1.0090771 3.624605e-05 0.004487440
> 9  0.9954559 1.725766e-03 0.005170014
> 10 1.0206703 5.732210e-04 0.016891454
> 11 0.9824403 5.265794e-03 0.007616536

Premarital Sex

  1. dataset
  2. script

Getting Started II

  1. Setup your environment
setwd('~/r-workshops/premarital_sex/')
  1. Load your dataset
### Use haven to load foreign dataset
library(haven)
premarsex = data.frame(read_dta('premarsx.dta'))
### Preview First 7 rows of first 8 columns
head(premarsex[,1:8], 7)
>   year id wrkstat prestige prestg80 marital childs age
> 1 1972  1       1       50       NA       5      0  23
> 2 1972  2       5       45       NA       1      5  70
> 3 1972  3       2       44       NA       1      4  48
> 4 1972  4       1       57       NA       1      0  27
> 5 1972  5       7       40       NA       1      2  61
> 6 1972  6       1       49       NA       5      0  26
> 7 1972  7       1       41       NA       3      2  28

If you click the object in the environment, you will see the variable descriptions created by STATA.

Data Summary II

(1) Cross Table

To determine what kind of analysis to use, use the cross table:

library(gmodels)
with(premarsex, CrossTable(premarsx, missing.include=T))
> 
>  
>    Cell Contents
> |-------------------------|
> |                       N |
> |         N / Table Total |
> |-------------------------|
> 
>  
> Total Observations in Table:  23270 
> 
>  
>           |         1 |         2 |         3 |         4 | 
>           |-----------|-----------|-----------|-----------|
>           |      6693 |      2373 |      5112 |      9092 | 
>           |     0.288 |     0.102 |     0.220 |     0.391 | 
>           |-----------|-----------|-----------|-----------|
> 
> 
> 
> 

(2) Generate binary variable notwrng

4 categories of response:

  • 1 as Always Wrong
  • 2 as Almost Always Wrong
  • 3 as Sometimes Wrong
  • 4 as Not Wrong At All

Since we only care about whether people think it is wrong at all, generate the variable notwrng

premarsex$notwrng <- ifelse(premarsex$premarsx==4, 1, 0)
with(premarsex, CrossTable(notwrng, missing.include=T))
> 
>  
>    Cell Contents
> |-------------------------|
> |                       N |
> |         N / Table Total |
> |-------------------------|
> 
>  
> Total Observations in Table:  23270 
> 
>  
>           |         0 |         1 | 
>           |-----------|-----------|
>           |     14178 |      9092 | 
>           |     0.609 |     0.391 | 
>           |-----------|-----------|
> 
> 
> 
> 

Modeling II

[Model 1]: Controls on age, educational attainment, year questioned, and gender

LRmodel1 = glm(notwrng ~ age + educ + year2 + factor(female), 
                data = premarsex, family = 'binomial')
summary(LRmodel1)
> 
> Call:
> glm(formula = notwrng ~ age + educ + year2 + factor(female), 
>     family = "binomial", data = premarsex)
> 
> Deviance Residuals: 
>     Min       1Q   Median       3Q      Max  
> -1.5726  -1.0020  -0.6823   1.1841   2.2487  
> 
> Coefficients:
>                   Estimate Std. Error z value Pr(>|z|)    
> (Intercept)      0.5563722  0.0804612   6.915 4.69e-12 ***
> age             -0.0314628  0.0008941 -35.189  < 2e-16 ***
> educ             0.0224030  0.0047907   4.676 2.92e-06 ***
> year2            0.0238412  0.0017719  13.455  < 2e-16 ***
> factor(female)1 -0.4626203  0.0282273 -16.389  < 2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> (Dispersion parameter for binomial family taken to be 1)
> 
>     Null deviance: 31138  on 23269  degrees of freedom
> Residual deviance: 29089  on 23265  degrees of freedom
> AIC: 29099
> 
> Number of Fisher Scoring iterations: 4

[Model 2]: Add in variables based on the region of the country a person is from and their religious views

LRmodel2 = glm(notwrng ~ age + educ + year2 + factor(female) + 
                 factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) +
                 factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest), 
                 data = premarsex, family = 'binomial')
summary(LRmodel2)
> 
> Call:
> glm(formula = notwrng ~ age + educ + year2 + factor(female) + 
>     factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) + 
>     factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest), 
>     family = "binomial", data = premarsex)
> 
> Deviance Residuals: 
>     Min       1Q   Median       3Q      Max  
> -2.0921  -0.9387  -0.6181   1.1072   2.4437  
> 
> Coefficients:
>                    Estimate Std. Error z value Pr(>|z|)    
> (Intercept)       0.4720287  0.0948079   4.979 6.40e-07 ***
> age              -0.0329677  0.0009382 -35.138  < 2e-16 ***
> educ             -0.0053257  0.0050361  -1.058    0.290    
> year2             0.0288602  0.0018573  15.539  < 2e-16 ***
> factor(female)1  -0.4444161  0.0292687 -15.184  < 2e-16 ***
> factor(protest)1  0.8444414  0.0409875  20.602  < 2e-16 ***
> factor(cathlic)1  0.7186661  0.0438504  16.389  < 2e-16 ***
> factor(jewish)1   1.7256928  0.1013053  17.035  < 2e-16 ***
> factor(othrel)1   0.0456812  0.0756039   0.604    0.546    
> factor(nonaff)1   1.5542900  0.0601495  25.840  < 2e-16 ***
> factor(nrthest)1 -0.0310514  0.0466282  -0.666    0.505    
> factor(south)1   -0.4471349  0.0436484 -10.244  < 2e-16 ***
> factor(midwest)1 -0.3423956  0.0445090  -7.693 1.44e-14 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> (Dispersion parameter for binomial family taken to be 1)
> 
>     Null deviance: 31138  on 23269  degrees of freedom
> Residual deviance: 27703  on 23257  degrees of freedom
> AIC: 27729
> 
> Number of Fisher Scoring iterations: 4

Likelihood Ratio Test

library(lmtest)
lrtest(LRmodel1, LRmodel2)
> Likelihood ratio test
> 
> Model 1: notwrng ~ age + educ + year2 + factor(female)
> Model 2: notwrng ~ age + educ + year2 + factor(female) + factor(protest) + 
>     factor(cathlic) + factor(jewish) + factor(othrel) + factor(nonaff) + 
>     factor(nrthest) + factor(south) + factor(midwest)
>   #Df LogLik Df  Chisq Pr(>Chisq)    
> 1   5 -14544                         
> 2  13 -13852  8 1385.4  < 2.2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[Model 3]: Change Model 1 as an Ordinal Logistic Regression

library(MASS)
### Hess=F gives you the observed information matrix 
OLRmodel1 = polr(factor(premarsx) ~ age + educ + year2 + factor(female), 
                 data = premarsex, Hess = F)
summary(OLRmodel1)

### add p-values to the result
OLRcoef1 = coef(summary(OLRmodel1))
OLRpvalue1 = pnorm(abs(OLRcoef1[,"t value"]), lower.tail = F)*2
OLRtable1 = cbind(OLRcoef1, "p value" = OLRpvalue1)
head(OLRtable1, 8)
> Call:
> polr(formula = factor(premarsx) ~ age + educ + year2 + factor(female), 
>     data = premarsex, Hess = F)
> 
> Coefficients:
>                    Value Std. Error t value
> age             -0.03251  0.0007489  -43.41
> educ             0.04499  0.0041112   10.94
> year2            0.02015  0.0015361   13.12
> factor(female)1 -0.44643  0.0248512  -17.96
> 
> Intercepts:
>     Value    Std. Error t value 
> 1|2  -1.8604   0.0708   -26.2754
> 2|3  -1.3468   0.0703   -19.1461
> 3|4  -0.3508   0.0697    -5.0313
> 
> Residual Deviance: 57016.64 
> AIC: 57030.64
>                       Value   Std. Error    t value       p value
> age             -0.03250831 0.0007488739 -43.409589  0.000000e+00
> educ             0.04498928 0.0041112020  10.943097  7.170798e-28
> year2            0.02014909 0.0015361109  13.116948  2.633294e-39
> factor(female)1 -0.44642724 0.0248511571 -17.964042  3.726544e-72
> 1|2             -1.86039139 0.0708034308 -26.275441 3.660707e-152
> 2|3             -1.34681703 0.0703440990 -19.146127  1.042591e-81
> 3|4             -0.35078461 0.0697201122  -5.031326  4.870992e-07

[Model 4]: Change Model 2 as an Ordinal Logistic Regression

OLRmodel2 = polr(factor(premarsx) ~ age + educ + year2 + factor(female) + 
                 factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) +
                 factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest), 
                 data = premarsex, Hess = F)
summary(OLRmodel2)
### add p-values to the result
OLRcoef2 = coef(summary(OLRmodel2))
OLRpvalue2 = pnorm(abs(OLRcoef2[,"t value"]), lower.tail = F)*2
OLRtable2 = cbind(OLRcoef2, "p value" = OLRpvalue2)
head(OLRtable2, 8)
> 
> Re-fitting to get Hessian
> Call:
> polr(formula = factor(premarsx) ~ age + educ + year2 + factor(female) + 
>     factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) + 
>     factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest), 
>     data = premarsex, Hess = F)
> 
> Coefficients:
>                     Value Std. Error  t value
> age              -0.03458  0.0007806 -44.3017
> educ              0.01562  0.0042840   3.6466
> year2             0.02647  0.0015974  16.5695
> factor(female)1  -0.42769  0.0254648 -16.7955
> factor(protest)1  0.96310  0.0340391  28.2938
> factor(cathlic)1  0.82156  0.0369296  22.2466
> factor(jewish)1   1.83844  0.0945234  19.4496
> factor(othrel)1   0.06251  0.0639205   0.9779
> factor(nonaff)1   1.68810  0.0559806  30.1552
> factor(nrthest)1  0.05027  0.0416053   1.2082
> factor(south)1   -0.48779  0.0382020 -12.7687
> factor(midwest)1 -0.26456  0.0389446  -6.7932
> 
> Intercepts:
>     Value    Std. Error t value 
> 1|2  -1.7874   0.0824   -21.6903
> 2|3  -1.2340   0.0820   -15.0440
> 3|4  -0.1647   0.0815    -2.0202
> 
> Residual Deviance: 54808.23 
> AIC: 54838.23
> 
> Re-fitting to get Hessian
>                        Value   Std. Error     t value       p value
> age              -0.03458079 0.0007805756 -44.3016548  0.000000e+00
> educ              0.01562177 0.0042839617   3.6465706  2.657635e-04
> year2             0.02646767 0.0015973736  16.5694953  1.157936e-61
> factor(female)1  -0.42769333 0.0254647793 -16.7954855  2.633410e-63
> factor(protest)1  0.96309621 0.0340391002  28.2938211 4.117067e-176
> factor(cathlic)1  0.82155729 0.0369296059  22.2465762 1.217390e-109
> factor(jewish)1   1.83844099 0.0945234362  19.4495785  2.938366e-84
> factor(othrel)1   0.06251059 0.0639205048   0.9779427  3.281027e-01

Likelihood Ratio Test

lrtest(OLRmodel1, OLRmodel2)
> Likelihood ratio test
> 
> Model 1: factor(premarsx) ~ age + educ + year2 + factor(female)
> Model 2: factor(premarsx) ~ age + educ + year2 + factor(female) + factor(protest) + 
>     factor(cathlic) + factor(jewish) + factor(othrel) + factor(nonaff) + 
>     factor(nrthest) + factor(south) + factor(midwest)
>   #Df LogLik Df  Chisq Pr(>Chisq)    
> 1   7 -28508                         
> 2  15 -27404  8 2208.4  < 2.2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[Model 5]: Change Model 1 to Multinomial Logistic Regression

library(nnet)
multiLR.model1 = multinom(factor(premarsx) ~ age + educ + year2 + factor(female), 
                 data = premarsex)
summary(multiLR.model1)
> # weights:  24 (15 variable)
> initial  value 32259.069783 
> iter  10 value 29828.540914
> iter  20 value 28410.695407
> final  value 28410.674463 
> converged
> Call:
> multinom(formula = factor(premarsx) ~ age + educ + year2 + factor(female), 
>     data = premarsex)
> 
> Coefficients:
>   (Intercept)          age       educ         year2 factor(female)1
> 2  -1.2154310 -0.007814912 0.05348480 -0.0003481443     -0.09083107
> 3   0.1684207 -0.028960138 0.08634704  0.0046113159     -0.31608858
> 4   1.4825286 -0.043220397 0.06265965  0.0255472772     -0.60942875
> 
> Std. Errors:
>   (Intercept)         age        educ       year2 factor(female)1
> 2  0.13676126 0.001407993 0.007824120 0.003031485      0.04957698
> 3  0.10937126 0.001162272 0.006427832 0.002412840      0.03913690
> 4  0.09777465 0.001067420 0.005766825 0.002171470      0.03500544
> 
> Residual Deviance: 56821.35 
> AIC: 56851.35

[Model 6]: Change Model 2 to Multinomial Logistic Regression

multiLR.model2 = multinom(factor(premarsx) ~ age + educ + year2 + factor(female) + 
                 factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) +
                 factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest), 
                 data = premarsex)
summary(multiLR.model2)
> # weights:  56 (39 variable)
> initial  value 32259.069783 
> iter  10 value 29760.418294
> iter  20 value 28222.446583
> iter  30 value 27563.078483
> iter  40 value 27261.915974
> final  value 27209.172574 
> converged
> Call:
> multinom(formula = factor(premarsx) ~ age + educ + year2 + factor(female) + 
>     factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) + 
>     factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest), 
>     data = premarsex)
> 
> Coefficients:
>   (Intercept)         age       educ       year2 factor(female)1
> 2 -1.16116465 -0.01085412 0.03832584 0.006280151     -0.09764752
> 3 -0.01386741 -0.03324876 0.06063173 0.014103499     -0.32107159
> 4  1.39521840 -0.04812038 0.02408638 0.035888244     -0.60020817
>   factor(protest)1 factor(cathlic)1 factor(jewish)1 factor(othrel)1
> 2        0.6427966        0.5556867       0.6623168     -0.13253473
> 3        1.0001523        0.8755890       1.8137528      0.08100695
> 4        1.3245970        1.1332842       2.6758401      0.01450491
>   factor(nonaff)1 factor(nrthest)1 factor(south)1 factor(midwest)1
> 2       0.7760257       0.01045983     -0.4037643     -0.054510290
> 3       1.6591778       0.30758067     -0.3812802      0.002979545
> 4       2.4677059       0.12228837     -0.6658029     -0.345907455
> 
> Std. Errors:
>   (Intercept)         age        educ       year2 factor(female)1
> 2   0.1570881 0.001448807 0.008090780 0.003112966      0.05005954
> 3   0.1281241 0.001221235 0.006707647 0.002520227      0.04019498
> 4   0.1169014 0.001153154 0.006159537 0.002331667      0.03691887
>   factor(protest)1 factor(cathlic)1 factor(jewish)1 factor(othrel)1
> 2       0.06173107       0.06911996       0.2484929      0.11992390
> 3       0.05205842       0.05713776       0.1770756      0.09378696
> 4       0.04833292       0.05276240       0.1638681      0.08696960
>   factor(nonaff)1 factor(nrthest)1 factor(south)1 factor(midwest)1
> 2      0.14472709       0.08490956     0.07389685       0.07556783
> 3      0.10805990       0.06823495     0.06080813       0.06250816
> 4      0.09902944       0.06215391     0.05469907       0.05700853
> 
> Residual Deviance: 54418.35 
> AIC: 54496.35

Likelihood Ratio Test

lrtest(multiLR.model1, multiLR.model2)
> Likelihood ratio test
> 
> Model 1: factor(premarsx) ~ age + educ + year2 + factor(female)
> Model 2: factor(premarsx) ~ age + educ + year2 + factor(female) + factor(protest) + 
>     factor(cathlic) + factor(jewish) + factor(othrel) + factor(nonaff) + 
>     factor(nrthest) + factor(south) + factor(midwest)
>   #Df LogLik Df Chisq Pr(>Chisq)    
> 1  15 -28411                        
> 2  39 -27209 24  2403  < 2.2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Data Visualization

Set up your own background format

If you don’t want to use ggplot2, you can actually generate a similar format as it.

Set your background as a function

ggbg = function(){
  points(0, 0, pch=16, cex=1e6, col="grey90")
  grid(col="white", lty=1)}

Specify the background

cherry = read.csv("cherry.csv", header = T)
plot(cherry$diameter, cherry$volume, panel.first = ggbg(),
     xlab="diameter", ylab="volume", pch = 19, col = 'gray50')
abline(lm(cherry$volume~cherry$diameter), col = "Steelblue", lwd = 1.5)

Use ggplot2

Boxplot

books = read.csv("books.csv", header = T)
books$Shelf = factor(books$Shelf)
library(ggplot2)
fill = "#4271AE"; line = "#1F3552"
boxp = 
  ggplot(books, aes(x = Shelf, y = ReplaceValue)) +
  geom_boxplot(fill = fill, colour = line, alpha = 0.7) +
  scale_x_discrete(name = "Shelf Number") +
  scale_y_continuous(name = "Replacement costs of books") +
  ggtitle("Boxplot of mean Replacement Costs by Shelf") +
  theme(plot.title = element_text(hjust = 0.5))
boxp

Scatter plot

statepop = read.csv("statepop.csv", header = T)
poptotal = 255077536
psi = statepop$popn / poptotal
library(ggplot2)
p1 = 
  ggplot(statepop, aes(x=popn/poptotal, y=numfarm)) + 
  geom_point(colour = "#4271AE") +
  geom_smooth(method=lm, colour = "#1F3552") +
  scale_y_continuous(name = "Number of Farms") +
  scale_x_continuous(name = "Probability of selection") +
  ggtitle("Number of Farms vs Selecting Probability") +
  theme(plot.title = element_text(hjust = 0.5))
p1