Further Implementation
Packages
Quick Review: a cheat sheet for base R
Installation
### Install only once
install.packages("package_name")
### Call the package everytime you want to use it
library(package_name)
Remember, you don’t have to install all the packages before you work on R. There are too many packages (and it’s getting more and more) in R, just install it before you really need them. You will sometimes have more than one option to do the analysis. For example, itsmr
, timeSeries
, forecast
, Quantmod
are all packages for doing time series in R.
Here is a quick list of useful R packages.
For Loading Datasets
haven
, Loading Foreign Datasets
- [Package Source]
read_dta('filename.dta')
:
Import dataset in.dta
from STATAread_sas('filename.sas')
:
Import dataset in.sas
from SASread_spss('filename.spss')
:
Import dataset in.spss
from SPSS
For Modeling
car
, ANOVA functions for type II and type III ANOVA tables
- [Package Source]
avPlot(model, ...)
Added variable plots, plots for linear and generalized linear models.ncvTest(model, ...)
Score Test for Non-Constant Error Variance
lme4
, Linear mixed effects models
- [Package Source]
lmer(formula, data, ...)
:
Fit Linear Mixed-Effects Models
gmodels
, various tools for model fitting
- [Package Source]
CrossTable(x,y, ...)
:
Cross Tabulation with Tests for Factor Independence
nnet
, for multinomial log-linear models
- [Package Source]
multinom(formula, data, ...)
:
Fit multinomial log-linear models via neural networks
For testing
lmtest
,
- [Package Source]
waldtest(model1, model2)
:
Wald test of nested modelsresettest(formula, ...)
:
Ramsey’s RESET test for omitting variables, same asovtest
in STATAbptest(formula, ...)
:
Breusch-Pagan test for heteroskedasticity, same ashettest
in STATAlrtest(model1, model2, ...)
: <br Likelihood ratio tests for nested models
moments
- [Package Source]
skewness(x, na.rm = TRUE)
:
skewness of given dataagostino.test(x, alternative = c("two.sided", "less", "greater"))
:
D’Agostino test for skewness in normally distributed datakurtosis(x, na.rm = TRUE)
:
Pearson’s measure of kurtosisanscombe.test(x, alternative = c("two.sided", "less", "greater"))
:
Anscombe-Glynn test of kurtosis for normal samples
olsrr
- [Package Source]
ols_plot_dfbetas(model)
ols_plot_dffits(model)
For Graphing
ggplot2
, R’s the most famous package for making beautiful graphics
Data Analysis
Education Investigation
Getting Started I
(1) Setup your environment
setwd('~/r-workshops/education_investigation/')
(2) Load your dataset
### Specify the NA strings
education = read.csv('education2000.csv', header = T, na.strings = "")
### Preview First 7 rows of first 8 columns
head(education[,1:8], 7)
> id age educ paeduc sex race income region
> 1 1 26 16 16 male white <NA> w. sou. central
> 2 2 48 15 9 female white $8000 to 9999 w. sou. central
> 3 3 67 13 12 female white $15000 - 19999 w. sou. central
> 4 4 39 14 12 female white $25000 or more w. sou. central
> 5 5 25 14 13 female white $25000 or more w. sou. central
> 6 6 25 12 NA female white <NA> w. sou. central
> 7 7 36 14 18 male white $25000 or more w. sou. central
Data Summary I
(1) Extract rows with miss is 0 (Drop missing values)
### A new dataframe only containing variable "miss" is 0
education_miss0 = education[education$miss==0, ]
### Preview First 7 rows of rist 8 columns
head(education_miss0[,1:8], 7)
> id age educ paeduc sex race income region
> 1 1 26 16 16 male white <NA> w. sou. central
> 2 2 48 15 9 female white $8000 to 9999 w. sou. central
> 3 3 67 13 12 female white $15000 - 19999 w. sou. central
> 4 4 39 14 12 female white $25000 or more w. sou. central
> 5 5 25 14 13 female white $25000 or more w. sou. central
> 7 7 36 14 18 male white $25000 or more w. sou. central
> 8 8 44 14 12 female white $20000 - 24999 w. sou. central
(2) Read the distribution of educ
### Draw three figures in a window
layout(matrix(c(1, 2, 3), 1, 3, byrow = T))
### qqplot: check normality
qqnorm(education_miss0$educ, col = "SteelBlue")
qqline(education_miss0$educ, col = "red")
### boxplot:
boxplot(education_miss0$educ, col = "SteelBlue")
title('Boxplot of education years')
mtext('Years', side = 2, line = 2.1, cex = 0.7)
### histogram:
hist(education_miss0$educ, col = "SteelBlue", breaks = 10, freq = F, main = "", xlab = "")
title('Histogram of education years')
mtext('Education Years', side = 1, line = 2.1, cex = 0.7)
box()
(3) Test the skewness and kurtosis of educ
library(moments)
### get skewness and test it
cat('Skeness: ', skewness(education_miss0$educ))
agostino.test(education_miss0$educ)
### get kurtosis and test it
cat('Kurtosis: ', kurtosis(education_miss0$educ))
anscombe.test(education_miss0$educ)
> Skeness: -0.1026498
>
> D'Agostino skewness test
>
> data: education_miss0$educ
> skew = -0.10265, z = -1.84620, p-value = 0.06487
> alternative hypothesis: data have a skewness
> Kurtosis: 3.749285
>
> Anscombe-Glynn kurtosis test
>
> data: education_miss0$educ
> kurt = 3.7493, z = 5.0874, p-value = 3.631e-07
> alternative hypothesis: kurtosis is not equal to 3
(4) Generate a new variable educsq instead of dropping missing values
### create a new column in original dataset
education$educsq = education$educ^2
### Read the distribution of new variable
### Draw two figures in a window
layout(matrix(c(1, 2), 1, 2, byrow = T))
### qqplot: check normality
qqnorm(education$educsq, col = "SteelBlue")
qqline(education$educsq, col = "red")
### histogram:
hist(education_miss0$educ, col = "SteelBlue", breaks = 10, freq = F, main = "", xlab = "")
title('Histogram of (education years)^2')
mtext('(Education Years)^2', side = 1, line = 2.1, cex = 0.7)
box()
Modeling I
[Model 1]: Linear regression using educ as dependent variable
model1 = lm(educ ~ female + age + west + midwest + nrthest, data = education_miss0)
summary(model1)
anova(model1)
>
> Call:
> lm(formula = educ ~ female + age + west + midwest + nrthest,
> data = education_miss0)
>
> Residuals:
> Min 1Q Median 3Q Max
> -11.318 -1.707 -0.387 2.013 7.014
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 14.465140 0.210479 68.725 < 2e-16 ***
> femaleYes -0.179782 0.125302 -1.435 0.15151
> age -0.017404 0.003773 -4.613 4.24e-06 ***
> westYes 0.522318 0.170497 3.064 0.00222 **
> midwestYes 0.041180 0.163345 0.252 0.80099
> nrthestYes 0.180938 0.178900 1.011 0.31195
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 2.737 on 1930 degrees of freedom
> Multiple R-squared: 0.01782, Adjusted R-squared: 0.01528
> F-statistic: 7.003 on 5 and 1930 DF, p-value: 1.704e-06
> Analysis of Variance Table
>
> Response: educ
> Df Sum Sq Mean Sq F value Pr(>F)
> female 1 17.5 17.529 2.3401 0.126244
> age 1 165.5 165.498 22.0938 2.78e-06 ***
> west 1 71.4 71.439 9.5370 0.002042 **
> midwest 1 0.2 0.167 0.0223 0.881297
> nrthest 1 7.7 7.662 1.0229 0.311954
> Residuals 1930 14457.1 7.491
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(1) Extract specific values from the results
The results usually have objects more than what it prints out. Use str()
to check what you can extract from the results, for example:
str(summary(model1))
Then you will be able to call the results by following options:
### residuals for observation 1 to 10
summary(model1)$residuals[1:10]
# you can also just do this:
residuals(model1)[1:10]
### R squared and adjusted R squared of this model
cat('R squared: ', summary(model1)$r.squared)
cat('adj. R squared: ', summary(model1)$adj.r.squared)
### coefficients of this model
summary(model1)$coefficients
> 1 2 3 4 5 7
> 1.9873517 1.5500108 -0.1193224 0.3933791 0.1497298 0.1613868
> 8 9 10 11
> 0.4803967 4.3006150 2.3528256 4.6370284
> 1 2 3 4 5 7
> 1.9873517 1.5500108 -0.1193224 0.3933791 0.1497298 0.1613868
> 8 9 10 11
> 0.4803967 4.3006150 2.3528256 4.6370284
> R squared: 0.01781975
> adj. R squared: 0.01527524
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 14.46513984 0.210479181 68.7248011 0.000000e+00
> femaleYes -0.17978169 0.125301592 -1.4347918 1.515085e-01
> age -0.01740352 0.003772953 -4.6127053 4.235625e-06
> westYes 0.52231768 0.170496770 3.0635048 2.217851e-03
> midwestYes 0.04117982 0.163345107 0.2521032 8.009882e-01
> nrthestYes 0.18093847 0.178899819 1.0113955 3.119539e-01
(2) Extract the variables information if they are significant
### coefficients table
coef_table1 = summary(model1)$coefficients
### extract significant variables
coef_table1[coef_table1[,"Pr(>|t|)"]<0.05, ]
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 14.46513984 0.210479181 68.724801 0.000000e+00
> age -0.01740352 0.003772953 -4.612705 4.235625e-06
> westYes 0.52231768 0.170496770 3.063505 2.217851e-03
[Model 2]: Linear regression adding variable race
model2 = lm(educ ~ female + age + west + midwest + nrthest + black + other,
data = education_miss0)
summary(model2)
anova(model2)
>
> Call:
> lm(formula = educ ~ female + age + west + midwest + nrthest +
> black + other, data = education_miss0)
>
> Residuals:
> Min 1Q Median 3Q Max
> -11.4228 -1.7524 -0.2311 1.9397 7.1373
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 14.65782 0.21358 68.630 < 2e-16 ***
> femaleYes -0.13426 0.12472 -1.076 0.2818
> age -0.01813 0.00377 -4.809 1.64e-06 ***
> westYes 0.40908 0.17177 2.382 0.0173 *
> midwestYes -0.04601 0.16303 -0.282 0.7778
> nrthestYes 0.09574 0.17858 0.536 0.5919
> blackYes -1.08581 0.20164 -5.385 8.13e-08 ***
> otherYes -0.09373 0.28918 -0.324 0.7459
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 2.718 on 1928 degrees of freedom
> Multiple R-squared: 0.03237, Adjusted R-squared: 0.02886
> F-statistic: 9.215 on 7 and 1928 DF, p-value: 2.963e-11
> Analysis of Variance Table
>
> Response: educ
> Df Sum Sq Mean Sq F value Pr(>F)
> female 1 17.5 17.529 2.3729 0.1236
> age 1 165.5 165.498 22.4029 2.371e-06 ***
> west 1 71.4 71.439 9.6704 0.0019 **
> midwest 1 0.2 0.167 0.0226 0.8805
> nrthest 1 7.7 7.662 1.0372 0.3086
> black 1 213.5 213.463 28.8958 8.565e-08 ***
> other 1 0.8 0.776 0.1051 0.7459
> Residuals 1928 14242.8 7.387
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Test Model Improvement
Although the summary of the model tells us the siginificance of the variables to the dependent variable and also the significance of the model, it does not tell us the improvement of the model 2 to model 1. Use Wald test to see if it improves just by waldtest()
:
library(lmtest)
waldtest(model1, model2)
> Wald test
>
> Model 1: educ ~ female + age + west + midwest + nrthest
> Model 2: educ ~ female + age + west + midwest + nrthest + black + other
> Res.Df Df F Pr(>F)
> 1 1930
> 2 1928 2 14.5 5.616e-07 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[Model 3]: Linear regression adding variable paeduc
model3 = lm(educ ~ female + age + west + midwest + nrthest + black + other + paeduc,
data = education_miss0)
summary(model3)
anova(model3)
>
> Call:
> lm(formula = educ ~ female + age + west + midwest + nrthest +
> black + other + paeduc, data = education_miss0)
>
> Residuals:
> Min 1Q Median 3Q Max
> -11.0649 -1.6561 -0.2253 1.4983 8.8581
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 10.490089 0.303418 34.573 < 2e-16 ***
> femaleYes -0.149875 0.115345 -1.299 0.193976
> age 0.007976 0.003773 2.114 0.034650 *
> westYes 0.149304 0.159494 0.936 0.349334
> midwestYes -0.277784 0.151313 -1.836 0.066537 .
> nrthestYes 0.022660 0.165197 0.137 0.890911
> blackYes -0.670528 0.187877 -3.569 0.000367 ***
> otherYes 0.277297 0.268208 1.034 0.301319
> paeduc 0.271949 0.015029 18.095 < 2e-16 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 2.514 on 1927 degrees of freedom
> Multiple R-squared: 0.1729, Adjusted R-squared: 0.1695
> F-statistic: 50.36 on 8 and 1927 DF, p-value: < 2.2e-16
> Analysis of Variance Table
>
> Response: educ
> Df Sum Sq Mean Sq F value Pr(>F)
> female 1 17.5 17.53 2.7746 0.095934 .
> age 1 165.5 165.50 26.1957 3.392e-07 ***
> west 1 71.4 71.44 11.3076 0.000787 ***
> midwest 1 0.2 0.17 0.0264 0.870836
> nrthest 1 7.7 7.66 1.2128 0.270909
> black 1 213.5 213.46 33.7879 7.176e-09 ***
> other 1 0.8 0.78 0.1228 0.726003
> paeduc 1 2068.5 2068.52 327.4143 < 2.2e-16 ***
> Residuals 1927 12174.3 6.32
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Contingency Table
To examine if there is interaction effect between being female and being african american, read the contigency table between two variables:
xtabs(~female+black, data = education_miss0)
> black
> female No Yes
> No 799 77
> Yes 927 133
[Model 4]: Linear regression adding interaction variable
### Adding asterisk(*) between two variables gives you the individual and interaction variables
model4 = lm(educ ~ female*black + age + west + midwest + nrthest + other + paeduc,
data = education_miss0)
summary(model4)
anova(model4)
>
> Call:
> lm(formula = educ ~ female * black + age + west + midwest + nrthest +
> other + paeduc, data = education_miss0)
>
> Residuals:
> Min 1Q Median 3Q Max
> -11.0572 -1.6506 -0.2164 1.5079 8.8441
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 10.500774 0.303996 34.542 < 2e-16 ***
> femaleYes -0.173247 0.121826 -1.422 0.15516
> blackYes -0.812406 0.302953 -2.682 0.00739 **
> age 0.008045 0.003776 2.131 0.03324 *
> westYes 0.148294 0.159530 0.930 0.35271
> midwestYes -0.281842 0.151490 -1.860 0.06297 .
> nrthestYes 0.023178 0.165227 0.140 0.88846
> otherYes 0.278341 0.268259 1.038 0.29959
> paeduc 0.271941 0.015032 18.091 < 2e-16 ***
> femaleYes:blackYes 0.227250 0.380628 0.597 0.55055
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 2.514 on 1926 degrees of freedom
> Multiple R-squared: 0.1731, Adjusted R-squared: 0.1692
> F-statistic: 44.78 on 9 and 1926 DF, p-value: < 2.2e-16
> Analysis of Variance Table
>
> Response: educ
> Df Sum Sq Mean Sq F value Pr(>F)
> female 1 17.5 17.53 2.7737 0.095990 .
> black 1 228.0 227.95 36.0691 2.271e-09 ***
> age 1 175.0 175.04 27.6966 1.577e-07 ***
> west 1 51.1 51.14 8.0920 0.004493 **
> midwest 1 2.1 2.08 0.3296 0.565960
> nrthest 1 2.0 2.02 0.3190 0.572273
> other 1 0.8 0.78 0.1228 0.726047
> paeduc 1 2068.5 2068.52 327.3049 < 2.2e-16 ***
> female:black 1 2.3 2.25 0.3565 0.550551
> Residuals 1926 12172.0 6.32
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[Model 5]: Linear regression using educsq as dependent variable
model5 = lm(educsq ~ paeduc + age + polviews, data = education)
summary(model5)
anova(model5)
>
> Call:
> lm(formula = educsq ~ paeduc + age + polviews, data = education)
>
> Residuals:
> Min 1Q Median 3Q Max
> -187.58 -46.40 -13.63 39.05 251.68
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 108.8219 8.8806 12.254 < 2e-16 ***
> paeduc 6.8701 0.4251 16.160 < 2e-16 ***
> age 0.3113 0.1063 2.928 0.00345 **
> polviewsextremely liberal 19.8902 9.0227 2.204 0.02762 *
> polviewsextrmly conservative -15.1011 9.5278 -1.585 0.11315
> polviewsliberal 9.6799 6.0849 1.591 0.11183
> polviewsmoderate -13.9357 4.6977 -2.967 0.00305 **
> polviewsslghtly conservative 7.2535 5.7511 1.261 0.20738
> polviewsslightly liberal 6.1740 6.3344 0.975 0.32985
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 69.44 on 1840 degrees of freedom
> (968 observations deleted due to missingness)
> Multiple R-squared: 0.161, Adjusted R-squared: 0.1574
> F-statistic: 44.14 on 8 and 1840 DF, p-value: < 2.2e-16
> Analysis of Variance Table
>
> Response: educsq
> Df Sum Sq Mean Sq F value Pr(>F)
> paeduc 1 1449137 1449137 300.5248 < 2.2e-16 ***
> age 1 43437 43437 9.0081 0.002724 **
> polviews 6 210140 35023 7.2632 1.105e-07 ***
> Residuals 1840 8872521 4822
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Diagnostics I
(1) Check assumptions of linear models via graphs
Assume that model 5 is our selected model, then there are several assumptions we need to check before using it as final model. R can easily generate useful graphs by plot()
command:
### Draw four figures in a window
layout(matrix(c(1, 2, 3, 4), 2, 2, byrow = T))
#### which=1 gives residual plot (Residuals vs Fitted)
#### which=2 gives qqplot (theoretical quantiles vs standardized residuals)
#### which=3 gives scale-location plot (fitted values vs sqrt(standardized residuals))
#### which=4 gives cook's distance (cooks.distance vs observation number)
#### which=5 gives residual plot (Residuals vs Standardized residuals)
#### which=6 gives leverage plot (leverage vs cooks distance)
### Call all the figures once
plot(model5) # gives you which=1,2,3,5
### Call only one figure
plot(model5, which=1)
(2) Check assumptions of linear models via tests
library(lmtest)
### Ramsey's RESET test for omitting variables (same as ovtest in STATA)
resettest(model5)
### Breusch-Pagan test for heteroskedasticity (same as hettest in STATA)
bptest(model5)
>
> RESET test
>
> data: model5
> RESET = 7.6105, df1 = 2, df2 = 1838, p-value = 0.000511
>
> studentized Breusch-Pagan test
>
> data: model5
> BP = 47.202, df = 8, p-value = 1.403e-07
(3) Get those specific values
### Draw two figures in a window
layout(matrix(c(1, 2), 1, 2, byrow = T))
### standardized residuals vs leverage
#### Plot by built-in
plot(model5, which=5)
#### Plot by your own
rsd = rstandard(model5)
lev = hatvalues(model5)
plot(lev, rsd, cex.main = 0.9,
main = "Residuals-squared vs Leverage",
xlab = "Leverage", ylab = "Standardized residuals")
abline(h=0, col="red")
(4) Compare residuals and specific variables
### Draw three figures in a window
layout(matrix(c(1, 2, 3), 1, 3, byrow = T))
### Calculate residuals
res = residuals(model5)
plot(education$paeduc[!is.na(education$paeduc)], res[!is.na(education$paeduc)],
xlab = 'Parents Education', ylab = "Residuals", cex.lab = 0.95,
main = 'Residuals Plot', cex.main = 0.95)
abline(h=0, col = "red")
plot(education$polviews[!is.na(education$polviews)], res[!is.na(education$polviews)],
xlab = 'Polviews', ylab = "Residuals", cex.lab = 0.95,
main = 'Residuals Plot', cex.main = 0.95)
abline(h=0, col = "red")
plot(education$age[!is.na(education$age)], res[!is.na(education$age)],
xlab = 'Age', ylab = "Residuals", cex.lab = 0.95,
main = 'Residuals Plot', cex.main = 0.95)
abline(h=0, col = "red")
(5) Check the side effect of variables
library(car)
avPlots(model5, id.n = 3)
(6) Check outliers
### generate cook's distance values
cooks = cooks.distance(model5)
### cook's distance plot
plot(model5, which = 4)
(7) Measure the influencial points
### Generate DfBETA values for each observation
model5_dfbeta = as.data.frame(dfbeta(model5))
summary(model5_dfbeta)
### Generate DFBeta, DFFIT, cooks.distance, hatbalues
influence.measures(model5)$infmat[1:10,]
> (Intercept) paeduc age
> Min. :-1.1105185 Min. :-9.756e-02 Min. :-1.582e-02
> 1st Qu.:-0.0805633 1st Qu.:-2.628e-03 1st Qu.:-8.816e-04
> Median :-0.0034487 Median : 2.488e-04 Median : 1.977e-05
> Mean : 0.0000085 Mean :-1.160e-06 Mean :-3.700e-08
> 3rd Qu.: 0.0617093 3rd Qu.: 2.964e-03 3rd Qu.: 9.736e-04
> Max. : 1.5020763 Max. : 4.610e-02 Max. : 1.693e-02
> polviewsextremely liberal polviewsextrmly conservative
> Min. :-2.250266 Min. :-2.023347
> 1st Qu.:-0.005195 1st Qu.:-0.005636
> Median :-0.000091 Median : 0.000340
> Mean : 0.000012 Mean : 0.000006
> 3rd Qu.: 0.005563 3rd Qu.: 0.007158
> Max. : 3.608127 Max. : 3.376175
> polviewsliberal polviewsmoderate polviewsslghtly conservative
> Min. :-0.8384594 Min. :-0.6976161 Min. :-0.7319911
> 1st Qu.:-0.0065209 1st Qu.:-0.0255094 1st Qu.:-0.0078125
> Median :-0.0001345 Median : 0.0000451 Median :-0.0001453
> Mean : 0.0000045 Mean : 0.0000043 Mean : 0.0000060
> 3rd Qu.: 0.0067487 3rd Qu.: 0.0123677 3rd Qu.: 0.0085105
> Max. : 0.8753166 Max. : 0.4408732 Max. : 0.8608520
> polviewsslightly liberal
> Min. :-0.913665
> 1st Qu.:-0.004405
> Median :-0.000012
> Mean : 0.000007
> 3rd Qu.: 0.004901
> Max. : 1.230383
> dfb.1_ dfb.padc dfb.age dfb.plvwsxl dfb.plvwsxc
> 1 0.0010266213 0.0041951619 -0.0058055646 -4.285246e-04 0.0004798882
> 2 0.0183474957 -0.0066922431 -0.0014549069 -1.354441e-02 -0.0133184716
> 3 0.0019382614 -0.0107578262 -0.0207309674 1.468939e-02 0.0147004648
> 4 -0.0013178430 0.0008398094 0.0015641359 2.896797e-05 -0.0000295543
> 5 -0.0036323380 0.0005374319 0.0059923657 2.286285e-04 -0.0002489021
> 7 0.0106637772 -0.0189229650 -0.0011853669 7.808168e-04 -0.0009033801
> 8 -0.0003882401 0.0004697845 0.0002504529 -1.008164e-05 0.0000122433
> 9 0.0046614009 -0.0067746096 -0.0019342603 2.159883e-04 -0.0002538480
> 10 -0.0126846809 0.0194331680 0.0043195350 -6.712959e-04 0.0637079265
> 11 0.0705087927 -0.1195048320 -0.0131476169 4.692818e-03 -0.0054442756
> dfb.plvwslb dfb.plvwsm dfb.plvwssc dfb.plvwssl dffit
> 1 -6.084049e-04 -1.627313e-04 1.329870e-02 -4.586724e-04 0.02115599
> 2 -1.991889e-02 -2.663812e-02 -2.110831e-02 -1.941710e-02 0.03263347
> 3 2.240287e-02 2.794892e-02 2.339744e-02 2.092489e-02 -0.04079024
> 4 -4.509726e-06 7.848171e-05 -9.348809e-03 4.174512e-05 -0.01291641
> 5 2.118396e-04 2.535512e-04 3.075398e-04 -1.289507e-02 -0.01785830
> 7 1.566778e-03 -1.061676e-02 1.604404e-03 7.279276e-04 -0.02708655
> 8 -1.382358e-02 1.846227e-05 -2.673349e-05 -7.243484e-06 -0.01805686
> 9 4.959682e-04 -1.979071e-04 4.849183e-04 9.818209e-02 0.12469447
> 10 -1.475568e-03 5.176407e-04 -1.463829e-03 -5.955747e-04 0.07181244
> 11 9.651168e-03 -2.636760e-03 9.796846e-03 1.454730e-01 0.21800431
> cov.r cook.d hat
> 1 1.0089110 4.975498e-05 0.004458081
> 2 1.0066258 1.183706e-04 0.003286386
> 3 1.0073143 1.849331e-04 0.004280059
> 4 1.0084356 1.854669e-05 0.003719386
> 5 1.0105757 3.545365e-05 0.005874361
> 7 1.0068488 8.155398e-05 0.003087354
> 8 1.0090771 3.624605e-05 0.004487440
> 9 0.9954559 1.725766e-03 0.005170014
> 10 1.0206703 5.732210e-04 0.016891454
> 11 0.9824403 5.265794e-03 0.007616536
Premarital Sex
Getting Started II
- Setup your environment
setwd('~/r-workshops/premarital_sex/')
- Load your dataset
### Use haven to load foreign dataset
library(haven)
premarsex = data.frame(read_dta('premarsx.dta'))
### Preview First 7 rows of first 8 columns
head(premarsex[,1:8], 7)
> year id wrkstat prestige prestg80 marital childs age
> 1 1972 1 1 50 NA 5 0 23
> 2 1972 2 5 45 NA 1 5 70
> 3 1972 3 2 44 NA 1 4 48
> 4 1972 4 1 57 NA 1 0 27
> 5 1972 5 7 40 NA 1 2 61
> 6 1972 6 1 49 NA 5 0 26
> 7 1972 7 1 41 NA 3 2 28
If you click the object in the environment, you will see the variable descriptions created by STATA.
Data Summary II
(1) Cross Table
To determine what kind of analysis to use, use the cross table:
library(gmodels)
with(premarsex, CrossTable(premarsx, missing.include=T))
>
>
> Cell Contents
> |-------------------------|
> | N |
> | N / Table Total |
> |-------------------------|
>
>
> Total Observations in Table: 23270
>
>
> | 1 | 2 | 3 | 4 |
> |-----------|-----------|-----------|-----------|
> | 6693 | 2373 | 5112 | 9092 |
> | 0.288 | 0.102 | 0.220 | 0.391 |
> |-----------|-----------|-----------|-----------|
>
>
>
>
(2) Generate binary variable notwrng
4 categories of response:
- 1 as Always Wrong
- 2 as Almost Always Wrong
- 3 as Sometimes Wrong
- 4 as Not Wrong At All
Since we only care about whether people think it is wrong at all, generate the variable notwrng
premarsex$notwrng <- ifelse(premarsex$premarsx==4, 1, 0)
with(premarsex, CrossTable(notwrng, missing.include=T))
>
>
> Cell Contents
> |-------------------------|
> | N |
> | N / Table Total |
> |-------------------------|
>
>
> Total Observations in Table: 23270
>
>
> | 0 | 1 |
> |-----------|-----------|
> | 14178 | 9092 |
> | 0.609 | 0.391 |
> |-----------|-----------|
>
>
>
>
Modeling II
[Model 1]: Controls on age, educational attainment, year questioned, and gender
LRmodel1 = glm(notwrng ~ age + educ + year2 + factor(female),
data = premarsex, family = 'binomial')
summary(LRmodel1)
>
> Call:
> glm(formula = notwrng ~ age + educ + year2 + factor(female),
> family = "binomial", data = premarsex)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -1.5726 -1.0020 -0.6823 1.1841 2.2487
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 0.5563722 0.0804612 6.915 4.69e-12 ***
> age -0.0314628 0.0008941 -35.189 < 2e-16 ***
> educ 0.0224030 0.0047907 4.676 2.92e-06 ***
> year2 0.0238412 0.0017719 13.455 < 2e-16 ***
> factor(female)1 -0.4626203 0.0282273 -16.389 < 2e-16 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> (Dispersion parameter for binomial family taken to be 1)
>
> Null deviance: 31138 on 23269 degrees of freedom
> Residual deviance: 29089 on 23265 degrees of freedom
> AIC: 29099
>
> Number of Fisher Scoring iterations: 4
[Model 2]: Add in variables based on the region of the country a person is from and their religious views
LRmodel2 = glm(notwrng ~ age + educ + year2 + factor(female) +
factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) +
factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest),
data = premarsex, family = 'binomial')
summary(LRmodel2)
>
> Call:
> glm(formula = notwrng ~ age + educ + year2 + factor(female) +
> factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) +
> factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest),
> family = "binomial", data = premarsex)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -2.0921 -0.9387 -0.6181 1.1072 2.4437
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 0.4720287 0.0948079 4.979 6.40e-07 ***
> age -0.0329677 0.0009382 -35.138 < 2e-16 ***
> educ -0.0053257 0.0050361 -1.058 0.290
> year2 0.0288602 0.0018573 15.539 < 2e-16 ***
> factor(female)1 -0.4444161 0.0292687 -15.184 < 2e-16 ***
> factor(protest)1 0.8444414 0.0409875 20.602 < 2e-16 ***
> factor(cathlic)1 0.7186661 0.0438504 16.389 < 2e-16 ***
> factor(jewish)1 1.7256928 0.1013053 17.035 < 2e-16 ***
> factor(othrel)1 0.0456812 0.0756039 0.604 0.546
> factor(nonaff)1 1.5542900 0.0601495 25.840 < 2e-16 ***
> factor(nrthest)1 -0.0310514 0.0466282 -0.666 0.505
> factor(south)1 -0.4471349 0.0436484 -10.244 < 2e-16 ***
> factor(midwest)1 -0.3423956 0.0445090 -7.693 1.44e-14 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> (Dispersion parameter for binomial family taken to be 1)
>
> Null deviance: 31138 on 23269 degrees of freedom
> Residual deviance: 27703 on 23257 degrees of freedom
> AIC: 27729
>
> Number of Fisher Scoring iterations: 4
Likelihood Ratio Test
library(lmtest)
lrtest(LRmodel1, LRmodel2)
> Likelihood ratio test
>
> Model 1: notwrng ~ age + educ + year2 + factor(female)
> Model 2: notwrng ~ age + educ + year2 + factor(female) + factor(protest) +
> factor(cathlic) + factor(jewish) + factor(othrel) + factor(nonaff) +
> factor(nrthest) + factor(south) + factor(midwest)
> #Df LogLik Df Chisq Pr(>Chisq)
> 1 5 -14544
> 2 13 -13852 8 1385.4 < 2.2e-16 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[Model 3]: Change Model 1 as an Ordinal Logistic Regression
library(MASS)
### Hess=F gives you the observed information matrix
OLRmodel1 = polr(factor(premarsx) ~ age + educ + year2 + factor(female),
data = premarsex, Hess = F)
summary(OLRmodel1)
### add p-values to the result
OLRcoef1 = coef(summary(OLRmodel1))
OLRpvalue1 = pnorm(abs(OLRcoef1[,"t value"]), lower.tail = F)*2
OLRtable1 = cbind(OLRcoef1, "p value" = OLRpvalue1)
head(OLRtable1, 8)
> Call:
> polr(formula = factor(premarsx) ~ age + educ + year2 + factor(female),
> data = premarsex, Hess = F)
>
> Coefficients:
> Value Std. Error t value
> age -0.03251 0.0007489 -43.41
> educ 0.04499 0.0041112 10.94
> year2 0.02015 0.0015361 13.12
> factor(female)1 -0.44643 0.0248512 -17.96
>
> Intercepts:
> Value Std. Error t value
> 1|2 -1.8604 0.0708 -26.2754
> 2|3 -1.3468 0.0703 -19.1461
> 3|4 -0.3508 0.0697 -5.0313
>
> Residual Deviance: 57016.64
> AIC: 57030.64
> Value Std. Error t value p value
> age -0.03250831 0.0007488739 -43.409589 0.000000e+00
> educ 0.04498928 0.0041112020 10.943097 7.170798e-28
> year2 0.02014909 0.0015361109 13.116948 2.633294e-39
> factor(female)1 -0.44642724 0.0248511571 -17.964042 3.726544e-72
> 1|2 -1.86039139 0.0708034308 -26.275441 3.660707e-152
> 2|3 -1.34681703 0.0703440990 -19.146127 1.042591e-81
> 3|4 -0.35078461 0.0697201122 -5.031326 4.870992e-07
[Model 4]: Change Model 2 as an Ordinal Logistic Regression
OLRmodel2 = polr(factor(premarsx) ~ age + educ + year2 + factor(female) +
factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) +
factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest),
data = premarsex, Hess = F)
summary(OLRmodel2)
### add p-values to the result
OLRcoef2 = coef(summary(OLRmodel2))
OLRpvalue2 = pnorm(abs(OLRcoef2[,"t value"]), lower.tail = F)*2
OLRtable2 = cbind(OLRcoef2, "p value" = OLRpvalue2)
head(OLRtable2, 8)
>
> Re-fitting to get Hessian
> Call:
> polr(formula = factor(premarsx) ~ age + educ + year2 + factor(female) +
> factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) +
> factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest),
> data = premarsex, Hess = F)
>
> Coefficients:
> Value Std. Error t value
> age -0.03458 0.0007806 -44.3017
> educ 0.01562 0.0042840 3.6466
> year2 0.02647 0.0015974 16.5695
> factor(female)1 -0.42769 0.0254648 -16.7955
> factor(protest)1 0.96310 0.0340391 28.2938
> factor(cathlic)1 0.82156 0.0369296 22.2466
> factor(jewish)1 1.83844 0.0945234 19.4496
> factor(othrel)1 0.06251 0.0639205 0.9779
> factor(nonaff)1 1.68810 0.0559806 30.1552
> factor(nrthest)1 0.05027 0.0416053 1.2082
> factor(south)1 -0.48779 0.0382020 -12.7687
> factor(midwest)1 -0.26456 0.0389446 -6.7932
>
> Intercepts:
> Value Std. Error t value
> 1|2 -1.7874 0.0824 -21.6903
> 2|3 -1.2340 0.0820 -15.0440
> 3|4 -0.1647 0.0815 -2.0202
>
> Residual Deviance: 54808.23
> AIC: 54838.23
>
> Re-fitting to get Hessian
> Value Std. Error t value p value
> age -0.03458079 0.0007805756 -44.3016548 0.000000e+00
> educ 0.01562177 0.0042839617 3.6465706 2.657635e-04
> year2 0.02646767 0.0015973736 16.5694953 1.157936e-61
> factor(female)1 -0.42769333 0.0254647793 -16.7954855 2.633410e-63
> factor(protest)1 0.96309621 0.0340391002 28.2938211 4.117067e-176
> factor(cathlic)1 0.82155729 0.0369296059 22.2465762 1.217390e-109
> factor(jewish)1 1.83844099 0.0945234362 19.4495785 2.938366e-84
> factor(othrel)1 0.06251059 0.0639205048 0.9779427 3.281027e-01
Likelihood Ratio Test
lrtest(OLRmodel1, OLRmodel2)
> Likelihood ratio test
>
> Model 1: factor(premarsx) ~ age + educ + year2 + factor(female)
> Model 2: factor(premarsx) ~ age + educ + year2 + factor(female) + factor(protest) +
> factor(cathlic) + factor(jewish) + factor(othrel) + factor(nonaff) +
> factor(nrthest) + factor(south) + factor(midwest)
> #Df LogLik Df Chisq Pr(>Chisq)
> 1 7 -28508
> 2 15 -27404 8 2208.4 < 2.2e-16 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[Model 5]: Change Model 1 to Multinomial Logistic Regression
library(nnet)
multiLR.model1 = multinom(factor(premarsx) ~ age + educ + year2 + factor(female),
data = premarsex)
summary(multiLR.model1)
> # weights: 24 (15 variable)
> initial value 32259.069783
> iter 10 value 29828.540914
> iter 20 value 28410.695407
> final value 28410.674463
> converged
> Call:
> multinom(formula = factor(premarsx) ~ age + educ + year2 + factor(female),
> data = premarsex)
>
> Coefficients:
> (Intercept) age educ year2 factor(female)1
> 2 -1.2154310 -0.007814912 0.05348480 -0.0003481443 -0.09083107
> 3 0.1684207 -0.028960138 0.08634704 0.0046113159 -0.31608858
> 4 1.4825286 -0.043220397 0.06265965 0.0255472772 -0.60942875
>
> Std. Errors:
> (Intercept) age educ year2 factor(female)1
> 2 0.13676126 0.001407993 0.007824120 0.003031485 0.04957698
> 3 0.10937126 0.001162272 0.006427832 0.002412840 0.03913690
> 4 0.09777465 0.001067420 0.005766825 0.002171470 0.03500544
>
> Residual Deviance: 56821.35
> AIC: 56851.35
[Model 6]: Change Model 2 to Multinomial Logistic Regression
multiLR.model2 = multinom(factor(premarsx) ~ age + educ + year2 + factor(female) +
factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) +
factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest),
data = premarsex)
summary(multiLR.model2)
> # weights: 56 (39 variable)
> initial value 32259.069783
> iter 10 value 29760.418294
> iter 20 value 28222.446583
> iter 30 value 27563.078483
> iter 40 value 27261.915974
> final value 27209.172574
> converged
> Call:
> multinom(formula = factor(premarsx) ~ age + educ + year2 + factor(female) +
> factor(protest) + factor(cathlic) + factor(jewish) + factor(othrel) +
> factor(nonaff) + factor(nrthest) + factor(south) + factor(midwest),
> data = premarsex)
>
> Coefficients:
> (Intercept) age educ year2 factor(female)1
> 2 -1.16116465 -0.01085412 0.03832584 0.006280151 -0.09764752
> 3 -0.01386741 -0.03324876 0.06063173 0.014103499 -0.32107159
> 4 1.39521840 -0.04812038 0.02408638 0.035888244 -0.60020817
> factor(protest)1 factor(cathlic)1 factor(jewish)1 factor(othrel)1
> 2 0.6427966 0.5556867 0.6623168 -0.13253473
> 3 1.0001523 0.8755890 1.8137528 0.08100695
> 4 1.3245970 1.1332842 2.6758401 0.01450491
> factor(nonaff)1 factor(nrthest)1 factor(south)1 factor(midwest)1
> 2 0.7760257 0.01045983 -0.4037643 -0.054510290
> 3 1.6591778 0.30758067 -0.3812802 0.002979545
> 4 2.4677059 0.12228837 -0.6658029 -0.345907455
>
> Std. Errors:
> (Intercept) age educ year2 factor(female)1
> 2 0.1570881 0.001448807 0.008090780 0.003112966 0.05005954
> 3 0.1281241 0.001221235 0.006707647 0.002520227 0.04019498
> 4 0.1169014 0.001153154 0.006159537 0.002331667 0.03691887
> factor(protest)1 factor(cathlic)1 factor(jewish)1 factor(othrel)1
> 2 0.06173107 0.06911996 0.2484929 0.11992390
> 3 0.05205842 0.05713776 0.1770756 0.09378696
> 4 0.04833292 0.05276240 0.1638681 0.08696960
> factor(nonaff)1 factor(nrthest)1 factor(south)1 factor(midwest)1
> 2 0.14472709 0.08490956 0.07389685 0.07556783
> 3 0.10805990 0.06823495 0.06080813 0.06250816
> 4 0.09902944 0.06215391 0.05469907 0.05700853
>
> Residual Deviance: 54418.35
> AIC: 54496.35
Likelihood Ratio Test
lrtest(multiLR.model1, multiLR.model2)
> Likelihood ratio test
>
> Model 1: factor(premarsx) ~ age + educ + year2 + factor(female)
> Model 2: factor(premarsx) ~ age + educ + year2 + factor(female) + factor(protest) +
> factor(cathlic) + factor(jewish) + factor(othrel) + factor(nonaff) +
> factor(nrthest) + factor(south) + factor(midwest)
> #Df LogLik Df Chisq Pr(>Chisq)
> 1 15 -28411
> 2 39 -27209 24 2403 < 2.2e-16 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Data Visualization
Set up your own background format
If you don’t want to use ggplot2
, you can actually generate a similar format as it.
Set your background as a function
ggbg = function(){
points(0, 0, pch=16, cex=1e6, col="grey90")
grid(col="white", lty=1)}
Specify the background
cherry = read.csv("cherry.csv", header = T)
plot(cherry$diameter, cherry$volume, panel.first = ggbg(),
xlab="diameter", ylab="volume", pch = 19, col = 'gray50')
abline(lm(cherry$volume~cherry$diameter), col = "Steelblue", lwd = 1.5)
Use ggplot2
Boxplot
books = read.csv("books.csv", header = T)
books$Shelf = factor(books$Shelf)
library(ggplot2)
fill = "#4271AE"; line = "#1F3552"
boxp =
ggplot(books, aes(x = Shelf, y = ReplaceValue)) +
geom_boxplot(fill = fill, colour = line, alpha = 0.7) +
scale_x_discrete(name = "Shelf Number") +
scale_y_continuous(name = "Replacement costs of books") +
ggtitle("Boxplot of mean Replacement Costs by Shelf") +
theme(plot.title = element_text(hjust = 0.5))
boxp
Scatter plot
statepop = read.csv("statepop.csv", header = T)
poptotal = 255077536
psi = statepop$popn / poptotal
library(ggplot2)
p1 =
ggplot(statepop, aes(x=popn/poptotal, y=numfarm)) +
geom_point(colour = "#4271AE") +
geom_smooth(method=lm, colour = "#1F3552") +
scale_y_continuous(name = "Number of Farms") +
scale_x_continuous(name = "Probability of selection") +
ggtitle("Number of Farms vs Selecting Probability") +
theme(plot.title = element_text(hjust = 0.5))
p1