How to do clustering for panel data model in R
Clustering standard errors in R
An introduction of clustering in panel data models
In my last post, I briefly introduced standard error clustering in panel data settings. In this post, I will continue the topic and present how to do the clustering in
Before we move to the coding part, I’d like to clarify several things. First, the panel model I focus on in this post refers to the fixed effect panel data model, which is distinct from the random effect model. These two models differ from each other in terms of the assumption of the unobserved individual effects. The random effect model assumes the individual effects are unrelated to the independent variables, whereas the fixed effect model allows the presence of arbitrary correlations. You are encouraged to go over the textbook by Wooldridge (2010) if you want to learn more.
In practice, the fixed effect model is much more privileged than the random effect model. Because we suspect there exist unobserved individual effects that are correlated with the covariates and we use the fixed effect model to control for the unobservables to alleviate omitted variable bias.
Back to our topic, clustering in the fixed effect model is straightforward. We can either cluster at
\(i\) (individual level) or
\(t\) (time level). The former case is robust to heterogeneity and serial correlation within the individual, whereas the latter one allows correlations across individuals yet is unable to take account of the serial correlations.
You may ask how to determine which level to cluster at? The rule of thumb is that in panels where you have a large number of individuals and a short period, you should cluster at the individual level. Analogously, if you have a few individuals that were observed during a long time slot, you should cluster at the time level.
Alternatively, you can also “cluster” at both individual and time levels (also known as two-way clustering). See here for more details.
Clustering in the random effect model is slightly different than that in the fixed effect counterpart. With a few more assumptions (again refer to Wooldridge (2010) for more details), we can use GLS to estimate standard errors. In such cases, we gain efficiency (because we have a smaller number of unknowns to estimate in the variance and covariance matrix), however, the payoff is the standard errors are less robust because we imposed additional assumptions in the first place. Thus, in some circumstances, regular clustering is also recommended in random effect model settings.
Clustering in R
Before we perform clustering, we need to run the panel data model first. You can either use the
lm function or the
plm function from the
plm package. I personally prefer the latter over the former. Thus, in this post, I am going to stick with the
Importing the data
I am going to run a panel data model to examine the effects of weather conditions on crop yields. The data can be downloaded from here.
library(tidyverse) library(plm) library(lmtest) library(clubSandwich) library(sandwich) df <- read_csv('dataforclustering.csv', col_types = cols(new_code = col_character())) head(df)
## # A tibble: 6 × 5 ## new_code Year wheat tavg pc ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 130402 1987 1.64 8.15 103. ## 2 130402 1988 1.62 8.29 177. ## 3 130402 1989 5.37 9.17 95.4 ## 4 130402 1999 4.75 10.3 41.4 ## 5 130402 2000 4.75 9.29 88.9 ## 6 130402 2001 4.75 8.83 151.
As you can see,
new_coderepresents the counties (individuals).
Yearindicates the time index.
wheatrefers to winter wheat yield in tons/hectare.
tavgstands for the average temperature over the growing season (October to May).
pcis the accumulation of precipitation over the growing season.
Running the fixed effect model
# Note I only included the county fixed effect in the model. reg <- df %>% pdata.frame(index = c('new_code','Year')) %>% # specify the inidivual and year index plm(formula = log(wheat) ~ tavg + I(tavg^2) + pc + I(pc^2), data = ., effect = 'individual', model = 'within') summary(reg)
## Oneway (individual) effect Within Model ## ## Call: ## plm(formula = log(wheat) ~ tavg + I(tavg^2) + pc + I(pc^2), data = ., ## effect = "individual", model = "within") ## ## Unbalanced Panel: n = 352, T = 10-34, N = 8867 ## ## Residuals: ## Min. 1st Qu. Median 3rd Qu. Max. ## -7.158354 -0.132313 0.081255 0.255566 1.144955 ## ## Coefficients: ## Estimate Std. Error t-value Pr(>|t|) ## tavg 4.4441e-01 5.0411e-02 8.8156 < 2.2e-16 *** ## I(tavg^2) -8.5789e-03 2.6044e-03 -3.2940 0.0009917 *** ## pc 1.8988e-03 2.0575e-04 9.2285 < 2.2e-16 *** ## I(pc^2) -2.6568e-06 3.3927e-07 -7.8309 5.425e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Total Sum of Squares: 2107.6 ## Residual Sum of Squares: 1726.4 ## R-Squared: 0.18087 ## Adj. R-Squared: 0.1467 ## F-statistic: 469.818 on 4 and 8511 DF, p-value: < 2.22e-16
As can be been, with raw (no clustering) standard errors, we got statistically significant coefficients on temperature and precipitation. We then move forward to clustering.
Clustering the standard erros
There are three functions available to do the clustering.
- You can use the
vcovHCfunction in the
plmpackage to construct the variance-covariance matrix.
- Alternatively, you can use the
vcovHCfunction from the
sandwichpackage to do the task.
vcovCRfunction from the
clubSandwichpackage to do the task.
# The first approach: vcovHC from the plm package # 'arellano' is the recommanded method as it does not # impose restrictions on the structure of the variance-covariance matrix. # cluster = 'group' indicates we cluster at county level # type = 'HC3' is for small-sample adjustment. method1 <- coeftest(reg, vcov = function(x) plm::vcovHC(x, method = 'arellano', cluster = 'group', type = 'HC3')) # The second approach: vcovHC from the sandwich package # You may have noticed that there does not exist a parameter for us # to specify the clustering level. method2 <- coeftest(reg, vcov = function(x) sandwich::vcovHC(x, type = 'HC3')) # The third approach: vcovHC from the clubSandwich package # type = 'CR2' is recommanded for small-sample adjustment. # cluster = df$new_code indicates we cluster at the individual level. method3 <- coeftest(reg, vcov = function(x) clubSandwich::vcovCR(x, type = 'CR2', cluster = df$new_code)) # Compare the estimated standard errors comse <- bind_cols(method1[,2], method2[,2], method3[,2]) %>% rename('plm' = '...1', 'sandwich' = '...2', 'clubsandwich' = '...3') %>% print(.)
## # A tibble: 4 × 3 ## plm sandwich clubsandwich ## <dbl> <dbl> <dbl> ## 1 0.0721 0.0721 0.0724 ## 2 0.00360 0.00360 0.00361 ## 3 0.000233 0.000233 0.000234 ## 4 0.000000375 0.000000375 0.000000377
As we can see,
sandwich gave us identical clustered standard errors, whereas
clubsanwich returned slightly larger standard errors. Note that in the analysis above, we clustered at the county (individual) level.
Things are different if we clustered at the year (time) level. Simply change the
cluster = 'group' to
cluster = 'time' in method one, and adjust
cluster = df$new_code to
cluster = df$Year in method three. See below. The
vcovHC function from the
sandwich package does not allow us to choose which level to cluster the standard errors at.
method1_time <- coeftest(reg, vcov = function(x) plm::vcovHC(x, method = 'arellano', cluster = 'time', type = 'HC3')) method3_time <- coeftest(reg, vcov = function(x) clubSandwich::vcovCR(x, type = 'CR2', cluster = df$Year)) comse_time <- bind_cols(method1_time[,2], method2[,2], method3_time[,2]) %>% rename('plm' = '...1', 'sandwich' = '...2', 'clubsandwich' = '...3') %>% print(.)
## # A tibble: 4 × 3 ## plm sandwich clubsandwich ## <dbl> <dbl> <dbl> ## 1 0.189 0.0721 0.208 ## 2 0.00865 0.00360 0.00996 ## 3 0.00114 0.000233 0.00124 ## 4 0.00000138 0.000000375 0.00000150
In applications where you cluster standard errors at the individual level, all three methods should work just fine. However, if you want to cluster at the time level (or other alternative levels), you may refer to the embedded
vcovHC function in the
plm package or the
vcovCR function from the
Well, actually I would say it’s better to just stick with the
clubSandwichapproaches regardless of what level your standard errors were clustered at. The
sandwich package is sort of particularly designed for clustering in
Jeffrey M Wooldridge, 2010. “Econometric Analysis of Cross Section and Panel Data,” MIT Press Books, The MIT Press, edition 2.