Fast Fama-MacBeth in R

Introduction

Students often work with panel data in their theses. Mitchell Petersen’s 2009 RFS paper is an invaluable resource for applied work, and his website provides excellent guidance for Stata and SPSS users. For R users, the website points to Wayne Chang’s fantastic webpage.

Sometimes, students use a Fama-MacBeth (1973) regression. The first step is to run one cross-sectional regression every period. Time-series averages and standard errors of coefficients and intercepts are computed in the second step.

A common way is to use the pmg command from the plm package. This indeed yields the desired result using Mitchell Petersen’s test data set (see Data for Testing Standard Error Estimation Programs).

library(foreign)
df_petersen <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

library(plm)
summary(pmg(y ~ x, data = df_petersen, index = c("year", "firmid")))

## Mean Groups model
## 
## Call:
## pmg(formula = y ~ x, data = df_petersen, index = c("year", "firmid"))
## 
## Balanced Panel: n = 10, T = 500, N = 5000
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -6.70005900 -1.36768322 -0.01673249  1.32802212  8.76554780 
## 
## Coefficients:
##             Estimate Std. Error z-value Pr(>|z|)    
## (Intercept) 0.031278   0.023356  1.3392   0.1805    
## x           1.035586   0.033342 31.0599   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Total Sum of Squares: 25368
## Residual Sum of Squares: 20024
## Multiple R-squared: 0.21067

There is an easy and fast way to do it with data.table, an extension of data.frame that is highly recommended. We first load the package and create a data.table.

library(data.table)

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:plm':
## 
##     between

dt_petersen <- as.data.table(df_petersen)

The data has 500 firms observed in 10 different years.

dt_petersen

##       firmid year            x          y
##    1:      1    1 -1.113972664  2.2515347
##    2:      1    2 -0.080853760  1.2423458
##    3:      1    3 -0.237607241 -1.4263762
##    4:      1    4 -0.152485684 -1.1093940
##    5:      1    5 -0.001426160  0.9146864
##   ---                                    
## 4996:    500    6 -0.077056959  3.7205019
## 4997:    500    7  0.218846902  0.5591205
## 4998:    500    8 -0.155530021 -3.7667847
## 4999:    500    9 -0.040172249  0.9033538
## 5000:    500   10 -0.001171455 -0.5297611

There are no missings.

any(is.na(dt_petersen))

## [1] FALSE

We then define a function that returns a list with the desired statistics from a single linear regression.

get_info_from_first_stage_regression <- function(my_y, my_x) {
  model <- lm(my_y ~ my_x)
  return(list(reg_coef = model$coefficients['my_x'],
              intercept = model$coefficients['(Intercept)'],
              r_squared = summary(model)$r.squared)
         )
}

We now run ten regressions, one regression each year, and store the results in a new data.table.

first_stage <- dt_petersen[, get_info_from_first_stage_regression(y, x), by = year]
first_stage

##     year  reg_coef    intercept r_squared
##  1:    1 0.9983268  0.142618359 0.1965707
##  2:    2 1.0765835  0.132873422 0.2107832
##  3:    3 1.0908976 -0.007500073 0.2279037
##  4:    4 1.1543529 -0.061696570 0.2471550
##  5:    5 1.0908093  0.082588085 0.2440966
##  6:    6 0.8342679  0.023133011 0.1559411
##  7:    7 1.0889767 -0.054278636 0.2266773
##  8:    8 0.9830493 -0.015115750 0.1871137
##  9:    9 0.8966287 -0.009813824 0.1617335
## 10:   10 1.1419682  0.079971630 0.2281170

Finally, we compute averages and the (unadjusted) standard errors.

first_stage[, .(mean_reg_coef      = mean(reg_coef), 
                se_reg_coef        = sd(reg_coef) / sqrt(nrow(first_stage)),
                mean_intercept     = mean(intercept), 
                se_reg_intercept   = sd(intercept) / sqrt(nrow(first_stage)),
                mean_r_squared     = mean(r_squared))
            ]

##    mean_reg_coef se_reg_coef mean_intercept se_reg_intercept mean_r_squared
## 1:      1.035586  0.03334159     0.03127797       0.02335649      0.2086092

Some simple benchmarking:

start_time <- Sys.time()
summary(pmg(y ~ x, data = df_petersen, index = c("year", "firmid"))) 
end_time <- Sys.time()

end_time - start_time

## Time difference of 0.0431776 secs

Now with data.table and a single function call:

start_time <- Sys.time()
first_stage <- dt_petersen[, get_info_from_first_stage_regression(y, x), by = year]
first_stage[, .(mean_reg_coef      = mean(reg_coef), 
                se_reg_coef        = sd(reg_coef) / sqrt(nrow(first_stage)),
                mean_intercept     = mean(intercept), 
                se_reg_intercept   = sd(intercept) / sqrt(nrow(first_stage)),
                mean_r_squared     = mean(r_squared))
            ]
end_time <- Sys.time()

end_time - start_time

## Time difference of 0.01808119 secs

If we increase the amount of data from 5000 to 5m observations:

dt_petersen_new <-
  dt_petersen[, .(firmid = 1:5000000, 
                  year, 
                  x = x + runif(n = 5000000, min = -0.1, max = 0.1), 
                  y = y + runif(n = 5000000, min = -0.1, max = 0.1))
              ]
dt_petersen_new

##           firmid year             x          y
##       1:       1    1 -1.0809877790  2.3084572
##       2:       2    2 -0.0702489842  1.1471552
##       3:       3    3 -0.2343303021 -1.3555752
##       4:       4    4 -0.0967945960 -1.0649213
##       5:       5    5  0.0541833037  0.9259824
##      ---                                      
## 4999996: 4999996    6 -0.0008292205  3.6925623
## 4999997: 4999997    7  0.2654742213  0.5327368
## 4999998: 4999998    8 -0.0847743160 -3.7414310
## 4999999: 4999999    9  0.0041157940  1.0001923
## 5000000: 5000000   10  0.0886689224 -0.5676410

pmg:

start_time <- Sys.time()
summary(pmg(y ~ x, data = dt_petersen_new, index = c("year", "firmid"))) 
end_time <- Sys.time()

end_time - start_time

## Time difference of 1.648223 mins

data.table:

start_time <- Sys.time()
first_stage <- dt_petersen_new[, get_info_from_first_stage_regression(y, x), by = year]
first_stage[, .(mean_reg_coef      = mean(reg_coef), 
                se_reg_coef        = sd(reg_coef) / sqrt(nrow(first_stage)),
                mean_intercept     = mean(intercept), 
                se_reg_intercept   = sd(intercept) / sqrt(nrow(first_stage)),
                mean_r_squared     = mean(r_squared)
                )]
end_time <- Sys.time()

end_time - start_time

## Time difference of 3.063729 secs

Literature:

Fama, Eugene F., and MacBeth, James D. (1973). Risk, Return, and Equilibrium: Empirical Tests. Journal of Political Economy. 81 (3): 607–636.
Petersen, Mitchell (2009). Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches. Review of Financial Studies. 22 (1): 435–480.