Students often work with panel data in their theses. Mitchell Petersen’s 2009 RFS paper is an invaluable resource for applied work, and his website provides excellent guidance for Stata and SPSS users. For R users, the website points to Wayne Chang’s fantastic webpage.
Sometimes, students use a Fama-MacBeth (1973) regression. The first step is to run one cross-sectional regression every period. Time-series averages and standard errors of coefficients and intercepts are computed in the second step.
A common way is to use the pmg
command from the plm
package. This indeed yields the desired result using Mitchell
Petersen’s test data set (see Data
for Testing Standard Error Estimation Programs).
library(foreign)
df_petersen <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
library(plm)
summary(pmg(y ~ x, data = df_petersen, index = c("year", "firmid")))
## Mean Groups model
##
## Call:
## pmg(formula = y ~ x, data = df_petersen, index = c("year", "firmid"))
##
## Balanced Panel: n = 10, T = 500, N = 5000
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -6.70005900 -1.36768322 -0.01673249 1.32802212 8.76554780
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 0.031278 0.023356 1.3392 0.1805
## x 1.035586 0.033342 31.0599 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Total Sum of Squares: 25368
## Residual Sum of Squares: 20024
## Multiple R-squared: 0.21067
There is an easy and fast way to do it with data.table
,
an extension of data.frame
that is highly recommended. We
first load the package and create a data.table
.
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked from 'package:plm':
##
## between
dt_petersen <- as.data.table(df_petersen)
The data has 500 firms observed in 10 different years.
dt_petersen
## firmid year x y
## 1: 1 1 -1.113972664 2.2515347
## 2: 1 2 -0.080853760 1.2423458
## 3: 1 3 -0.237607241 -1.4263762
## 4: 1 4 -0.152485684 -1.1093940
## 5: 1 5 -0.001426160 0.9146864
## ---
## 4996: 500 6 -0.077056959 3.7205019
## 4997: 500 7 0.218846902 0.5591205
## 4998: 500 8 -0.155530021 -3.7667847
## 4999: 500 9 -0.040172249 0.9033538
## 5000: 500 10 -0.001171455 -0.5297611
There are no missings.
any(is.na(dt_petersen))
## [1] FALSE
We then define a function that returns a list with the desired statistics from a single linear regression.
get_info_from_first_stage_regression <- function(my_y, my_x) {
model <- lm(my_y ~ my_x)
return(list(reg_coef = model$coefficients['my_x'],
intercept = model$coefficients['(Intercept)'],
r_squared = summary(model)$r.squared)
)
}
We now run ten regressions, one regression each year, and store the results in a new data.table.
first_stage <- dt_petersen[, get_info_from_first_stage_regression(y, x), by = year]
first_stage
## year reg_coef intercept r_squared
## 1: 1 0.9983268 0.142618359 0.1965707
## 2: 2 1.0765835 0.132873422 0.2107832
## 3: 3 1.0908976 -0.007500073 0.2279037
## 4: 4 1.1543529 -0.061696570 0.2471550
## 5: 5 1.0908093 0.082588085 0.2440966
## 6: 6 0.8342679 0.023133011 0.1559411
## 7: 7 1.0889767 -0.054278636 0.2266773
## 8: 8 0.9830493 -0.015115750 0.1871137
## 9: 9 0.8966287 -0.009813824 0.1617335
## 10: 10 1.1419682 0.079971630 0.2281170
Finally, we compute averages and the (unadjusted) standard errors.
first_stage[, .(mean_reg_coef = mean(reg_coef),
se_reg_coef = sd(reg_coef) / sqrt(nrow(first_stage)),
mean_intercept = mean(intercept),
se_reg_intercept = sd(intercept) / sqrt(nrow(first_stage)),
mean_r_squared = mean(r_squared))
]
## mean_reg_coef se_reg_coef mean_intercept se_reg_intercept mean_r_squared
## 1: 1.035586 0.03334159 0.03127797 0.02335649 0.2086092
Some simple benchmarking:
start_time <- Sys.time()
summary(pmg(y ~ x, data = df_petersen, index = c("year", "firmid")))
end_time <- Sys.time()
end_time - start_time
## Time difference of 0.0431776 secs
Now with data.table
and a single function call:
start_time <- Sys.time()
first_stage <- dt_petersen[, get_info_from_first_stage_regression(y, x), by = year]
first_stage[, .(mean_reg_coef = mean(reg_coef),
se_reg_coef = sd(reg_coef) / sqrt(nrow(first_stage)),
mean_intercept = mean(intercept),
se_reg_intercept = sd(intercept) / sqrt(nrow(first_stage)),
mean_r_squared = mean(r_squared))
]
end_time <- Sys.time()
end_time - start_time
## Time difference of 0.01808119 secs
If we increase the amount of data from 5000 to 5m observations:
dt_petersen_new <-
dt_petersen[, .(firmid = 1:5000000,
year,
x = x + runif(n = 5000000, min = -0.1, max = 0.1),
y = y + runif(n = 5000000, min = -0.1, max = 0.1))
]
dt_petersen_new
## firmid year x y
## 1: 1 1 -1.0809877790 2.3084572
## 2: 2 2 -0.0702489842 1.1471552
## 3: 3 3 -0.2343303021 -1.3555752
## 4: 4 4 -0.0967945960 -1.0649213
## 5: 5 5 0.0541833037 0.9259824
## ---
## 4999996: 4999996 6 -0.0008292205 3.6925623
## 4999997: 4999997 7 0.2654742213 0.5327368
## 4999998: 4999998 8 -0.0847743160 -3.7414310
## 4999999: 4999999 9 0.0041157940 1.0001923
## 5000000: 5000000 10 0.0886689224 -0.5676410
pmg
:
start_time <- Sys.time()
summary(pmg(y ~ x, data = dt_petersen_new, index = c("year", "firmid")))
end_time <- Sys.time()
end_time - start_time
## Time difference of 1.648223 mins
data.table
:
start_time <- Sys.time()
first_stage <- dt_petersen_new[, get_info_from_first_stage_regression(y, x), by = year]
first_stage[, .(mean_reg_coef = mean(reg_coef),
se_reg_coef = sd(reg_coef) / sqrt(nrow(first_stage)),
mean_intercept = mean(intercept),
se_reg_intercept = sd(intercept) / sqrt(nrow(first_stage)),
mean_r_squared = mean(r_squared)
)]
end_time <- Sys.time()
end_time - start_time
## Time difference of 3.063729 secs
Literature:
Fama, Eugene F., and MacBeth, James D. (1973). Risk, Return, and Equilibrium: Empirical Tests. Journal of Political Economy. 81 (3): 607–636.
Petersen, Mitchell (2009). Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches. Review of Financial Studies. 22 (1): 435–480.