Develop an intuition for the methods you use

Introduction

Especially in a Master thesis, you will encounter methods not or rarely discussed in your courses. A typical response is to find an R package or a Stata command that does the work for you. While there is nothing wrong with delegating computational details to a package, developing an intuition for the methods used in your thesis is still necessary and helpful. Here is an ordered probit model as an example.

Basics

Look for an intuitive explanation that gives you a good start, maybe Simon Jackman’s teaching notes.

I start with simplified code from the Library of Statistical Techniques (LOST).

library(data.table)
library(MASS)
library(ggplot2)

We use data on marital happiness. Fair (1986) uses this data set, which is now part of Rdatasets (documentation).

mar <- fread('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Fair.csv')

We do not attempt to replicate the original analysis, nor are we trying to estimate the most reasonable model for this data set. We use the data and the model below just for educational illustration.

# just keep rate, child, and age
mar <- mar[, .(id, rate, age, child)]
mar

##       id rate age child
##   1:   1    4  37    no
##   2:   2    4  27    no
##   3:   3    4  32   yes
##   4:   4    5  57   yes
##   5:   5    3  22    no
##  ---                   
## 597: 597    5  22   yes
## 598: 598    4  32   yes
## 599: 599    5  32   yes
## 600: 600    2  22   yes
## 601: 601    5  32   yes

Each row belongs to one respondent. The variable rate is the self-rating of marriage, ranging from 1 (very unhappy) to 5 (very happy). The dummy variable child tells you if a person has kids.

A common choice for estimating ordered probit models is the polr command of the MASS package.

# See how age and children predict marital happiness
m <- polr(factor(rate) ~ age + child,
          data = mar, 
          method = 'probit'
)

summary(m)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = factor(rate) ~ age + child, data = mar, method = "probit")
## 
## Coefficients:
##             Value Std. Error t value
## age      -0.01725   0.005249  -3.287
## childyes -0.39102   0.111463  -3.508
## 
## Intercepts:
##     Value    Std. Error t value 
## 1|2  -2.8407   0.2001   -14.1932
## 2|3  -1.9759   0.1752   -11.2794
## 3|4  -1.4140   0.1696    -8.3384
## 4|5  -0.5443   0.1650    -3.2987
## 
## Residual Deviance: 1596.248 
## AIC: 1608.248

logLik(m)

## 'log Lik.' -798.1242 (df=6)

Sometimes students paste the summary output of the polr command without further explanation into their theses. But: What are the intercepts here (sometimes called cutoff values or thresholds)? What are the coefficients telling us? Students need to develop a basic understanding of what these statistics mean. I recommend playing around with some data to build up these intuitions as we do in the following.

Let’s have a look at the predicted probabilities for each observation. We ask R to show us just the first five observations.

m$fitted.values[1:5,]

##             1          2          3         4         5
## 1 0.013818096 0.07667856 0.12846113 0.3184791 0.4625631
## 2 0.008776590 0.05671942 0.10601361 0.2971975 0.5312928
## 3 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
## 4 0.071270311 0.20242134 0.21047418 0.3125635 0.2032706
## 5 0.006924422 0.04826966 0.09526513 0.2841035 0.5654373

The fitted values tell how likely a respondent is to choose 1, 2, 3, 4, or 5 on the Likert scale. Note that probabilities for each respondent sum up to one.

The predicted results for each observation are just the response with the highest predicted probability.

predict(m)[1:5]

## [1] 5 5 5 4 5
## Levels: 1 2 3 4 5

Where do the predicted probabilities come from, and what is the role of the cutoff values?

The key is the construct of a continuous latent variable, usually denoted \(y^*\). It is assumed to be a linear combination of some covariates. In our example, \(y^* = \beta_1 \times age + \beta_2 \times I(kids)+\epsilon\). \(I(kids)\) is an indicator variable equal to one if a respondent has children and zero otherwise. The disturbance term \(\epsilon\) is assumed to have a standard normal distribution.

The value of the latent variable for subject i (\(y^*_i\)) is unobservable. But there is a link between \(y^*_i\) and the observed category chosen by a respondent. This is where the cutoff values come into play. We have five ratings. We need four cutoff points to split the real axis into five parts. If \(y^*_i\) is smaller than the smallest cutoff value \(\mu_0\), then we see the lowest response. In our example, this refers to someone who rates her or his marriage as “very unhappy”. If \(y^*_i\) lies between the smallest cutoff value \(\mu_0\) and the second smallest cutoff value \(\mu_1\), we see a person that rates her or his marriage as “unhappy”, and so forth.

The final step uses the assumption about the disturbance term. Assume we already know the optimal params \(\beta_1\) and \(\beta_2\). Assume further that a person has children and is 27 years old (like the respondent in the 20th row of the dataset).

unique(mar$child)

## [1] "no"  "yes"

mar[, child_numeric := ifelse(child == "yes", 1, 0)]

# define subject no.
subject_no <- 20
mar[subject_no, age]

## [1] 27

mar[subject_no, child_numeric]

## [1] 1

How likely is it that this person rates her or his marriage as “very poor”? We first calculate the predicted value of the latent variable \(y^*_i\), where \(\Phi\) is the cumulative distribution function of the standard normal (note that the disturbance term \(\epsilon\) has zero mean).

# latent variable 
latent_value <- as.numeric(m$coefficients['age'] * mar[subject_no, age] + m$coefficients['childyes'] * mar[subject_no, child_numeric])
latent_value

## [1] -0.8567661

Then, the probability pf seeing a rating “very poor” is equal to \(Prob[y^*_i + \epsilon < \mu_0]\). By assumption, \(\mu\) follows a standard normal, so \(Prob[y^*_i + \epsilon < \mu_0] = Prob[\epsilon < \mu_0 - y^*_i] = \Phi[\mu_0 - y^*_i]\). The output of polr at the beginning gives us the cutoff values.

# the estimated cutoff values
m$zeta

##        1|2        2|3        3|4        4|5 
## -2.8406574 -1.9759479 -1.4139629 -0.5442665

We need the first cutoff value to calculate \(\Phi[\mu_0 - y^*_i]\) for our observation.

m$zeta['1|2']

##       1|2 
## -2.840657

pnorm(m$zeta['1|2'] - latent_value)

##        1|2 
## 0.02363398

We move on and ask you likely it is for our respondent to rate her or his marriage as “unhappy”: \(Prob[\mu_0 < y^*_i + \epsilon < \mu_1] = Prob[\mu_0 - y^*_i < \epsilon < \mu_1 - y^*_i] = \Phi[\mu_1 - y^*_i] - \Phi[\mu_0 - y^*_i]\).

pnorm(m$zeta['2|3'] - latent_value) - pnorm(m$zeta['1|2'] - latent_value)

##       2|3 
## 0.1078973

Applying the same logic to the other categories, we get:

pnorm(m$zeta['3|4'] - latent_value) - pnorm(m$zeta['2|3'] - latent_value)

##       3|4 
## 0.1571652

pnorm(m$zeta['4|5'] - latent_value) - pnorm(m$zeta['3|4'] - latent_value)

##       4|5 
## 0.3339731

pnorm(latent_value - m$zeta['4|5'])

##       4|5 
## 0.3773304

These values are the same as the fitted.values of the fitted polr object.

# same as
m$fitted.values[subject_no, ]

##          1          2          3          4          5 
## 0.02363398 0.10789732 0.15716519 0.33397307 0.37733044

# prediction
predict(m)[subject_no]

## [1] 5
## Levels: 1 2 3 4 5

The following plot summarizes our example graphically (with some help from here and here).

ggplot(data.frame(x = c(-5, 5)), aes(x = x)) +
  stat_function(fun = function(x) dnorm(x, mean = latent_value)) +
  geom_vline(xintercept=c(m$zeta['1|2'], m$zeta['2|3'], m$zeta['3|4'], m$zeta['4|5']), linetype="dotted")

Estimation

To estimate the model, we are looking for the cutoff values \(\mu_0\) to \(\mu_4\) and the coefficients \(\beta_1\) and \(\beta_2\) that fit the data best. What does “fit the data best” mean?

Excursion: Think of a simple model predicting that every respondent chooses the first option in a binary choice problem with a probability of \(0.6\). We have a sample of two people. One respondent chooses A, and the second chooses B. How likely is it that we observe the data assuming our model is true? It’s \(0.6 \times 0.4 = 0.24\). What about a model that predicts that every person chooses the first option with a probability of \(0.5\). Under this model, there is a \(0.5 \times 0.5 = 0.25\) chance that we see the data. So the second model seems to be more accurate for our data.

Maximum likelihood estimation generalizes this intuition, but it is usually done with a sum over the logarithm of probabilities and not with a multiplication of many probabilities. With many observations, the multiplicative procedure is numerically problematic. To see this, note that it is easy to calculate \(0.6 \times 0.4 = 0.24\). But what about \((0.6 \times 0.4)^{100}\) if a data set that is 100 times as large?

(0.6 * 0.4) ^ 100

## [1] 1.049843e-62

Still works. But what if we have 2000 obs or 1000 pairs of A and B choices under the first model?

(0.6 * 0.4) ^ 1000

## [1] 0

This term evaluates to \(0\). Of course, we know that’s wrong. We should see a tiny number, which is certainly larger than \(0\). However, our probability has become so small that my machine can no longer handle it. The solution is the maximize the natural log of the probability. \(ln \left( \prod_{i=1}^N p_i \right)\) has the same max as \(\prod_{i=1}^N p_i\), but because of \(ln \left( \prod_{i=1}^N p_i \right) = \sum_{i=1}^N ln(p_i)\), there is no need to cope with tiny numbers in large samples.

Let’s apply the same logic to estimating the ordered probit model. We want to choose the model’s parameters to maximize the probability of seeing the data. The log-likelihood with \(601\) observations, \(5\) categories, \(\mu_0 = - \infty\) and \(\mu_5 = \infty\) is, therefore:

\(ln \mathcal{L} = \sum_{i=1}^{601} \sum_{j=1}^5 Z_{ij} ln(Prob[Z_{ij} = j]) = \sum_{i=1}^{601} \sum_{j=1}^5 Z_{ij} ln(\Phi(\mu_j - y^*_{ij}) - \Phi(\mu_{j-1} - y^*_{ij}))\)

\(Z_{ij}\) is one if respondent \(i\) has chosen category \(j\) and zero otherwise. Note that most of the summands above are zero; just because \(Z_{ij}\) is equal to zero in four out of five cases.

We define a few functions to calculate this number for our model and data. There is no need to bother with the details.

# define a function that calculates all predicted probabilities
get_predicted_probabilities <- function(my_zetas, my_coeff) {
  
  # Input: - my_zetas : A list of four threshold values, labeled with '1|2', '2|3', '3|4', and '4|5'.
  #        - my_coeff : A list of two beta coefficents, labeled with 'age' and 'childyes'
  # Output: A data.table with predicted probabilites for each observation
  #         Predicted values of 1 are replaced with 1.0 - 0.000001
  #         Predicted values of 0 are replaced with 0.000001
  
  return_matrix <- matrix(data = NA, nrow = nrow(mar), ncol = 5)
  latent_values <- my_coeff['age'] * mar[, age] + my_coeff['childyes'] * mar[, child_numeric]
  return_matrix[, 1] <- pnorm(my_zetas['1|2'] - latent_values)
  return_matrix[, 2] <- pnorm(my_zetas['2|3'] - latent_values) - pnorm(my_zetas['1|2'] - latent_values)
  return_matrix[, 3] <- pnorm(my_zetas['3|4'] - latent_values) - pnorm(my_zetas['2|3'] - latent_values)
  return_matrix[, 4] <- pnorm(my_zetas['4|5'] - latent_values) - pnorm(my_zetas['3|4'] - latent_values)
  return_matrix[, 5] <- 1 - pnorm(my_zetas['4|5'] - latent_values)
  colnames(return_matrix) <- c("Prob1", "Prob2", "Prob3", "Prob4", "Prob5")
  return_matrix[return_matrix == 1] <- 1.0 - 0.000001
  return_matrix[return_matrix == 0] <- 0.000001
   
  return(as.data.table(return_matrix)[])
}

get_prediction <- function(predicted_probabilities) {
  
  # Input: - predicted_probabilities : A data.table formatted as the output of get_predicted_probabilities
  # Output: A data.table with predicted choices for every responded 
  
  my_predicted_choices <- copy(predicted_probabilities)
  my_predicted_choices[, I := 1:.N]
  my_predicted_choices[, row_max := max(Prob1, Prob2, Prob3, Prob4, Prob5), by = I]
  my_predicted_choices[, Pred1 := ifelse(Prob1 == row_max, 1, 0)]
  my_predicted_choices[, Pred2 := ifelse(Prob2 == row_max, 1, 0)]
  my_predicted_choices[, Pred3 := ifelse(Prob3 == row_max, 1, 0)]
  my_predicted_choices[, Pred4 := ifelse(Prob4 == row_max, 1, 0)]
  my_predicted_choices[, Pred5 := ifelse(Prob5 == row_max, 1, 0)]
  my_predicted_choices[, c("Prob1", "Prob2", "Prob3", "Prob4", "Prob5", "I", "row_max") := NULL]
  return(my_predicted_choices[])
  
}

Short note for people interested in the details of the data.table package: If you find the []’s in the return statements weird, see the answer to frequently asked question 2.23 about data.table.

# sanity check
test_probs <- get_predicted_probabilities(my_zetas = m$zeta, my_coeff = m$coefficients)
test_probs

##            Prob1      Prob2      Prob3     Prob4     Prob5
##   1: 0.013818096 0.07667856 0.12846113 0.3184791 0.4625631
##   2: 0.008776590 0.05671942 0.10601361 0.2971975 0.5312928
##   3: 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
##   4: 0.071270311 0.20242134 0.21047418 0.3125635 0.2032706
##   5: 0.006924422 0.04826966 0.09526513 0.2841035 0.5654373
##  ---                                                      
## 597: 0.019219588 0.09479887 0.14594887 0.3295293 0.4105034
## 598: 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
## 599: 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
## 600: 0.019219588 0.09479887 0.14594887 0.3295293 0.4105034
## 601: 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391

m$fitted.values[1:5, ]

##             1          2          3         4         5
## 1 0.013818096 0.07667856 0.12846113 0.3184791 0.4625631
## 2 0.008776590 0.05671942 0.10601361 0.2971975 0.5312928
## 3 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
## 4 0.071270311 0.20242134 0.21047418 0.3125635 0.2032706
## 5 0.006924422 0.04826966 0.09526513 0.2841035 0.5654373

m$fitted.values[597:601, ]

##              1          2         3         4         5
## 597 0.01921959 0.09479887 0.1459489 0.3295293 0.4105034
## 598 0.02887163 0.12194612 0.1680214 0.3361218 0.3450391
## 599 0.02887163 0.12194612 0.1680214 0.3361218 0.3450391
## 600 0.01921959 0.09479887 0.1459489 0.3295293 0.4105034
## 601 0.02887163 0.12194612 0.1680214 0.3361218 0.3450391

test_predict <- get_prediction(predicted_probabilities = test_probs)
test_predict

##      Pred1 Pred2 Pred3 Pred4 Pred5
##   1:     0     0     0     0     1
##   2:     0     0     0     0     1
##   3:     0     0     0     0     1
##   4:     0     0     0     1     0
##   5:     0     0     0     0     1
##  ---                              
## 597:     0     0     0     0     1
## 598:     0     0     0     0     1
## 599:     0     0     0     0     1
## 600:     0     0     0     0     1
## 601:     0     0     0     0     1

predict(m)[1:5]

## [1] 5 5 5 4 5
## Levels: 1 2 3 4 5

Finally, we define the log-likelihood function and the observed choices in a matrix with zeros and ones.

# estimation: maximum likelihood
ll <- function(params) {
  zetas <- c('1|2' = params[1], '2|3' = params[2], '3|4' = params[3], '4|5' = params[4])
  coeffs <- c('age' = params[5], 'childyes' = params[6])
  my_probs <- get_predicted_probabilities(zetas, coeffs)
  return(sum(my_choices * log(my_probs)))
}

my_choices <- matrix(data = NA, nrow = nrow(mar), ncol = 5)
for (i in 1:5) {
  my_choices[, i] <- ifelse(mar$rate == i, 1, 0)
}
my_choices[1:5, ]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    1    0
## [2,]    0    0    0    1    0
## [3,]    0    0    0    1    0
## [4,]    0    0    0    0    1
## [5,]    0    0    1    0    0

mar$rate[1:5]

## [1] 4 4 4 5 3

We get the same log-likelihood if we use the optimal values from the polr output above and evaluate the log-likelihood with our function ll.

params <- as.vector(c(unlist(m$zeta), unlist(m$coefficients)))
ll(params)

## [1] -798.1242

We get a smaller log-likelihood when adding small random numbers to the optimal params.

ll(params + runif(n = 6, min = -0.01, max = 0.01))

## [1] -804.2977

polr found the best fitting parameters for us.

The MASS package is beneficial for several reasons. The code above is only valid for our model and our data. If we change any of these, we have to change the code. Furthermore, maximizing ll with out-of-the-box algorithms and unwise chosen starting values may …

negll <- function(params) {return (-1 * ll(params))}
nlm(negll, c(0, 0.1, 0.2, 0.3, 0, 0))

## $minimum
## [1] 798.1242
## 
## $estimate
## [1] -2.84092297 -1.97620273 -1.41421462 -0.54451101 -0.01725857 -0.39096819
## 
## $gradient
## [1] -3.321458e-06  3.394147e-06 -4.421377e-06  0.000000e+00  2.522711e-04
## [6]  1.341505e-05
## 
## $code
## [1] 1
## 
## $iterations
## [1] 62

… or may not work.

nlm(negll, c(0, 0.1, 0.2, 0.3, 0.2, -0.2))

## $minimum
## [1] 8082.074
## 
## $estimate
## [1]  -0.35959241   0.35551324   0.05124773   3.07565150 -91.94363894
## [6]  -2.30315152
## 
## $gradient
## [1] 0 0 0 0 0 0
## 
## $code
## [1] 1
## 
## $iterations
## [1] 1

So, we are thankful that a package handles all these details for us. However, the point of this markdown is that we should still aim for a solid intuitive understanding of our methods.

Interpretation of Coefficients

There are several ways to think about the coefficients. Here is what I think is meaningful in our context.

We start by fixing age at the mean.

mean(mar$age)

## [1] 32.48752

The predicted latent value of a person at the mean age with children is …

latent_value_with_kids <- as.numeric(m$coefficients['age'] * mean(mar$age) + m$coefficients['childyes'] * 1)
latent_value_with_kids

## [1] -0.9514251

… and for a person at the mean age without kids, it is ….

latent_value_without_kids <- as.numeric(m$coefficients['age'] * mean(mar$age) + m$coefficients['childyes'] * 0)
latent_value_without_kids

## [1] -0.5604054

The predicted probabilities with kids are …

pnorm(m$zeta['1|2'] - latent_value_with_kids)

##        1|2 
## 0.02943035

pnorm(m$zeta['2|3'] - latent_value_with_kids) - pnorm(m$zeta['1|2'] - latent_value_with_kids)

##       2|3 
## 0.1233639

pnorm(m$zeta['3|4'] - latent_value_with_kids) - pnorm(m$zeta['2|3'] - latent_value_with_kids)

##       3|4 
## 0.1690536

pnorm(m$zeta['4|5'] - latent_value_with_kids) - pnorm(m$zeta['3|4'] - latent_value_with_kids)

##       4|5 
## 0.3362064

1 - pnorm(m$zeta['4|5'] - latent_value_with_kids)

##       4|5 
## 0.3419458

The predicted probabilities without kids are …

pnorm(m$zeta['1|2'] - latent_value_without_kids)

##        1|2 
## 0.01129637

pnorm(m$zeta['2|3'] - latent_value_without_kids) - pnorm(m$zeta['1|2'] - latent_value_without_kids)

##        2|3 
## 0.06715837

pnorm(m$zeta['3|4'] - latent_value_without_kids) - pnorm(m$zeta['2|3'] - latent_value_without_kids)

##       3|4 
## 0.1182204

pnorm(m$zeta['4|5'] - latent_value_without_kids) - pnorm(m$zeta['3|4'] - latent_value_without_kids)

##       4|5 
## 0.3097631

1 - pnorm(m$zeta['4|5'] - latent_value_without_kids)

##       4|5 
## 0.4935618

People without kids rate their marriages higher, and the calculations above tell us how large this effect is in terms of predicted probabilities.

Assessed graphically, we have a person with a mean age and no kids in blue and a person with a mean age with kids in red. The cutoff values are fixed, and predicted probabilities, the areas between the cutoffs, differ.

ggplot(data.frame(x = c(-5, 5)), aes(x = x)) +
   stat_function(fun = function(x) dnorm(x, mean = latent_value_with_kids), col = c("red")) +
   stat_function(fun = function(x) dnorm(x, mean = latent_value_without_kids), col = c("blue")) +
  geom_vline(xintercept=c(m$zeta['1|2'], m$zeta['2|3'], m$zeta['3|4'], m$zeta['4|5']), linetype="dotted")

Literature:

Arel-Bundock, Vincent (2023). Rdatasets: A collection of datasets originally distributed in various R packages. R package version 1.0.0, https://vincentarelbundock.github.io/Rdatasets.
Fair, Ray C. (1978). A theory of extramarital affairs. Journal of Political Economy 86, 45-61.