Especially in a Master thesis, you will encounter methods not or rarely discussed in your courses. A typical response is to find an R package or a Stata command that does the work for you. While there is nothing wrong with delegating computational details to a package, developing an intuition for the methods used in your thesis is still necessary and helpful. Here is an ordered probit model as an example.

Look for an intuitive explanation that gives you a good start, maybe Simon Jackman’s teaching notes.

I start with simplified code from the Library of Statistical Techniques (LOST).

```
library(data.table)
library(MASS)
library(ggplot2)
```

We use data on marital happiness. Fair (1986) uses this data set, which is now part of Rdatasets (documentation).

`mar <- fread('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Fair.csv')`

We do not attempt to replicate the original analysis, nor are we trying to estimate the most reasonable model for this data set. We use the data and the model below just for educational illustration.

```
# just keep rate, child, and age
mar <- mar[, .(id, rate, age, child)]
mar
```

```
## id rate age child
## 1: 1 4 37 no
## 2: 2 4 27 no
## 3: 3 4 32 yes
## 4: 4 5 57 yes
## 5: 5 3 22 no
## ---
## 597: 597 5 22 yes
## 598: 598 4 32 yes
## 599: 599 5 32 yes
## 600: 600 2 22 yes
## 601: 601 5 32 yes
```

Each row belongs to one respondent. The variable `rate`

is
the self-rating of marriage, ranging from 1 (very unhappy) to 5 (very
happy). The dummy variable `child`

tells you if a person has
kids.

A common choice for estimating ordered probit models is the
`polr`

command of the `MASS`

package.

```
# See how age and children predict marital happiness
m <- polr(factor(rate) ~ age + child,
data = mar,
method = 'probit'
)
summary(m)
```

```
##
## Re-fitting to get Hessian
```

```
## Call:
## polr(formula = factor(rate) ~ age + child, data = mar, method = "probit")
##
## Coefficients:
## Value Std. Error t value
## age -0.01725 0.005249 -3.287
## childyes -0.39102 0.111463 -3.508
##
## Intercepts:
## Value Std. Error t value
## 1|2 -2.8407 0.2001 -14.1932
## 2|3 -1.9759 0.1752 -11.2794
## 3|4 -1.4140 0.1696 -8.3384
## 4|5 -0.5443 0.1650 -3.2987
##
## Residual Deviance: 1596.248
## AIC: 1608.248
```

`logLik(m)`

`## 'log Lik.' -798.1242 (df=6)`

Sometimes students paste the summary output of the `polr`

command without further explanation into their theses. But: What are the
intercepts here (sometimes called cutoff values or thresholds)? What are
the coefficients telling us? Students need to develop a basic
understanding of what these statistics mean. I recommend playing around
with some data to build up these intuitions as we do in the
following.

Let’s have a look at the predicted probabilities for each observation. We ask R to show us just the first five observations.

`m$fitted.values[1:5,]`

```
## 1 2 3 4 5
## 1 0.013818096 0.07667856 0.12846113 0.3184791 0.4625631
## 2 0.008776590 0.05671942 0.10601361 0.2971975 0.5312928
## 3 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
## 4 0.071270311 0.20242134 0.21047418 0.3125635 0.2032706
## 5 0.006924422 0.04826966 0.09526513 0.2841035 0.5654373
```

The fitted values tell how likely a respondent is to choose 1, 2, 3, 4, or 5 on the Likert scale. Note that probabilities for each respondent sum up to one.

The predicted results for each observation are just the response with the highest predicted probability.

`predict(m)[1:5]`

```
## [1] 5 5 5 4 5
## Levels: 1 2 3 4 5
```

Where do the predicted probabilities come from, and what is the role of the cutoff values?

The key is the construct of a continuous latent variable, usually denoted \(y^*\). It is assumed to be a linear combination of some covariates. In our example, \(y^* = \beta_1 \times age + \beta_2 \times I(kids)+\epsilon\). \(I(kids)\) is an indicator variable equal to one if a respondent has children and zero otherwise. The disturbance term \(\epsilon\) is assumed to have a standard normal distribution.

The value of the latent variable for subject i (\(y^*_i\)) is unobservable. But there is a link between \(y^*_i\) and the observed category chosen by a respondent. This is where the cutoff values come into play. We have five ratings. We need four cutoff points to split the real axis into five parts. If \(y^*_i\) is smaller than the smallest cutoff value \(\mu_0\), then we see the lowest response. In our example, this refers to someone who rates her or his marriage as “very unhappy”. If \(y^*_i\) lies between the smallest cutoff value \(\mu_0\) and the second smallest cutoff value \(\mu_1\), we see a person that rates her or his marriage as “unhappy”, and so forth.

The final step uses the assumption about the disturbance term. Assume we already know the optimal params \(\beta_1\) and \(\beta_2\). Assume further that a person has children and is 27 years old (like the respondent in the 20th row of the dataset).

`unique(mar$child)`

`## [1] "no" "yes"`

```
mar[, child_numeric := ifelse(child == "yes", 1, 0)]
# define subject no.
subject_no <- 20
mar[subject_no, age]
```

`## [1] 27`

`mar[subject_no, child_numeric]`

`## [1] 1`

How likely is it that this person rates her or his marriage as “very poor”? We first calculate the predicted value of the latent variable \(y^*_i\), where \(\Phi\) is the cumulative distribution function of the standard normal (note that the disturbance term \(\epsilon\) has zero mean).

```
# latent variable
latent_value <- as.numeric(m$coefficients['age'] * mar[subject_no, age] + m$coefficients['childyes'] * mar[subject_no, child_numeric])
latent_value
```

`## [1] -0.8567661`

Then, the probability pf seeing a rating “very poor” is equal to
\(Prob[y^*_i + \epsilon < \mu_0]\).
By assumption, \(\mu\) follows a
standard normal, so \(Prob[y^*_i + \epsilon
< \mu_0] = Prob[\epsilon < \mu_0 - y^*_i] = \Phi[\mu_0 -
y^*_i]\). The output of `polr`

at the beginning gives
us the cutoff values.

```
# the estimated cutoff values
m$zeta
```

```
## 1|2 2|3 3|4 4|5
## -2.8406574 -1.9759479 -1.4139629 -0.5442665
```

We need the first cutoff value to calculate \(\Phi[\mu_0 - y^*_i]\) for our observation.

`m$zeta['1|2']`

```
## 1|2
## -2.840657
```

`pnorm(m$zeta['1|2'] - latent_value)`

```
## 1|2
## 0.02363398
```

We move on and ask you likely it is for our respondent to rate her or his marriage as “unhappy”: \(Prob[\mu_0 < y^*_i + \epsilon < \mu_1] = Prob[\mu_0 - y^*_i < \epsilon < \mu_1 - y^*_i] = \Phi[\mu_1 - y^*_i] - \Phi[\mu_0 - y^*_i]\).

`pnorm(m$zeta['2|3'] - latent_value) - pnorm(m$zeta['1|2'] - latent_value)`

```
## 2|3
## 0.1078973
```

Applying the same logic to the other categories, we get:

`pnorm(m$zeta['3|4'] - latent_value) - pnorm(m$zeta['2|3'] - latent_value)`

```
## 3|4
## 0.1571652
```

`pnorm(m$zeta['4|5'] - latent_value) - pnorm(m$zeta['3|4'] - latent_value)`

```
## 4|5
## 0.3339731
```

`pnorm(latent_value - m$zeta['4|5'])`

```
## 4|5
## 0.3773304
```

These values are the same as the fitted.values of the fitted
`polr`

object.

```
# same as
m$fitted.values[subject_no, ]
```

```
## 1 2 3 4 5
## 0.02363398 0.10789732 0.15716519 0.33397307 0.37733044
```

```
# prediction
predict(m)[subject_no]
```

```
## [1] 5
## Levels: 1 2 3 4 5
```

The following plot summarizes our example graphically (with some help from here and here).

```
ggplot(data.frame(x = c(-5, 5)), aes(x = x)) +
stat_function(fun = function(x) dnorm(x, mean = latent_value)) +
geom_vline(xintercept=c(m$zeta['1|2'], m$zeta['2|3'], m$zeta['3|4'], m$zeta['4|5']), linetype="dotted")
```

To estimate the model, we are looking for the cutoff values \(\mu_0\) to \(\mu_4\) and the coefficients \(\beta_1\) and \(\beta_2\) that fit the data best. What does “fit the data best” mean?

Excursion: Think of a simple model predicting that every respondent chooses the first option in a binary choice problem with a probability of \(0.6\). We have a sample of two people. One respondent chooses A, and the second chooses B. How likely is it that we observe the data assuming our model is true? It’s \(0.6 \times 0.4 = 0.24\). What about a model that predicts that every person chooses the first option with a probability of \(0.5\). Under this model, there is a \(0.5 \times 0.5 = 0.25\) chance that we see the data. So the second model seems to be more accurate for our data.

Maximum likelihood estimation generalizes this intuition, but it is usually done with a sum over the logarithm of probabilities and not with a multiplication of many probabilities. With many observations, the multiplicative procedure is numerically problematic. To see this, note that it is easy to calculate \(0.6 \times 0.4 = 0.24\). But what about \((0.6 \times 0.4)^{100}\) if a data set that is 100 times as large?

`(0.6 * 0.4) ^ 100`

`## [1] 1.049843e-62`

Still works. But what if we have 2000 obs or 1000 pairs of A and B choices under the first model?

`(0.6 * 0.4) ^ 1000`

`## [1] 0`

This term evaluates to \(0\). Of course, we know that’s wrong. We should see a tiny number, which is certainly larger than \(0\). However, our probability has become so small that my machine can no longer handle it. The solution is the maximize the natural log of the probability. \(ln \left( \prod_{i=1}^N p_i \right)\) has the same max as \(\prod_{i=1}^N p_i\), but because of \(ln \left( \prod_{i=1}^N p_i \right) = \sum_{i=1}^N ln(p_i)\), there is no need to cope with tiny numbers in large samples.

Let’s apply the same logic to estimating the ordered probit model. We want to choose the model’s parameters to maximize the probability of seeing the data. The log-likelihood with \(601\) observations, \(5\) categories, \(\mu_0 = - \infty\) and \(\mu_5 = \infty\) is, therefore:

\(ln \mathcal{L} = \sum_{i=1}^{601} \sum_{j=1}^5 Z_{ij} ln(Prob[Z_{ij} = j]) = \sum_{i=1}^{601} \sum_{j=1}^5 Z_{ij} ln(\Phi(\mu_j - y^*_{ij}) - \Phi(\mu_{j-1} - y^*_{ij}))\)

\(Z_{ij}\) is one if respondent \(i\) has chosen category \(j\) and zero otherwise. Note that most of the summands above are zero; just because \(Z_{ij}\) is equal to zero in four out of five cases.

We define a few functions to calculate this number for our model and data. There is no need to bother with the details.

```
# define a function that calculates all predicted probabilities
get_predicted_probabilities <- function(my_zetas, my_coeff) {
# Input: - my_zetas : A list of four threshold values, labeled with '1|2', '2|3', '3|4', and '4|5'.
# - my_coeff : A list of two beta coefficents, labeled with 'age' and 'childyes'
# Output: A data.table with predicted probabilites for each observation
# Predicted values of 1 are replaced with 1.0 - 0.000001
# Predicted values of 0 are replaced with 0.000001
return_matrix <- matrix(data = NA, nrow = nrow(mar), ncol = 5)
latent_values <- my_coeff['age'] * mar[, age] + my_coeff['childyes'] * mar[, child_numeric]
return_matrix[, 1] <- pnorm(my_zetas['1|2'] - latent_values)
return_matrix[, 2] <- pnorm(my_zetas['2|3'] - latent_values) - pnorm(my_zetas['1|2'] - latent_values)
return_matrix[, 3] <- pnorm(my_zetas['3|4'] - latent_values) - pnorm(my_zetas['2|3'] - latent_values)
return_matrix[, 4] <- pnorm(my_zetas['4|5'] - latent_values) - pnorm(my_zetas['3|4'] - latent_values)
return_matrix[, 5] <- 1 - pnorm(my_zetas['4|5'] - latent_values)
colnames(return_matrix) <- c("Prob1", "Prob2", "Prob3", "Prob4", "Prob5")
return_matrix[return_matrix == 1] <- 1.0 - 0.000001
return_matrix[return_matrix == 0] <- 0.000001
return(as.data.table(return_matrix)[])
}
get_prediction <- function(predicted_probabilities) {
# Input: - predicted_probabilities : A data.table formatted as the output of get_predicted_probabilities
# Output: A data.table with predicted choices for every responded
my_predicted_choices <- copy(predicted_probabilities)
my_predicted_choices[, I := 1:.N]
my_predicted_choices[, row_max := max(Prob1, Prob2, Prob3, Prob4, Prob5), by = I]
my_predicted_choices[, Pred1 := ifelse(Prob1 == row_max, 1, 0)]
my_predicted_choices[, Pred2 := ifelse(Prob2 == row_max, 1, 0)]
my_predicted_choices[, Pred3 := ifelse(Prob3 == row_max, 1, 0)]
my_predicted_choices[, Pred4 := ifelse(Prob4 == row_max, 1, 0)]
my_predicted_choices[, Pred5 := ifelse(Prob5 == row_max, 1, 0)]
my_predicted_choices[, c("Prob1", "Prob2", "Prob3", "Prob4", "Prob5", "I", "row_max") := NULL]
return(my_predicted_choices[])
}
```

Short note for people interested in the details of the
`data.table`

package: If you find the `[]`

’s in
the return statements weird, see the answer
to frequently asked question 2.23 about data.table.

```
# sanity check
test_probs <- get_predicted_probabilities(my_zetas = m$zeta, my_coeff = m$coefficients)
test_probs
```

```
## Prob1 Prob2 Prob3 Prob4 Prob5
## 1: 0.013818096 0.07667856 0.12846113 0.3184791 0.4625631
## 2: 0.008776590 0.05671942 0.10601361 0.2971975 0.5312928
## 3: 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
## 4: 0.071270311 0.20242134 0.21047418 0.3125635 0.2032706
## 5: 0.006924422 0.04826966 0.09526513 0.2841035 0.5654373
## ---
## 597: 0.019219588 0.09479887 0.14594887 0.3295293 0.4105034
## 598: 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
## 599: 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
## 600: 0.019219588 0.09479887 0.14594887 0.3295293 0.4105034
## 601: 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
```

`m$fitted.values[1:5, ]`

```
## 1 2 3 4 5
## 1 0.013818096 0.07667856 0.12846113 0.3184791 0.4625631
## 2 0.008776590 0.05671942 0.10601361 0.2971975 0.5312928
## 3 0.028871630 0.12194612 0.16802137 0.3361218 0.3450391
## 4 0.071270311 0.20242134 0.21047418 0.3125635 0.2032706
## 5 0.006924422 0.04826966 0.09526513 0.2841035 0.5654373
```

`m$fitted.values[597:601, ]`

```
## 1 2 3 4 5
## 597 0.01921959 0.09479887 0.1459489 0.3295293 0.4105034
## 598 0.02887163 0.12194612 0.1680214 0.3361218 0.3450391
## 599 0.02887163 0.12194612 0.1680214 0.3361218 0.3450391
## 600 0.01921959 0.09479887 0.1459489 0.3295293 0.4105034
## 601 0.02887163 0.12194612 0.1680214 0.3361218 0.3450391
```

```
test_predict <- get_prediction(predicted_probabilities = test_probs)
test_predict
```

```
## Pred1 Pred2 Pred3 Pred4 Pred5
## 1: 0 0 0 0 1
## 2: 0 0 0 0 1
## 3: 0 0 0 0 1
## 4: 0 0 0 1 0
## 5: 0 0 0 0 1
## ---
## 597: 0 0 0 0 1
## 598: 0 0 0 0 1
## 599: 0 0 0 0 1
## 600: 0 0 0 0 1
## 601: 0 0 0 0 1
```

`predict(m)[1:5]`

```
## [1] 5 5 5 4 5
## Levels: 1 2 3 4 5
```

Finally, we define the log-likelihood function and the observed choices in a matrix with zeros and ones.

```
# estimation: maximum likelihood
ll <- function(params) {
zetas <- c('1|2' = params[1], '2|3' = params[2], '3|4' = params[3], '4|5' = params[4])
coeffs <- c('age' = params[5], 'childyes' = params[6])
my_probs <- get_predicted_probabilities(zetas, coeffs)
return(sum(my_choices * log(my_probs)))
}
my_choices <- matrix(data = NA, nrow = nrow(mar), ncol = 5)
for (i in 1:5) {
my_choices[, i] <- ifelse(mar$rate == i, 1, 0)
}
my_choices[1:5, ]
```

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 1 0
## [2,] 0 0 0 1 0
## [3,] 0 0 0 1 0
## [4,] 0 0 0 0 1
## [5,] 0 0 1 0 0
```

`mar$rate[1:5]`

`## [1] 4 4 4 5 3`

We get the same log-likelihood if we use the optimal values from the
`polr`

output above and evaluate the log-likelihood with our
function `ll`

.

```
params <- as.vector(c(unlist(m$zeta), unlist(m$coefficients)))
ll(params)
```

`## [1] -798.1242`

We get a smaller log-likelihood when adding small random numbers to the optimal params.

`ll(params + runif(n = 6, min = -0.01, max = 0.01))`

`## [1] -804.2977`

`polr`

found the best fitting parameters for us.

The `MASS`

package is beneficial for several reasons. The
code above is only valid for our model and our data. If we change any of
these, we have to change the code. Furthermore, maximizing
`ll`

with out-of-the-box algorithms and unwise chosen
starting values may …

```
negll <- function(params) {return (-1 * ll(params))}
nlm(negll, c(0, 0.1, 0.2, 0.3, 0, 0))
```

```
## $minimum
## [1] 798.1242
##
## $estimate
## [1] -2.84092297 -1.97620273 -1.41421462 -0.54451101 -0.01725857 -0.39096819
##
## $gradient
## [1] -3.321458e-06 3.394147e-06 -4.421377e-06 0.000000e+00 2.522711e-04
## [6] 1.341505e-05
##
## $code
## [1] 1
##
## $iterations
## [1] 62
```

… or may not work.

`nlm(negll, c(0, 0.1, 0.2, 0.3, 0.2, -0.2))`

```
## $minimum
## [1] 8082.074
##
## $estimate
## [1] -0.35959241 0.35551324 0.05124773 3.07565150 -91.94363894
## [6] -2.30315152
##
## $gradient
## [1] 0 0 0 0 0 0
##
## $code
## [1] 1
##
## $iterations
## [1] 1
```

So, we are thankful that a package handles all these details for us. However, the point of this markdown is that we should still aim for a solid intuitive understanding of our methods.

There are several ways to think about the coefficients. Here is what I think is meaningful in our context.

We start by fixing age at the mean.

`mean(mar$age)`

`## [1] 32.48752`

The predicted latent value of a person at the mean age with children is …

```
latent_value_with_kids <- as.numeric(m$coefficients['age'] * mean(mar$age) + m$coefficients['childyes'] * 1)
latent_value_with_kids
```

`## [1] -0.9514251`

… and for a person at the mean age without kids, it is ….

```
latent_value_without_kids <- as.numeric(m$coefficients['age'] * mean(mar$age) + m$coefficients['childyes'] * 0)
latent_value_without_kids
```

`## [1] -0.5604054`

The predicted probabilities with kids are …

`pnorm(m$zeta['1|2'] - latent_value_with_kids)`

```
## 1|2
## 0.02943035
```

`pnorm(m$zeta['2|3'] - latent_value_with_kids) - pnorm(m$zeta['1|2'] - latent_value_with_kids)`

```
## 2|3
## 0.1233639
```

`pnorm(m$zeta['3|4'] - latent_value_with_kids) - pnorm(m$zeta['2|3'] - latent_value_with_kids)`

```
## 3|4
## 0.1690536
```

`pnorm(m$zeta['4|5'] - latent_value_with_kids) - pnorm(m$zeta['3|4'] - latent_value_with_kids)`

```
## 4|5
## 0.3362064
```

`1 - pnorm(m$zeta['4|5'] - latent_value_with_kids)`

```
## 4|5
## 0.3419458
```

The predicted probabilities without kids are …

`pnorm(m$zeta['1|2'] - latent_value_without_kids)`

```
## 1|2
## 0.01129637
```

`pnorm(m$zeta['2|3'] - latent_value_without_kids) - pnorm(m$zeta['1|2'] - latent_value_without_kids)`

```
## 2|3
## 0.06715837
```

`pnorm(m$zeta['3|4'] - latent_value_without_kids) - pnorm(m$zeta['2|3'] - latent_value_without_kids)`

```
## 3|4
## 0.1182204
```

`pnorm(m$zeta['4|5'] - latent_value_without_kids) - pnorm(m$zeta['3|4'] - latent_value_without_kids)`

```
## 4|5
## 0.3097631
```

`1 - pnorm(m$zeta['4|5'] - latent_value_without_kids)`

```
## 4|5
## 0.4935618
```

People without kids rate their marriages higher, and the calculations above tell us how large this effect is in terms of predicted probabilities.

Assessed graphically, we have a person with a mean age and no kids in blue and a person with a mean age with kids in red. The cutoff values are fixed, and predicted probabilities, the areas between the cutoffs, differ.

```
ggplot(data.frame(x = c(-5, 5)), aes(x = x)) +
stat_function(fun = function(x) dnorm(x, mean = latent_value_with_kids), col = c("red")) +
stat_function(fun = function(x) dnorm(x, mean = latent_value_without_kids), col = c("blue")) +
geom_vline(xintercept=c(m$zeta['1|2'], m$zeta['2|3'], m$zeta['3|4'], m$zeta['4|5']), linetype="dotted")
```

Literature:

Arel-Bundock, Vincent (2023). Rdatasets: A collection of datasets originally distributed in various R packages. R package version 1.0.0, https://vincentarelbundock.github.io/Rdatasets.

Fair, Ray C. (1978). A theory of extramarital affairs. Journal of Political Economy 86, 45-61.