Simple tests for theses III: McNemar¶

(with Matthias Zuckschwert)

We continue with our previous example from Simple Test for Theses I.

In Kahneman and Tversky (1979), survey participants had the hypothetical choice between the risky gamble

$A: \begin{cases} 2{,}500 & \text{ with probability } 0.33\\ 2{,}400 & \text{ with probability } 0.66\\ 0 & \text{ with probability } 0.01\\ \end{cases}$

and 2,400 for sure

$B: \begin{cases} 2{,}400& \text{ with probability } 1.00\\ \end{cases}$.

Most people chose $B$ in the original sample.

The second choice was between

$A': \begin{cases} 2{,}500 & \text{ with probability } 0.33\\ 0 & \text{ with probability } 0.67\\ \end{cases}$

and

$B': \begin{cases} 2{,}400 & \text{ with probability } 0.34\\ 0 & \text{ with probability } 0.66\\ \end{cases}$.

Most people choose $A'$ in Kahneman and Tversky's original sample.

Similarly, 126 out of 282 Australian participants (45%) prefer $A$ over $B$ and 199 out of 282 Australian participants (55%) prefer $A'$ over $B'$ in the replication by Ruggeri et al. (2020).

In the following, answers to the lotteries are labeled so that AA' represents that the participant chooses A over B and A' over B'. Possible answers are AA', BB', AB,' and BA'. The first two answers are in line with expected utility theory (EU), whereas the latter two are not (Allais, 1953).

282 participants answer the upper lottery in the following way:

	A'	B'	Sum
A	95	31	126
B	104	52	156
Sum	199	83	282

In [1]:

import numpy as np
from statsmodels.stats.contingency_tables import mcnemar
from scipy.stats import binom
print('numpy version: ' + np.__version__)

numpy version: 1.19.2

Most people chose BA'. For many people, this makes intuitive sense. They choose the sure 2,400 in the first decision situation. In the second decision situation, the probability of winning a large price does not differ much (0.33 vs. 0.34), while the price in the case of success varies by a meaningful amount (2,500 vs. 2,400). However, this choice pattern violates the independence axiom. (Decision analysis students should be able to show that.)

What pattern would we expect if expected utility theory validly describes actual choice behavior? Expected utility maximizer would choose either AA' or BB'. We may occasionally see AB' and BA', but these patterns are just random mistakes. Especially, AB' and BA' should appear equally often.

Thus, we formulate our null hypothesis that AB' and BA' are equally likely. The data appears to be inconsistent with the null. We count AB' more often than BA'. The McNemar test proposes the following test statistic (see Moffatt, 2015, for details, and Conlisk, 1989, for an alternative test):

$$\frac{\left[ n(AB') - n(BA') \right]^2}{n(AB')+n(BA')} ,$$

where n(AB') and n(BA') are the number of people that choose patterns AB' and BA', respectively. The test statistic is exactly zero if n(AB') and n(BA') are equal and becomes larger the larger the absolute difference between n(AB') and n(BA') is.

The test statistic in our example is:

In [2]:

(31 - 104) ** 2 / (31 + 104)

Out[2]:

39.474074074074075

How likely is a pattern as extreme or even more extreme than 39.474 if the null is true?

Given the simplicity of the test statistic, we can use the binominal distribution to calculate the probability exactly. First, we determine how likely it is that we observe 31 BA' or less than 31 BA' out of 135 BA' and AB' choices if the null (BA' and AB' are equally likely) is true:

In [3]:

binom.cdf(k=31, n=135, p=0.5)

Out[3]:

1.0278682020214401e-10

Similarly, how likely is it to observe 104 or even more than 104 BA' out of 135 BA' and AB' choices if the null is true (the second way how data could be extreme):

In [4]:

1 - binom.cdf(k=103, n=135, p=0.5)

Out[4]:

1.027867790881487e-10

We sum the probabilities up (that is, we perform a two-sided test).

In [5]:

1 - binom.cdf(k=103, n=135, p=0.5) + binom.cdf(k=31, n=135, p=0.5)

Out[5]:

2.0557359929029271e-10

That is the exact p-value of the McNemar test. In Python, we can define the data

In [6]:

data = np.array([[95, 31], [104, 52]])
data

Out[6]:

array([[ 95,  31],
       [104,  52]])

and call (see statsmodels docs for details)

In [7]:

print(mcnemar(data, exact=True))

pvalue      2.0557364040428802e-10
statistic   31.0

Often, it is more convenient to work with a continuous approximation. (For example, you may encounter numerical problems if your distribution is too extreme.)

In [8]:

print(mcnemar(data, exact=False, correction=False))

pvalue      3.324439111401887e-10
statistic   39.474074074074075

Here, our test statistic reappears, and the p-value is quite close to the p-value of the exact test.

Note: In R, this test is one line of code: "mcnemar.test(matrix(c(95, 104, 31, 52), nrow = 2), correct = F)". Try it online.

Literature:

Allais, M., 1953, Le comportement de l’homme rationnel devant le risque: critique des postulats et axiomes de l’école Américaine, Econometrica 21, 503-546.
Conlisk, J, 1989, Three variants on the Allais example, American Economic Review 79, 392-407.
Moffatt, P. G., 2015, Experimetrics: Econometrics for Experimental Economics, 61-63.
Ruggeri, K. et al., 2020, Replicating patterns of prospect theory for decision under risk, Nature Human Behavior 4, 622-633.