In Kahneman and Tversky (1979), survey participants had the hypothetical choice between the risky gamble
$A: \begin{cases} 2{,}500 & \text{ with probability } 0.33\\ 2{,}400 & \text{ with probability } 0.66\\ 0 & \text{ with probability } 0.01\\ \end{cases}$
and 2,400 for sure
$B: \begin{cases} 2{,}400& \text{ with probability } 1.00\\ \end{cases}$.
Most people chose $B$ in the original sample.
In the replication by Ruggeri et al. (2020), 126 out of 282 Australian participants (44.7%) chose $A$. Suppose we assume that half of Australians prefer $A$. What is the probability that we observe an empirical distribution at least as extreme as the sample we have observed (in other words, we are looking for the p-value)?
A common approach to this question is a one-way chi-square test (sometimes called chi-square test goodness of fit). It tests the null hypothesis that the categorical data has pre-defined frequencies. The data and code supplementing Ruggeri et al. (2020) use this test by calling scipy's chisquare function for each item in each country (not reported in the paper).
import scipy
import matplotlib
import numpy as np
from scipy.stats import chisquare
from matplotlib import pyplot as plt
print('numpy version: ' + np.__version__)
print('scipy version: ' + scipy.__version__)
print('matplotlib version: ' + matplotlib.__version__)
# set seed
np.random.seed(1)
numpy version: 1.19.2 scipy version: 1.6.2 matplotlib version: 3.5.0
chisquare(f_obs=[126,156])
Power_divergenceResult(statistic=3.1914893617021276, pvalue=0.0740225429253814)
Calling the function chisquare with the empirically observed choice frequencies yields a chi-squared test statistic of 3.19 and a p-value of 0.074.
Note: In R, you can just call "chisq.test(x=c(126, 156), correct=F)". rdrr.io allows you to do that online without the need to install R.
Let us try to develop an intuition. The empirically observed and the expected frequencies under the null hypothesis look like this:
Choice | A | B |
---|---|---|
Observed | 126 | 156 |
Expected | 141 | 141 |
The test statistic is the sum of the squared deviations between the observed and expected frequencies, divided by the expected frequency, $\sum \frac{(observed - expected)^2}{expected}$, sometimes written as $\chi^2 = \sum_i \frac{(O_i-E_i)^2}{E_i}$. The larger the test statistic, the larger the differences between observed and expected frequencies. In our example:
(126 - 141) ** 2 / 141 + (156 - 141) ** 2 / 141
3.1914893617021276
How likely is a value greater or equal to 3.19 if the null is true? scipy's chisquare function already provides an answer (0.074), but we run a short simulation to obtain an estimate.
The function get_simulated_samples repeatedly generates samples of the same size. You don't need to understand every single line of the code. However, you should get the idea of the simulation and develop an intuition of the p-value.
def get_simulated_samples(sim_runs, sample_size):
results = []
for _ in range(sim_runs):
# flip a coin sample_size times and record the outcome (returns an array of size sim_runs of 0's and 1's)
random_choices = np.random.randint(low=0, high=2, size=sample_size, dtype=int)
# count the "A" choices
no_A_choices = random_choices.sum()
# calculate the expected A choices
exp_A_choices = sample_size / 2
# append the X^2 statistic of this simulation run to our results list
results.append(((no_A_choices - exp_A_choices) ** 2 +
((sample_size - no_A_choices) - exp_A_choices) ** 2) / exp_A_choices)
return results
With the parameters sim_runs and sample_size, you determine how large the sample size should be and how many of these samples you would like to draw. The function assumes that A and B choices are equally likely for each simulation run, randomly generates a series of choices, and computes the test statistic.
small_sample = get_simulated_samples(sim_runs=10000, sample_size=282)
The function returns a list of these statistics. Therefore, the length of the list is equal to the number of simulation runs.
len(small_sample)
10000
plt.hist(small_sample)
(array([7.806e+03, 1.336e+03, 5.850e+02, 1.400e+02, 8.500e+01, 2.300e+01, 1.600e+01, 6.000e+00, 2.000e+00, 1.000e+00]), array([ 0. , 1.54468085, 3.0893617 , 4.63404255, 6.1787234 , 7.72340426, 9.26808511, 10.81276596, 12.35744681, 13.90212766, 15.44680851]), <BarContainer object of 10 artists>)
How often do we get test statistics larger than 3.19?
np.sum(np.array(small_sample)>3.1914893617021276)
668
... as a fraction:
np.sum(np.array(small_sample)>3.1914893617021276) / len(small_sample)
0.0668
That is our simulation-based estimate of the p-value. It is close to the p-value that we obtain from scipy. If choice proportions are identical and the sample size is 282, we should see in roughly 7% of all cases a test statistic that is larger than 3.19.
Bottom Line: Using R, you can add a statistical test to your thesis just by running a single line of code.
Final Comment: The test works fine with more than 2 categories and alternative null hypotheses. Exact tests exist if the number of observations becomes small.
Literature: