Simple tests for theses I: One-way chi-square test¶

In Kahneman and Tversky (1979), survey participants had the hypothetical choice between the risky gamble

$A: \begin{cases} 2{,}500 & \text{ with probability } 0.33\\ 2{,}400 & \text{ with probability } 0.66\\ 0 & \text{ with probability } 0.01\\ \end{cases}$

and 2,400 for sure

$B: \begin{cases} 2{,}400& \text{ with probability } 1.00\\ \end{cases}$.

Most people chose $B$ in the original sample.

In the replication by Ruggeri et al. (2020), 126 out of 282 Australian participants (44.7%) chose $A$. Suppose we assume that half of Australians prefer $A$. What is the probability that we observe an empirical distribution at least as extreme as the sample we have observed (in other words, we are looking for the p-value)?

A common approach to this question is a one-way chi-square test (sometimes called chi-square test goodness of fit). It tests the null hypothesis that the categorical data has pre-defined frequencies. The data and code supplementing Ruggeri et al. (2020) use this test by calling scipy's chisquare function for each item in each country (not reported in the paper).

In [1]:
import scipy
import matplotlib
import numpy as np
from scipy.stats import chisquare
from matplotlib import pyplot as plt
print('numpy version: ' + np.__version__)
print('scipy version: ' + scipy.__version__)
print('matplotlib version: ' + matplotlib.__version__)

# set seed
np.random.seed(1)
numpy version: 1.19.2
scipy version: 1.6.2
matplotlib version: 3.5.0
In [2]:
chisquare(f_obs=[126,156])
Out[2]:
Power_divergenceResult(statistic=3.1914893617021276, pvalue=0.0740225429253814)

Calling the function chisquare with the empirically observed choice frequencies yields a chi-squared test statistic of 3.19 and a p-value of 0.074.

Note: In R, you can just call "chisq.test(x=c(126, 156), correct=F)". rdrr.io allows you to do that online without the need to install R.

Let us try to develop an intuition. The empirically observed and the expected frequencies under the null hypothesis look like this:

Choice A B
Observed 126 156
Expected 141 141

The test statistic is the sum of the squared deviations between the observed and expected frequencies, divided by the expected frequency, $\sum \frac{(observed - expected)^2}{expected}$, sometimes written as $\chi^2 = \sum_i \frac{(O_i-E_i)^2}{E_i}$. The larger the test statistic, the larger the differences between observed and expected frequencies. In our example:

In [3]:
(126 - 141) ** 2 / 141 + (156 - 141) ** 2 / 141
Out[3]:
3.1914893617021276

How likely is a value greater or equal to 3.19 if the null is true? scipy's chisquare function already provides an answer (0.074), but we run a short simulation to obtain an estimate.

The function get_simulated_samples repeatedly generates samples of the same size. You don't need to understand every single line of the code. However, you should get the idea of the simulation and develop an intuition of the p-value.

In [4]:
def get_simulated_samples(sim_runs, sample_size):
    results = []
    for _ in range(sim_runs):
        # flip a coin sample_size times and record the outcome (returns an array of size sim_runs of 0's and 1's)
        random_choices = np.random.randint(low=0, high=2, size=sample_size, dtype=int)
        # count the "A" choices
        no_A_choices = random_choices.sum()
        # calculate the expected A choices
        exp_A_choices = sample_size / 2
        # append the X^2 statistic of this simulation run to our results list
        results.append(((no_A_choices - exp_A_choices) ** 2 + 
                        ((sample_size - no_A_choices) - exp_A_choices) ** 2) / exp_A_choices)
    return results

With the parameters sim_runs and sample_size, you determine how large the sample size should be and how many of these samples you would like to draw. The function assumes that A and B choices are equally likely for each simulation run, randomly generates a series of choices, and computes the test statistic.

In [5]:
small_sample = get_simulated_samples(sim_runs=10000, sample_size=282)

The function returns a list of these statistics. Therefore, the length of the list is equal to the number of simulation runs.

In [6]:
len(small_sample)
Out[6]:
10000
In [7]:
plt.hist(small_sample)
Out[7]:
(array([7.806e+03, 1.336e+03, 5.850e+02, 1.400e+02, 8.500e+01, 2.300e+01,
        1.600e+01, 6.000e+00, 2.000e+00, 1.000e+00]),
 array([ 0.        ,  1.54468085,  3.0893617 ,  4.63404255,  6.1787234 ,
         7.72340426,  9.26808511, 10.81276596, 12.35744681, 13.90212766,
        15.44680851]),
 <BarContainer object of 10 artists>)

How often do we get test statistics larger than 3.19?

In [8]:
np.sum(np.array(small_sample)>3.1914893617021276)
Out[8]:
668

... as a fraction:

In [9]:
np.sum(np.array(small_sample)>3.1914893617021276) / len(small_sample)
Out[9]:
0.0668

That is our simulation-based estimate of the p-value. It is close to the p-value that we obtain from scipy. If choice proportions are identical and the sample size is 282, we should see in roughly 7% of all cases a test statistic that is larger than 3.19.

Bottom Line: Using R, you can add a statistical test to your thesis just by running a single line of code.

Final Comment: The test works fine with more than 2 categories and alternative null hypotheses. Exact tests exist if the number of observations becomes small.

Literature:

  • Kahneman, D., and Tversky, A., 1979, Prospect theory: An analysis of decision under risk, Econometrica 47, 263-291.
  • Ruggeri, K. et al., 2020, Replicating patterns of prospect theory for decision under risk, Nature Human Behavior 4, 622-633.