Simpson's Paradox: Cholesterol Example

This app illustrates the Simpson's Paradox based on the Cholesterol Example in Glymour et al. (2016, Section 1.2). In this example, we want to assess the effect of exercise on cholesterol in various age groups. The scatter plot below suggests that overall more exercise leads to higher cholesterol. However, if we consider each of the age groups separately, the sign of the effect is negative.

Does More Exercise Lead to Higher Cholesterol?

We simulate data to illustrate the Simpson's Paradox in the Cholesterol Example. The Simpson's Paradox describes the phenomenon that a certain statistical association like a positive correlation might hold in a population of interest, whereas it is reversed in all subpopulations. If we consider the population irrespective of the information on the age group, we find a positive correlation between exercise and cholesterol. However, when we perform such a statistical analysis within every age group, the correlation is found to be negative. This sign-flip can be explained by the fact that people in higher age groups are also more likely to do more exercise in our data example. Hence, age is a common cause for cholesterol and exercise, which we have to account for in our analysis.

Data Example: Scatter Plot, Causal Diagram, Regression

The scatter plot below shows the simulated data points. The color of the points refers to the different age groups.

Click on the checkbox to see how scatter plot, regression line, regression output and DAG change when we condition on the confounder 'age'. The regression output shows the coefficent estimate on the variable 'Exercise' with the corresponding regression lines being added to the scatter plot. The Directed Acylcal Graph (DAG) below illustrates that age is a common cause for Exercise and Cholesterol.

Options

Regression Output

Output from a linear regression of the variable 'Cholesterol' on 'Exercise' according to the specified options: Either the regression is estimated using the entire data set or separately in each age group.

Scatter Plot

DAG

Code

The code is available at the GitHub repository https://github.com/DigitalCausalityLab/simpsonsparadox.

In case you find a bug or have suggestion for improvements, please open an issue in GitHub.

References

Glymour, Madelyn, Judea Pearl, and Nicholas P. Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016.