Use of Causal Diagrams to Inform the Analysis of Observational Studies

Observational studies usually involve some sort of multi-variable analysis. To make sense of the association between an explanatory variable (E) and an outcome (O), it is necessary to control for confounders – age for example in clinical studies. A confounder (C) is a variable that is associated with both E and O. Indeed it is causal of E and O as shown by the direction of arrows in Figure 1.

Fig 1. Causal Diagram for a Confounder

A common error is to mistake a confounder for a mediator. If the variable lies on the causal pathway between E and O, then it is a Mediator – M in Figure 2.

Fig 2. Causal Diagram to Distinguish Between a Confounder (C) and a Mediator (M)

Failure to make this distinction, and to adjust for M, will reduce or remove the effect of E on the outcome. In a study of the effect of money spent on tobacco on lung cancer, it would be self-defeating to adjust for smoking! If we are interested in decomposing different causal pathways, then we should adapt the multivariable analysis to examine how much of the effect of E or O is explained by the putative mediator (M in Figure 2) – a structural equation model or ‘mediator’ analysis.

There are some issues to consider:

  1. It may not be possible to say for certain whether a variable is a mediator or confounder and some variables may be both. Then try the analysis three ways: omit it, treat it as a confounder, or treat it as a mediator.
  2. It is hard to know which variables to include as confounders. A dataset was sent for analysis by 29 different teams of statisticians.[1] They came up with different results that varied wildly. This was because they adjusted for different combinations of variables. The corollary is that choice of variables should not be left to statisticians – it turns on causal theory that distinguishes between variables that are likely to have arrows pointing from E and O via M, and those pointing from C to both E and O (Figure 2). Context matters!
  3. There may be an interaction between variables, such that the causal effect of one variable on E or O is amplified or attenuated in the presence of another. Given four variables, each with four ‘levels’, yields 256 possible first order interactions. So, again, theory is needed to determine which variables to include in such interaction tests.

A variable may exist that is an independent cause of C or M (let’s call these C* and M*), as in Figure 3. There is no reason to adjust for these variables. Likewise, do not adjust for any variable that ‘precedes’ E, as also shown in Figure 3.

Fig 3. Variables That Cause Change in Other Variables

In this example, C* and M* are not causally linked to O, except through C and M respectively. But a situation may occur where such a link is possible. It is well known that maternal smoking is causally linked to both low birth-weight and to neonatal deaths, as per Figure 4. The theory is that smoking is toxic and leads to both a small baby and, via that pathway and other pathways, leads to neonatal death.

Fig 4. Causal Pathway for Smoking and Neonatal Deaths

If this analysis is conducted controlling for ‘small baby’, then smoking is associated with lower mortality – it appears protective. The obvious fault was to control for a variable on the causal pathway, as per Figure 2. But this could explain why the association may be reduced, but not reversed.

The explanation for the reversal lies in a putative third variable (perhaps a ‘genetic’ defect, G), which predisposes to both a small baby and neonatal death (Figure 5). Note, that both E and G collide on M, and such a scenario leads to ‘collider bias’ – by controlling for one source of bias, the door is opened to another. It is well known that there may be unobserved (‘lurking’) confounders in any association. The same applies, of course, to a variable that might completely alter the meaning of an association once one has conditioned on another variable.

Fig 5. Collider Bias

These analyses show that conducting a multivariable analysis is not, or rather should never be, an entirely data-driven / empirical exercise. Choices have to be made, such that the statistical model informs on, but does not determine, the causal model. For a brilliant example of extensive causal chains involving confounders, colliders and mediators, see an example from Andrew Forbes and colleagues.[2]

Note, we are not arguing against adjustment per se. It is an essential part of the analysis. We argue against adjusting without reference to a causal model.

Richard Lilford, ARC WM Director; Sam Watson, Senior Lecturer [With thanks to Peter Diggle (Lancaster University & Health Data Research UK) for comments.]


  1. Silberzahn R, et al. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Adv Method Pract Psychological Science. 2018; 1(3).
  2. Williamson EJ, et al. Introduction to causal diagrams for confounder selection. Respirology. 2014; 19(3): 303-11.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s