Use of Causal Diagrams to Inform the Analysis of Observational Studies

Observational studies usually involve some sort of multi-variable analysis. To make sense of the association between an explanatory variable (E) and an outcome (O), it is necessary to control for confounders – age for example in clinical studies. A confounder (C) is a variable that is associated with both E and O. Indeed it is causal of E and O as shown by the direction of arrows in Figure 1.

A common error is to mistake a confounder for a mediator. If the variable lies on the causal pathway between E and O, then it is a Mediator – M in Figure 2.

Failure to make this distinction, and to adjust for M, will reduce or remove the effect of E on the outcome. In a study of the effect of money spent on tobacco on lung cancer, it would be self-defeating to adjust for smoking! If we are interested in decomposing different causal pathways, then we should adapt the multivariable analysis to examine how much of the effect of E or O is explained by the putative mediator (M in Figure 2) – a structural equation model or ‘mediator’ analysis.

There are some issues to consider:

1. It may not be possible to say for certain whether a variable is a mediator or confounder and some variables may be both. Then try the analysis three ways: omit it, treat it as a confounder, or treat it as a mediator.
2. It is hard to know which variables to include as confounders. A dataset was sent for analysis by 29 different teams of statisticians.[1] They came up with different results that varied wildly. This was because they adjusted for different combinations of variables. The corollary is that choice of variables should not be left to statisticians – it turns on causal theory that distinguishes between variables that are likely to have arrows pointing from E and O via M, and those pointing from C to both E and O (Figure 2). Context matters!
3. There may be an interaction between variables, such that the causal effect of one variable on E or O is amplified or attenuated in the presence of another. Given four variables, each with four ‘levels’, yields 256 possible first order interactions. So, again, theory is needed to determine which variables to include in such interaction tests.

A variable may exist that is an independent cause of C or M (let’s call these C* and M*), as in Figure 3. There is no reason to adjust for these variables. Likewise, do not adjust for any variable that ‘precedes’ E, as also shown in Figure 3.

In this example, C* and M* are not causally linked to O, except through C and M respectively. But a situation may occur where such a link is possible. It is well known that maternal smoking is causally linked to both low birth-weight and to neonatal deaths, as per Figure 4. The theory is that smoking is toxic and leads to both a small baby and, via that pathway and other pathways, leads to neonatal death.

If this analysis is conducted controlling for ‘small baby’, then smoking is associated with lower mortality – it appears protective. The obvious fault was to control for a variable on the causal pathway, as per Figure 2. But this could explain why the association may be reduced, but not reversed.

The explanation for the reversal lies in a putative third variable (perhaps a ‘genetic’ defect, G), which predisposes to both a small baby and neonatal death (Figure 5). Note, that both E and G collide on M, and such a scenario leads to ‘collider bias’ – by controlling for one source of bias, the door is opened to another. It is well known that there may be unobserved (‘lurking’) confounders in any association. The same applies, of course, to a variable that might completely alter the meaning of an association once one has conditioned on another variable.

These analyses show that conducting a multivariable analysis is not, or rather should never be, an entirely data-driven / empirical exercise. Choices have to be made, such that the statistical model informs on, but does not determine, the causal model. For a brilliant example of extensive causal chains involving confounders, colliders and mediators, see an example from Andrew Forbes and colleagues.[2]

Note, we are not arguing against adjustment per se. It is an essential part of the analysis. We argue against adjusting without reference to a causal model.

Richard Lilford, ARC WM Director; Sam Watson, Senior Lecturer [With thanks to Peter Diggle (Lancaster University & Health Data Research UK) for comments.]

References:

1. Silberzahn R, et al. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Adv Method Pract Psychological Science. 2018; 1(3).
2. Williamson EJ, et al. Introduction to causal diagrams for confounder selection. Respirology. 2014; 19(3): 303-11.

The Land War in the Fight Against COVID-19

Gone are the days of thinking there is a quick fix to the COVID-19 pandemic. Another country-wide lockdown would reduce COVID-19 infection, but at the same time would damage the economy and pose a threat to other long-term health conditions, with disproportionate effects on the more disadvantaged groups in society. The Great Barrington Declaration – aiming for herd immunity while sequestering high-risk people – does not bear close examination.[1] Vaccination is not an automatic get out of jail card – we do not yet know when vaccination will be available at the required volume, nor what degree of protection it will confer. So, this is the land war. We must work on supply chains, procedures, detection and contact tracing, getting ever slicker at the operation. Personal protection, social distancing and graded lockdowns can all play a part, but only if they are accepted by the general public, who deserve clear explanations of when, where and why unwelcome restrictions will be imposed and what these restrictions are intended to achieve.

While central government has an obvious role to play, it has become clear that the battle must go local; and the more local the better. The risk of being hospitalised with COVID-19 in Birmingham varies dramatically across the various electoral wards, with the seven-day rolling rate of new cases (for week ending 14 October 2020) ranging from 43.8 per 100,000 in Nechells, to 825.8 in Selly Oak.[2] So, supported by the MRC, NIHR ARC West Midlands and our host hospital (University Hospitals Birmingham NHS Foundation Trust) we are developing a computer application to track the evolving pattern of the COVID-19 pandemic. We have developed software that uses geostatistical models to identify “hot spots”, however one defines them, across a broad space such as an urban conurbation or a country. Within such a space we identify localities at whatever scale is relevant for local decision-making and that the data can support. We can map rates of infection per unit of population in real time on these maps to show the current state of the epidemic and its direction of travel (see Example). These maps can direct decision-makers to specific localities where incidence is increasing rapidly and hence where urgent action is needed.

But there is a problem with policy action directed at small areas and particular communities – dictatorial edicts are likely to provoke resentment rather than effective action, especially when carried out at a very local level. It is one thing to place restrictions across a whole country or even a large city, but quite another to try to lockdown an area such as Lady Pool in Birmingham or Chapel Town in Leeds. Indeed, the disease has highest incidence in BAME communities who may feel victimised or disenfranchised. Already only 18% of people fully comply with UK regulations regarding self-isolation.[3] So here we come to the second use of our application and the maps it produces.

We think that policy-makers should increasingly turn to local communities and ask them to be the architects, not recipients, of policy. In essence we are arguing for an ‘assets-based’ or ‘participatory’ approach based on ‘co-invention’. And here our application can help by providing scientific data at a local level in a form that can be easily assimilated. We are arguing at a local level for the type of thing that Prof Chris Witty used at a national level in his Downing Street presentation with the Prime Minister and Chancellor (12 October 2020). There is evidence that populations relate well to local maps and they are sometimes used in qualitative research as a method to promote discussion among people.[4] The approach we are advocating here, of high-risk spatio-temporal identification, followed by case-area targeted intervention, has proven effective in limiting the spread of cholera outbreaks,[5] and we advocate a similar approach with respect to the COVID-19 pandemic.

2. Whether you would like to hear more or use the application when it is developed.
3. Whether you have examples of similar initiatives elsewhere in the world.
4. Whether you would like to collaborate.

Richard Lilford, ARC WM Director; Sam Watson, Senior Lecturer; Peter Diggle, Distinguished Professor at Lancaster University

Example of Real-Time Surveillance of COVID-19

For this example we have aggregated the results to MSOA (middle-layer Super Output Area) level across the catchment area of University Hospitals Birmingham NHS Foundation Trust, although we have retained other areas of Birmingham to make the boundary of the city clear. One could aggregate to smaller or larger levels as needed. A case here is an admission to hospital for COVID-19.

We have produced these outputs as if we were working on March 26 2020 using data from the preceding two weeks. The first thing someone interested in tracking COVID-19 in the city might ask is what is the incidence of the disease that day?

There is a lot of variation across the different MSOAs, with one area standing out as being high (yellow area). The variation here could be explained by differences in demographics or socioeconomic status, and we might want to ask whether any differences are for unexpected reasons. We can break down the incidence into
different components:

Where:

• Expected is the number of cases we would expect that day from each area based on the size of its population.
• Observed shows the relative risk in each area associated with observable characteristics
(age, ethnicity, and deprivation). For example, consider if the average incidence across the city were one case per 10,000 person-days. An area with a larger proportion of older residents would have a high risk; if this risk were double the average then it would have a relative risk of two.
• Latent is the relative risks in each area due to unexplained factors or unobserved
variables. Our area with more older people may have an expected incidence of two cases per 10,000 person-days (a ‘baseline’ of 1 per 10,000 person-days times a relative risk of two), but if we observe an average rate of four cases per 10,000 person-days, then there is an additional unexplained relative risk of 2.
• Posterior SD indicates the predictive variance.

So based on these plots the area with high incidence in the North of Birmingham would appear to be higher than we would expect based on the observed variables by factor of 2 or 3. This may indicate the need for public health intervention. We might finally ask, how this compares to previous days?

The next plot shows the incidence rate ratio, which here is the ratio of incidence compared to seven days prior for each area. A value of one indicates no change, two a doubling, and so forth. One can clearly see that it is above one, i.e. it is increasing, city-wide. The greatest relative increases are centred on the area we identified as being of high concern.

References:

1. Alwan NA, et al. Scientific consensus on the COVID-19 pandemic: we need to act now. Lancet. 2020.
2. Public Health England. Coronavirus (COVID-19) in the UK: Interactive Map. 19 October 2020.
3. Smith LE, et al. Adherence to the test, trace and isolate system: results from a time series of 21 nationally representative surveys in the UK (the COVID-19 Rapid Survey of Adherence to Interventions and Responses [CORSAIR] study). MedRXiv. 2020. [Pre-print].
4. Boschmann EE, Cubbon E. Sketch maps and qualitative GIS: Using cartographies of individual spatial narratives in geographic research. Professional Geographer. 2014;66(2):236-48.
5. Ratnayake R, et al. Highly targeted spatiotemporal interventions against cholera epidemics, 2000-19: a scoping review. Lancet Infect Dis. 2020.