Statistical Adjustment in Nutrition (Part 1 – Why and How?)

Picture of Shaun Ward

Shaun Ward

Founder and Writer at My Nutrition Science

Writer Profile

Why is Statistical Adjustment Needed?

I recently published an article named “observational research is valid“. However, one area that I couldn’t fully expand upon for the sake of article length was confounding and statistical adjustment methods. These are both important to comprehend if the goal is to understand nutrition research fully and be confident in supporting or refuting claims about causality.

Understanding why statistical adjustment ties in with causality is explained by the “fundamental problem of causality”. This states that it’s impossible to directly observe a cause in real life because it’s impossible to know what indefinitely would have happened to someone if they had acted differently. What we think would have happened is known as a ‘potential outcome’ and can only be imputed or estimated. Thus, inferring causality relates to “what if?” questions in counterfactual worlds. As we don’t have access to a counterfactual world, though, we rely on the scientific method and real-world data. For example, we can measure outcome differences between distinct groups, and then we can make assumptions about how those group differences relate to our understanding of causality.

These assumptions about causality are needed as many times the outcome differences between groups are simply associative and do not represent a causal effect. Although an association is easily calculated from the observed data, causation must extend on the association using models of the counterfactual world. The pivotal question for causality then becomes.. under which conditions can real-world data be used to justify a causal inference?. By conditions, I am really referring to the external factor(s), called covariate(s), that may otherwise explain the measured exposure-outcome association. For example, often times we see systematic covariate imbalances between groups interfering and distorting the “true” relationship. In these cases, the covariate(s) are named confounders. They are confounding the results of the relationship that we are interested in. Considering this, the better question then becomes “how can we account for confounders so that real-world data can be used to support a causal inference?“. It is for these reasons why researchers have to minimise (ideally eliminate) the influence of confounders through various methods in order to justify a causal inference. One of these methods is using statistical adjustment during the analysis phase.

Statistical Adjustment and Observational Research

Let’s quickly clarify why statistical adjustment is mainly used for observational research and not for randomised controlled trials (RCTs). The primary reason here is because RCTs better consider the influence of confounders in the research design phase by way of ‘randomisation’. Randomisation is when participants are randomly assigned to an intervention or control group by chance with the purpose to distribute covariates between groups equally and leave the exposure of interest as the only difference between groups. This makes it easier to attribute the outcome differences between groups directly to the exposure of interest.

But due to the absence of randomisation in observational research, inferring causal associations is often doubted because confounders are harder to account for. Healthy user bias is the most popular example in nutrition. Since the quantities and types of nutrients and foods are linked to other health-related behaviours (dietary or lifestyle), it is harder to seperate them from other influential factors. For instance, a specific food may be consistently associated with higher cardiovascular disease (CVD) risk, but one could try to argue that groups eating more of that particular food may also eat more food in general, and smoke more, be less active, and drink more alcohol than groups that eat less of that particular food. All of these other factors may confound the relationship between the nutrient or food that we are interested in.

These research design considerations are directly relatable to a key concept known as “exchangeability”. Exchangeability means that the observed study outcomes would have been the same if one exposure group was (hypothetically) subject to the comparative group’s exposure level. It assumes that we overcome the fundamental problem of causality and can support a causal effect with confidence. Of course, we are never certain of exchangeability because we can’t access a parallel universe, but we can assume exchangeability under certain conditions.

The conditions for exchangeability in RCTs is settled by the randomisation process alone, with little need for further consideration. However, for observational research, because of the unequal distribution of covariates between groups, only when you remove the confounder effects can exchangeability be assumed. More formally, this is known as conditional exchangeability, wherein exchangeability holds only within population subgroups (strata) of the confounding variables. Although there are other research design considerations that help to account for confounders and help to meet the exchangeability assumption, using statistical adjustment methods is usually required in observational research.

Graphical Models for Identifying Confounders

Before we discuss statistical adjustment methods, we should first know how to identify the confounders we actually need to adjust for; otherwise, the adjustment in itself is a rather pointless endeavour. We can only take our causal effect estimates as seriously as we take the conditions that are needed to endow them with a causal interpretation. This is why graphical models of causal assumptions are so important, especially directed acyclic graphs (DAGs). DAGs are used to map out the assumed paths by which an exposure influences an outcome. These causal assumptions are based on prior topic knowledge of the covariates that have a known relationship with the exposure and outcome of interest. By this method, confounding issues can be planned and controlled for.

Looking at DAGs can be a little confusing though. You will typically see multiple arrows pointing from one variable to another and it all appears rather messy. To help break it down, it’s good to be aware of what each variable represents depending on it’s position within the DAG:

  • The variable with an arrow pointing away from it is called an ancestor of the variable it is pointing toward.
  • The variable that the ancestor is pointed toward is called a descendant.
  • Other than the exposure and the outcome, all variables on a DAG are also named either colliders or non-colliders.
  • A collider variable has arrows going into it from its ancestor and descendant.
  • Non-collider variables are either forks or mediators. Forks have arrows pointing out to variables before and after it. Mediators have arrows pointing into it and also from it along the path.

Once all relevant variables have been included, a completed DAG displays how the associations between the exposure and outcome of interest are transmitted through the various paths between them; otherwise said to be through the ‘open’ paths between them.

An exposure-outcome path is said to be open conditional on a set of covariates, if (1) every collider on the path is in the conditional set of covariates or has a descendant in the conditional set of covariates, and (2) there is no non-collider on the path within the conditional set of covariates. On the other hand, an exposure-outcome path is said to be ‘closed’ conditional on a set of covariates if either (1) there is no collider on the path in the conditional set of covariates or that has a descendant in the conditional set of covariates, or (2) at least one non-collider on the path is in the conditional set of covariates.

Anyway, how do we take the above and use it to minimise confounding in observational studies? Well, DAGs are critical for knowing what covariates to condition on to remove confounding from the exposure-outcome relationship and obtain a more accurate representation of the truth. To know which variables to condition on, though, we must identify ‘backdoor paths’ on the DAG. A backdoor path is one that indirectly connects the exposure to the outcome of interest without being part of the mapped causal path, such that even if the exposure has no direct effect on an outcome, the backdoor path would make it seem as though there is an effect. Pearl et al. [1] states that if you then condition on a set of covariates blocking all of the backdoor paths from the exposure to the outcome of interest, and no variable within this set is a descendant of the exposure, then controlling for these covariates will appropriately control for confounding. This is referred to as the backdoor path adjustment criterion.

Knowing this makes it simpler to understand why observational studies that do not account for backdoor paths open themselves up to confounding bias. As it relates to DAGs, confounding is defined as an association between exposure and outcome that arises from a biasing path from exposure to outcome, beginning with an arrow into the exposure and ending with an arrow into the outcome. Thus, the best types of observational studies are usually those which account for bathdoor paths and minimise confounding. A sufficient condition for the absence of confounding is that none of the variables in a conditional set of covariates are themselves affected by the exposure, and that the conditional set of covariates adequately blocks all backdoor paths from the exposure to the outcome. So, provided a researcher has good knowledge of a causal diagram, then the backdoor path criterion can be used to identify a set of covariates that suffices to adjust for confounding.

The practical takeaway here is that, if you’re approaching a causal question with observational research, it’s always good to ask whether the set of covariates used for statistical adjustment are sufficient for the results to be interpreted causally. Or at the very least, can we be certain enough that any possible confounding isn’t going to meaningfully influence the result? The answer to these questions isn’t necessarily “yes” or “no”, but more so relates to a scale of confidence that is partially subjective.

Overadjustment

To follow on from the topic of identifying confounders, it’s good to be aware of the prevalent issue of overadjustment in nutrition research. The issue, specifically, is that some researchers appear to condition on as many covariates as possible with a “more equals better” attitude. This is not appropriate and definitely something to look out for when analysing study methodology.

Simply put, overadjustment is when researchers wrongly condition on a covariate that is on, or a descendant of a covariate that is on, a causal pathway from exposure to outcome. It can be thought of as an unnecessary or misinterpreted adjustment that introduces bias. This bias is pervasive in nutrition studies because researchers often adjust for the moderator variables by which a nutrient acts. Seen as nutrients rarely directly affect an outcome and almost always act through moderating pathways, conditioning on a moderator variable unjustly removes (at least part of) the weight of the exposure-outcome effect. This renders an inability to estimate the total impact of an exposure on an outcome through all causal pathways. For example, if researchers adjust for cholesterol when analysing the relationship between saturated fat and CVD, this is a form of overadjustment as we know that saturated fat changes CVD risk via changes to cholesterol levels (or ApoB to be specific).

Thus, when highlighting what covariates to control for, we must always be mindful of seeking what is known as a ‘minimally sufficient adjustment set’. Meaning that every included covariate in an adjustment set is needed to control for confounding, and removing any covariate within the set would no longer suffice to control for confounding. By this definition, a confounder can actually be defined as any covariate within a minimally sufficient adjustment set.

Statistical Adjustment Methods and Stratification

Ok, so now say we have identified our minimal set of covariates needed to measure a causal effect. In this case, the next step is to use statistical adjustment methods to account for these confounders during analysis. Let’s start with stratification because it’s the easiest adjustment method to understand first, and is often the starting point in many textbooks dealing with confounding in the analysis phase.

The use of stratification is based on the idea that even if the population of study have diverse characteristics, we can assume that at least within a subgroup (strata) of some set of covariates, different groups are comparable in their outcomes and causal effects are better identified. In other words, making outcome comparisons within appropriate levels of the confounder(s) means that the confounder(s) cannot confound the comparisons. If you picture a DAG, stratification is essentially measuring the effect within strata where the arrows from the confounder(s) to the exposure do not exist. An example of stratifying by age (a common confounder) for the obesity and CVD relationship is shown below.

Or we can take a hypothetical example more applicable to nutrition: let’s say smoking is a confounder for the effect of red meat on CVD risk. We might then ask “what is the effect of red meat (<3 servings/week vs >3 servings/week) on CVD when adjusted for smoking”. To quantify this, we can make associations between 4 population strata; (1) high red meat eaters that smoke, (2) high red meat eaters that do not smoke, (3) low red meat eaters that smoke, and (4) low red meat eaters that do not smoke. Once we stratify the data into two or more levels of the confounding factor, it’s then possible to compute a weighted average of the risk ratios or odds ratios across strata, i.e. across subgroups defined by levels of the confounder. The Cochran-Mantel-Haenszel method is typically used to calculate the weighted average here, and the weighted average of risk ratios after stratification for confounder(s) is called the adjusted summary effect estimate.

If we relate this back to the fundamental problem of causality, we can say that we’re using a set of covariates—in this case only one covariate—to make different exposure groups more comparable in their ‘potential outcomes’ so that the exchangeability condition for causality is met. Thus, given a set of measured covariates—called “adjusting for” or “conditioning on” on a set of covariates—exchangeability applies much as it does with RCTs. Or to be technically correct, conditional exchangeability is said to apply, because exchangeability in observational research is reliant on whether we have conditioned on the covariates appropriately. Conditional exchangeability allows us to estimate the average causal effect within strata of covariates.

In sum, because there is no randomisation process in observational research, investigators try to collect data and stratify by, or otherwise control for, a sufficient set of covariates so that the exchangeability assumption is plausible. Graphical models are critical for knowing what covariate data should be collected during the study and conditioned upon during analysis.

The Need for Multivariable Analysis

Although stratified analysis works best for when 1 or 2 confounders require consideration, if the number of potential confounders or the level of their grouping is large, we run into many practical and statistical issues. This is certainly the case for multifactorial diseases that have a large number of risk factors additional to the exposure that we’re interested in. In this case, stratification becomes an unwieldy technique for eliminating confounding.

As seen in the prior sections example, stratifying by just one variable already requires the analysis of four groups, presented in a 2 x 2 table. The prior example was that if we want to adjust for smoking when making associations between high and low red meat eaters, we have to analyse (1) high red meat eaters that smoke, (2) high red meat eaters that do not smoke, (3) low red meat eaters that smoke, and (4) low red meat eaters that do not smoke.

But if we then introduce a second variable by which to stratify—let’s say age—now we have eight groups to analyse (2 x 2 x 2 = 8). Another third variable—let’s say exercise—now you have 16 groups (2 × 2 × 2 × 2 = 16). Add a fourth—let’s say energy intake—now you have 32 groups (2 × 2 × 2 × 2 × 2 = 32); and this goes on and on until you have 100s if not 1000s of groups to analyse. Clearly, this is extremely impractical and time-consuming.

Statistically speaking, while, theoretically, one can stratify based on any number of covariates for a fixed sample size, increasing the number of stratified covariates will reduce the amount of data per group. If we have a high number of covariates to control for, which is usually the case in nutrition, we then risk estimating causal effects from data susceptible to high variance because such few participants are in each strata. Sparse-strata methods such as Mantel-Haenszel can enable analyses when there are few subjects per strata, but eventually the number will fall to 0 or 1 and this method will no longer be reasonable to apply. For this reason, stratification is rarely used as an exclusive tool to control for multiple confounders in observational nutrition studies. Instead, it is used as an assisting tool in combination with other methods to identify effect measure modifications, i.e. to demonstrate that the strength of the association between an exposure and an outcome depends on the value of another factor that is not a confounder.

This highlights the need for multivariable analysis as a statistical adjustment tool in nutrition research as it offers the only practical solution. I find the coverage of basic biostatistics books often stops short of multivariable analysis while multivariable analysis books are littered with mathematical formulas and derivations that hinder basic understanding. I’ll try to bridge the gap here because multivariable analysis in nutrition research is vital to comprehend many exposure-outcome relationships where multiple confounders are at play. Multivariable analysis opens the possibility to adjust for many confounding variables in just one (assumed true) mathematical model, which provides a chance to replicate the randomisation design of RCTs by statistically approximating equal comparison groups.

Take the example below, which is a true analysis of smoking and the risk of death. In the bivariate analysis which does not use multivariable analysis but only stratifies by smoking status, we can see that persistent smokers appear to have a lower risk of death compared to nonsmokers. Of course, this finding can be misleading without the use of multivariable analysis because, compared to nonsmokers, persistent smokers were younger, had angina for a shorter time period, were less likely to have diabetes and hypertension, and had less severe coronary artery disease. Only with multivariable adjustment for the baseline differences between the smoking groups was it possible to quantify that persistent smokers indeed have a significantly greater risk of death than nonsmokers – a more sensible result based on our understanding of smoking harms.

How Does Multivariable Analysis Work?

Multivariable analysis is better known as multiple regression. Regression is the general configuration of the mathematical model needed for analysis, and it is used as a measure of the relationship between the mean value of one variable (e.g. an exposure) and corresponding values of other variables (e.g. an outcome). Multiple regression, though, generally concerns averages of an outcome within levels of multiple covariates (exposure and confounders).

It’s easiest to understand regression by illustrating it on a graph, shown below. The vertical y-axis is termed the dependent variable (outcome), while the horizontal X-axis is termed the independent variable (exposure). If we are analysing just one independent and one dependent variable, and the relationship is assumed to be linear, this is known as simple linear regression and the linear regression model is: Y = a + bX.

Y is the dependent variable; a is the intercept/constant (the value of Y when X is 0); b is the regression coefficienct (the slope/gradient of the line); X is the independent variable.

The objective of linear regression is to find the line of best fit through the individual data points, such that the residuals (distance of data points to the regression line) are minimised as much as possible. This is also known as the least-squares regression line. The strength and direction of the regression line are called the correlation coefficient, or R. R is measured between -1 and 1. -1 and 1 indicates that there is no error and no residuals, meaning that the individual data points all fit perfectly on the regression line; 1 being a perfectly positive correlation and -1 being a perfectly negative correlation. On the other end, 0 indicates that there is maximum error and residuals, meaning that a regression line is not compatible with the individual data points because they are scattered so randomly. Between these values, we have a number that represents a correlation with some degree of error.

If we then take the R-value and multiply it by itself, we obtain R-squared. R-squared is a statistical measure that represents the proportion of the dependent (outcome) variance that’s explained by the independent variable (exposure) in a regression model. So, if the R-squared of a model is 0.50, then approximately 50% of the observed correlation can be explained by the model’s exposure input. However, R-squared only works as intended in a simple linear regression model with one exposure and outcome variable, with no confounders to adjust for. When we need to account for several confounders, which is almost always the case, the R-squared must be adjusted.

Thus, we typically rely on multiple linear regression to analyse complex relationships. Contrary to how it sounds, though, it is not just a series of simple linear regressions, but it does combine multiple independent variables (the exposure and the confounders) affecting a single outcome into a “generic equation” that is similar to simple linear regression. The generic algebraic model that is used for all multivariable analyses is: Y = b0 + b1X1 + b2X2 + b3X3.

Y is the dependent (outcome) variable; b0 is the intercept or constant; b1, b2, b3, and so on are the individual regression coefficients showing the impact of each X on G; X1, X2, X3, and so on are the independent variables. X1 is usually the exposure of interest, and then the following X’s are the confounders that need to be used for statistical adjustment.

So, what on earth is this model actually doing in concept? Well, the best way to think of it is conceptually similar (not exactly) to regressing the outcome on the confounder variable(s) and then taking the residuals of that model and regressing that on the exposure variable. This is often termed as “holding the confounders constant” and helps to quantify how much of the change in the outcome is explained by variation in the exposure, independent of the effect of confounder(s) on the outcome. If it so happens that the confounder effects on the outcome explain away the original exposure-outcome association, then a causal relationship is unlikely. But if there still appears to be a positive or negative association between the exposure and outcome after accounting for the confounder effects, then a causal relationship is plausible.

Another great way to help conceptualise multiple regression is with a 3-dimensional graph, as opposed to the standard 2-dimensional graph used for simple regression. Because, when we are adding other independent variables (confounders) to our mathematical model, we are essentially adding another dimension to a standard 2D graph. When moving from two dimensions to three dimensions, things change; if we have two dimensions then we have a line, but if we have three dimensions then we have a plane (i.e a flat surface). This means that the regression line (least-squares line) is more now of a regression plane; however, the same concept applies whereby the regression is minimising the residuals of the 3D plane. Achieving this, it is then possible to reduce what is a relatively complicated pattern of 3D data back into a 2D regression line of the exposure-outcome relationship once modelled with, or adjusted for, confounders. This formulates a simplified picture of reality.

So, whereas before the R-squared was telling us the proportion of variation in the dependent (outcome) variable that is explained by variance in the independent (exposure) variable, multivariable regression gives us an adjusted R-squared that tells us the new variance of the exposure-outcome relationship when the confounder effects are added to the model. As such, the adjusted R-squared adds precision and reliability to R-squared by considering the impact of additional independent variables that otherwise skew the original R-squared measurements.

The adjusted R-squared increases only if the confounder improves the model more than would be expected by chance, and the adjusted R-squared decreases when a confounder improves the model by less than expected by chance. However, R-squared always increases with every confounder added, regardless of whether the confounder is a legitimate or erroneous. This can be misleading and again highlights the need to ensure that we use a minimally sufficient adjustment set for multivariable analysis.

So, now that we have covered the concept of multivariable analysis, we can see that it differs a heck of a lot compared to stratification. The main difference being that regression analyses pool everyone in the study into the same model, rather than comparing one exposure group to another per se. So, whereas regression analyses estimate effects as contrasts of averages in the same population under the same conditions, stratification does the opposite and measures associations in different populations under different conditions. Both, though, are helpful tools to statistically adjust observational data to support or reject causal inferences.

To Note

I hope that you have enjoyed this article until this point. We will call this part 1 to seperate the simpler concepts from the more complex. The latter will be dealt with in part 2 and will include topics such as the types of multivariable analysis, regression assumptions, and nutrition-specific adjustment issues. Catch you next time!

If you enjoy our articles, please consider supporting the site by donating through this link. You can also join our email list below (no spam guarantee!)

SUBSCRIBE TO EMAIL LIST

Related Deep Dives

Research Skills
Screenshot 2022-01-17 at 16.05.56

Nutrition Misinformation: How to Spot a Quack

Research Skills
Screenshot 2021-12-30 at 16.20.41

Statistical Inferences in Nutrition: P-Values, Point Estimates, and Confidence Intervals

Disease Prevention
Screenshot 2021-11-27 at 15.59.15

Merging Reductionism and Holism in Nutrition