Chapter 17. Understanding covariates: simple regression and analyses that combine covariates and factors

This chapter introduces approaches to model continuous data as an independent variable. We refer to continuous independent variables as ‘covariates’.

Until now, we have considered statistical tools that allow one to compare and estimate differences of averages between groups (e.g., t-tests, 1- and multi-Factor GLMs): i.e., we have modeled data where the independent variable comprises ‘treatments’ or ‘levels’. Sometimes we wish to examine effects of a ‘continuous’ variable, instead. Height, weight, speed, mass, volume, length and density are all examples of ‘continuous’ variables: they can have any real-number value within a given range. This chapter introduces approaches to model continuous data as an independent variable. We refer to continuous independent variables as ‘covariates’.

We first introduce linear regression. This technique fits a straight line through data with continuous data on both the x- and y-axes (the independent and dependent variables, respectively). Linear regression allows one to estimate the slope (and y-intercept) for the relationship between the x- and y-variables, with appropriate estimates of uncertainty (i.e., standard errors, 95% confidence intervals); it also provides evidence (p-values) to judge whether the estimates of the slope or intercept differ from zero. (That said, one could also use the output from linear regression to judge whether the line’s slope differs from any arbitrary value (not just zero), which we describe, below.)

This technique allows us to ask questions like, “does a flower’s width (independent variable) affect the amount of pollen removed from the flower (dependent variable)?”, or “does metabolic rate (dependent variable) change with ambient temperature (independent variable)?”. Note that, when conducting linear regression, we assume that:

the covariate (i.e., x-axis; independent variable) affects the dependent variable (y-axis). Therefore, we must consider the variables’ functional relationship to decide which will be the (in)dependent variable. For example, in the flower size example, above, we would model flower size and amount of pollen removed as the independent- and dependent-variables, respectively.

This is because it makes biological sense to hypothesize that flower size affects pollen removal (e.g., by affecting how a pollinator handles a flower), but it makes little sense to hypothesize that the amount of pollen removed would determine how big a flower was.

the covariate (x-variable) is measured precisely (i.e., with little measurement error) or is controlled by the experimenter.

Note that some biological disciplines commonly analyze models that include multiple covariates (i.e., ‘multiple regression’). We will discuss analyses with multiple covariates in the future.

We next introduce models that include a single covariate as well as one or more factors. This approach was once called ‘ANCOVA’ (i.e., ANalysis of CO-VAriance); in a general linear model context, we simply note that a glm includes both a covariate(s) and a factor(s).

As we saw in the Chapter, “Analyzing experiments with multiple factors”, models that include both a covariate and at least one factor allow a researcher to assess evidence for multiple hypotheses, simultaneously.

For example, imagine that we wished to compare the dispersal of seeds from maple trees vs. ash trees. Both trees produce seeds with ‘wings’, but their morphology differs. We might conduct an experiment involving a random sample of seeds from each tree species. We could drop a seed from a known height and measure the distance it travels; we could repeat this process for many seeds for each species at a variety of known heights (ranging from, say, 3 to 25 meters, which spans biologically plausible heights). With these data, we could address several hypotheses:

Does the covariate (Height) affect Dispersal distance after accounting for effects of the factor (tree Species)? i.e., This hypothesis tests whether we have evidence for a linear relationship between Height and Distance, accounting for differences between species.
Does Dispersal distance differ between levels of the factor (Tree Species) affect after accounting for effects of the covariate (Height)?
Do the slopes of the relationships between Height and Dispersal distance differ between levels of the factor (Tree Species)? i.e., do we find evidence for an interaction between the covariate and the factor?

As noted in the Chapter, Analyzing experiments with multiple factors, these hypotheses differ qualitatively from those we might ask with, say, 1-factor glm. Therefore, understanding analyses that include covariates increases the scope of biological questions we might investigate beyond simpler methods. Studies in ecology and evolution analyze models with covariates on a regular basis. However, my experience is that analyses in biomedical sciences rarely include covariates (with notable exceptions, e.g., epidemiology), but might benefit from doing so.

We might include a covariate in an analysis for several reasons. Foremost, we could include a covariate in a model because we’re specifically interested in its biological effect; this reason should be self-evident. However, we might also model the effects of a covariate not because the covariate interests us biologically, per se, but because including the covariate may help us understand effects of another term in our model (e.g., a factor).

First, we might include a covariate to account for confounding effects in a study. For example, imagine that we wished to test whether blood pressure differed between adult human females vs. males. To test this, we might measure blood pressure for an appropriate sample of many females and males and analyze the data with a 1-factor general linear model (blood pressure and Sex would be the dependent and independent variables, respectively).

However, we might also know that body size can affect blood pressure and that, on average, mass differs females and males. Therefore, if we found evidence for differences in blood pressure between females and males in our 1-factor glm, we might wonder whether an apparent effect of Sex arose due to differences in body size between the Sexes, rather than another biological aspects of Sex. A model that included body size as a covariate would help resolve this issue because the results for the effect of Sex would have accounted for effects of body size; i.e., we test whether Sex affects blood pressure independent of differences due to body size.

Clearly, this approach can deepen understanding of biology. Second, we might include a covariate because, if the covariate accounts for a reasonable amount variation in the dependent variable, we increase statistical power to examine effects of a factor that interests us; again, this provides clear benefits.

Conversely, including a covariate that does not explain reasonable variation in the dependent variable decreases power to examine effects of a factor. Therefore, we should think carefully about including a given covariate in an analysis. But, with this careful thinking, covariates improve analyses and provide deeper biological understanding.

The videos and practice problems, below, equip you with the skills to implement models with a covariate.

Introduction to linear regression

An introduction to GLM with 1 covariate, and comparison with 1-Factor GLM

Link to sharepoint folder for example 1 regression axon

Experimental data GLM with a covariate (3.28 MB / PPT)

Transcript GLM with a covariate (8.79 KB / TXT)

Covariate vs 1-Factor

This video demonstrates that a 1-Factor GLM works in a similar manner as a 1-covariate GLM

Experimental data glm with a covariate part 2 (1.14 MB / PPT)

Transcript GLM with a covariate part 2 (9.6 KB / TXT)

Example regression

An example regression analysis (with 1 covariate).

Please note that this video needs to be updated to also discuss the third residual plot (where the square root of standardized residuals lies on the y-axis). If you pause the video at this point, you will see that the red line is not perfectly flat, but it is sufficiently flat that we do not worry about equal variance.

The p-value should also be described as strong evidence for an effect.

Experimental data GLM with a covariate part 3 (235 KB / PPT)

Transcript - GLM with a covariate part 3 (21.69 KB / TXT)

Beware of extrapolating and a summary of regression

This video discusses perils of extrapolating beyond the data and provides a summary overview of regression.

Experimental data - GLM with a covariate part 4 (323 KB / PPT)

Transcript - GLM with a covariate part 4 (8.19 KB / TXT)

An introduction to analyzing factors and covariates simultaneously

This video provides a conceptual introduction to GLMs that include both Factors and Covariates as independent variables. It considers the types of biological questions that can be addressed, lists assumptions of this approach, and briefly compares this approach to GLMs with multiple Factors.

Experimental data GLM with a covariate part 5 (499.5 KB / PPT)

Transcript - GLM with a covariate part 5 (12.1 KB / TXT)

GLM with factor and covariate Example: Blood Pressure

This video walks through an analysis of measurements of undergraduate students at the University of Edinburgh: we test whether weight and sex affect systolic blood pressure. The video provides:

i) simple suggestions to plot data;

ii) two approaches to analyze the data.

iii) guidance when an interaction between a factor and covariate appears present vs. absent in the data

Please note that I mis-speak at the very end of the video, where I say there's a typo about d.f., when reporting the results (362 vs 361) (there is no typo; the df come from different models, which I forgot under the pressure of arriving to the end of a long video!)

This video needs to be edited to consider the third residual plot (where we find the square root of standardized residuals on the y-axis). If you pause the video on this plot (you have to be quick!) you will see that this plot indicates the data meet the assumption of equal variance: the red line is flat and the points are evenly spread around the line.

The video should also be edited with respect to interpreting p-values. The interaction has p = 0.917, which constitutes (at most) weak evidence for an interaction. The p-value for the effect of Adj.Weight is interpreted as strong evidence for an effect; similarly, we eventually find p-values that provide strong evidence for an effect of Sex.

data on sharepoint chapter 16

Experimental data GLM with factor and covariate (210 KB / PPT)

Transcript - GLM with factor and covariate (44.47 KB / TXT)