Chapter 17. Understanding covariates: simple regression and analyses that combine covariates and factors

This chapter introduces approaches to model continuous data as an independent variable. We refer to continuous independent variables as ‘covariates’.

Until now, we have considered statistical tools that allow one to compare and estimate differences of averages between groups (e.g., t-tests, 1- and multi-Factor GLMs):  i.e., we have modeled data where the independent variable comprises ‘treatments’ or ‘levels’.  Sometimes we wish to examine effects of a ‘continuous’ variable, instead.  Height, weight, speed, mass, volume, length and density are all examples of ‘continuous’ variables:  they can have any real-number value within a given range.  This chapter introduces approaches to model continuous data as an independent variable.  We refer to continuous independent variables as covariates.

We first introduce linear regression.  This technique fits a straight line through data with continuous data on both the x- and y-axes (the independent and dependent variables, respectively).  Linear regression allows one to estimate the slope (and y-intercept) for the relationship between the x- and y-variables, with appropriate estimates of uncertainty (i.e., standard errors, 95% confidence intervals); it also provides evidence (p-values) to judge whether the estimates of the slope or intercept differ from zero.  (That said, one could also use the output from linear regression to judge whether the line’s slope differs from any arbitrary value (not just zero), which we describe, below.)

This technique allows us to ask questions like, “does a flower’s width (independent variable) affect the amount of pollen removed from the flower (dependent variable)?”, or “does metabolic rate (dependent variable) change with ambient temperature (independent variable)?”.  Note that, when conducting linear regression, we assume that:

the covariate (i.e., x-axis; independent variable) affects the dependent variable (y-axis).  Therefore, we must consider the variables’ functional relationship to decide which will be the (in)dependent variable.  For example, in the flower size example, above, we would model flower size and amount of pollen removed as the independent- and dependent-variables, respectively.

This is because it makes biological sense to hypothesize that flower size affects pollen removal (e.g., by affecting how a pollinator handles a flower), but it makes little sense to hypothesize that the amount of pollen removed would determine how big a flower was. 

the covariate (x-variable) is measured precisely (i.e., with little measurement error) or is controlled by the experimenter.

Note that some biological disciplines commonly analyze models that include multiple covariates (i.e., ‘multiple regression’).  We will discuss analyses with multiple covariates in the future. 

We next introduce models that include a single covariate as well as one or more factors.  This approach was once called ‘ANCOVA’ (i.e., ANalysis of CO-VAriance); in a general linear model context, we simply note that a glm includes both a covariate(s) and a factor(s). 

As we saw in the Chapter, “Analyzing experiments with multiple factors”, models that include both a covariate and at least one factor allow a researcher to assess evidence for multiple hypotheses, simultaneously. 

For example, imagine that we wished to compare the dispersal of seeds from maple trees vs. ash trees.  Both trees produce seeds with ‘wings’, but their morphology differs.  We might conduct an experiment involving a random sample of seeds from each tree species.  We could drop a seed from a known height and measure the distance it travels; we could repeat this process for many seeds for each species at a variety of known heights (ranging from, say, 3 to 25 meters, which spans biologically plausible heights).  With these data, we could address several hypotheses:

  1. Does the covariate (Height) affect Dispersal distance after accounting for effects of the factor (tree Species)?  i.e., This hypothesis tests whether we have evidence for a linear relationship between Height and Distance, accounting for differences between species.
  2. Does Dispersal distance differ between levels of the factor (Tree Species) affect after accounting for effects of the covariate (Height)?
  3. Do the slopes of the relationships between Height and Dispersal distance differ between levels of the factor (Tree Species)?  i.e., do we find evidence for an interaction between the covariate and the factor?

As noted in the Chapter, Analyzing experiments with multiple factors, these hypotheses differ qualitatively from those we might ask with, say, 1-factor glm.  Therefore, understanding analyses that include covariates increases the scope of biological questions we might investigate beyond simpler methods.  Studies in ecology and evolution analyze models with covariates on a regular basis.  However, my experience is that analyses in biomedical sciences rarely include covariates (with notable exceptions, e.g., epidemiology), but might benefit from doing so.

We might include a covariate in an analysis for several reasons.  Foremost, we could include a covariate in a model because we’re specifically interested in its biological effect; this reason should be self-evident.  However, we might also model the effects of a covariate not because the covariate interests us biologically, per se, but because including the covariate may help us understand effects of another term in our model (e.g., a factor). 

First, we might include a covariate to account for confounding effects in a study.  For example, imagine that we wished to test whether blood pressure differed between adult human females vs. males.  To test this, we might measure blood pressure for an appropriate sample of many females and males and analyze the data with a 1-factor general linear model (blood pressure and Sex would be the dependent and independent variables, respectively). 

However, we might also know that body size can affect blood pressure and that, on average, mass differs females and males.  Therefore, if we found evidence for differences in blood pressure between females and males in our 1-factor glm, we might wonder whether an apparent effect of Sex arose due to differences in body size between the Sexes, rather than another biological aspects of Sex.  A model that included body size as a covariate would help resolve this issue because the results for the effect of Sex would have accounted for effects of body size; i.e., we test whether Sex affects blood pressure independent of differences due to body size. 

Clearly, this approach can deepen understanding of biology.  Second, we might include a covariate because, if the covariate accounts for a reasonable amount variation in the dependent variable, we increase statistical power to examine effects of a factor that interests us; again, this provides clear benefits.  

Conversely, including a covariate that does not explain reasonable variation in the dependent variable decreases power to examine effects of a factor.  Therefore, we should think carefully about including a given covariate in an analysis.  But, with this careful thinking, covariates improve analyses and provide deeper biological understanding.

The videos and practice problems, below, equip you with the skills to implement models with a covariate.

Document
experimental data ms1 (15.86 KB / XLSX)

The Powerpoint presentation below provides questions that review basic concepts from this chapter.  Note that the questions sometimes present more than one correct answer, and sometimes all the options are incorrect!  The point of these questions is to get you to think and to reinforce basic concepts from the videos.  You can find the answers to the questions in the ‘notes’ section beneath each slide.


Document