Chapter 15. Dealing with violated assumptions

This chapter focuses on data transformation.

All statistical tests make assumptions because the mathematical frameworks that underlie the tests make assumptions; if those assumptions are not met, conclusions may be unreliable.  How do we check whether our data meets these assumptions?  What are the consequences of violating these assumptions?  This chapter addresses both of these questions in the context of 1-factor general linear models; later chapters will elaborate for alternative model types.

If our data do not meet assumptions, what should we do?  First, we need to determine whether the violation is serious or not:  is it likely to alter our conclusions?  This chapter provides guidance to judge whether violating assumptions of equal variance and normality may influence your conclusions.  (These two assumptions can be tested with the data; the remaining assumptions of randomization and independence informed by previous chapters).  If your data violate assumptions in a serious way, several approaches may resolve the issue:

Determine whether “transforming” the data resolves the issue.  This chapter focuses on data transformation.

Check that your model is adequate.  Sometimes violated assumptions can be remedied by accounting for another variable in your statistical model.  We find an example of this in the Chapter 15 (Analysing experiments with more than 1 factor) (see video, ‘Multi-Factor GLM: Introductory Example Analysis’).

Use a different type of model.  For example, Generalized linear models might meet your needs. Generalized linear models can analyse forms of data (e.g., binary, zero-inflated, or ordinal data) that General linear models cannot.  We will address Generalized linear models in later chapters (yet to be developed).  Alternatively, you could develop your own models that explicitly address aspects of your data that violate assumptions, for example, using ‘Stan’, a Bayesian framework in R.  We will address some of these options in the future.

Use computational methods, such as permutation (randomization) tests and bootstrapping.  The chapter, “Comparing averages between two (or fewer) groups” provides brief thoughts on ‘bootstrapping’ to compare central values between two distributions.  We also briefly introduced one form of randomization test in the Chapter, ‘Using R to introduce basic concepts of hypothesis testing’.  We will provide more extensive coverage of computational methods in the future.

A perspective on non-parametric tests. 

You might notice that we have not mentioned non-parametric tests.  We do not address non-parametric tests for two reasons.  First, non-parametric tests involve (often unappreciated) assumptions that hinder analyses.  For example, researchers often use a Mann-Whitney U test to analyse data that fail to meet the assumptions of a t-test. 

This approach can be problematic, however, because a Mann-Whitney U test provides evidence to assess whether two distributions differ, not to determine whether median values differ between two distributions.  Therefore, if the two distributions being compared differ in shape, then a small p-value might arise (at least in part) due to shape differences, and not due to differences in median values.  In other words, in order to use a Mann-Whitney U test to evaluate evidence for different median values between groups, the researcher must be confident that the two distributions have sufficiently similar shapes. 

Such unappreciated assumptions make non-parametric tests less desirable than alternative methods, suggested above.  Second, non-parametric methods cannot provide meaningful estimates of effect size with appropriate measures of uncertainty.  As effect sizes (with appropriate 95% CI’s) can offer more insight than p-values (see Chapter, ‘Abandon statistical significance’), non-parametric tests offer less than other methods that estimate effect sizes.

I have ‘outlier’ data points – what should I do?

‘Outliers’ are data points that appear unusual.  They can be inconvenient because they may cause a dataset to violate a test’s assumptions.  As a result, it can be tempting to remove outliers from a dataset.  But is this wise?  A problem arises when we choose to remove outliers.  When we collect data to address a biological question, we want the data to be representative of the phenomenon we are studying; therefore, if we remove outliers we run the risk of analysing a dataset that is no longer appropriate to address our biological question.  That said, in exceptional cases it is reasonable to remove outliers.  For example, we might find that a measurement is physically impossible; e.g., we might find that ‘mass’ is negative, which violates physics. 

While it makes sense to remove ‘impossible’ measurements, we must be cautious to not mistaken ‘improbable’ with ‘impossible’ because ‘improbable’ phenotypes do arise occasionally and represent real biology.  Alternatively, a researcher can check their lab notes to determine whether they observed anything unusual about the conditions associated with the outlying data point during the experiment. 

For example, if the researcher had previously noted that a focal individual (the outlier) was much harder to measure accurately, we might be justified to remove the outlier because independent information suggests that the observation may have high measurement error.  Even here, however, we must exercise caution.  How do I (Crispin) deal with outliers?  I analyse the data twice: once with the outliers included and once with the outliers removed.  I then compare the conclusions between the two analyses. 

If the two analyses yield similar conclusions then we know that we needn’t worry about the outliers; I would explain in any publication that I analysed the data twice but we reached similar conclusions.  Alternatively, if conclusions differ between the analyses then it becomes difficult to know which conclusion to trust most.  In this case, I would publish the results and conclusions of both analyses to allow a reader to decide for themselves which conclusion to trust more.  The transparency of this approach benefits science.

Data transformation

The videos, below, focus on data transformation to resolve violations of the assumptions of equal variance and normality.  We will check these assumptions by using plots of model residuals.  Note that R’s plot() function returns four plots when we provide it with output from a general linear model from the function, lm(). 

For now, we will not use the fourth plot to check assumptions.  You might ask, ‘why’?  The fourth plot provides a means to identify possibly unusual data points.  We will ignore this plot (for now) because we will make it a habit to plot our data (usually plotting individual observations) before we analyse it; this plot, along with the first 3 plots of residuals will provide sufficient information to identify unusual data points.  That said, we will use the fourth residual plot when we analyse models with multiple covariates (see Chapter ‘Understanding covariates…’; we will introduce analyses of multiple covariates in future videos).

The Powerpoint presentation below provides questions that review basic concepts from this chapter.  Note that the questions sometimes present more than one correct answer, and sometimes all the options are incorrect!  The point of these questions is to get you to think and to reinforce basic concepts from the videos.  You can find the answers to the questions in the ‘notes’ section beneath each slide.


Document