Okay, In this video, we're going to discuss how one way analysis of variance works or one-factor general linear model. Before actually getting into how the, how this test works, I want to very briefly discuss why we're visiting this topic in the first place. Why bother learning how this test actually works? You don't actually need to know exactly how a test works in order to use it. So you could in fact skip this video and you could skip the next video as well. And you could just jump onto a labia video where we talk about using analysis of variance in practice. So we'll learn what the assumptions are and learn how to actually go through an example of a one-factor general linear model in R, you could do that. So why bother learning how it works? There are a couple of reasons. Well, so just go into very briefly. The big reason or the big reason from, from a teaching perspective, is that my impression or my belief, misguided or not, is that when people understand how something works, it becomes easier to be comfortable with it. And I'm aware of the fact that some students, not all, but some students, can be intimidated by the analysis of data. It can seem like a scary thing. And so for those students, I think it's really important to understand how analysis of variance works. Because what you're going to see is that when we go into the guts of it, it's really not that complicated. It's basically adding up a bunch of numbers and divide eat a bunch of numbers. And the logic is relatively straightforward. And so my, my main motivation for explaining the how one-way analysis of variance works is to just make people more comfortable with this general topic. Two other very brief reasons. One is if you don't understand how the test works or what it's doing, then you're not going to understand all the outputs that you get when you actually perform an analysis. And I think it's important to actually understand what the, what the output actually means. And learning how the test works is the best way of doing that. The last thing is just that the more you know about how to analyze data, the better off you'll be when you actually are analyzing data. So some of the concepts that are introduced here can be more universally applicable. And so this is this discussion. It will just help your general understanding of statistics overall. So that's why we're doing this. Okay? And our goal in this video and the next, once again, is just to understand how these tests actually work. Before jumping into how these tests work, I want you to note that most of the figures that I'm using in this video are modified from this really nice book, modern statistics for the life sciences by a graph and inhales. There, these figures were, are openly available for anyone to use this website. And I just really want to highlight this book because a lot of what I teach in this video and in subsequent videos that deal with other types of general linear models are all largely driven by the perspective in this book. So I think this books excellent. And if you're looking for a companion book to go along with these videos, this would be a great choice. Okay? Analysis of variance. How does it work? Well, to explain how it works, we're going to imagine a particular experiment that we might conduct. And the experiments that were going to conduct involves understanding the rate of growth of Scots pine, which can vary greatly among individuals and potentially among locations. Here's a picture of our wonderful Scots pine. There's a Scots pine and there and there. And as I've already said, the growth rate of the Scots pines can vary a lot. And one of the questions we might ask of Scott's pines is, is some of the variation in growth rate. In other words, is do some individuals grow faster than others? Partly because of the underlying rock type that they're growing on. In other words, is the soil substrate and the rocket they're living on, can that influence their growth rate? That's the main question that we're going to address in our minds with this imaginary experiments that we're going to be walking through. Okay? So to analyze this or to answer this question, what we would do is we would go out into nature and we would measure the growth rate from of Scott's points from multiple forests. And we'd find multiple forests on each of, let's say, three different rock types. So you'll find some forests that are growing on schist, sub growing on granite and some growing on sandstone. So we have three different treatment groups here essentially carry three different environments. And we want to compare the average growth rate on sandstone versus that of granite and schist. And we're going to do that with analysis of variance. This slide here really represents the overall approach for how analysis of variance works. And we have two columns of histograms here. One where we're imagining we have no effect and one where we're imagining that there is an effective rock type. In other words, we're imagining it in this case, that rock type does influence growth rates so that growth and some types of rocks will be higher than on other types of rocks. And we're just going to walk through this slide just to illustrate where analysis of variance gets the signal in the data in order to answer the question of whether or not rock type influences growth. Okay, let's start by focusing on the left. Here is a histogram of all the growth rates. So we had in our imaginary experiment, okay, so there's some growth rates that are pretty high. Something it says around 25. Here are some that are relatively low, around five. Ps are imaginary data. And so we're not going to worry about what the units are here, okay? But what we're going to imagine is that this total set of data can be split up into our three different environments. So some of these data will come from schist, some of these data will come from granite, and some of these data will come from sandstone. And so if we plot the data from each of those different groups, schist granting sense on separately. We can notice two things. The first is that we see a lot of variation in growth rate within schist, also within granite and also within sandstone. So within the, each of these environments we see a lot of variation in growth rate. Okay, we're going to return to that in a moment. We can also see though at there's the mean value of schist and granite and sandstone are pretty much equivalent. So there's very little variation among our different rock types. The average of this is pretty similar to the average of this adds to that. Okay? So this situation could be, would be consistent with data where there was no effect of rock type. Let's contrast that with situation a different situation we have on the right. This top panel is actually identical to the top panel over here. So this represents our total distribution of growth rates. But now we're imagining that when we split this top histogram into the three different rock types, we get a slightly different situation. Once again, we can see there's lots of variation within each of these groups. And we'll return to that in just a moment. But this time I want you to notice that the mean value tends to be a little bit different among our groups. In other words, it looks like we have variation among the groups. So among schist and granite and sandstone in addition to variation within the groups. Okay, the way analysis of variance works, it analyzes the variation within the groups and compares that to the variation that we see among the groups. Here with these histograms, we can see, yes, there is, there does appear to be variation among our groups. Whereas on the left, there really was no visible signal of variation among the groups. Okay? And so what analysis of variance does is compares the variation among our groups are among the rock types in this scenario. And it compares that to the variation within the rock types. And it's by making that comparison of the variation among or the variance among rock types to the variants within the rock types. That's how we can infer whether or not there's an effective rock type on growth rate. Okay, That's dots are the big picture for understanding how analysis of variance works. And the rest of this video and the next video, or just basically going to involve going into the details of that larger idea. Before we do that, I just want to stop and think about this variation we see within rock types. And I want you to stop the video for a moment and think about what could be causing that variation within the rock types because that's real variation. Okay. Just stop the video for a moment. Okay, Now that you've had a think, I'll just provide a few suggestions. Well, it could be that we have variation within these treatments are within these rock types. Just for boring reason of measurement error. Maybe it's very difficult to measure growth rate and so maybe it's measurement error that's leading some of this variation. It's very possible. It's also possible that we might have differences in genotypes. So there might be some genotypes that cause individual trees to grow faster than others. Alternatively, there could be differences in the environment. So some of the trees within a particular forest might have more light. There might be able to grow faster than trees that are more shaded. Some have me, some may have more water, et cetera. So this variation that we see within these groups is going to be random variation that is caused by a combination of measurement error, most likely genetic effects, and almost certainly environmental effects. Okay, we can just keep that in mind. I'm the reason for this error is not really important for the point of this video, but just for building your biological intuition of what's going on with these data. Let's, let's, let's put that in our back pocket and keep that with us. Okay? So the overall basis of analysis of variances, this, we can take the total variation in growth rate like we had in the top panel in our previous slide. And we can split that variation into two types. It can split it into variation among rock types and also variation within rock types. And then we can compare those two sources of variation. In particular, we can create a ratio that looks like this, where on the top we will have variation both among rock types and within rock types. And we can compare that to the variation we find just within rock types. Now, with this in mind, I want you to stop the video again and just think about this and ask yourselves, what ratio would we expect here if there truly was no effect of rock types? In other words, if they're truly should be no differences among the rock types. Stop the video and think about what ratio we would expect to have here if there was no effective rock type. Okay, Now if you've had a think, I'll say that if there is no effect of rock type, then that means that we really do not expect there to be any variation among rock type. Okay? And, or at least due to the rocks are due to the rock type. In which case we would say that this portion of the numerator, this portion, the top of this fraction, will be equal to 0. So if this is equal to 0, what are we left with? We're left with variation within rock types, divided by variation within rock types. So that's one number divided by itself, which is equal to 1. So what this means is that when we perform an analysis of variance, what we're doing is we're calculating this ratio. And we're asking whether or not we have good evidence that this ratio is greater than one. Because if it's greater than one, then this means that we have good evidence that there is variation among the rock types. Okay? So that's the basis of analysis of variance. We create this ratio. And we create, we were where we have this ratio of variation among the ad within rock types divided by variation within rock types, we compare that to this value of one. And if this ratio is greater than one, which I've indicated here, then that is consistent with there being an effective rock type on growth rate. Okay? So that's the general basis for how analysis of variance works. What we're going to do now is we're going to talk about how we can quantify this variation. So what math would we use in order to actually calculate the amount of variation within and among rocks. And I'll tell you in a moment that this is basically just addition. It's very straightforward. The first step in this process to quantify the variation among and within rock types is just to recognize that we have this total amount of variation in our data set that we are then going to partition into among and within. Group effects. Okay, so let's imagine that this represents our data set, okay? Where we have multiple datasets, or let's say that each of these data points represents a tree. Let's say there was sampled from a different forest. So we have 12345. These five data points represents the measurement from, let's say one tree, each sample from Phi different for us and we have a number of different forests here. Okay? So you have multiple measurements of growth rate on to rock a, multiple measurements on rock B, and multiple measurements on rock type C. So schist, granite and sandstone, I think it was. Okay. We would like to quantify the total amount of variation in this data set. How do we do that? Well, what we can do is we can just calculate. We start by calculating the mean value for all of these data, okay? And that mean values indicated by this dotted line here. What we can do then is we can then just calculate the difference between each data point and this overall grand mean. So this overall mean for all the data I'm just going to call the grand mean. And so for each these data points, we can calculate the difference between this, a particular data point and the mean. Okay? So we'll take that difference and that difference, and that difference and that difference and that difference and that difference. And we'll do that for all the differences. That one, that one, that one, that one, that when they, when they, when they win and so on. Some of these differences are going to be positive as we have for all the data points that lie above the mean. And some of the differences will be negative, like we'll have for the ones below the mean. What we're gonna do then is we're going to take these differences and we want to add them up somehow. But if we add up just all of these raw differences, I'll tell you they're all going to add up to 0. And that's just because of one of the properties of what it means to have calculated a mean for the data. It's not very useful if these all sum up to 0. So what we're gonna do is we're going to take all of these differences and we're going to square them. Case we'll take this difference, square it, this difference, square it. That difference squared, that difference, square it, that different squared, that different square, that difference and so on. Okay? And we're going to end up with a series of squared values. And all the values that were negative are now going to become positive values once we've squared them. Okay? We're then going to take all of those squared values and sum them up. When we do that, we end up getting a measure of total variability. And this measure of total variability is called our total sum of squares. And it's called the total sum of squares because it's, we're quantifying the total variation in our data set. And it's called sum of squares because we've taken some differences and squared them. So those are our squares. And we, it's called sum of squares because we've taken all of those squared values and added them up. We've summed them. So we have our sum of squares for the total variation in our data. So there's our total sum of squares. And I'll just point out that this total sums of squares is our measure, the total amount of variation in our data set. And so the larger our total sums of square is, the more variable the data are. We're now going to imagine that doing the exact same thing, but in a case where the differences among our rock groups are much more obvious. Just to make This example, to make that the, the signal and this example just down to stand out a little bit more. So go back to that previous slide for a second. There might be differences in this figure among the rock types. So it looks like the average your figure a, sorry for rock type a, maybe a little bit, maybe higher than for B. And C might be intermediate. But that's relatively subtle. In this next case. It's different, it's a much more pronounced. So here we have data from rock type a. Blue is rock type B sees rock type C. And we're just going to imagine we're going to do the exact same thing. Excuse me. So we're going to imagine that with this more pronounced scenario, we're calculating the total sum of squares, which again gives us a total variation. So once again, just to review, we take each of our data points, find the difference between each of those data points and the overall mean of the data. Take that difference and square it, and then added up. And we do that for all the data points here. And that gives us our total sum of squares. We can take this total sum of squares and divide it, or this total variation and divide it into two different sorts. The first sort is what we call the sums of squares within groups. Okay? So this is the variation we have within our treatments. And so in order to do this, in order to calculate the variation within our groups or sum of squares within groups. What we would do or what our statistical package does for us is it calculates a mean value within each of our three different groups. So this line here is the mean value for rock type a. That's the mean value for rock type B and there's a mean value for rock type C. Now, following what you just learned for how to calculate the total sum of squares. Just stop for a second and I want to ask you, how do you think we would quantify the sum of squares within treatments? Stop and think about that for a moment. Okay, now that you've had a think, I'll point out that the way in which we calculate the sums of squares within treatments is completely analogous to what we did with to calc the total sums of squares. What we'll do is for one treatment at a time, we find the difference between a given data point and the mean for its treatment. We take that difference and we square it. And we do that for every single data point in this first treatment. So we take this difference between this data point and that mean, this difference between that point and the mean and so on. Take all those differences, square them, and then add them up. We hear the same thing for our next treatment and for the next treatment. And then we just add up all of those sums of squares we got from there, there and there. And so once again, we have squared differences. We have squared are the differences between our values and the mean. And we summed them up, that gives us our sum of squares. But we've calculated differently this time, because we've just calculated the sum of squares within the groups. This gives us a measurement of the amount of variation that we have within each of our treatments. I'll just stop and ask ourselves again, what do you think would cause this variation within the treatments? I want to ask you to pause the video this time because we've already talked about this. So earlier on I asked you about what would cause a variation in growth rate within the treatments. When we talked about how there can be variation due to measurement error. So variation among our measurements due to measurement error, there could be genetic effects, there could be environmental effects, etc. And this variation within these treatments will be, see on this slide, will be the same causes of this variation as we talked about for the variation within treatments we saw in a previous slide. Okay? We call this sum of squares within groups. We referred to it by a number of different names. Either sums of squares within groups or we can call it the error sums of squares, or the residual sums of squares. So this type of variation goes by a number of different names. So just keep your ears open for any of those possible terms. Okay? The second type we said before that we can take our total variation and divide it into variation within groups and variation among groups. Okay? So we just talked about how to calculate variation within groups. On this slide, we're going to talk about how you can calculate variation among the groups. Okay? To do this, we don't work with the individual data points anymore, but instead, we work with the mean values from each of our three treatments. And we also work with the mean for all the data in total. So this was the original Grand, this dark thick line here that we have here. This is the original grand mean that we were looking at when we calculated our total sums of squares a few slides ago. When my calculator sums of squares among groups were working with that overall grand mean. And the mean for each of these different groups. Imagine you know, how we're going to calculate the sum to squares among groups? Or just stop and think about this for a moment. Okay? As you might guess, if we want to calculate the sums of squares among groups, we is very much the same approach as we did already. We take this value, this mean, and find the difference between it and the grand mean, which is given by this vertical dotted line. We take that amount, which in this case looks like it's just greater than four. Because this, though, this mean here looks like it's just greater than nine. And so we take that via this just greater than four and we'd squared giving us a value of just greater than 16. We do the same thing for the small negative value here. We take this value which looks like it's less than a half, we would square it and we get a value that's probably a little bit less than 0.25. Here again, we have a value that's around four, but this time negative four, we take that value, we square it and that would give us again a value of around 16. So in this case, in this example, we can actually take big, actually guesstimate what our approximate sums of squares would be among our groups. You've got 16 plus 16 plus something a little bit less than one. So be something around 32, okay, that's how we would calculate or sums of squares among the groups. I've said here that the sums of squares among groups is due to variation between the treatments. In other words, they can have differences among these means, specifically because of differences that are inherent to these different environments. So in other words, we can have differences among these means. Because for one thing, because our different rock environments actually do we influence growth rate, that can be one cause of this among group variation. I want to note however, that we can also have, um, there's some error in our estimate of these means that arises because the variation within groups. So that within group variation we talked about on the previous slides, that's going to introduce some error that is going to cause some, some random variation in the value of these particular means. And as a result, this variation within groups is also going to influence our calculation for our total among four are we will calculate, our, will affect our calculation of the among that group sums of squares. So I almost got myself twisted, aren't doctor. Okay. So what causes variation between groups? Well, we just talked about that. Okay? This types of sums of squares goes by a number different names. We can cause the sum, the among groups, some squares, or maybe the treatment sum of squares or other similar terms. Just keep your ears open. Keep your brain's open, and it should be pretty clear what types of sums of squares we're talking about a particular situation. Okay. Let's just take a step back now. Just kind of review. We've talked about. We're talking about an experiment that summarized by these data here, where we have a number of independent data points have growth rate from three different treatments, whether it looks to be like us in this particular example, a small amount of variation within our different treatment groups and a fair, excuse me, a fair amount of variation among our treatment groups. We're going to wrap up this video and just a moment. And to do that, I just want to review the ideas we've discussed. In order to perform analysis of variance. What we do, or what the computer does for us, is we calculate the total variation our data, and that gets divided into two types. The among group variation and the within group variation. And we calculate this ratio. And we say that if this ratio is much greater than one, and if the probability of getting a ratio greater than one just due to random error is sufficiently low, then we can say that we can reject our null hypothesis. In other words, if our p-value that's associated with this ratio is particularly small, then we can say that we have good evidence to reject our null hypothesis. And in that case, you would conclude that rock type affects growth rate. Okay, that's our review of these big idea of analysis of variance. I just want to end by pointing out that our discussion so far has been slightly misleading. I've been misleading you very slightly. And that's because we started out by saying that in a previous video, we noted that ANOVA refers to analysis of variance. Variance. So far we've just been talking about variation. Okay? So we've been discussing this ratio is a ratio of variation. What we need to do now is we need to be able to convert this ratio of variation into a ratio of variance. How do we do that? Well, that is going to be the topic of our next video. Okay, so I'll stop this video. Now. I'll say, I hope this video has been helpful and I'll say, thank you very much.