Okay, In this video, we're going to discuss one of the consequences that arises when we conduct experiments with low statistical power. We're going to be addressing this terrifying question. Is Most Published Research wrong? I want to start by thanking Nick Colgate because he actually created most of the slides in this video. He created them for his own purposes earlier, but then he very kindly shared them with me for purposes like this. And I decided not to reinvent the wheel because the original wheel was so nice. So thank you Nick. The aim of our video is to understand why low-powered studies affect reproducibility. And the ideas in this video, I'm first started come to light in 2005 in this amazing paper published in PLOS medicine with again the terrifying title, why most published research findings are false. We're also going to be drawing heavily from a follow-up paper in 2013. Power failure. Why small sample size undermines the reliability of neuroscience? And of course, I couldn't make a video without pointing you to a nice book by Graham Brexit and Nicole grave. They talk about these issues in this book as well. Okay. Now, as scientists, our job is to figure out how the world works. We want to know what does the world really like. And to figure that out, we create hypotheses. And the most basic type of hypothesis that we can create is a null hypothesis. Which is that basically there's nothing interesting going on. So for example, if we wanted to compare the mean value of two different groups, then the null hypothesis would be that the mean of this group is equal to the mean of that group. So our null hypothesis would be there was no difference between the groups. Now, there is some truth to the universe. Either the truth is that the null hypothesis is true, or the truth for the universe is that the null hypothesis is false. Okay? We try to obtain evidence to understand whether or not the null hypothesis or whether or not we have evidence for whether or not the null hypothesis is false by conducting experiments. So we conduct experiments to see whether or not we can reject the null hypothesis. When we do this, there are actually four different types of conclusions that we can draw from our experiment. And this four different types arise from different combinations. About reality. So the truth of the universe and the types of outcomes we can get from our experiment. So I'm just going to talk about these different combinations using this table here, this two-by-two table. Along the left here, we've got the various possibilities for the universe. So the truth of the universe can either be that the null hypothesis is true. So the truth of the universe is that there is no difference between these groups, for example. Or the truth for the universe is that these groups truly are different. And so the null hypothesis is false. Okay? That's the truth. We conduct experiments to try to learn something about this. And our experiments can lead us to two general outcomes. We can either reject the null hypothesis or fail to reject the null hypothesis. Notice that I have not said accepts the null hypothesis. We never accept the null hypothesis. We only reject it or fail to reject it. Okay? Now it turns out that there are two ways to be correct, in two ways to be wrong given these possible combinations. The first way to be correct is if the truth about the universe is that the null hypothesis is true. And if the outcome of our experiment is to fail to reject the null hypothesis. So the truth is, there's nothing interesting going on and our experiment fails to give us evidence that there is something interesting going on. So this would lead us to a correct conclusion, which would be that we have no evidence to say that the null hypothesis is false. The other way to be correct is if the null hypothesis truly is false. So that's the truth of the universe and our experiment causes us to reject the null hypothesis. Okay, So those are the two ways we can be correct. There's two ways to fail or to be incorrect as well. The first way is to commit was called a type one error. And this involves the case where the null hypothesis truly is true. So the truth is, there's nothing going on interesting in the universe with respect to our question, scientific question. Excuse me. And yet It's actually start that over my cough threw me off. So we can commit a type one error if the truth is that the null hypothesis is true. Yet our experiment causes us to believe that there is something interesting going on. In other words, the outcome of experiment causes us to reject the null hypothesis, even though we shouldn't have. Okay, The other way of being incorrect, just commit what's called a type two error. And in this case, the truth is that there is something interesting going on in the universe. So the truth is the null hypothesis is false. Yet our experiment failed to reject the null hypothesis. In other words, our experiment. Fails to give us any signal that would lead us to believe that the null hypothesis was actually false. Okay, and that's called a type two error. Before we talk about power, I want to say a little bit more about these type one and type two error rates. So the type one error rate is traditionally specified by this value alpha. And a type one error rate is equal to the probability of making a type one error, given that the null hypothesis is actually true. Okay? And this rate of making type 1 errors depends on how small we decide to set a threshold value that we compare p-value against. In other words, our type one error rates depends on the critical p-value that we use to decide whether or not to reject our null hypothesis. In biology, the critical p-value is typically 0.05. And the most common approach in biology right now is to reject the null hypothesis. If our p-value is less than 0.05. I'm going to stop myself here for a moment and just raise one caveats to this whole video. And that is in some other videos, I discussed at why this approach of deciding whether or not to reject the null hypothesis based on comparing the p-value against an arbitrary threshold was a bad idea. In some other videos explained that the American Statistical Association now has put its foot down and said This type of approach to decide on whether or not to reject a null hypothesis, where we compare p-value to an arbitrary threshold value is no longer considered a wise way to do science, okay? However, for the sake of this video, I am going to continue to adopt this perspective where we decide whether or not to reject a null hypothesis depending on whether or not our p-value is less than 0.05. Okay? And the reason I'm doing that, even though that approach to science is not wise. The reason I'm doing that is because that approach really is the approach that for the moment any ways remains the most common approach in biology. Okay? And so I'm creating this video from this perspective so that it can inform this approach that remains so common. Okay, so now back, back to our normal stream of thought. Okay. So we're stepping back into our frame of mind where we would reject our null hypothesis if our p-value is less than 0.05. Okay? When we adopt this perspective, we accept the fact that data as extreme as ours. So data that we could have obtained for an experiment could be as extreme as we found by random chance 5% of the time, even if the null hypothesis is true. Okay? In other words, if the null hypothesis is true, when. Taking this perspective, we will make a type 1 error 5% of the time. Okay? Now, throughout this video, I'm going to use graphics like this, where I'm basically going to be taking our two-by-two table that we were discussing before and just resizing it in order to help us visualize some of the ideas that we're going to be discussing. And on the, actually this isn't exactly the same two-by-two table, but there aren't and it is going to be a two-by-two table, but with wonky shapes, Okay? And what we're gonna have in the bottom row of this table is the case where the null hypothesis is true. And that's what we're going to focus our attention now in a few slides, we're going to consider the case where the null hypothesis is false. Okay? For the moment though, we're still thinking about type 1 errors. And when the null hypothesis is true, we will commit a type one error 5% of the time when we use a p-value of 0.05 as our arbitrary threshold to decide whether or not to reject the null hypothesis. So when our null hypothesis is true, we commit this type of error 5% of the time. But we will correctly, we'll make a correct conclusion 95% of the time in the cases where the null hypothesis is true. Okay? Now, let's talk about the type 2 error. Type 2 errors are often referred to as false negatives. And in this case, so for type 2 error, we're talking about a case where something really is going on in the universe. But our experiment causes us to conclude that there isn't anything going on, very loosely speaking. Okay? And the convention is to refer to the type two error rate as beta. Can't specify Beta very precisely in advance of doing an experiment, because beta will depend on the nature of the data, and it also depends on what the world is actually like. However, we can do things to control Beta when we're designing our experiment. In particular, we can design our experiments to have an appropriate level of statistical power. Where statistical power is the probability of not making a type two error if the null hypothesis is false. Okay? So statistical power is conventionally described as this value here of one minus Beta. So it's one minus the probability of making a type two error. Just to remind you of some things we've talked about in some previous videos. That there are three things that determine the power or statistical power for an experiment. The first is the size, the effects that we want to understand. And the bigger the effect, the higher the power will be. The second is the amount of variability that we have in our data. And the less variable that our data are higher our power will be. And third is for a given experimental design. Power will increase as we increase the number of samples. Okay? Now, if we have a high-powered study, MBA, high-powered, I mean, power of 80 percent. That is the convention of what's considered a high-powered study. That does not mean that everyone always design their study to have 80 percent power. In some cases, that might design a study to have higher power or maybe lower power. But those kinds of decisions are not the point of this video. I'm getting myself slightly off track here. The point is that by convention, we consider a high-powered study, one that has 80 percent power. And so when our study has this conventional 80 percent power, kind of our landscape of our conclusions will look like this, okay? So when the null hypothesis is false, so when the truth of the universe is that there is something interesting going on, then when we have 80 percent power, we will correctly detect, we will correctly reject our null hypothesis 80 percent of the time, but will make a type two error. We will fail to reject your null hypothesis 20 percent of the time. Okay? So that's kind of what our landscape of our conclusions will look like in a high-powered study. If we have a low powered study, for example, say 20 percent. So power 20 percent, and this is what our landscape of outcomes will look like, okay? When we have only 20 percent power, we will only correctly reject the null hypothesis when the null hypothesis is false 20% of the time. And among the times where the null hypothesis is true, the false, we will fail to reject the null hypothesis 80 percent of the time. So we'll make a type two error. Now, traditionally, we tend to focus on the Type 1 error rate when we're thinking about false positives, okay? That's typically where our attention has b. In other words, typically when thinking about false positives, we're thinking about this. Okay? And that's maybe largest summed up things you might say that a competent scientists prefers to not know about something than to falsely say or to falsely assert something. Okay? So that might be the philosophy that might explain this perspective of focusing on type one error rates when thinking about false positives. Power has received less attention in the context of false positives. And that's because it's frequently assumes that it's not going to matter if the studies provides a significant result. Okay. And that's because it's traditionally argued that sample size effects type 2 errors, but not the type one error rate. Okay? In other words, people may often use this argument. They might say, Okay. Experiment with small, but the results are significant. So even though I had a really small study, my result must be really important. Okay, That's the kind of logic the may often be used. Okay? But we're going to point out at this point in the video, is that power can affect your probability of publishing a type one error. So the power can affect the type one error rate among published studies in the literature. Okay, How does that work? Okay, well, to understand this, we're going to conduct a thought experiment. Well, let's imagine that there are 2200 different laboratories that are all carrying out drug trials on different compounds, okay, on different drugs. And let's assume that half of these labs are considering drugs where the drug actually does work. Okay, so in half the time, the null hypothesis is actually false, okay? Whereas the other half the time for the other 100 labs, they're working on drugs at truly have no effect. Okay. What would our landscape of conclusions look like for high or low powered studies in this scenario. Okay, that's what we're going to think about. Okay? So here's the landscape that we are considering before, our landscape for a possible outcomes when we are considering a high-powered study. And the one change that I've made here is that I've pointed out than the top. We have a 100 cases where the null hypothesis is false. So half the time this will be true. And the other half of the time, the null hypothesis will be true. So this will be the case half the time. Okay? So let's start by looking at the cases where the null hypothesis truly is true. Okay? In this case, among our 100 labs, five of them will commit a type one error. And so they will conclude that the drug has an effect when the truth is, the drug does not have had an effect. Whereas 95 of those labs will correctly fail to find evidence for an effect of the drug. Okay, So that's it we have for the case where the null hypothesis is true. Now, for the cases where the null hypothesis is false, we have again a 100 trials. When we have 80 percent power. This means that 80 of those 100 labs will correctly reject the fall, will correctly reject the null hypothesis, okay? So it would be correct 80 out of those 100 times and 80% of the time we find in effect the drug. Whereas for 20 of the 100 labs, we fail to detect an effect and we commit a Type 2 error. Now, let's just focus on the statistically significant results. In this case, we can ask overall. Among these values, what is our false positive rate? Well, in this case, our false positive rate among the statistically significant values is equal to 5 divided by 80 plus 5. So we're looking at this proportion of this divided by the total number of outcomes. So five divided by 85, okay? And that's equal to roughly 6%. Okay, so what this tells us that in high powered studies, about 6% of the published results are going to be wrong. Based in our scenario of a 100 labs getting drugs that do work and a 100 labs that do not work. A 100 labs or the drug does not work. Okay? So with high powered studies, in this example, 6% of the published results are wrong. That's not so bad. That's pretty good. How does that change though? When we have a low powered study, wanted to low-powered study. Now are when our powers 20 percent, only 20 of those 100 labs will, are likely to detect an effect. Okay, so we expect there to be 20 labs that find a difference between a drug and some other treatment. Whereas the other 80 labs may type two errors. Now. Now, what is our false positive rate? And we focused on the statistically significant results. In this case, it's able to five divided by 25, okay? Which is equal to 20%. So in this case, when we have low powered studies, about 20 percent of the published significant p-values are going to be wrong. Okay? So this is a case where we're equally likely to have the null-hypothesis be false or true. Okay? Not all research areas are likely to be in as nice a situation. Is this what the situation we're outlining here that we've just outlined where you can be, your null hypothesis can be false half the time. This would be a pretty, pretty nice situation to be in science honestly. Or maybe wouldn't be depending on your perspective. But what this would mean is that when you are setting out an experiment and when you're dreaming up particular hypothesis that you test, your null hypothesis will be false very frequently. Okay. Pardon me, excuse me. When we're doing exploratory studies. So when we're venturing out into areas of science where we know very little. Okay, So when there's very little background knowledge which we often considered to be the cutting edge science. In that case, it's much more likely that our null hypotheses will actually be true. Okay? And so what we're going to do now is we're going to consider what this landscape of outcomes would look like when. Let's say we have our 200 labs, but we're in an exploratory field where it's much more likely for the null hypothesis to be true than it is for it to be false. And so let's imagine that we have a 170 labs where the null hypothesis is true. And 30 labs where the know half-halt, where the null hypothesis is false. In this case. Here's our a number of statistically significant results when we have a low powered study. Okay? So in this scenario, you can see that our false positive rate is going to be much higher. Okay? So when we're in a, an exploratory field, low-powered studies are going to lead us to being wrong. In other words, are going to yield incorrect conclusion. Smart, significant p-values 59% of the time based on the scenario, okay? In other words, when we're in a field of exploratory research and we're using low-powered experiments. The significant p-values that are published from that type of research are going to be incorrect 59 percent of the time. Okay. And so what this illustrates is that low-powered studies can increase the proportion of published significant results that are wrong, okay, this is especially true if we're in an exploratory field of research. Okay? Now, is this a real problem? To understand? To answer that question, we need to have a sense of the true level of statistical power that is typical for various fields of biology. This beautiful paper that I pointed to at the beginning, this video makes this very terrifying conclusion. They conclude, after having looked at a broad array of literature for neuroscience, they see that our results indicate that the average statistical power of studies in the field of neuroscience is probably no more than between 8% and 31 percent. Okay? This harkens back to this terrible scenario with we have low power were talking about earlier where and this kind of situation, if you're in an exploratory field, you are. If you have low-powered studies, you're very likely to reach false conclusions based on significant P-values that you find. In other words, that the proportion of significant p-values that you obtain are likely to lead you to wrong conclusions. What about ecology and evolution? So this is a paper from 2016 published in trends in ecology and evolution. And I've just pulled up this paper because they actually draw together the conclusions from a variety of other papers that examine the, the amount of power that's found in various areas of biology or Ecology and Evolution. And they point out that in these fields, typically ecology and evolution have about 13 to 16 percent power in order to detect small effects. About 45 percent power to detect medium effects, and 65 to 72% power to detect large effects. So the situation in ecology and evolution is, is not a good one either. Okay? So what can we do? Well, It's really one main thing we can do, which is we can conduct experiments that have reasonable power. And that's really what I'm trying to emphasize in this video, is that when we conduct experiments with low power, it can have devastating consequences for our understanding of science and literature. So I hope that other videos that you might watch in this series can give you advice on how to design good experiments with a reliable level of power so that we can obtain reliable and reproducible results. I'm going to end the video there, and I'll say hope it's been helpful and thank you very much.