Okay, in this video, we're going to address this question. What is independence as it relates to data, say, in an experiment? And we're going to address this question from two different perspectives. We're going to start this though with a bet. And my bet is that you already have a good gut sense of what independence is. We're going to test might that. With this example. Let's imagine that we wanted to characterize something about the people that live in the city of Edinburgh. Let's imagine we want to characterize the height of people that live here. And you were considering two different ways of doing this. The first would be to randomly select 100 people. Okay? So we can imagine assigning a number to all 0.5 million people that live in the city. And the randomly sample a 100 numbers from that list. You pull out those individuals. You measure each of them, and then you determine the average height. And you measure the variation in height for the people who live in this city. That's one approach. The second approach involves randomly sampling, not individuals, but randomly sampling ten families. And then you might randomly sample ten individuals within those families. That also gives us a 100 individuals. In this case. We also measure the height of all 100 individuals and we can measure the variation among those individuals. Which of these two different approaches do you think is going to better characterize the distribution of height for people in this city. Do you think it's going to be these approach in blue are the approach and green stuff. But think about this for a second. Okay, based on my past experience, I'm going to bet that you have chosen this one. And this is the correct answer. This approach here on the right in green, which involves randomly sampling families, is not going to give rise to independent data, which is our theme of this, of this video. It may not be clear yet why these data are not independent or are why we would say that these data are not independent. And so we're going to spend the rest of this video trying to understand what is meant by independence. More clearly. We're going to discuss two different perspectives with respect to independence and this approach to sampling. So here's the first perspective. We can say that data in a sample are independent. If knowing something about one of our subjects provides us no infer, no useful information about something we're interested in for another subject. Non-independence then refers to the opposite scenario, where we would say that our data are non independent. If knowing something about one of our subjects does provide us with useful information about something that we're interested in for another subject. Let's return to our Edinburgh example. So how does that relate here? Well, we have sampled families and we know that most often families are made up of individuals who are related to one another. In other words, individuals that share genes. We also know that height is strongly determined by the genes that you have in your cells. We also know that families will often experience a similar environment. So families, people within a family will often have the same diet. And they'll experience all other sorts of conditions that are similar. And those environmental conditions that are similar among individuals in a family could also affect the height of people within a family. Similarly. So if I were to take a particular individual from your sample of a 100 individuals here, where we've randomly sampled ten families. And if I were to ask you whether or not you thought that individual was likely to have above or low average height? We can imagine that it would help you to answer that question. If I told you that another individual in the sample who was in the same family was far above average height. So in other words, if you knew that, say a sibling of the individual, I'm asking you to to to to predict their height four, if I told you their sibling was much taller than usual, and that would probably lead you to conclude that the individual I've asked you about also is likely to have above average height. So this is a way of what I'm trying to do here is I'm trying to show more explicitly what I mean about knowing something. I'm trying to get more explicitly explain this idea of, of when data are considered independent or not independent. Data are not independent. If knowing something about one individual provide you information for another individual. The data are independent. If knowing something about one individual provides you with no useful information for predicting the state of another individual. So that's our first perspective. Here's a second perspective. Or second perspective is that we can say that data in a sample are independent when on average, to individuals in a sample are no more similar to one another than two randomly selected individuals from a population. That might sound a bit strange. We'll walk through what I mean there with it, with an example in just a moment. So that's what we mean when data are independent, when the similarity of or the average similarity of individuals within a sample is consistent with the similarity between two randomly chosen individuals in a population. Or in the population that you sample came from. That means that data in a sample or non-independent, if on average to individuals in a sample are more similar to one another, then we would have, if we were to compare two randomly selected individuals from the population. Let's, let's flesh this out. Why by being a bit more explicit. Let's imagine that our population looks like this, where each letter represented a different individual in our population. And individuals that share a letter, we're gonna say are very similar to one another. And that might be because they share genes or because they share similar environment. So these four individuals that all have the letter A, we're saying they're very similar to one another, but they're different from this individual with letter C. And these individuals, letter Q, are all similar to one another and they're likely to be very dissimilar from compared to this individual with p. I've been a bit unrealistic here and how I'm characterizing individuals as similar or dissimilar to one another. And that's because that's not how the real world works. Usually. How similar individuals B, will be a continuous, continuous variation, how similar individuals are. So for example, individuals that are related might be very similar to one another if they're identical twins. But a first cousin might be more similar to you than a random person from the population. But that first causing will be less similar to you than your identical twin. Okay. But so that that's what reality is like. I've just tried to simplify things here in this case by calling individuals similar or not. So our second perspective on independence involves trying to determine the average similarity between individuals. What do I mean by that? What I mean is that we could take say, this individual here, this letter a, or this individual with characterized by letter a. And we can compare this individual to all the other individuals in the population and find the average similarity between this individual and the others. So this individual will be highly similar to this other individual, cuz they're both have letter a. The first to visual will also be very similar to this individual, this letter a. And similarly this, our first individual will be very similar to this other individual with letter a. But our first individual will be very dissimilar to this individual with B and D and so on to all the rest of the individuals. Similarly, this individual with the letter Q will be really similar when compared to all the other cues, but really dissimilar to all the other individuals in the population. We could go through this process of taking each individual and comparing them to every other individual in the population and calculating their average similarity. And then we could just find the average similarity for all these comparisons overall. That would give us a measure of how similar the individuals are to one another in this population. If we take a random sample of this population, then that similarity among individuals will be reflected in our random sample. And that's what I've tried to illustrate here. You can see that we have some individuals that are similar to one another because we have this, we have two individuals that share the same letter. We have two letter cues. But just like in the original population, most individuals are going to be dissimilar to one another. So what we can expect is that if we calculated the average similarity among individuals in our sample, we would expect that the average similarity of individuals in our sample will be consistent with the average similarity among individuals in our population. And because of that, we could say that our random sample involves independent data. Ok, these are independent samples. Or our subjects are independent within our sample. Let's compare that to a case where we did not sample randomly. Instead, let's imagine that we went into a particular region and we just sampled all the individuals that were close to us in that region. So for that reason, we might end up getting lots of cues if we just sampled here. And so we have lots of individuals that are similar to one another from these cues. Similarly, we might sample all the use. We have 3u individuals here, and we have three. We've sampled all of them in our non-random sample. Okay? You can see we have multiple owes, multiple a's and we have a couple individuals that are dissimilar to the rest. If we were to go through the process of comparing the, each individual in this sample to one another and calculating how similar they were to one another, we would find the average similarity among individuals in this sample will be much higher than the average similarity among individuals in our population. And for that reason, we would say that our sample of individuals here is involves data that are not independent. Okay? So because a similarity among individuals in this second sample is going to be because a similarity among individuals and the second sample will be higher, then the similarity among individuals in the population. We say that these data are non independent. This should connect naturally. Now back to our Edinburgh example, where since we have ten different families, if we were to compare the similarity among individuals within our sample of a 100 individuals made up of ten families of ten, then we would be comparing among individuals within families much more frequently than we would be if we were comparing among individuals within the population. And that would lead us to have a higher degree of similarity among our subjects in our sample. Then we would expect to find among individuals in our original population. And on this perspective lead us to, leads us to conclude that this form of sampling will give us a sample of non independent data. So those are our two perspectives on trying to, our two perspectives to understand the notion of independence in an experiment. I hope this video has been helpful. In the next video, we're going to take this notion of independence and apply it more explicitly in the context of an experiment that I will stop and say, thank you very much.