Distributions
You will need to refer to previous assignments to complete this assignment.
1. Make a variable called sample.data1 that has 1000 data points, a mean of 500, and a standard deviation of 100.
2. Plot a histogram of sample.data: hist(sample.data1)
3. This data has a "normal" or Gaussian distribution. Why is this? Think of sampling the height of people in Canada. If you pick people at random off the street, you will have the occasional short person and the occasional tall person, but on average, most people will have a height that is close to the average height in Canada. But again, why is this? As it turns out, in nature, most phenomena (but not all) have a distribution that is normal. Indeed, this is a fundamental principle of research statistics and is formally know as the Assumption of Normality. It is important to note that most classic statistical analyses do not work properly if the assumption of normality is not met. In truth, the assumption of normality is a bit more complex than this, but what I have stated here is roughly true. More of why normal distributions are normal can be found HERE.
4. So, the rnorm makes data that is normally distributed. You can use the runif command to make randomly distributed data: sample.data2 = runif(1000,1,1000). This command makes a sample of 1000 randomly distributed numbers between 1 and 1000. Look at the histogram of sample.data2 and compare it to the histogram of sample.data1. You should see quite clearly that the distribution of scores is random and not normal.
5. Recall that we have stated that most naturally occurring phenomena generate data that is normally distributed. It is always worth checking to see that this is true when working with data, if your data is not normal then you may have to analyze it differently.
6. Having said Point 5, what is truly important for most statistical tests is the assumption that the sampling distribution of the mean for your data is normally distributed. What is a sampling distribution of a mean? Let's create one. Try the following code:
sample.data = NA
sample.means = NA
for (counter in 1:10000)
{
sample.data = rnorm(20,300,25)
sample.means[counter] = mean(sample.data)
}
hist(sample.means)
So let's look at what this code does. It creates 10000 samples of size 20 with a mean of 300 and a standard deviation of 25. It then plots the histogram of the sample means. This is a sampling distribution of the mean! It is the distribution of the means of a number of samples. Theoretically, the sampling distribution of the mean shows EVERY VALUE THE MEAN CAN TAKE and THE PROBABILITY OF GETTING A GIVEN MEAN VALUE. If you think of the histogram you have just made, it kind of has both of these properties? What value is at the 50th percentile? One way to think of the sampling distribution of the mean is that it shows the range of values the mean can take (and the likelihood of getting a given value). Experimentally, this is important to know - when we collect a sample of data we hope that the mean of our sample is close to the population mean, but it is important to never forget it could actually be quite far from the true population mean!
Assignment Questions
1. Make 10 samples of normally distributed data, each with a mean of 500 and a standard deviation of 100, but with increasing sample sizes (10, 20, 30, 50, 100, 1000, 2500, 5000, 10000, 100000). Make a plot that shows the histograms for each of your 10 samples - what happens to the shape of the histogram with increasing sample size? Make sure you repeat this exercise a few times to verify your claim. Send a r script that does this.
2. Repeat Question 1 using randomly distributed data. What changes do you note in the histograms with increasing sample size.
3. Construct a sampling distribution of the mean that has 100000 samples with a size of 30 and a mean of 500 and a standard deviation of 100 using normally distributed data. What is the range of values the mean can take? What is the probability of getting a mean of 460? (THIS IS TRICKY TO DO, BUT I AM SURE GOOGLE CAN HELP YOU!)
4. Repeat Question 3 using randomly distributed data. What do you note about the shape of the sampling distribution of the mean? How does this differ from your answer for Question 2?
5. Repeat Question 3 using 3 different sample sizes (10, 50, 100). How does this change the sampling distribution of the mean?
6. Read the following brief summary HERE.
7. What is the Central Limit Theorem? How does the Lyon paper (above) differ in the theory about why most data is normally distributed?
8. What is the Standard Error of the Mean?
9. Compute the Standard Error of the Mean for your answer to Question 3. Frequently we compute the Standard Error of the Mean from a data set for a single sample without knowing the sampling distribution of the mean. How is this possible?
1. Make a variable called sample.data1 that has 1000 data points, a mean of 500, and a standard deviation of 100.
2. Plot a histogram of sample.data: hist(sample.data1)
3. This data has a "normal" or Gaussian distribution. Why is this? Think of sampling the height of people in Canada. If you pick people at random off the street, you will have the occasional short person and the occasional tall person, but on average, most people will have a height that is close to the average height in Canada. But again, why is this? As it turns out, in nature, most phenomena (but not all) have a distribution that is normal. Indeed, this is a fundamental principle of research statistics and is formally know as the Assumption of Normality. It is important to note that most classic statistical analyses do not work properly if the assumption of normality is not met. In truth, the assumption of normality is a bit more complex than this, but what I have stated here is roughly true. More of why normal distributions are normal can be found HERE.
4. So, the rnorm makes data that is normally distributed. You can use the runif command to make randomly distributed data: sample.data2 = runif(1000,1,1000). This command makes a sample of 1000 randomly distributed numbers between 1 and 1000. Look at the histogram of sample.data2 and compare it to the histogram of sample.data1. You should see quite clearly that the distribution of scores is random and not normal.
5. Recall that we have stated that most naturally occurring phenomena generate data that is normally distributed. It is always worth checking to see that this is true when working with data, if your data is not normal then you may have to analyze it differently.
6. Having said Point 5, what is truly important for most statistical tests is the assumption that the sampling distribution of the mean for your data is normally distributed. What is a sampling distribution of a mean? Let's create one. Try the following code:
sample.data = NA
sample.means = NA
for (counter in 1:10000)
{
sample.data = rnorm(20,300,25)
sample.means[counter] = mean(sample.data)
}
hist(sample.means)
So let's look at what this code does. It creates 10000 samples of size 20 with a mean of 300 and a standard deviation of 25. It then plots the histogram of the sample means. This is a sampling distribution of the mean! It is the distribution of the means of a number of samples. Theoretically, the sampling distribution of the mean shows EVERY VALUE THE MEAN CAN TAKE and THE PROBABILITY OF GETTING A GIVEN MEAN VALUE. If you think of the histogram you have just made, it kind of has both of these properties? What value is at the 50th percentile? One way to think of the sampling distribution of the mean is that it shows the range of values the mean can take (and the likelihood of getting a given value). Experimentally, this is important to know - when we collect a sample of data we hope that the mean of our sample is close to the population mean, but it is important to never forget it could actually be quite far from the true population mean!
Assignment Questions
1. Make 10 samples of normally distributed data, each with a mean of 500 and a standard deviation of 100, but with increasing sample sizes (10, 20, 30, 50, 100, 1000, 2500, 5000, 10000, 100000). Make a plot that shows the histograms for each of your 10 samples - what happens to the shape of the histogram with increasing sample size? Make sure you repeat this exercise a few times to verify your claim. Send a r script that does this.
2. Repeat Question 1 using randomly distributed data. What changes do you note in the histograms with increasing sample size.
3. Construct a sampling distribution of the mean that has 100000 samples with a size of 30 and a mean of 500 and a standard deviation of 100 using normally distributed data. What is the range of values the mean can take? What is the probability of getting a mean of 460? (THIS IS TRICKY TO DO, BUT I AM SURE GOOGLE CAN HELP YOU!)
4. Repeat Question 3 using randomly distributed data. What do you note about the shape of the sampling distribution of the mean? How does this differ from your answer for Question 2?
5. Repeat Question 3 using 3 different sample sizes (10, 50, 100). How does this change the sampling distribution of the mean?
6. Read the following brief summary HERE.
7. What is the Central Limit Theorem? How does the Lyon paper (above) differ in the theory about why most data is normally distributed?
8. What is the Standard Error of the Mean?
9. Compute the Standard Error of the Mean for your answer to Question 3. Frequently we compute the Standard Error of the Mean from a data set for a single sample without knowing the sampling distribution of the mean. How is this possible?