2B. Variance
1. Load the file sample_data.txt into a table called "data" using read.table.
2. Check that the mean of column V1 is 298.16 and that the mean of column V2 is 350.79.
3. You will note that each of the numbers in the column is not 298.16 or 350.79, indeed, these numbers VARY quite a bit around the mean. Let's examine that variance by plotting data$V1 and data$V2. Try the following code:
par(mfrow=c(1,2)) this tells R you want to arrange two plots side by side, 1 row, 2 columns
plot(data$V1,ylim = c(250,400)) this will plot data$V1 on the left panel with y-axis limits of 250 and 400
plot(data$V2,ylim = c(250,400)) this will plot data$V2 on the right panel with y-axis limits of 250 and 400
Notice that the numbers are distributed about the mean values, in other words, there is VARIABILITY of these values around the mean.
4. Try the following command: data$V3 = data$V1 - mean(data$V1). This creates a new column data$V3 where each entry is just the corresponding value in data$V1 with the mean subtracted. What is the sum of this column? (Hint: sum(data$V3)). This will give you a really small number - in reality the sum should be zero but because of rounding error this never happens. Try plotting data$V3. You should see that is looks a lot like a plot of data$V1, however, the mean is now zero. This is of course, because you have subtracted the mean. The variability (data$V3) is simply the variability in scores about the mean. Recall that in the previous lesson we described the mean as a model. The mean is a model of the data, the best estimate of what the next score will be or of the data as a whole. However, you can also think of the data this way: data = model + variability, or data = mean + variability. Typically, in statistics, we use the word error instead of variability. The error is the difference of individual scores from the model.
5. As opposed to keeping all of the "error" scores, we may wish to describe how much variability there is around the mean. Create a new column, data$V4 using the following code: data$V4 = data$V3*data$V3. This column has a special name in statistics, it shows the squared errors from the mean. Importantly, this column does not sum to zero, it sums to a number (this is because there are no negative numbers once it is squared). The name for this term for a given set of data is the Sum of Squared Errors or SSE. If you divide this number by n-1 (the number of data points minus 1, so 99 for this data), you have a quantity called VARIANCE - the variance is a number that describes the variability in the data. For example, if everyone had the same score as the mean, the variance would be zero because the sum of squared errors would be zero. Compute the variance now: sum(data$V4)/99. We did not actually have to do all this work, usually we would take the short cut: var(data$V3). Note, the actual number we get for the variance only has meaning if you understand the data. For example, if I tell you a mean for a data set is 60 and the variance is 20 that does not mean a lot to you. However, if I told you that the mean of 60 was an age representing age of death and 20 was the variance, that is quite scary! Imagine what the distribution of age of death would look like! The mean of the variance also has relative meaning - you could compare the variance between two groups and that may be of interest to you.
6. Compute the variance for data$V2. I would recommend using the long approach we used above (by doing the math and creating new columns) and also by using the command var.
7. Standard Deviations. Sometimes researchers use a different quantity to represent the variability in a data set - the standard deviation. Mathematically, the standard deviation is just the square root of the variance: sqrt(var(data$V1)). There is also a simple command for it: sd(data$V1). This would be a really good time to find a textbook and read (a lot more) about variance and standard deviation.
8. Sometimes statisticians like to create data to test ideas. In R this is very easy, for instance:
newdata = rnorm(1000,300,25) creates a new data set called newdata which has 1000 scores, a mean of 300, and a standard deviation of 25.
9. Another way to examine variance is to use a histogram. Try the following command: hist(newdata). Use a textbook and again read about histograms. Essentially, they show a breakdown of your data into bins (usually 10, but can be any number), with a count of the number of data points that fall into the range of a given bin. If you try the following two commands: output = hist(newdata) and then simply output, you should see a lot of information about what a histogram is comprised of. Make sure you know this information. NOTE. Your histogram is normally distributed, what some would call a "bell curve". There are reasons for this and we will talk a lot more about normality later.
Assignment Questions
1. Why is the sum of a column of numbers with the mean subtracted equal to zero?
2. Use the commands in this assignment to create a plot (which you will submit) that has 1 row, 3 columns and has a plots of three data sets, both of size 1000 and a mean of 500, but one with a standard deviation of 10, one with a standard deviation of 50, and the last with a standard deviation of 100. Can you visually see the difference in the variance? HINT: Set your y limits to 0 and 1000 for both plots for this to work! Describe the increase in variability you see in words. Use a computation of the variance for each data set to justify your description.
3. For the data you created in Question 2, create a new variable called squared errors. This variable should have 6 columns, the first three containing your three columns of data, the next three each containing the squared errors (thus a 1000 row by 3 column matrix) for one of the columns. Compute the sum of squared errors for each of your columns and report these numbers. Also report the variance.
4. Repeat Question 2, but using histograms instead of plots. For your comparison to be valid both the x and y axis limits have to be the same for all your plots! Use hist(mydata$V1,xlim = c(1, 1000), ylim = c(1,300)) as a guide to help you with this.
5. If you did Step 9 of above correctly you would have seen all the outputs of a histogram in R, starting with $breaks. What do each of the 7 outputs mean?
2. Check that the mean of column V1 is 298.16 and that the mean of column V2 is 350.79.
3. You will note that each of the numbers in the column is not 298.16 or 350.79, indeed, these numbers VARY quite a bit around the mean. Let's examine that variance by plotting data$V1 and data$V2. Try the following code:
par(mfrow=c(1,2)) this tells R you want to arrange two plots side by side, 1 row, 2 columns
plot(data$V1,ylim = c(250,400)) this will plot data$V1 on the left panel with y-axis limits of 250 and 400
plot(data$V2,ylim = c(250,400)) this will plot data$V2 on the right panel with y-axis limits of 250 and 400
Notice that the numbers are distributed about the mean values, in other words, there is VARIABILITY of these values around the mean.
4. Try the following command: data$V3 = data$V1 - mean(data$V1). This creates a new column data$V3 where each entry is just the corresponding value in data$V1 with the mean subtracted. What is the sum of this column? (Hint: sum(data$V3)). This will give you a really small number - in reality the sum should be zero but because of rounding error this never happens. Try plotting data$V3. You should see that is looks a lot like a plot of data$V1, however, the mean is now zero. This is of course, because you have subtracted the mean. The variability (data$V3) is simply the variability in scores about the mean. Recall that in the previous lesson we described the mean as a model. The mean is a model of the data, the best estimate of what the next score will be or of the data as a whole. However, you can also think of the data this way: data = model + variability, or data = mean + variability. Typically, in statistics, we use the word error instead of variability. The error is the difference of individual scores from the model.
5. As opposed to keeping all of the "error" scores, we may wish to describe how much variability there is around the mean. Create a new column, data$V4 using the following code: data$V4 = data$V3*data$V3. This column has a special name in statistics, it shows the squared errors from the mean. Importantly, this column does not sum to zero, it sums to a number (this is because there are no negative numbers once it is squared). The name for this term for a given set of data is the Sum of Squared Errors or SSE. If you divide this number by n-1 (the number of data points minus 1, so 99 for this data), you have a quantity called VARIANCE - the variance is a number that describes the variability in the data. For example, if everyone had the same score as the mean, the variance would be zero because the sum of squared errors would be zero. Compute the variance now: sum(data$V4)/99. We did not actually have to do all this work, usually we would take the short cut: var(data$V3). Note, the actual number we get for the variance only has meaning if you understand the data. For example, if I tell you a mean for a data set is 60 and the variance is 20 that does not mean a lot to you. However, if I told you that the mean of 60 was an age representing age of death and 20 was the variance, that is quite scary! Imagine what the distribution of age of death would look like! The mean of the variance also has relative meaning - you could compare the variance between two groups and that may be of interest to you.
6. Compute the variance for data$V2. I would recommend using the long approach we used above (by doing the math and creating new columns) and also by using the command var.
7. Standard Deviations. Sometimes researchers use a different quantity to represent the variability in a data set - the standard deviation. Mathematically, the standard deviation is just the square root of the variance: sqrt(var(data$V1)). There is also a simple command for it: sd(data$V1). This would be a really good time to find a textbook and read (a lot more) about variance and standard deviation.
8. Sometimes statisticians like to create data to test ideas. In R this is very easy, for instance:
newdata = rnorm(1000,300,25) creates a new data set called newdata which has 1000 scores, a mean of 300, and a standard deviation of 25.
9. Another way to examine variance is to use a histogram. Try the following command: hist(newdata). Use a textbook and again read about histograms. Essentially, they show a breakdown of your data into bins (usually 10, but can be any number), with a count of the number of data points that fall into the range of a given bin. If you try the following two commands: output = hist(newdata) and then simply output, you should see a lot of information about what a histogram is comprised of. Make sure you know this information. NOTE. Your histogram is normally distributed, what some would call a "bell curve". There are reasons for this and we will talk a lot more about normality later.
Assignment Questions
1. Why is the sum of a column of numbers with the mean subtracted equal to zero?
2. Use the commands in this assignment to create a plot (which you will submit) that has 1 row, 3 columns and has a plots of three data sets, both of size 1000 and a mean of 500, but one with a standard deviation of 10, one with a standard deviation of 50, and the last with a standard deviation of 100. Can you visually see the difference in the variance? HINT: Set your y limits to 0 and 1000 for both plots for this to work! Describe the increase in variability you see in words. Use a computation of the variance for each data set to justify your description.
3. For the data you created in Question 2, create a new variable called squared errors. This variable should have 6 columns, the first three containing your three columns of data, the next three each containing the squared errors (thus a 1000 row by 3 column matrix) for one of the columns. Compute the sum of squared errors for each of your columns and report these numbers. Also report the variance.
4. Repeat Question 2, but using histograms instead of plots. For your comparison to be valid both the x and y axis limits have to be the same for all your plots! Use hist(mydata$V1,xlim = c(1, 1000), ylim = c(1,300)) as a guide to help you with this.
5. If you did Step 9 of above correctly you would have seen all the outputs of a histogram in R, starting with $breaks. What do each of the 7 outputs mean?