2C. CONFIDENCE INTERVALS
When we do research we typically measure the data for one or more groups/conditions. However, we are not really interested in group/condition differences - we are actually trying to make statements about populations - the world in general.
For example, imagine we measure the weight of people in Victoria prior to and after an exercise program. Our research question relates to the impact of exercise on weight loss - do people who exercise lose weight? Now, let's say we do some fancy statistics and find out that people do lose weight during an exercise program. Is our real interest related to just the impact of exercise on weight for people who live in Victoria? Or instead, are we more interested in the impact of exercise on weight in general?
To make a statement about the relative "truth" of this statement we need to compute statistics that allow us to make population judgements based on sample data.
One of the most useful statistics for doing this is the 95% confidence interval. A 95% confidence interval is essentially a range around a sample mean that shows the potential range that the true population mean lies in with 95% confidence, or probability. In other words, If a population mean is 100 with a 95% confidence interval of 75 to 125 there is a 95% chance, statistically speaking, that the true population mean lies between 75 and 125. You can find out more about 95% confidence intervals HERE and HERE.
Load the following data into R: weightdata.txt. Assign column one the name time and column two the name weight.
The formula for a 95% confidence interval is relatively simple. The confidence interval length is equal to:
The Critical T Value * The Standard Deviation of the Data / The Square Root of the Sample Size
Let's briefly look at what each of these are.
The Critical T Value
The simple version is the critical t value is a factor to scale the interval appropriately given the sample size and a desired confidence level (e.g., 95%, 99%, etc). The scaling factor is an attempt to scale the range of the confidence interval assuming the sample mean comes from a sampling distribution of means that is normally distributed - there will be more on this in Section 4. However, the short version is the critical T value is a scaling factor. Note, the inputs to compute the critical T value is 1 minus the desired level of confidence, so 0.05 for a 95% confidence interval, and the degrees of freedom, n - 1, or one less than the number of data points the confidence interval is being generated for. For more on degrees of freedom read THIS.
The Standard Deviation of the Data
See the previous lesson! But, it is simply the standard deviation of the numbers represented by the confidence interval.
The Square Root of the Sample Size
The square root of the number of data points the confidence interval represents.
Computing a confidence interval in R. For this exercise we will ignore the time variable and just compute the confidence interval for all 40 data points. In R:
df = length(weightdata$weight) - 1
This computes the degrees of freedom of the sample, of 39 in this instance. I am hoping you understand what the length command does!
crt = qt(0.05,df)
This computes the critical t value given a 95% confidence range and the afore-computed degrees of freedom.
std = sd(weightdata$weight)
The standard deviation of the weight data.
sqn = sqrt(length(weightdata$weight))
The square root of the number of data points.
ci = crt * std / sqn
The confidence interval is equal to the critical t-value times the standard deviation of the data divided by the square root of the number of data points.
Note, this can frequently be a negative number so it is appropriate to go:
ci = abs(crt*std/sqn)
To force a positive value using the the absolute value (abs) function.
Finally, the lower and upper ranges of the confidence interval would be:
mean(weightdata$weight) - ci
mean(weightdata$weight) + ci
Assignment Questions
1. Compute the confidence intervals for time == 1 and time == 2 for the weight data.
2. Compute the confidence interval for the difference scores (i.e., the difference scores reflecting the difference in weight data between time == 2 and time == 1 for each person.
Challenge Question
1. Plot the mean of the difference scores as a bar graph with the confidence interval range of the difference scores as the error bar.
For example, imagine we measure the weight of people in Victoria prior to and after an exercise program. Our research question relates to the impact of exercise on weight loss - do people who exercise lose weight? Now, let's say we do some fancy statistics and find out that people do lose weight during an exercise program. Is our real interest related to just the impact of exercise on weight for people who live in Victoria? Or instead, are we more interested in the impact of exercise on weight in general?
To make a statement about the relative "truth" of this statement we need to compute statistics that allow us to make population judgements based on sample data.
One of the most useful statistics for doing this is the 95% confidence interval. A 95% confidence interval is essentially a range around a sample mean that shows the potential range that the true population mean lies in with 95% confidence, or probability. In other words, If a population mean is 100 with a 95% confidence interval of 75 to 125 there is a 95% chance, statistically speaking, that the true population mean lies between 75 and 125. You can find out more about 95% confidence intervals HERE and HERE.
Load the following data into R: weightdata.txt. Assign column one the name time and column two the name weight.
The formula for a 95% confidence interval is relatively simple. The confidence interval length is equal to:
The Critical T Value * The Standard Deviation of the Data / The Square Root of the Sample Size
Let's briefly look at what each of these are.
The Critical T Value
The simple version is the critical t value is a factor to scale the interval appropriately given the sample size and a desired confidence level (e.g., 95%, 99%, etc). The scaling factor is an attempt to scale the range of the confidence interval assuming the sample mean comes from a sampling distribution of means that is normally distributed - there will be more on this in Section 4. However, the short version is the critical T value is a scaling factor. Note, the inputs to compute the critical T value is 1 minus the desired level of confidence, so 0.05 for a 95% confidence interval, and the degrees of freedom, n - 1, or one less than the number of data points the confidence interval is being generated for. For more on degrees of freedom read THIS.
The Standard Deviation of the Data
See the previous lesson! But, it is simply the standard deviation of the numbers represented by the confidence interval.
The Square Root of the Sample Size
The square root of the number of data points the confidence interval represents.
Computing a confidence interval in R. For this exercise we will ignore the time variable and just compute the confidence interval for all 40 data points. In R:
df = length(weightdata$weight) - 1
This computes the degrees of freedom of the sample, of 39 in this instance. I am hoping you understand what the length command does!
crt = qt(0.05,df)
This computes the critical t value given a 95% confidence range and the afore-computed degrees of freedom.
std = sd(weightdata$weight)
The standard deviation of the weight data.
sqn = sqrt(length(weightdata$weight))
The square root of the number of data points.
ci = crt * std / sqn
The confidence interval is equal to the critical t-value times the standard deviation of the data divided by the square root of the number of data points.
Note, this can frequently be a negative number so it is appropriate to go:
ci = abs(crt*std/sqn)
To force a positive value using the the absolute value (abs) function.
Finally, the lower and upper ranges of the confidence interval would be:
mean(weightdata$weight) - ci
mean(weightdata$weight) + ci
Assignment Questions
1. Compute the confidence intervals for time == 1 and time == 2 for the weight data.
2. Compute the confidence interval for the difference scores (i.e., the difference scores reflecting the difference in weight data between time == 2 and time == 1 for each person.
Challenge Question
1. Plot the mean of the difference scores as a bar graph with the confidence interval range of the difference scores as the error bar.