Unpaired Two sample t-test in R
What is unpaired sample t test ?
Like paired-samples t tests, independent-samples t tests also test hypotheses about differences between two means; however, the means are for the same variable but for two different populations. The following research hypotheses would use independent-samples t tests.
For example :
1. Biology graduates have a different average annual income than chemistry graduates.
2. Husbands average more hours of sleep per night than wives.
3. Calorie between hotdog and cheese burger.
As with paired-samples t tests, there are again two groups of scores. However, this time, the scores in both groups are scores on the same variable. What distinguishes the groups is that they represent different populations: for example, Biology graduates and chemistry graduates, and hotdogs calorie and cheese burger calorie.
The Independent-Samples T Test procedure is used to compare just two groups in the population. It would not be used to test a hypothesis about differences in average number of calories for hotdogs, cheese burger, and Jewish pizza. Comparisons of three or more groups require a different procedure, one presented in the next chapter.
Before we start using software we should know how it works manually, lest see the formula !
The formula for unpaired t test is for equal variance :
The formula for unpaired t test is for unequal variance :
So using that formula above we will see the mean difference between two independent sample with the null hypothesis is :
H0 = There is no difference in mean
H1= There is a difference in mean
So for the example we have a data, [Data taken from Moore, D. and McCabe, G. (2002) Introduction to the Practice of Statistics, New York: WH Freeman.]
A US magazine, Consumer Reports, carried-out a survey of the calorie content of a number of different brands of hotdogs. The calorie content of 20 beef and 17 poultry hotdogs was recorded as below:
Question : Is there a difference in calorie content between beef and poultry hotdogs ?
With the hypothesis is :
H0 = There is no difference in mean calorie content between the beef and poultry hotdogs
H1= There is a difference in mean calorie content between the beef and poultry hotdogs
So before we use that formula we should know about the material of the formula, so we got
After we know the value is now we use the formula to get the mean difference :
So after we got the value of t we can see t table to compare the value to gain the hypothesis, but before that, we should know the degree of freedom from the case. so we got 208 for the degree of freedom from the data, after that we know that we use 95% confident. From the T table there is no 208 degree of freedom but there are 100 and 1000 so don’t confuse, because we can assume the degree of freedom are between 100 and 1000 so 100 < 208 < 1000, the value is 1.984< df <1.962.
The conclusion is :
Because the t stat = 4.301 is greater than t table=1.984. We can conclude that beef calorie average is significantly different from poultry calorie.
Okayy lets try with R to solve that case !!
Before that we should know the flow about t test in R :
1. Data preparation
2. Packages
3. Normality test
4. Variance test
5. T test
Data preparation
Here we use the same case before about beef and poultry and make that vector to build a data frame
#calculate with R
beef<- c(186,181,176,149,184,190,158,139,175,148,152,111,
141,153,190,157,131,149,135,132)poultry<- c(129,132,102,106,94,102,87,99,170,113,135,142,86,
143,152,146,144)#create data frame
my_data <- data.frame(
group = rep(c("beef","poultry"), times=c(20,17)),
calorie = c(beef, poultry)
)
View(my_data)
After we have preparation our data, for the next step we should install packages that we will need to compute t test in R, but if you already install that’s packages so just call the packages with library tools.
library(car) #normality test
library(ggpubr) #visualization
library(dplyr) #compute summary statistics
Next is to calculate the summary statistics by groups of “beef” and “poultry” using packages (dplyr) to find out how the average contained in the weight variable “beef” and “poultry” will form cross tabulations using the script as follows
#summary by groups library(dplyr)
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(calorie, na.rm = TRUE),
sd = sd(calorie, na.rm = TRUE)
)
Then from the results above it can be seen that the mean of each beef and poultry are 150 and 122 with both standard deviations of 22.6 and 22.5
Next is to explore the data using the box plot to see if there are outliers in the data, using packages ggplot2 to be visually more interesting.
#boxplot vizualitation
library("ggpubr")
ggboxplot(my_data, x = "group", y = "calorie",
color = "group", palette = c("#00AFBB", "#E7B800"),
ylab = "Calorie", xlab = "Groups")
Variance Test
Next is to test both variances whether they have the same or different variance values
#var test(not equal)
var.test(my_data$calorie, my_data$group, alternative = c("two.sided"), conf.level = 0.95)
The p-value of F test is p = 2.2e-16. It’s less than significance level alpha = 0.05. in conclusion. There is significant difference between the variance of the two sets of data. Therefore, we can use the classic t test witch assume inequality of the two variances.
Normality Test
Next is to check the next condition, namely when the data is small or data <30, then it must meet the normal distribution requirements and if the big data does not need to fulfill the normal distribution requirements, it is assumed that large data will form a normal distribution. This time I will use the Shaphiro Wilk test to see whether the data is normally distributed or not.
#normality test
library(car)
shapiro.test(my_data$calorie)
From the output results, it can be seen that the p-value> 0.05 or, alpha, which means the data is normally distributed.
Unpaired T test
After all of the above have been done, the next thing to do is to do an unpaired t test by specifying the Variance on var.test not previously the same.
#t test
t.test(beef, poultry, var.equal = FALSE)t.test(calorie~group, data = my_data, var.equal = FALSE)t.test(calorie ~ group, data = my_data,
var.equal = FALSE, alternative = "greater")
The p-value of the test is 0,0001455, which is less than the significance alpha level = 0.05. We can conclude that a beef calorie average is significantly different from a poultry calorie average with a p-value = 0.0001455.
source :