In regression, a dummy variable is a numerical variable that is used to represent subgroups of the sample in the study. Dummy variables are also known as indicator variables, design variables, contrasts, one-hot coding, and binary basis variables. A dummy variable is used to include categorical or qualitative variable or factors in a regression model.
If the categorical variable has N levels, then there is (N-1) dummy variables to represent the categorical variable. Remaining one is used as a reference level or baseline.
The level of the categorical variable to which all of the other levels are compared is known as the reference level. All the interpretations are defined by comparing with the reference level or baseline.
X is a dummy variable which represents the gender of an individual.
X = 0, if male
X = 1, if female
E is the categorical variable which represents the education levels of an individual. E has three levels as follows:
Undergraduate
Graduate
Postgraduate
Since variable E has three levels, there are 2 dummy variables to represent E.
E | Dummy_1 | Dummy_2 |
---|---|---|
Undergraduate | 1 | 0 |
Graduate | 0 | 1 |
Postgraduate | 0 | 0 |
The order of the levels of categorical variables and the reference level can be changed when creating dummies. In this example, the reference level is “Postgraduate”.
Unless writing separate equation models for each subgroup, dummy variables enables us to use a single regression equation to represent multiple groups.
Even though coded dummy variable is a nominal-level variable, can be treated as an interval-level variable statistically.
The effect of one variable depends on the value of another variable is defined as the interaction effect.
The difference between the true population parameter and the null hypothesis value is called as an effect. Effect is also known as population effect or the difference.
Eg:- The mean difference between the weight loss for male and female is the effect.
The true population parameter is not known. Samples are taken from the population and by using a statistical test, such as a t-test or a one-way ANOVA, determines whether an effect exists.
R will create the dummy variable automatically.
By default, in R the first level of the categorical variable appears alphabetically or numerically(if the categorical variable is coded as 0,1,2,…) is defined as the baseline or reference level.
DietWeigthLoss <- read.delim("DietWeigthLoss.csv")
head(DietWeigthLoss,5)
## WeightLoss Diet
## 1 9.9 A
## 2 9.6 A
## 3 8.0 A
## 4 4.9 A
## 5 10.2 A
levels(DietWeigthLoss$Diet_factor)
## [1] "A" "B" "C" "D"
Since there are four levels in the Diet category variable, three dummy variables are needed to represent the Diet category variable.
According to this example, the baseline or reference level is Diet category A.
An individual in:
Diet | Dummy_A | Dummy_B | Dummy_C |
---|---|---|---|
A | 0 | 0 | 0 |
B | 1 | 0 | 0 |
C | 0 | 1 | 0 |
D | 0 | 0 | 1 |
model1 <- lm(WeightLoss ~ Diet_factor,DietWeigthLoss)
summary(model1)
##
## Call:
## lm(formula = WeightLoss ~ Diet_factor, data = DietWeigthLoss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1067 -1.1883 0.1033 1.2600 3.7933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.1800 0.5946 15.439 < 2e-16 ***
## Diet_factorB -0.2733 0.8409 -0.325 0.746355
## Diet_factorC 2.9333 0.8409 3.488 0.000954 ***
## Diet_factorD 1.3600 0.8409 1.617 0.111430
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.303 on 56 degrees of freedom
## Multiple R-squared: 0.2468, Adjusted R-squared: 0.2065
## F-statistic: 6.118 on 3 and 56 DF, p-value: 0.001128
Reject \(H_0\) if p-value\(\leq 0.05\)
We have enough evidence to say that Diet_factorC is significant at 5% level of significance.
\(\beta_0\) = 9.18
Mean WeightLoss for someone in the reference group, A is 9.18.
\(\beta_2\) = 2.933
The increase in mean WeightLoss for category C relative to category A is 2.933.
If we want to change the reference level or beseline in R, we can use the following codes and fit the model.
DietWeigthLoss$Diet_factor_new <- relevel(DietWeigthLoss$Diet_factor, ref = "C")
model2 <- lm(WeightLoss ~ Diet_factor_new,DietWeigthLoss)
summary(model2)
##
## Call:
## lm(formula = WeightLoss ~ Diet_factor_new, data = DietWeigthLoss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1067 -1.1883 0.1033 1.2600 3.7933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.1133 0.5946 20.372 < 2e-16 ***
## Diet_factor_newA -2.9333 0.8409 -3.488 0.000954 ***
## Diet_factor_newB -3.2067 0.8409 -3.813 0.000344 ***
## Diet_factor_newD -1.5733 0.8409 -1.871 0.066571 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.303 on 56 degrees of freedom
## Multiple R-squared: 0.2468, Adjusted R-squared: 0.2065
## F-statistic: 6.118 on 3 and 56 DF, p-value: 0.001128
LungCapData <- read.csv("LungCapData2.csv")
head(LungCapData,5)
## Age LungCap Height Gender Smoke
## 1 9 3.124 57.0 female no
## 2 8 3.172 67.5 female no
## 3 7 3.160 54.5 female no
## 4 9 2.674 53.0 male no
## 5 9 3.685 57.0 male no
levels(LungCapData$Gender_factor)
## [1] "female" "male"
levels(LungCapData$Smoke_factor)
## [1] "no" "yes"
Since Gender has two categories, only one dummy is needed. According to this example, the baseline or reference level is Gender category female.
Gender | Dummy_Female | Dummy_Male |
---|---|---|
Female | 0 | 0 |
Male | 0 | 1 |
model3 <- lm(LungCap ~ Age + Gender_factor,LungCapData)
summary(model3)
##
## Call:
## lm(formula = LungCap ~ Age + Gender_factor, data = LungCapData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2448 -1.0552 -0.1115 0.9527 5.9218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.15587 0.23190 -4.984 7.98e-07 ***
## Age 0.66134 0.02165 30.553 < 2e-16 ***
## Gender_factormale 0.97000 0.12783 7.588 1.13e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.633 on 651 degrees of freedom
## Multiple R-squared: 0.607, Adjusted R-squared: 0.6058
## F-statistic: 502.7 on 2 and 651 DF, p-value: < 2.2e-16
model4 <- lm(LungCap ~ Age + Gender_factor + Age*Gender_factor ,LungCapData)
summary(model4)
##
## Call:
## lm(formula = LungCap ~ Age + Gender_factor + Age * Gender_factor,
## data = LungCapData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9222 -1.0301 -0.1480 0.9962 5.6060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.54840 0.30660 1.789 0.0741 .
## Age 0.48819 0.02986 16.351 < 2e-16 ***
## Gender_factormale -2.32760 0.42824 -5.435 7.74e-08 ***
## Age:Gender_factormale 0.33225 0.04136 8.033 4.47e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.559 on 650 degrees of freedom
## Multiple R-squared: 0.6425, Adjusted R-squared: 0.6408
## F-statistic: 389.4 on 3 and 650 DF, p-value: < 2.2e-16
Reject \(H_0\) if p-value\(\leq 0.05\)
The interaction effect is significant at 5% level of significance.
\(\beta_0\) = 0.5484
Mean LungCap for someone in the reference group, female is 0.5484.
\(\beta_1\) = 0.48819
For female, the increase in mean LungCap is 0.48819 for unit increase in Age .
\(\beta_2\) = -2.32760
The decrease in mean LungCap for category Male relative to category Female is 2.32760.
\(\beta_3\) = 0.33225
The increase in mean LungCap for category Male relative to category Female is 0.33225 in unit increase in Age.
SPSS will not create the dummy variable automatically. We have to define dummies before fitting the model.
SPSS is used Recode into Different Variables to create dummy variables.
By using Linear Regression in SPSS, we can fit the simple & multiple regression models.
We have to create new variable to define interaction. In theis example, we create new variable by multiplying Age and Gender_factor_Male in each individuals.
Python will create the dummy variable automatically. By default, in Python the first level of the categorical variable appears alphabetically or numerically(if the categorical variable is coded as 0,1,2,…) is defined as the baseline or reference level.
Minitab will create the dummy variable automatically. By default, in Minitab the first level of the categorical variable appears alphabetically or numerically(if the categorical variable is coded as 0,1,2,…) is defined as the baseline or reference level.
On an interaction plot, parallel lines indicate that there is no interaction effect while different slopes suggest that one might be present. Below is the plot for Age*Gender_factor_Male.
The crossed lines on the graph suggest that there is an interaction effect, which the significant p-value for the Age*Gender_factor_Male term confirms.
By comparing the R,SPSS, Python, and Minitab outputs you can see that;
For more details, you can refer the What is p-value?, ANOVA, and Multiple linear regression blogs.
by Prof William M.K. Trochim. n.d. Dummy Variables. Conjoint.ly. https://conjointly.com/kb/dummy-variables/.
“CODING Systems for Categorical Variables in Regression Analysis.” n.d. UCLA. https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/#:~:text=The%20level%20of%20the%20categorical,DUMMY%20CODING.
Jim Frost. n.d. “Understanding Interaction Effects in Statistics.” https://statisticsbyjim.com/regression/interaction-effects/.
Mike Marin. n.d. “R Users’ Guide.” https://sites.google.com/site/rusersguide/.
Tim Bock. n.d. “What Are Dummy Variables.” https://www.displayr.com/what-are-dummy-variables/.