Dummy variable in regression

In regression, a dummy variable is a numerical variable that is used to represent subgroups of the sample in the study. Dummy variables are also known as indicator variables, design variables, contrasts, one-hot coding, and binary basis variables. A dummy variable is used to include categorical or qualitative variable or factors in a regression model.

If the categorical variable has N levels, then there is (N-1) dummy variables to represent the categorical variable. Remaining one is used as a reference level or baseline.

Reference level or baseline

The level of the categorical variable to which all of the other levels are compared is known as the reference level. All the interpretations are defined by comparing with the reference level or baseline.

Coding system for categorical variables in regression analysis

Example 1:

X is a dummy variable which represents the gender of an individual.

X = 0, if male

X = 1, if female

Example 2:

E is the categorical variable which represents the education levels of an individual. E has three levels as follows:

  • Undergraduate

  • Graduate

  • Postgraduate

Since variable E has three levels, there are 2 dummy variables to represent E.

E Dummy_1 Dummy_2
Undergraduate 1 0
Graduate 0 1
Postgraduate 0 0

The order of the levels of categorical variables and the reference level can be changed when creating dummies. In this example, the reference level is “Postgraduate”.

Advantage of using a dummy variable

Interaction effect

The effect of one variable depends on the value of another variable is defined as the interaction effect.

Effect

The difference between the true population parameter and the null hypothesis value is called as an effect. Effect is also known as population effect or the difference.

Eg:- The mean difference between the weight loss for male and female is the effect.

The true population parameter is not known. Samples are taken from the population and by using a statistical test, such as a t-test or a one-way ANOVA, determines whether an effect exists.

Regression with a dummy variable in R

Specificity in R

R will create the dummy variable automatically.

By default, in R the first level of the categorical variable appears alphabetically or numerically(if the categorical variable is coded as 0,1,2,…) is defined as the baseline or reference level.

Multiple Linear regression with dummy variables

DietWeigthLoss <- read.delim("DietWeigthLoss.csv")
head(DietWeigthLoss,5)
##   WeightLoss Diet
## 1        9.9    A
## 2        9.6    A
## 3        8.0    A
## 4        4.9    A
## 5       10.2    A
levels(DietWeigthLoss$Diet_factor)
## [1] "A" "B" "C" "D"

Since there are four levels in the Diet category variable, three dummy variables are needed to represent the Diet category variable.

According to this example, the baseline or reference level is Diet category A.

An individual in:

Diet Dummy_A Dummy_B Dummy_C
A 0 0 0
B 1 0 0
C 0 1 0
D 0 0 1

Model equation

\[\begin{aligned} \hat{Y} = \beta_0+ \beta_1 X_B+ \beta_2 X_C + \beta_3 X_D\\ \end{aligned}\]
model1 <- lm(WeightLoss ~ Diet_factor,DietWeigthLoss)
summary(model1)
## 
## Call:
## lm(formula = WeightLoss ~ Diet_factor, data = DietWeigthLoss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1067 -1.1883  0.1033  1.2600  3.7933 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.1800     0.5946  15.439  < 2e-16 ***
## Diet_factorB  -0.2733     0.8409  -0.325 0.746355    
## Diet_factorC   2.9333     0.8409   3.488 0.000954 ***
## Diet_factorD   1.3600     0.8409   1.617 0.111430    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.303 on 56 degrees of freedom
## Multiple R-squared:  0.2468, Adjusted R-squared:  0.2065 
## F-statistic: 6.118 on 3 and 56 DF,  p-value: 0.001128

Hypothesis to be tested

\[\begin{aligned} H_0:\text{The coefficient is not significant.}\\ H_1:\text{The coefficient is significant.} \end{aligned}\]

Decision rule

Reject \(H_0\) if p-value\(\leq 0.05\)

Conclusion

We have enough evidence to say that Diet_factorC is significant at 5% level of significance.

Interpretations

\(\beta_0\) = 9.18

Mean WeightLoss for someone in the reference group, A is 9.18.

\(\beta_2\) = 2.933

The increase in mean WeightLoss for category C relative to category A is 2.933.

Change Reference(Baseline) category

If we want to change the reference level or beseline in R, we can use the following codes and fit the model.

DietWeigthLoss$Diet_factor_new <- relevel(DietWeigthLoss$Diet_factor, ref = "C")
model2 <- lm(WeightLoss ~ Diet_factor_new,DietWeigthLoss)
summary(model2)
## 
## Call:
## lm(formula = WeightLoss ~ Diet_factor_new, data = DietWeigthLoss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1067 -1.1883  0.1033  1.2600  3.7933 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       12.1133     0.5946  20.372  < 2e-16 ***
## Diet_factor_newA  -2.9333     0.8409  -3.488 0.000954 ***
## Diet_factor_newB  -3.2067     0.8409  -3.813 0.000344 ***
## Diet_factor_newD  -1.5733     0.8409  -1.871 0.066571 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.303 on 56 degrees of freedom
## Multiple R-squared:  0.2468, Adjusted R-squared:  0.2065 
## F-statistic: 6.118 on 3 and 56 DF,  p-value: 0.001128

Multiple Linear regression with a dummy variable

LungCapData <- read.csv("LungCapData2.csv")
head(LungCapData,5)
##   Age LungCap Height Gender Smoke
## 1   9   3.124   57.0 female    no
## 2   8   3.172   67.5 female    no
## 3   7   3.160   54.5 female    no
## 4   9   2.674   53.0   male    no
## 5   9   3.685   57.0   male    no
levels(LungCapData$Gender_factor)
## [1] "female" "male"
levels(LungCapData$Smoke_factor)
## [1] "no"  "yes"

Since Gender has two categories, only one dummy is needed. According to this example, the baseline or reference level is Gender category female.

Gender Dummy_Female Dummy_Male
Female 0 0
Male 0 1

Model equation

\[\begin{aligned} \hat{Y} = \beta_0+ \beta_1 \text{Age}+ \beta_2 \text{Gender_factor_Male} \\ \end{aligned}\]
model3 <- lm(LungCap ~ Age + Gender_factor,LungCapData)
summary(model3)
## 
## Call:
## lm(formula = LungCap ~ Age + Gender_factor, data = LungCapData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2448 -1.0552 -0.1115  0.9527  5.9218 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.15587    0.23190  -4.984 7.98e-07 ***
## Age                0.66134    0.02165  30.553  < 2e-16 ***
## Gender_factormale  0.97000    0.12783   7.588 1.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.633 on 651 degrees of freedom
## Multiple R-squared:  0.607,  Adjusted R-squared:  0.6058 
## F-statistic: 502.7 on 2 and 651 DF,  p-value: < 2.2e-16

Multiple Linear regression with an interaction

Model equation

\[\begin{aligned} \hat{Y} = \beta_0+ \beta_1 \text{Age}+ \beta_2 \text{Gender_factor_Male} + \beta_3 \text{Age *Gender_factor_Male} \\ \end{aligned}\]
model4 <- lm(LungCap ~ Age + Gender_factor + Age*Gender_factor ,LungCapData)
summary(model4)
## 
## Call:
## lm(formula = LungCap ~ Age + Gender_factor + Age * Gender_factor, 
##     data = LungCapData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9222 -1.0301 -0.1480  0.9962  5.6060 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.54840    0.30660   1.789   0.0741 .  
## Age                    0.48819    0.02986  16.351  < 2e-16 ***
## Gender_factormale     -2.32760    0.42824  -5.435 7.74e-08 ***
## Age:Gender_factormale  0.33225    0.04136   8.033 4.47e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.559 on 650 degrees of freedom
## Multiple R-squared:  0.6425, Adjusted R-squared:  0.6408 
## F-statistic: 389.4 on 3 and 650 DF,  p-value: < 2.2e-16

Hypothesis to be tested

\[\begin{aligned} H_0:\text{The interaction effect is not significant.}\\ H_1:\text{The interaction effect is significant.} \end{aligned}\]

Decision rule

Reject \(H_0\) if p-value\(\leq 0.05\)

Conclusion

The interaction effect is significant at 5% level of significance.

Interpretations

\(\beta_0\) = 0.5484

Mean LungCap for someone in the reference group, female is 0.5484.

\(\beta_1\) = 0.48819

For female, the increase in mean LungCap is 0.48819 for unit increase in Age .

\(\beta_2\) = -2.32760

The decrease in mean LungCap for category Male relative to category Female is 2.32760.

\(\beta_3\) = 0.33225

The increase in mean LungCap for category Male relative to category Female is 0.33225 in unit increase in Age.

Regression with a dummy variable in SPSS

Disadvantage in SPSS:

SPSS will not create the dummy variable automatically. We have to define dummies before fitting the model.

SPSS is used Recode into Different Variables to create dummy variables.

By using Linear Regression in SPSS, we can fit the simple & multiple regression models.

Multiple Linear regression with dummy variables

Multiple Linear regression with a dummy variable

Multiple Linear regression with an interaction

We have to create new variable to define interaction. In theis example, we create new variable by multiplying Age and Gender_factor_Male in each individuals.

Regression with a dummy variable in Python

Specificity in Python:

Python will create the dummy variable automatically. By default, in Python the first level of the categorical variable appears alphabetically or numerically(if the categorical variable is coded as 0,1,2,…) is defined as the baseline or reference level.

Multiple Linear regression with dummy variables

Example:

Multiple Linear regression with a dummy variable

Example:

Multiple Linear regression with an interaction

Regression with a dummy variable in Minitab

Specificity in Minitab:

Minitab will create the dummy variable automatically. By default, in Minitab the first level of the categorical variable appears alphabetically or numerically(if the categorical variable is coded as 0,1,2,…) is defined as the baseline or reference level.

Simple Linear regression with a dummy variable

Multiple Linear regression with dummy variables

Multiple Linear regression with a dummy variable

Multiple Linear regression with an interaction

Interaction plot

On an interaction plot, parallel lines indicate that there is no interaction effect while different slopes suggest that one might be present. Below is the plot for Age*Gender_factor_Male.

The crossed lines on the graph suggest that there is an interaction effect, which the significant p-value for the Age*Gender_factor_Male term confirms.

By comparing the R,SPSS, Python, and Minitab outputs you can see that;

  • R output values are rounded to five decimal places
  • SPSS & Minitab output valuess are rounded to three decimal places
  • Python output values are rounded to four decimal places

For more details, you can refer the What is p-value?, ANOVA, and Multiple linear regression blogs.

References

by Prof William M.K. Trochim. n.d. Dummy Variables. Conjoint.ly. https://conjointly.com/kb/dummy-variables/.

Jim Frost. n.d. “Understanding Interaction Effects in Statistics.” https://statisticsbyjim.com/regression/interaction-effects/.

Mike Marin. n.d. “R Users’ Guide.” https://sites.google.com/site/rusersguide/.

Tim Bock. n.d. “What Are Dummy Variables.” https://www.displayr.com/what-are-dummy-variables/.