# Packages
library(tidyverse) # For data manipulation
library(gtsummary) # Descriptive statistics
library(nnet) # For multinomial regression
library(caret) # To check the accuracy of the model
library(performance) # For model fit measures6 Session 6: Multinomial model
6.1 Introduction
Multinomial regression is useful when the dependent variable is categorical and has more than two categories. This week, we’ll build on the logistic regression example from last week, where we explored mode choice for work trips and how free parking influences that decision. This time, we’ll break down mode choice at the person level, with three possible outcomes: driving, active travel, or transit.
We will continue to work with the Puget Sound Household Travel Survey.
6.2 Preliminaries
For today’s session we will need the following packages. Make sure that you have these packages installed in your machine.
As before, we read the data in to the R session. Again, we will obtain information from trips, persons, and households tables.
# Load data
trips <- read_csv('data/Trips.csv')
persons <- read_csv('data/Persons.csv')
households <- read_csv('data/Households.csv')We limit the data for the year 2023 only.
# Filter data for the year 2023 only
households <- households %>% filter(survey_year == 2023)
trips <- trips %>% filter(survey_year == 2023)
persons <- persons %>% filter(survey_year == 2023)6.3 Dependent variable: Main mode choice to work per person
In this example, we focus on trips to work only. So, we keep trips which its destination is ‘Work’ only.
trips_work <- trips %>%
filter('Work' == dest_purpose_cat)Note that in the outcomes in a multinomial model must be mutually exclusive, which means that each individual can only belong to one outcome category at a time. In this example, we identify only one ‘main’ mode of transport for each person when travelling to work.
Here, the main mode is defined as the one used most frequently. If a person reported using more than one mode the same number of times, we’ll select the one that covered the greatest distance. To start, we’ll summarise the frequency and distance reported for each person and each mode.
mode_summary <- trips_work %>%
group_by(person_id, mode_class) %>%
summarise(
mode_frequency = n(),
mode_distance = sum(distance_miles, na.rm = TRUE),
) And, we have a quick look to the result:
mode_summary# A tibble: 2,137 × 4
# Groups: person_id [1,821]
person_id mode_class mode_frequency mode_distance
<dbl> <chr> <int> <dbl>
1 2300017304 Drive SOV 4 16.2
2 2300021301 Bike 1 1.36
3 2300074501 Transit 5 25.1
4 2300082802 Transit 1 2.07
5 2300116701 Drive SOV 1 25.2
6 2300124701 Drive SOV 1 3.90
7 2300124702 Bike 1 2.35
8 2300124702 Drive SOV 2 5.15
9 2300132201 Drive SOV 3 11.2
10 2300149601 Transit 1 9.64
# ℹ 2,127 more rows
Looking at the variables available and the summary that you just have created. Think how you can identify the ‘main’ mode before coding it.
Now, implement the code which select the most frequently used mode for each person in the data. If there’s a tie, we’ll choose the mode that covered the most miles.
# Define only one 'main' mode per person based on frequency and distance
mode_main <- mode_summary %>%
group_by(person_id) %>%
# Keep the most frequent modes per individual
slice_max(order_by = mode_frequency, with_ties = TRUE) %>%
# If tied, pick the one with more miles
slice_max(order_by = mode_distance, with_ties = FALSE) %>%
ungroup()For simplicity in our model, we will group mode into three broader classes, namely: Drive, Transit, and Active. For this exercise, we drop all other classes.
# reclassify
mode_main <- mode_main %>%
mutate(
mode_main = case_when(
grepl('Drive', mode_class) ~ 'Drive',
grepl('Transit', mode_class) ~ 'Transit',
grepl('Bike|Walk', mode_class) ~ 'Active',
# All other classes are ignored using NA
TRUE ~ NA
)
)
# Keep only relevant column, i.e. main mode per individual
mode_main <- mode_main %>%
select(person_id, mode_main)How is the distribution of the main mode looking?
count(mode_main, mode_main)# A tibble: 4 × 2
mode_main n
<chr> <int>
1 Active 336
2 Drive 1190
3 Transit 258
4 <NA> 37
Would it be possible to aggregate modal choice at a different unit of observation? Are there implications?
6.4 Independent variables
In addition to the dependent variable (mode choice in three categories), we need to define the independent variables that we think are related to mode choice to work. For this, we include information from trips, individual, and households.
We summarise trip information to work at the person level.
trips_distance <- trips_work %>%
group_by(person_id) %>%
summarise(
total_miles = sum(distance_miles, na.rm = TRUE),
num_days_trips = n_distinct(daynum),
avg_distance = total_miles / num_days_trips
) %>%
filter(num_days_trips > 0 & avg_distance < 100)Later, we supplement the dependent variable information with trips, persons, and households.
# Join the tables together
persons_main <- mode_main %>%
left_join(trips_distance, by = 'person_id') %>%
left_join(persons, by = 'person_id') %>%
left_join(households, by = 'household_id')Have a look to the number of observations in persons_main compared to persons. Why do we see this difference?
As in our prior lab, we format and label the variables that we will use:
# Write the name and appropriate order of the levels for hhincome_broad
income_labs <- c(
"Under $25,000",
"$25,000-$49,999",
"$50,000-$74,999",
"$75,000-$99,999",
"$100,000-$199,999",
"$200,000 or more"
)
# Categories for higher education (explicit definition)
he_levs <- c(
"Bachelor degree",
"Graduate/post-graduate degree",
"Associates degree"
)
# Format and create appropriate variables
persons_main <- persons_main %>%
mutate(
# Whether person has free parking at work
free_parking = ifelse(commute_subsidy_3 == 'Selected', 'Yes', 'No'),
# Simplify education, e.g. graduate and bachelor vs all other
higher_education = ifelse(education %in% he_levs, 'Yes', 'No'),
children = ifelse(numchildren == '0 children', 'No', 'Yes'),
hhincome_broad = factor(hhincome_broad, levels = income_labs)
)And, keep only the variables of interest for now.
# First, select variables of interest
persons_main <- persons_main %>%
select(
mode_main,
free_parking,
avg_distance,
gender,
higher_education,
children,
hhincome_broad
)
# We also remove incomplete observations
persons_main <- persons_main %>%
drop_na()6.5 Descriptive statistics
Before moving into the modelling stage, we check the main descriptive statistics.
We split the data according to the main dependent variable, whether respondents used active mode to travel to work.
persons_main %>%
tbl_summary(
by = mode_main,
statistic = all_continuous() ~ "{mean} ± {sd}"
) %>%
add_overall()| Characteristic | Overall N = 1,6581 |
Active N = 3181 |
Drive N = 1,0991 |
Transit N = 2411 |
|---|---|---|---|---|
| free_parking | 558 (34%) | 53 (17%) | 472 (43%) | 33 (14%) |
| avg_distance | 8 ± 8 | 2 ± 3 | 10 ± 9 | 7 ± 6 |
| gender | ||||
| Boy/Man (cisgender or transgender) | 775 (47%) | 164 (52%) | 494 (45%) | 117 (49%) |
| Girl/Woman (cisgender or transgender) | 761 (46%) | 129 (41%) | 529 (48%) | 103 (43%) |
| Non-binary/Something else fits better | 43 (2.6%) | 12 (3.8%) | 21 (1.9%) | 10 (4.1%) |
| Prefer not to answer | 79 (4.8%) | 13 (4.1%) | 55 (5.0%) | 11 (4.6%) |
| higher_education | 1,308 (79%) | 274 (86%) | 843 (77%) | 191 (79%) |
| children | 383 (23%) | 35 (11%) | 317 (29%) | 31 (13%) |
| hhincome_broad | ||||
| Under $25,000 | 54 (3.3%) | 19 (6.0%) | 18 (1.6%) | 17 (7.1%) |
| $25,000-$49,999 | 186 (11%) | 46 (14%) | 105 (9.6%) | 35 (15%) |
| $50,000-$74,999 | 233 (14%) | 37 (12%) | 151 (14%) | 45 (19%) |
| $75,000-$99,999 | 202 (12%) | 49 (15%) | 130 (12%) | 23 (9.5%) |
| $100,000-$199,999 | 592 (36%) | 94 (30%) | 432 (39%) | 66 (27%) |
| $200,000 or more | 391 (24%) | 73 (23%) | 263 (24%) | 55 (23%) |
| 1 n (%); Mean ± SD | ||||
Are there noticeable differences between respondent groups?
6.6 Multinomial model estimation
We will need a category of reference for outcome variable in multinomial models. By default in R, the reference is the first level/category of the outcome variable. However, we will choose ‘Drive’ for our case.
persons_main <- persons_main %>%
mutate(mode_main = fct_relevel(mode_main, 'Drive'))We can fit the multinomial model using a similar syntax to what we’ve done before. The main mode (the dependent variable) is explained by whether individuals have free parking at work, as well as other control variables related to trips and socio-demographic characteristics. Important note: the multinom() function is not part of base R, but it’s available in from the nnet:: package.
# Estimate Multinomial model
multinomial_model1 <- multinom(
formula = mode_main ~ free_parking + avg_distance + gender + higher_education + children,
data = persons_main
)# weights: 27 (16 variable)
initial value 1821.499175
iter 10 value 1150.503869
iter 20 value 1125.841550
final value 1125.139744
converged
The standard scale of the model output is the log of the odds (similar to logistic regression). We have already discussed some aspects of interpreting results on this scale, so we will avoid repetition and move directly to interpreting the coefficients on the odds ratio scale.
6.6.1 Exponentied coefficients: Odds ratio
As we know, the log-odds scale is not very intuitive. Similar to what we did with logistic regression, we can exponentiate the coefficients to obtain the odds ratios, which allows us to interpret the coefficients in terms of percentage changes, holding other factors constant.
multinomial_model1 %>%
tbl_regression(exponentiate = TRUE) %>%
add_glance_table()| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Active | |||
| free_parking | |||
| No | — | — | |
| Yes | 0.32 | 0.22, 0.46 | <0.001 |
| avg_distance | 0.67 | 0.63, 0.71 | <0.001 |
| gender | |||
| Boy/Man (cisgender or transgender) | — | — | |
| Girl/Woman (cisgender or transgender) | 0.50 | 0.36, 0.68 | <0.001 |
| Non-binary/Something else fits better | 1.23 | 0.50, 3.03 | 0.7 |
| Prefer not to answer | 0.75 | 0.36, 1.57 | 0.4 |
| higher_education | |||
| No | — | — | |
| Yes | 2.04 | 1.36, 3.06 | <0.001 |
| children | |||
| No | — | — | |
| Yes | 0.39 | 0.25, 0.59 | <0.001 |
| Transit | |||
| free_parking | |||
| No | — | — | |
| Yes | 0.22 | 0.15, 0.32 | <0.001 |
| avg_distance | 0.95 | 0.93, 0.98 | <0.001 |
| gender | |||
| Boy/Man (cisgender or transgender) | — | — | |
| Girl/Woman (cisgender or transgender) | 0.67 | 0.49, 0.91 | 0.011 |
| Non-binary/Something else fits better | 1.56 | 0.68, 3.56 | 0.3 |
| Prefer not to answer | 0.77 | 0.38, 1.57 | 0.5 |
| higher_education | |||
| No | — | — | |
| Yes | 1.29 | 0.90, 1.84 | 0.2 |
| children | |||
| No | — | — | |
| Yes | 0.37 | 0.25, 0.56 | <0.001 |
| NA | |||
| edf | 16.0 | ||
| Deviance | 2,250 | ||
| AIC | 2,282 | ||
| No. Obs. | 1,658 | ||
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
Let’s give it a go to the key independent variable:
Compared to individuals without free parking at work, the odds of using active travel are 68% lower than driving for those with free parking at their workplace. Similarly, the odds of using transit are about 80% lower than driving for people with free parking at work relative to those without it.
More generally, we observe that the likelihood of both engaging in active travel and using transit is lower than driving for individuals with free parking at work compared to those without it. Additionally, the effect is more pronounced for transit than for active travel.
As you can see, we can use the following formula to express the changes in percent terms (1 - coefficient) * 100. However, we can also use the direct coefficient and talk about X times changes. This is helpful when we have large coefficients. For example:
The odds of engaging in active travel, compared to driving, are twice as high for individuals with a higher education degree compared to those without one.
For continuous variables the interpretation is as follows:
For every additional average mile to work, the odds of engaging in active travel are 33% lower than driving. Likewise, the odds of riding transit are 5% lower than driving for every additional mile to work.
Generally speaking, the odds of both active travel and transit use are lower relative to driving and decrease as the distance to work increases. However, this effect is more pronounced for active travel. Does this difference align with your prior expectations?
Can you provide the interpretation for rest of the coefficients?
6.7 Model assessment
6.7.1 Goodness-of-fit
As with logistic regression, we can use the Akaike information criterion (AIC). Lower AIC values indicate a better-fitting model, i.e., a model that haves a good balance between goodness of fit and complexity.
We can compute a pseudo-r-squared value and other measures for mutlinomial models, too:
# Measures of fit
model_performance(multinomial_model1)Can't calculate log-loss.
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------------
2282.279 | 2282.611 | 2368.893 | 0.220 | 0.219 | 0.360 | 1.171
From the table the column ‘R2 (adj.)’ corresponds to McFadden’s pseudo R-squared values. This value ranges from 0 to 1. Higher values indicate a better-fitting model.
Additionally, we can look at the predictions. For instance, the model predictions look as following:
# Model predictions
multinomial_model1$fitted.values[1:6, ] Drive Active Transit
1 0.7496419 1.326472e-01 0.11771087
2 0.2252198 6.138324e-01 0.16094781
3 0.4512202 2.769393e-01 0.27184047
4 0.2675356 5.474510e-01 0.18501343
5 0.9517953 5.069152e-05 0.04815401
6 0.7430277 1.395053e-01 0.11746701
predict(multinomial_model1)[1:6][1] Drive Active Drive Active Drive Drive
Levels: Drive Active Transit
We can check these predictions against observed data using a confusion matrix, and related measures:
# Correctly predicted classes
predicted_classes <- predict(multinomial_model1)
confusionMatrix(predicted_classes, persons_main$mode_main)Confusion Matrix and Statistics
Reference
Prediction Drive Active Transit
Drive 1010 123 191
Active 89 195 50
Transit 0 0 0
Overall Statistics
Accuracy : 0.7268
95% CI : (0.7046, 0.7481)
No Information Rate : 0.6628
P-Value [Acc > NIR] : 1.238e-08
Kappa : 0.3676
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: Drive Class: Active Class: Transit
Sensitivity 0.9190 0.6132 0.0000
Specificity 0.4383 0.8963 1.0000
Pos Pred Value 0.7628 0.5838 NaN
Neg Pred Value 0.7335 0.9071 0.8546
Prevalence 0.6628 0.1918 0.1454
Detection Rate 0.6092 0.1176 0.0000
Detection Prevalence 0.7986 0.2014 0.0000
Balanced Accuracy 0.6786 0.7547 0.5000
The diagonal values in the Confusion Matrix indicate the number of correct predictions. The sensitivity in Statistics by Class indicates the percentage of correct predictions. For further information consult the function documentation typing ?confusionMatrix.
6.8 Individual activities
Run a similar analysis at the person level, but this time focus on sustainable mode choice for a purpose of your choice other than work. Specifically, disaggregate active travel into ‘Walk’ and ‘Bike’, and include ‘Transit’. Exclude all other modes from the model. Examine the relationship of employer subsidies for public transport in the model, such as free passes or fares. This information can be found in the ‘persons’ table. Interpret the results of the model.
6.9 Bonus: Multinomial model assumption checks
Generally, there are two main assumptions for the multinomial model (Harris, 2020):
- Independence of errors, and
- independence of irrelevant alternatives (IIA).
Independence of errors relates to the structure of the data or the way it has been collected. In this case, individuals might have dependencies at the household level.
The IIA assumption states that the odds of choosing between any two alternatives are unaffected by the presence of other alternatives. For example, the odds of choosing active travel over driving are assumed to be unaffected by the presence of transit. In other words, if transit were removed from the model, the relative odds between drive and active would remain the same.
The Hausman-McFadden test is helpful to check if the IIA assumption holds. For our example, we can run the check as demonstrated below. However, we require another package, mlogit, and the structure of the data needs to be transformed.
library(mlogit)
# Make sure all categorical variables are factors
persons_main <- persons_main %>%
mutate(across(is.character, as.factor))
# Convert data to mlogit format
mlogit_data <- mlogit.data(
persons_main,
choice = "mode_main",
shape = "wide"
)
# Refit the model using mlogit()
mlogit_model <- mlogit(
formula = mode_main ~ 0 | avg_distance + gender + higher_education + children,
data = mlogit_data
)
# Refit the model using mlogit() with an alternative specification
mlogit_model_alt <- mlogit(
formula = mode_main ~ 0 | avg_distance + gender + higher_education + children,
data = mlogit_data,
alt.subset = c('Drive', 'Active')
)
# Perform Hausman-McFadden test
hmftest(x = mlogit_model, z = mlogit_model_alt)
Hausman-McFadden test
data: mlogit_data
chisq = -26.446, df = 7, p-value = 1
alternative hypothesis: IIA is rejected
The null hypothesis assumes that IIA holds. Since it’s rejected in our results, this suggests that the IIA assumption does NOT hold in this model. The implication is that the multinomial logit model might not be appropriate because the choice probabilities are affected by irrelevant alternatives. In such a case, a nested logit model can be an alternative. However, this is out of the scope of this course. More information is available at https://github.com/friendly/nestedLogit.
6.10 References and further reading
- Harris, J.K., 2019. Statistics with R: solving problems using real-world data. SAGE Publications. Chapter 11.1-11.6
- Martin, P. (2022) Regression Models for Categorical and Count Data. The Sage Quantitative Research Kit. Sage, London. Ch. 4
- Chapter 8.1: Agresti, A. (2012). Categorical Data Analysis. John Wiley & Sons. [UoG library - ebook]
- Chapter 9: Heeringa, S. G., West, B. T., & Berglund, P. A. (2017). Applied Survey Data Analysis, Second Edition. Chapman and Hall/CRC. https://doi.org/10.1201/9781315153278 [UoG library - ebook]
- Ozbilen, B., Akar, G., White, K., Dabelko-Schoeny, H., & Cao, Q. (2022). Analysing the travel behaviour of older adults: what are the determinants of sustainable mobility? Ageing and Society, 1–29. https://doi.org/10.1017/S0144686X22001180