6  Session 6: Multinomial model

6.1 Introduction

Multinomial regression is useful when the dependent variable is categorical and has more than two categories. This week, we’ll build on the logistic regression example from last week, where we explored mode choice for work trips and how free parking influences that decision. This time, we’ll break down mode choice at the person level, with three possible outcomes: driving, active travel, or transit.

We will continue to work with the Puget Sound Household Travel Survey.

6.2 Preliminaries

For today’s session we will need the following packages. Make sure that you have these packages installed in your machine.

# Packages 
library(tidyverse) # For data manipulation
library(gtsummary) # Descriptive statistics
library(nnet) # For multinomial regression
library(caret) # To check the accuracy of the model
library(performance) # For model fit measures

As before, we read the data in to the R session. Again, we will obtain information from trips, persons, and households tables.

# Load data
trips <- read_csv('data/Trips.csv')
persons <- read_csv('data/Persons.csv')
households <- read_csv('data/Households.csv')

We limit the data for the year 2023 only.

# Filter data for the year 2023 only
households <- households %>% filter(survey_year == 2023)
trips <- trips %>% filter(survey_year == 2023)
persons <- persons %>% filter(survey_year == 2023)

6.3 Dependent variable: Main mode choice to work per person

In this example, we focus on trips to work only. So, we keep trips which its destination is ‘Work’ only.

trips_work <- trips %>% 
  filter('Work' == dest_purpose_cat)

Note that in the outcomes in a multinomial model must be mutually exclusive, which means that each individual can only belong to one outcome category at a time. In this example, we identify only one ‘main’ mode of transport for each person when travelling to work.

Here, the main mode is defined as the one used most frequently. If a person reported using more than one mode the same number of times, we’ll select the one that covered the greatest distance. To start, we’ll summarise the frequency and distance reported for each person and each mode.

mode_summary <- trips_work %>% 
  group_by(person_id, mode_class) %>% 
  summarise(
    mode_frequency = n(),
    mode_distance = sum(distance_miles, na.rm = TRUE),
  ) 

And, we have a quick look to the result:

mode_summary
# A tibble: 2,137 × 4
# Groups:   person_id [1,821]
    person_id mode_class mode_frequency mode_distance
        <dbl> <chr>               <int>         <dbl>
 1 2300017304 Drive SOV               4         16.2 
 2 2300021301 Bike                    1          1.36
 3 2300074501 Transit                 5         25.1 
 4 2300082802 Transit                 1          2.07
 5 2300116701 Drive SOV               1         25.2 
 6 2300124701 Drive SOV               1          3.90
 7 2300124702 Bike                    1          2.35
 8 2300124702 Drive SOV               2          5.15
 9 2300132201 Drive SOV               3         11.2 
10 2300149601 Transit                 1          9.64
# ℹ 2,127 more rows
Reflection

Looking at the variables available and the summary that you just have created. Think how you can identify the ‘main’ mode before coding it.

Now, implement the code which select the most frequently used mode for each person in the data. If there’s a tie, we’ll choose the mode that covered the most miles.

# Define only one 'main' mode per person based on frequency and distance
mode_main <- mode_summary %>% 
  group_by(person_id) %>% 
  # Keep the most frequent modes per individual
  slice_max(order_by = mode_frequency, with_ties = TRUE) %>% 
  # If tied, pick the one with more miles
  slice_max(order_by = mode_distance, with_ties = FALSE)  %>% 
  ungroup()

For simplicity in our model, we will group mode into three broader classes, namely: Drive, Transit, and Active. For this exercise, we drop all other classes.

# reclassify
mode_main <- mode_main %>% 
  mutate(
    mode_main = case_when(
      grepl('Drive', mode_class) ~ 'Drive',
      grepl('Transit', mode_class) ~ 'Transit',
      grepl('Bike|Walk', mode_class) ~ 'Active',
      # All other classes are ignored using NA
      TRUE ~ NA
    )
  )

# Keep only relevant column, i.e. main mode per individual
mode_main <- mode_main %>% 
  select(person_id, mode_main)

How is the distribution of the main mode looking?

count(mode_main, mode_main)
# A tibble: 4 × 2
  mode_main     n
  <chr>     <int>
1 Active      336
2 Drive      1190
3 Transit     258
4 <NA>         37
Reflection

Would it be possible to aggregate modal choice at a different unit of observation? Are there implications?

6.4 Independent variables

In addition to the dependent variable (mode choice in three categories), we need to define the independent variables that we think are related to mode choice to work. For this, we include information from trips, individual, and households.

We summarise trip information to work at the person level.

trips_distance <- trips_work %>% 
  group_by(person_id) %>% 
  summarise(
    total_miles = sum(distance_miles, na.rm = TRUE),
    num_days_trips = n_distinct(daynum),
    avg_distance = total_miles / num_days_trips
  ) %>% 
  filter(num_days_trips > 0 & avg_distance < 100)

Later, we supplement the dependent variable information with trips, persons, and households.

# Join the tables together
persons_main <- mode_main %>% 
  left_join(trips_distance, by = 'person_id') %>% 
  left_join(persons, by = 'person_id') %>%
  left_join(households, by = 'household_id')
Reflection

Have a look to the number of observations in persons_main compared to persons. Why do we see this difference?

As in our prior lab, we format and label the variables that we will use:

# Write the name and appropriate order of the levels for hhincome_broad
income_labs <- c(
  "Under $25,000", 
  "$25,000-$49,999", 
  "$50,000-$74,999", 
  "$75,000-$99,999", 
  "$100,000-$199,999", 
  "$200,000 or more"
)

# Categories for higher education (explicit definition)
he_levs <- c(
  "Bachelor degree",
  "Graduate/post-graduate degree",
  "Associates degree"
)

# Format and create appropriate variables
persons_main <- persons_main %>% 
  mutate(
    # Whether person has free parking at work
    free_parking = ifelse(commute_subsidy_3 == 'Selected', 'Yes', 'No'),
    # Simplify education, e.g. graduate and bachelor vs all other
    higher_education = ifelse(education %in% he_levs, 'Yes', 'No'),
    children = ifelse(numchildren == '0 children', 'No', 'Yes'),
    hhincome_broad = factor(hhincome_broad, levels = income_labs)
)

And, keep only the variables of interest for now.

# First, select variables of interest
persons_main <- persons_main %>% 
  select(
    mode_main, 
    free_parking,
    avg_distance,
    gender,
    higher_education,
    children,
    hhincome_broad
  )

# We also remove incomplete observations
persons_main <- persons_main %>% 
  drop_na()

6.5 Descriptive statistics

Before moving into the modelling stage, we check the main descriptive statistics.

We split the data according to the main dependent variable, whether respondents used active mode to travel to work.

persons_main %>% 
  tbl_summary(
    by = mode_main,
    statistic = all_continuous() ~ "{mean} ± {sd}"
  ) %>% 
  add_overall()
Characteristic Overall
N = 1,6581
Active
N = 3181
Drive
N = 1,0991
Transit
N = 2411
free_parking 558 (34%) 53 (17%) 472 (43%) 33 (14%)
avg_distance 8 ± 8 2 ± 3 10 ± 9 7 ± 6
gender



    Boy/Man (cisgender or transgender) 775 (47%) 164 (52%) 494 (45%) 117 (49%)
    Girl/Woman (cisgender or transgender) 761 (46%) 129 (41%) 529 (48%) 103 (43%)
    Non-binary/Something else fits better 43 (2.6%) 12 (3.8%) 21 (1.9%) 10 (4.1%)
    Prefer not to answer 79 (4.8%) 13 (4.1%) 55 (5.0%) 11 (4.6%)
higher_education 1,308 (79%) 274 (86%) 843 (77%) 191 (79%)
children 383 (23%) 35 (11%) 317 (29%) 31 (13%)
hhincome_broad



    Under $25,000 54 (3.3%) 19 (6.0%) 18 (1.6%) 17 (7.1%)
    $25,000-$49,999 186 (11%) 46 (14%) 105 (9.6%) 35 (15%)
    $50,000-$74,999 233 (14%) 37 (12%) 151 (14%) 45 (19%)
    $75,000-$99,999 202 (12%) 49 (15%) 130 (12%) 23 (9.5%)
    $100,000-$199,999 592 (36%) 94 (30%) 432 (39%) 66 (27%)
    $200,000 or more 391 (24%) 73 (23%) 263 (24%) 55 (23%)
1 n (%); Mean ± SD
Reflection

Are there noticeable differences between respondent groups?

6.6 Multinomial model estimation

We will need a category of reference for outcome variable in multinomial models. By default in R, the reference is the first level/category of the outcome variable. However, we will choose ‘Drive’ for our case.

persons_main <- persons_main %>% 
  mutate(mode_main = fct_relevel(mode_main,  'Drive'))

We can fit the multinomial model using a similar syntax to what we’ve done before. The main mode (the dependent variable) is explained by whether individuals have free parking at work, as well as other control variables related to trips and socio-demographic characteristics. Important note: the multinom() function is not part of base R, but it’s available in from the nnet:: package.

# Estimate Multinomial model
multinomial_model1 <- multinom(
  formula = mode_main ~ free_parking + avg_distance + gender + higher_education + children, 
  data = persons_main
)
# weights:  27 (16 variable)
initial  value 1821.499175 
iter  10 value 1150.503869
iter  20 value 1125.841550
final  value 1125.139744 
converged

The standard scale of the model output is the log of the odds (similar to logistic regression). We have already discussed some aspects of interpreting results on this scale, so we will avoid repetition and move directly to interpreting the coefficients on the odds ratio scale.

6.6.1 Exponentied coefficients: Odds ratio

As we know, the log-odds scale is not very intuitive. Similar to what we did with logistic regression, we can exponentiate the coefficients to obtain the odds ratios, which allows us to interpret the coefficients in terms of percentage changes, holding other factors constant.

multinomial_model1 %>% 
  tbl_regression(exponentiate = TRUE) %>% 
  add_glance_table()
Characteristic OR 95% CI p-value
Active
free_parking


    No
    Yes 0.32 0.22, 0.46 <0.001
avg_distance 0.67 0.63, 0.71 <0.001
gender


    Boy/Man (cisgender or transgender)
    Girl/Woman (cisgender or transgender) 0.50 0.36, 0.68 <0.001
    Non-binary/Something else fits better 1.23 0.50, 3.03 0.7
    Prefer not to answer 0.75 0.36, 1.57 0.4
higher_education


    No
    Yes 2.04 1.36, 3.06 <0.001
children


    No
    Yes 0.39 0.25, 0.59 <0.001
Transit
free_parking


    No
    Yes 0.22 0.15, 0.32 <0.001
avg_distance 0.95 0.93, 0.98 <0.001
gender


    Boy/Man (cisgender or transgender)
    Girl/Woman (cisgender or transgender) 0.67 0.49, 0.91 0.011
    Non-binary/Something else fits better 1.56 0.68, 3.56 0.3
    Prefer not to answer 0.77 0.38, 1.57 0.5
higher_education


    No
    Yes 1.29 0.90, 1.84 0.2
children


    No
    Yes 0.37 0.25, 0.56 <0.001
NA
edf 16.0

Deviance 2,250

AIC 2,282

No. Obs. 1,658

Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Let’s give it a go to the key independent variable:

Compared to individuals without free parking at work, the odds of using active travel are 68% lower than driving for those with free parking at their workplace. Similarly, the odds of using transit are about 80% lower than driving for people with free parking at work relative to those without it.

More generally, we observe that the likelihood of both engaging in active travel and using transit is lower than driving for individuals with free parking at work compared to those without it. Additionally, the effect is more pronounced for transit than for active travel.

As you can see, we can use the following formula to express the changes in percent terms (1 - coefficient) * 100. However, we can also use the direct coefficient and talk about X times changes. This is helpful when we have large coefficients. For example:

The odds of engaging in active travel, compared to driving, are twice as high for individuals with a higher education degree compared to those without one.

For continuous variables the interpretation is as follows:

For every additional average mile to work, the odds of engaging in active travel are 33% lower than driving. Likewise, the odds of riding transit are 5% lower than driving for every additional mile to work.

Reflection

Generally speaking, the odds of both active travel and transit use are lower relative to driving and decrease as the distance to work increases. However, this effect is more pronounced for active travel. Does this difference align with your prior expectations?

Can you provide the interpretation for rest of the coefficients?

6.7 Model assessment

6.7.1 Goodness-of-fit

As with logistic regression, we can use the Akaike information criterion (AIC). Lower AIC values indicate a better-fitting model, i.e., a model that haves a good balance between goodness of fit and complexity.

We can compute a pseudo-r-squared value and other measures for mutlinomial models, too:

# Measures of fit
model_performance(multinomial_model1)
Can't calculate log-loss.
# Indices of model performance

AIC      |     AICc |      BIC |    R2 | R2 (adj.) |  RMSE | Sigma
------------------------------------------------------------------
2282.279 | 2282.611 | 2368.893 | 0.220 |     0.219 | 0.360 | 1.171

From the table the column ‘R2 (adj.)’ corresponds to McFadden’s pseudo R-squared values. This value ranges from 0 to 1. Higher values indicate a better-fitting model.

Additionally, we can look at the predictions. For instance, the model predictions look as following:

# Model predictions
multinomial_model1$fitted.values[1:6, ]
      Drive       Active    Transit
1 0.7496419 1.326472e-01 0.11771087
2 0.2252198 6.138324e-01 0.16094781
3 0.4512202 2.769393e-01 0.27184047
4 0.2675356 5.474510e-01 0.18501343
5 0.9517953 5.069152e-05 0.04815401
6 0.7430277 1.395053e-01 0.11746701
predict(multinomial_model1)[1:6]
[1] Drive  Active Drive  Active Drive  Drive 
Levels: Drive Active Transit

We can check these predictions against observed data using a confusion matrix, and related measures:

# Correctly predicted classes
predicted_classes <- predict(multinomial_model1)
confusionMatrix(predicted_classes, persons_main$mode_main)
Confusion Matrix and Statistics

          Reference
Prediction Drive Active Transit
   Drive    1010    123     191
   Active     89    195      50
   Transit     0      0       0

Overall Statistics
                                          
               Accuracy : 0.7268          
                 95% CI : (0.7046, 0.7481)
    No Information Rate : 0.6628          
    P-Value [Acc > NIR] : 1.238e-08       
                                          
                  Kappa : 0.3676          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: Drive Class: Active Class: Transit
Sensitivity                0.9190        0.6132         0.0000
Specificity                0.4383        0.8963         1.0000
Pos Pred Value             0.7628        0.5838            NaN
Neg Pred Value             0.7335        0.9071         0.8546
Prevalence                 0.6628        0.1918         0.1454
Detection Rate             0.6092        0.1176         0.0000
Detection Prevalence       0.7986        0.2014         0.0000
Balanced Accuracy          0.6786        0.7547         0.5000

The diagonal values in the Confusion Matrix indicate the number of correct predictions. The sensitivity in Statistics by Class indicates the percentage of correct predictions. For further information consult the function documentation typing ?confusionMatrix.

6.8 Individual activities

Run a similar analysis at the person level, but this time focus on sustainable mode choice for a purpose of your choice other than work. Specifically, disaggregate active travel into ‘Walk’ and ‘Bike’, and include ‘Transit’. Exclude all other modes from the model. Examine the relationship of employer subsidies for public transport in the model, such as free passes or fares. This information can be found in the ‘persons’ table. Interpret the results of the model.

6.9 Bonus: Multinomial model assumption checks

Generally, there are two main assumptions for the multinomial model (Harris, 2020):

  1. Independence of errors, and
  2. independence of irrelevant alternatives (IIA).

Independence of errors relates to the structure of the data or the way it has been collected. In this case, individuals might have dependencies at the household level.

The IIA assumption states that the odds of choosing between any two alternatives are unaffected by the presence of other alternatives. For example, the odds of choosing active travel over driving are assumed to be unaffected by the presence of transit. In other words, if transit were removed from the model, the relative odds between drive and active would remain the same.

The Hausman-McFadden test is helpful to check if the IIA assumption holds. For our example, we can run the check as demonstrated below. However, we require another package, mlogit, and the structure of the data needs to be transformed.

library(mlogit)

# Make sure all categorical variables are factors
persons_main <- persons_main %>% 
  mutate(across(is.character, as.factor)) 

# Convert data to mlogit format
mlogit_data <- mlogit.data(
  persons_main, 
  choice = "mode_main", 
  shape = "wide"
)

# Refit the model using mlogit()
mlogit_model <- mlogit(
  formula = mode_main ~  0 | avg_distance + gender + higher_education + children, 
  data = mlogit_data
)

# Refit the model using mlogit() with an alternative specification
mlogit_model_alt <- mlogit(
  formula = mode_main ~  0 | avg_distance + gender + higher_education + children, 
  data = mlogit_data,
  alt.subset = c('Drive', 'Active')
)

# Perform Hausman-McFadden test
hmftest(x = mlogit_model, z = mlogit_model_alt)

    Hausman-McFadden test

data:  mlogit_data
chisq = -26.446, df = 7, p-value = 1
alternative hypothesis: IIA is rejected

The null hypothesis assumes that IIA holds. Since it’s rejected in our results, this suggests that the IIA assumption does NOT hold in this model. The implication is that the multinomial logit model might not be appropriate because the choice probabilities are affected by irrelevant alternatives. In such a case, a nested logit model can be an alternative. However, this is out of the scope of this course. More information is available at https://github.com/friendly/nestedLogit.

6.10 References and further reading

  • Harris, J.K., 2019. Statistics with R: solving problems using real-world data. SAGE Publications. Chapter 11.1-11.6
  • Martin, P. (2022) Regression Models for Categorical and Count Data. The Sage Quantitative Research Kit. Sage, London. Ch. 4
  • Chapter 8.1: Agresti, A. (2012). Categorical Data Analysis. John Wiley & Sons. [UoG library - ebook]
  • Chapter 9: Heeringa, S. G., West, B. T., & Berglund, P. A. (2017). Applied Survey Data Analysis, Second Edition. Chapman and Hall/CRC. https://doi.org/10.1201/9781315153278 [UoG library - ebook]
  • Ozbilen, B., Akar, G., White, K., Dabelko-Schoeny, H., & Cao, Q. (2022). Analysing the travel behaviour of older adults: what are the determinants of sustainable mobility? Ageing and Society, 1–29. https://doi.org/10.1017/S0144686X22001180