4  Session 4: Logistic regression Part 1

4.1 Preliminaries

We will continue to work with the Puget Sound Household Travel Survey. If you do not have the data or project set for this, please check the preliminary instructions for Week 1.

4.1.1 Data manipulation

For today’s session we will need the following packages (Make sure that you have these packages installed in your machine, if not install them using the install.packages() function).

# Packages 
library(tidyverse) # For data manipulation
library(gtsummary) # Descriptive statistics
library(performance) # Model checks

You will also need to install a helper package: install.packages("broom.helpers").

Now, we will read the data in to the R session. You will need information from trips, persons, and households tables.

# Load data
households <- read_csv('data/Households.csv')
persons <- read_csv('data/Persons.csv')
trips <- read_csv('data/Trips.csv')

Let’s work with data for the year 2023 only.

# Filter data for the year 2023 only
households <- households %>% filter(survey_year == 2023)
trips <- trips %>% filter(survey_year == 2023)
persons <- persons %>% filter(survey_year == 2023)

4.1.2 Summarising data at the appropriate level

What is the unit of observation?

Many transport datasets have a relational (hierarchical) structure. For example, repeated measurements may be nested within individuals, individuals within household, or households within places.

Before analysing the data, you will need to summarise the relevant information at the appropriate level of this hierarchy—that is, for the unit of observation that matches your research question.

Ask yourself:

  • What does one row of data represent at this stage of the analysis?
  • Are there multiple observations per unit?
  • Do I need to aggregate, average, or otherwise summarise the data to move to a higher-level unit?

Take a moment to identify the unit of observation you are working with as you go through the following sections.

In this activity, we focus on trips to work only.

trips_work <- trips %>% 
  filter('Work' == dest_purpose_cat)

The dependent variable we are interested in is whether the travellers used an active transport mode or not to travel to work. This is a binary variable. So, we will use logistic regression to identify the relationships with other variables.

The criterion for constructing this variable is that the person has made at least one trip to work during the reported period. This information is obtained at the trip level and then summarised at the person level, as shown below.

# Summarise trip data at the person level
trips_summary <- trips_work %>% 
  group_by(person_id) %>% 
  summarise(
    # Count the number of work trips made using an active mode, e.g. bike or walk
    n_active_trips = sum(grepl("Walk|Bike", mode_class)),
    # Create a binary variable, ie whether they made any active trips
    active_binary = ifelse(n_active_trips > 0, 1, 0),
    
    # Total distance travelled across all work trips
    total_miles = sum(distance_miles, na.rm = TRUE),
    # Number of days with observed trips
    num_days_trips = n_distinct(daynum),
    # Average distance travelled per day
    avg_distance = total_miles / num_days_trips
)

Remember, persons can generate many trips a day and use different modes. This is why we need to summarise the trip data at the person level.

We keep persons who reported trips to work only, and whose distance is less than 50 miles. This can be a reasonable threshold, as we are interested in active transport modes. In practice, it is always important to justify choices, e.g. literature, expert opinion, or empirical evidence.

trips_summary <- trips_summary %>% 
  filter(num_days_trips > 0 & avg_distance < 50)
Reflection

What share of respondents use an active travel mode when travelling to work?

4.1.3 Independent variables

In addition to the travel behaviour information, we want to know more about the demographics and characteristics of the household. Thus, we join the trip summary and household data at the person level.

persons_main <- persons %>% 
  left_join(trips_summary, by = 'person_id') %>% 
  left_join(households, by = 'household_id')

Next, we format the independent variables and simplify some of these.

We will also create the key independent variable, namely: having free parking at work. We wish to examine if and how this is related to the use of active transport modes. We are also including further demographic control variables at the person and household level.

# Write the name and appropriate order of the levels for hhincome_broad
income_labs <- c(
  "Under $25,000", 
  "$25,000-$49,999", 
  "$50,000-$74,999", 
  "$75,000-$99,999", 
  "$100,000-$199,999", 
  "$200,000 or more"
)

# Categories for higher education (explicit definition)
he_levs <- c(
  "Bachelor degree",
  "Graduate/post-graduate degree",
  "Associates degree"
)

# Format and create appropriate variables
persons_main <- persons_main %>% 
  mutate(
    # Whether person has free parking at work
    free_parking = ifelse(commute_subsidy_3 == 'Selected', 'Yes', 'No'),
    # Simplify education, e.g. graduate and bachelor vs all other
    higher_education = ifelse(education %in% he_levs, 'Yes', 'No'),
    children = ifelse(numchildren == '0 children', 'No', 'Yes'),
    hhincome_broad = factor(hhincome_broad, levels = income_labs)
)

Before moving to the analysis, we will keep only complete observations for the variables of interest. We will remove incomplete cases—that is, persons with missing values in any of the selected columns.

Be careful: removing incomplete cases

This step requires caution, especially when working with supplementary variables where missing values may be expected.

For example, the variable age of children will naturally be missing for persons without children. Treating these cases as incomplete would incorrectly remove valid observations and result in the loss of valuable information.

Before dropping cases, always consider why values are missing and whether they are meaningful for the unit of observation and research question.

# First, select variables of interest
persons_main <- persons_main %>% 
  select(
    active_binary, 
    free_parking,
    avg_distance,
    gender,
    higher_education,
    children,
    hhincome_broad
  )

# Then, removing incomplete observations
persons_main <- persons_main %>% 
  drop_na()

4.2 Descriptive statistics

Before moving on to the modelling stage, always check that the descriptive statistics are plausible and consistent, i.e. do values make sense?!. This step helps you catch data issues, unexpected patterns, or coding errors early—before they affect your models. Let’s do that now.

Print a simple summary of the sample including all variables selected. We split the data according to the main dependent variable, whether respondents used active mode to travel to work.

persons_main %>% 
  tbl_summary(
    by = active_binary,
    statistic = all_continuous() ~ "{mean} ± {sd}"
  ) %>% 
  add_overall()
Characteristic Overall
N = 1,6811
0
N = 1,2381
1
N = 4431
free_parking 563 (33%) 478 (39%) 85 (19%)
avg_distance 8 ± 8 9 ± 8 4 ± 5
gender


    Boy/Man (cisgender or transgender) 787 (47%) 562 (45%) 225 (51%)
    Girl/Woman (cisgender or transgender) 772 (46%) 583 (47%) 189 (43%)
    Non-binary/Something else fits better 43 (2.6%) 29 (2.3%) 14 (3.2%)
    Prefer not to answer 79 (4.7%) 64 (5.2%) 15 (3.4%)
higher_education 1,326 (79%) 949 (77%) 377 (85%)
children 384 (23%) 326 (26%) 58 (13%)
hhincome_broad


    Under $25,000 56 (3.3%) 29 (2.3%) 27 (6.1%)
    $25,000-$49,999 191 (11%) 132 (11%) 59 (13%)
    $50,000-$74,999 238 (14%) 178 (14%) 60 (14%)
    $75,000-$99,999 203 (12%) 143 (12%) 60 (14%)
    $100,000-$199,999 597 (36%) 466 (38%) 131 (30%)
    $200,000 or more 396 (24%) 290 (23%) 106 (24%)
1 n (%); Mean ± SD
Reflection
  • Are there noticeable differences between respondent groups?
  • Can you make conclusions from this?

4.2.1 Write the data

Before moving ahead, save the data subset with the variables that you have formatted and created. This will also be helpful for the next session.

write_rds(persons_main,'data/persons_main.rds')

4.3 Logistic regression

We fit a logistic regression model to identify the relationships between the dependent variable (active travel) and key independent variable (having free parking at the workplace), including control variables, such as average trip distance, gender, higher education, presence of children at household, and household income.

To fit the logistic regression we use the glm() function and specify the family ‘binomial’. This uses the first category as the reference group (the group without the outcome) and treats the second category as the outcome group. In this case, ‘0’ = no active travel (the reference group), and ‘1’ = active travel. This is critical to make correct coefficient interpretations.

logit_model1 <- glm(
  active_binary ~ free_parking + avg_distance + gender + higher_education + children + hhincome_broad,
  family = "binomial", 
  data = persons_main
)

Print the results of the model in the log odds ratio scale.

logit_model1 %>% 
  tbl_regression() %>% 
  add_significance_stars(hide_ci = FALSE, hide_p = FALSE) %>% 
  add_glance_table()
Characteristic log(OR)1 SE 95% CI p-value
free_parking



    No
    Yes -0.77*** 0.146 -1.1, -0.49 <0.001
avg_distance -0.18*** 0.015 -0.21, -0.15 <0.001
gender



    Boy/Man (cisgender or transgender)
    Girl/Woman (cisgender or transgender) -0.49*** 0.130 -0.74, -0.23 <0.001
    Non-binary/Something else fits better -0.24 0.384 -1.0, 0.50 0.5
    Prefer not to answer -0.66* 0.329 -1.3, -0.04 0.044
higher_education



    No
    Yes 0.66*** 0.172 0.32, 1.0 <0.001
children



    No
    Yes -0.68*** 0.172 -1.0, -0.35 <0.001
hhincome_broad



    Under $25,000
    $25,000-$49,999 -0.64 0.343 -1.3, 0.03 0.063
    $50,000-$74,999 -0.74* 0.346 -1.4, -0.07 0.032
    $75,000-$99,999 -0.71* 0.347 -1.4, -0.03 0.040
    $100,000-$199,999 -0.90** 0.326 -1.5, -0.26 0.006
    $200,000 or more -0.62 0.338 -1.3, 0.04 0.065
Null deviance 1,939


Null df 1,680


Log-likelihood -789


AIC 1,605


BIC 1,675


Deviance 1,579


Residual df 1,668


No. Obs. 1,681


Abbreviations: CI = Confidence Interval, OR = Odds Ratio, SE = Standard Error
1 *p<0.05; **p<0.01; ***p<0.001

For now, the interpretation of the coefficients is on the log-odds scale. For example, the coefficient for free parking at work is -0.75 and statistically significant. This means that having free parking is associated with a decrease in the log-odds of using active transport modes compared to people without free parking:

\[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 \cdot \text{FreeParking} + \dots \]

where \(p\) is the probability of using an active mode, and \(\beta_1 = -0.77\).

For continuous variables: We expect the log-odds of using active transport modes to decrease by 0.18 for every additional mile to work. This is also a significant predictor.

Reflection

Can you provide the interpretation for other coefficients?

4.3.1 Final thoughts

So far, the results of the logistic regression are interpreted in terms of the log of the odds. We can exponentiate the coefficient to interpret it as an odds ratio, which is often more intuitive. This is what we will do in the next hands-on session, including the discussion of model fit measures and model assumption checks.