1 Session 1. Working environment and formatting data

1.1 Setting the working environment

A good practice is to have an RStudio for each different project you are working on. This creates a folder in which, ideally, you should include all the input and output data.

Before moving any further, make sure you have your own RStudio project in your session. Staff will be more than happy to assist you here if needed.

1.2 Lets Practice using some transport data

In this course, we will look into Puget Sound Regional Travel Surve. This workbook uses dataset version ‘2023.5’.

Download the dataset available from the course Moodle page under the ‘Lab Activities’ tile.
Unzip the file. You can see that the following tables in CSV files:
- Trips
- Persons
- Households
- Vehicles
- Days
Move the files into your RStudio project. It is a good practice to keep inputs, code, and outputs separate. I suggest you to create a folder called data in your project folder and paste all the survey files in it.

The codebook is available on Moodle, too. Additional information offered by the source are available at https://www.psrc.org/our-work/household-travel-survey-program.

1.3 Hands on

The objective of this first part is to reproduce the results shown below. For now, I encourage you to type the code, rather than copying, this can help you to actively think of what the code is doing.

As a first step, create an RMardown/Quarto file, and save it in your RStudio project (remove all the default contents, if included).

First, load the packages required for the analysis (install if needed).

# for data manipulation and visualization
library(tidyverse)

Then, read the ‘Household’ table from the CSV file. Make sure to adjust the directory to match the location where you save the Survey files.

# Reading the Household data
households <- read_csv("data/Households.csv")

A quick check. How many rows and columns are there in the dataset?

nrow(households)

[1] 12118

ncol(households)

[1] 69

That’s many columns. Check the first first 20 variable names in the houshold table.

head(colnames(household), 20)

[1] "family"      "dob_child1"  "dob_child2"  "name_child1" "name_child2"

What do all these names mean?!

Download the dataset codebook from Moodle, and keep it in your project folder. Spend 5-10 minutes to explore the structure and get familiar with the contents of the dataset.

1.3.1 Subsetting

Lets subset some variables of interest from the Household table. Also, keep the records for the year 2023 only.

# Subset the data to select only the variables we need for the year 2023
household_subset <- households %>%
  filter(survey_year == 2023) %>%
  select(
    household_id, hhincome_broad, hhsize, 
    home_county, vehicle_count, hh_race_category, 
    numadults, numchildren, numworkers
  )

We can glimpse the data to see the type of the variables are included.

glimpse(household_subset)

Rows: 3,870
Columns: 9
$ household_id     <dbl> 23007083, 23007112, 23003118, 23007129, 23007152, 230…
$ hhincome_broad   <chr> "$75,000-$99,999", "$50,000-$74,999", "$25,000-$49,99…
$ hhsize           <chr> "3 people", "2 people", "1 person", "2 people", "3 pe…
$ home_county      <chr> "Kitsap County", "King County", "King County", "King …
$ vehicle_count    <chr> "4 vehicles", "1 vehicle", "1 vehicle", "1 vehicle", …
$ hh_race_category <chr> "AANHPI non-Hispanic", "White non-Hispanic", "AANHPI …
$ numadults        <chr> "3 adults", "2 adults", "1 adult", "2 adults", "2 adu…
$ numchildren      <chr> "0 children", "0 children", "0 children", "0 children…
$ numworkers       <chr> "3 workers", "2 workers", "1 worker", "0 workers", "2…

1.4 Formating variables

Some columns are not in it’s appropriate type, as they mix text with numbers. Thus, we will have to extract the numbers and set categorical variables to factors.

household_subset <- household_subset %>%
  mutate(
    hhsize = parse_number(hhsize),
    vehicle_count = parse_number(vehicle_count),
    numadults = parse_number(numadults),
    numchildren = parse_number(numchildren),
    numworkers = parse_number(numworkers)
  )

Categorical variables as factors.

# Convert all required variables to factors
household_subset <- household_subset %>%
  mutate(
    across(c(hhincome_broad, home_county, hh_race_category), 
    factor
))

Let’s glimpse the formatted variables.

glimpse(household_subset)

Rows: 3,870
Columns: 9
$ household_id     <dbl> 23007083, 23007112, 23003118, 23007129, 23007152, 230…
$ hhincome_broad   <fct> "$75,000-$99,999", "$50,000-$74,999", "$25,000-$49,99…
$ hhsize           <dbl> 3, 2, 1, 2, 3, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2,…
$ home_county      <fct> Kitsap County, King County, King County, King County,…
$ vehicle_count    <dbl> 4, 1, 1, 1, 1, 2, 1, 1, 2, 0, 1, 2, 1, 2, 1, 3, 1, 2,…
$ hh_race_category <fct> AANHPI non-Hispanic, White non-Hispanic, AANHPI non-H…
$ numadults        <dbl> 3, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2,…
$ numchildren      <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ numworkers       <dbl> 3, 2, 1, 0, 2, 1, 1, 2, 1, 1, 1, 0, 1, 1, 2, 0, 1, 1,…

The income bands are ordered categories. However, they are unordered. Rearrenge them as following:

Format the order of household income labels

# Type ordered income labs
income_labs <- c(
  "Under $25,000", 
  "$25,000-$49,999", 
  "$50,000-$74,999", 
  "$75,000-$99,999", 
  "$100,000-$199,999", 
  "$200,000 or more"
)

household_subset <- household_subset %>% 
  mutate(hhincome_broad = factor(hhincome_broad, levels = income_labs))

1.5 Descriptive statistics

We can have a quick and dirt descriptive statistics overview by producing a default summary.

summary(household_subset)

  household_id                hhincome_broad     hhsize     
 Min.   :23000173   Under $25,000    : 402   Min.   :1.000  
 1st Qu.:23073829   $25,000-$49,999  : 490   1st Qu.:1.000  
 Median :23154700   $50,000-$74,999  : 509   Median :2.000  
 Mean   :23173722   $75,000-$99,999  : 423   Mean   :1.954  
 3rd Qu.:23265921   $100,000-$199,999:1065   3rd Qu.:2.000  
 Max.   :23423045   $200,000 or more : 605   Max.   :9.000  
                    NA's             : 376                  
           home_county   vehicle_count  
 King County     :2793   Min.   :0.000  
 Kitsap County   : 215   1st Qu.:1.000  
 Pierce County   : 416   Median :1.000  
 Snohomish County: 446   Mean   :1.389  
                         3rd Qu.:2.000  
                         Max.   :8.000  
                                        
                               hh_race_category   numadults    
 AANHPI non-Hispanic                   : 667    Min.   :1.000  
 Black or African American non-Hispanic: 121    1st Qu.:1.000  
 Hispanic                              : 260    Median :2.000  
 Missing/No response                   : 640    Mean   :1.683  
 Some Other Races non-Hispanic         : 132    3rd Qu.:2.000  
 White non-Hispanic                    :2050    Max.   :9.000  
                                                               
  numchildren       numworkers   
 Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.0000   Median :1.000  
 Mean   :0.2716   Mean   :1.107  
 3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :5.0000   Max.   :6.000

For tailored descriptive statistic summarise, we can do the following:

household_subset %>% 
  summarise(hhsize_mean = mean(hhsize), hhsize_sd = sd(hhsize))

# A tibble: 1 × 2
  hhsize_mean hhsize_sd
        <dbl>     <dbl>
1        1.95      1.11

An we can easily extend this breaking down the descriptive statistics by group a categorical characteristic of the household. In the example below, by the home County.

 household_subset %>% 
  group_by(home_county) %>% 
  summarise(mean = mean(hhsize), sd = sd(hhsize))

# A tibble: 4 × 3
  home_county       mean    sd
  <fct>            <dbl> <dbl>
1 King County       1.87  1.06
2 Kitsap County     2.18  1.12
3 Pierce County     2.09  1.17
4 Snohomish County  2.25  1.26

Try summarising by hh_race_category.

These examples provide useful and flexible summaries. However, it’d require some work to format it for a final output. Packages that can help you creating descriptive statistic tables are: gtsummary, modelsummary, summarytools.

1.6 Missing data

Examine the variable hh_race_category. Are the missing values?

Following the conventions in R, missing values should be treated as NA. Format these as appropriate using the fct_recode().

There are many reasons why some data might be missing.

1.7 Wraping up

1.7.1 Write the formated subset

We will finish this session here.

We will write the formatted household subset. So, you do not have to go over the same steps in the next session.

write_rds(household_subset,file = 'data/household_subset.RDS')

1.7.2 Final reflections

What year are you including in the data?
What level of unit is represented in the table you prepared?