# for data manipulation and visualization
library(tidyverse) 1 Session 1. Working environment and formatting data
1.1 Setting the working environment
A good practice is to have an RStudio for each different project you are working on. This creates a folder in which, ideally, you should include all the input and output data.
- Before moving any further, make sure you have your own
RStudioproject in your session. Staff will be more than happy to assist you here if needed.
1.2 Lets Practice using some transport data
In this course, we will look into Puget Sound Regional Travel Surve. This workbook uses dataset version ‘2023.5’.
- Download the dataset available from the course Moodle page under the ‘Lab Activities’ tile.
- Unzip the file. You can see that the following tables in
CSVfiles:- Trips
- Persons
- Households
- Vehicles
- Days
- Move the files into your
RStudioproject. It is a good practice to keep inputs, code, and outputs separate. I suggest you to create a folder calleddatain your project folder and paste all the survey files in it.
The codebook is available on Moodle, too. Additional information offered by the source are available at https://www.psrc.org/our-work/household-travel-survey-program.
1.3 Hands on
The objective of this first part is to reproduce the results shown below. For now, I encourage you to type the code, rather than copying, this can help you to actively think of what the code is doing.
As a first step, create an RMardown/Quarto file, and save it in your RStudio project (remove all the default contents, if included).
First, load the packages required for the analysis (install if needed).
Then, read the ‘Household’ table from the CSV file. Make sure to adjust the directory to match the location where you save the Survey files.
# Reading the Household data
households <- read_csv("data/Households.csv")A quick check. How many rows and columns are there in the dataset?
nrow(households)[1] 12118
ncol(households)[1] 69
That’s many columns. Check the first first 20 variable names in the houshold table.
head(colnames(household), 20)[1] "family" "dob_child1" "dob_child2" "name_child1" "name_child2"
What do all these names mean?!
Download the dataset codebook from Moodle, and keep it in your project folder. Spend 5-10 minutes to explore the structure and get familiar with the contents of the dataset.
1.3.1 Subsetting
Lets subset some variables of interest from the Household table. Also, keep the records for the year 2023 only.
# Subset the data to select only the variables we need for the year 2023
household_subset <- households %>%
filter(survey_year == 2023) %>%
select(
household_id, hhincome_broad, hhsize,
home_county, vehicle_count, hh_race_category,
numadults, numchildren, numworkers
)We can glimpse the data to see the type of the variables are included.
glimpse(household_subset)Rows: 3,870
Columns: 9
$ household_id <dbl> 23007083, 23007112, 23003118, 23007129, 23007152, 230…
$ hhincome_broad <chr> "$75,000-$99,999", "$50,000-$74,999", "$25,000-$49,99…
$ hhsize <chr> "3 people", "2 people", "1 person", "2 people", "3 pe…
$ home_county <chr> "Kitsap County", "King County", "King County", "King …
$ vehicle_count <chr> "4 vehicles", "1 vehicle", "1 vehicle", "1 vehicle", …
$ hh_race_category <chr> "AANHPI non-Hispanic", "White non-Hispanic", "AANHPI …
$ numadults <chr> "3 adults", "2 adults", "1 adult", "2 adults", "2 adu…
$ numchildren <chr> "0 children", "0 children", "0 children", "0 children…
$ numworkers <chr> "3 workers", "2 workers", "1 worker", "0 workers", "2…
1.4 Formating variables
Some columns are not in it’s appropriate type, as they mix text with numbers. Thus, we will have to extract the numbers and set categorical variables to factors.
household_subset <- household_subset %>%
mutate(
hhsize = parse_number(hhsize),
vehicle_count = parse_number(vehicle_count),
numadults = parse_number(numadults),
numchildren = parse_number(numchildren),
numworkers = parse_number(numworkers)
)Categorical variables as factors.
# Convert all required variables to factors
household_subset <- household_subset %>%
mutate(
across(c(hhincome_broad, home_county, hh_race_category),
factor
))Let’s glimpse the formatted variables.
glimpse(household_subset)Rows: 3,870
Columns: 9
$ household_id <dbl> 23007083, 23007112, 23003118, 23007129, 23007152, 230…
$ hhincome_broad <fct> "$75,000-$99,999", "$50,000-$74,999", "$25,000-$49,99…
$ hhsize <dbl> 3, 2, 1, 2, 3, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2,…
$ home_county <fct> Kitsap County, King County, King County, King County,…
$ vehicle_count <dbl> 4, 1, 1, 1, 1, 2, 1, 1, 2, 0, 1, 2, 1, 2, 1, 3, 1, 2,…
$ hh_race_category <fct> AANHPI non-Hispanic, White non-Hispanic, AANHPI non-H…
$ numadults <dbl> 3, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2,…
$ numchildren <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ numworkers <dbl> 3, 2, 1, 0, 2, 1, 1, 2, 1, 1, 1, 0, 1, 1, 2, 0, 1, 1,…
The income bands are ordered categories. However, they are unordered. Rearrenge them as following:
Format the order of household income labels
# Type ordered income labs
income_labs <- c(
"Under $25,000",
"$25,000-$49,999",
"$50,000-$74,999",
"$75,000-$99,999",
"$100,000-$199,999",
"$200,000 or more"
)
household_subset <- household_subset %>%
mutate(hhincome_broad = factor(hhincome_broad, levels = income_labs))1.5 Descriptive statistics
We can have a quick and dirt descriptive statistics overview by producing a default summary.
summary(household_subset) household_id hhincome_broad hhsize
Min. :23000173 Under $25,000 : 402 Min. :1.000
1st Qu.:23073829 $25,000-$49,999 : 490 1st Qu.:1.000
Median :23154700 $50,000-$74,999 : 509 Median :2.000
Mean :23173722 $75,000-$99,999 : 423 Mean :1.954
3rd Qu.:23265921 $100,000-$199,999:1065 3rd Qu.:2.000
Max. :23423045 $200,000 or more : 605 Max. :9.000
NA's : 376
home_county vehicle_count
King County :2793 Min. :0.000
Kitsap County : 215 1st Qu.:1.000
Pierce County : 416 Median :1.000
Snohomish County: 446 Mean :1.389
3rd Qu.:2.000
Max. :8.000
hh_race_category numadults
AANHPI non-Hispanic : 667 Min. :1.000
Black or African American non-Hispanic: 121 1st Qu.:1.000
Hispanic : 260 Median :2.000
Missing/No response : 640 Mean :1.683
Some Other Races non-Hispanic : 132 3rd Qu.:2.000
White non-Hispanic :2050 Max. :9.000
numchildren numworkers
Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.000
Median :0.0000 Median :1.000
Mean :0.2716 Mean :1.107
3rd Qu.:0.0000 3rd Qu.:2.000
Max. :5.0000 Max. :6.000
For tailored descriptive statistic summarise, we can do the following:
household_subset %>%
summarise(hhsize_mean = mean(hhsize), hhsize_sd = sd(hhsize))# A tibble: 1 × 2
hhsize_mean hhsize_sd
<dbl> <dbl>
1 1.95 1.11
An we can easily extend this breaking down the descriptive statistics by group a categorical characteristic of the household. In the example below, by the home County.
household_subset %>%
group_by(home_county) %>%
summarise(mean = mean(hhsize), sd = sd(hhsize))# A tibble: 4 × 3
home_county mean sd
<fct> <dbl> <dbl>
1 King County 1.87 1.06
2 Kitsap County 2.18 1.12
3 Pierce County 2.09 1.17
4 Snohomish County 2.25 1.26
Try summarising by hh_race_category.
These examples provide useful and flexible summaries. However, it’d require some work to format it for a final output. Packages that can help you creating descriptive statistic tables are: gtsummary, modelsummary, summarytools.
1.6 Missing data
Examine the variable hh_race_category. Are the missing values?
Following the conventions in R, missing values should be treated as NA. Format these as appropriate using the fct_recode().
There are many reasons why some data might be missing.
1.7 Wraping up
1.7.1 Write the formated subset
We will finish this session here.
We will write the formatted household subset. So, you do not have to go over the same steps in the next session.
write_rds(household_subset,file = 'data/household_subset.RDS')1.7.2 Final reflections
- What year are you including in the data?
- What level of unit is represented in the table you prepared?