This short post can be viewed as an unofficial appendix to Grant McDermott’s terrific lecture on “Regression analysis in R” (go read it!). In particular, it is meant to extend the “Dummy variables” section of that lecture by introducing you to the recipes package, authored by Max Kuhn and Hadley Wickham.
The recipes Package
recipes basically provides a “tidy” approach to data preprocessing. Though recipes’ true greatness reveals itself in the “feature engineering” stage of building machine learning models, I find it extremely useful even for the simple task of generating dummy variables before running a linear regression.
The approach of recepies
, as its name hints, is related to the process of cooking (or baking…). Your variables are the ingredients and recipes’ collection of step_{X}
functions define what you want to do with your ingredient. If you follow the recipe’s instructions carefully you will end up with a new (and tasty) data frame that includes the new and transformed variables you need.
Generating dummies using recipes
In this tutorial, we will focus on one recipes function called step_dummy
that makes the task of generating dummies a breeze.
We start by loading the tidyverse and recipes packages:
library(tidyverse)library(recipes)
Like Grant, we’ll be working with the starwars
data frame.
starwars
## # A tibble: 87 x 13## name height mass hair_color skin_color eye_color birth_year gender## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 Luke~ 172 77 blond fair blue 19 male ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA> ## 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA> ## 4 Dart~ 202 136 none white yellow 41.9 male ## 5 Leia~ 150 49 brown light brown 19 female## 6 Owen~ 178 120 brown, gr~ light blue 52 male ## 7 Beru~ 165 75 brown light blue 47 female## 8 R5-D4 97 32 <NA> white, red red NA <NA> ## 9 Bigg~ 183 84 black light brown 24 male ## 10 Obi-~ 182 77 auburn, w~ fair blue-gray 57 male ## # ... with 77 more rows, and 5 more variables: homeworld <chr>,## # species <chr>, films <list>, vehicles <list>, starships <list>
Lets get down to cooking. We first need to prepare our dataframe. In this case, we will use Grant’s humans
dataframe.
humans <- starwars %>% filter(species == "Human") %>% select(name:species)humans
## # A tibble: 35 x 10## name height mass hair_color skin_color eye_color birth_year gender## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 Luke~ 172 77 blond fair blue 19 male ## 2 Dart~ 202 136 none white yellow 41.9 male ## 3 Leia~ 150 49 brown light brown 19 female## 4 Owen~ 178 120 brown, gr~ light blue 52 male ## 5 Beru~ 165 75 brown light blue 47 female## 6 Bigg~ 183 84 black light brown 24 male ## 7 Obi-~ 182 77 auburn, w~ fair blue-gray 57 male ## 8 Anak~ 188 84 blond fair blue 41.9 male ## 9 Wilh~ 180 NA auburn, g~ fair blue 64 male ## 10 Han ~ 180 80 brown fair brown 29 male ## # ... with 25 more rows, and 2 more variables: homeworld <chr>,## # species <chr>
Now it is time to define the ingredients of our recipe, which are basically the variables in humans
:
humans_rec <- humans %>% recipe(mass ~ .)summary(humans_rec)
## # A tibble: 10 x 4## variable type role source ## <chr> <chr> <chr> <chr> ## 1 name nominal predictor original## 2 height numeric predictor original## 3 hair_color nominal predictor original## 4 skin_color nominal predictor original## 5 eye_color nominal predictor original## 6 birth_year numeric predictor original## 7 gender nominal predictor original## 8 homeworld nominal predictor original## 9 species nominal predictor original## 10 mass numeric outcome original
Note that we’ve defined mass
as our “outcome” variable and the rest of the variables are defined as “predictors” (this is how ML folks call dependent and independent variables).
In the next step, we will write down our recipe for our variables (yeah. I know. Recipe for humans. Yuck. I blame Grant for choosing this data frame identifier…). Each step in the recipe contains instructions about what to do to some or all our variables included in that step.
In the following example, we will use step_dummy
to generate numeric (zero-one) columns for each possible category of hair_color
and skin_color
. Then, we will use the prep
function in order to associate our recipe with the humans
data frame.
humans_cell <- humans_rec %>% step_dummy(skin_color, hair_color) %>% prep(training = humans)summary(humans_cell)
## # A tibble: 22 x 4## variable type role source ## <chr> <chr> <chr> <chr> ## 1 name nominal predictor original## 2 height numeric predictor original## 3 eye_color nominal predictor original## 4 birth_year numeric predictor original## 5 gender nominal predictor original## 6 homeworld nominal predictor original## 7 species nominal predictor original## 8 mass numeric outcome original## 9 skin_color_fair numeric predictor derived ## 10 skin_color_light numeric predictor derived ## # ... with 12 more rows
Note that now we’ve added 12 new variable definitions to our dataframe. These are the (straightforward) names of our new dummies. For example, in the 9th row, you can find skin_color_fair
, the dummy for skin_color == "fair"
.
Calling the juice()
function generates a new data frame according to our predefined recipe.
humans_juiced <- juice(humans_cell)humans_juiced
## # A tibble: 35 x 22## name height eye_color birth_year gender homeworld species mass## <fct> <int> <fct> <dbl> <fct> <fct> <fct> <dbl>## 1 Luke~ 172 blue 19 male Tatooine Human 77## 2 Dart~ 202 yellow 41.9 male Tatooine Human 136## 3 Leia~ 150 brown 19 female Alderaan Human 49## 4 Owen~ 178 blue 52 male Tatooine Human 120## 5 Beru~ 165 blue 47 female Tatooine Human 75## 6 Bigg~ 183 brown 24 male Tatooine Human 84## 7 Obi-~ 182 blue-gray 57 male Stewjon Human 77## 8 Anak~ 188 blue 41.9 male Tatooine Human 84## 9 Wilh~ 180 blue 64 male Eriadu Human NA## 10 Han ~ 180 brown 29 male Corellia Human 80## # ... with 25 more rows, and 14 more variables: skin_color_fair <dbl>,## # skin_color_light <dbl>, skin_color_pale <dbl>, skin_color_tan <dbl>,## # skin_color_white <dbl>, hair_color_auburn..grey <dbl>,## # hair_color_auburn..white <dbl>, hair_color_black <dbl>,## # hair_color_blond <dbl>, hair_color_brown <dbl>,## # hair_color_brown..grey <dbl>, hair_color_grey <dbl>,## # hair_color_none <dbl>, hair_color_white <dbl>
Done! Now, let’s take a closer look at our new skin_color
dummies:
humans_juiced %>% select(starts_with("skin_color"))
## # A tibble: 35 x 5## skin_color_fair skin_color_light skin_color_pale skin_color_tan## <dbl> <dbl> <dbl> <dbl>## 1 1 0 0 0## 2 0 0 0 0## 3 0 1 0 0## 4 0 1 0 0## 5 0 1 0 0## 6 0 1 0 0## 7 1 0 0 0## 8 1 0 0 0## 9 1 0 0 0## 10 1 0 0 0## # ... with 25 more rows, and 1 more variable: skin_color_white <dbl>
As you can see, instead of skin_color
we now have five zero-one numeric (<dbl>) columns, each corresponding to a specific category of skin_color
, excluding “dark” which is set as the base category. Note that unless instructed othewise, step_dummy
results in \(C\)-1 dummies, where \(C\) is the number of categories. I.e., step_dummy
excludes one category by default.
NOTE: Unlike
lm()
, which automatically handles factor variables for you, with most machine learning it well advised to work with numeric columns as input. As we just saw, recipes was built with this feature in mind.
Thanks to the pipe operator we can do all of the above in a single command:
humans_juiced <- humans %>% recipe( ~ .) %>% step_dummy(hair_color, skin_color) %>% prep() %>% juice()humans_juiced
## # A tibble: 35 x 22## name height mass eye_color birth_year gender homeworld species## <fct> <int> <dbl> <fct> <dbl> <fct> <fct> <fct> ## 1 Luke~ 172 77 blue 19 male Tatooine Human ## 2 Dart~ 202 136 yellow 41.9 male Tatooine Human ## 3 Leia~ 150 49 brown 19 female Alderaan Human ## 4 Owen~ 178 120 blue 52 male Tatooine Human ## 5 Beru~ 165 75 blue 47 female Tatooine Human ## 6 Bigg~ 183 84 brown 24 male Tatooine Human ## 7 Obi-~ 182 77 blue-gray 57 male Stewjon Human ## 8 Anak~ 188 84 blue 41.9 male Tatooine Human ## 9 Wilh~ 180 NA blue 64 male Eriadu Human ## 10 Han ~ 180 80 brown 29 male Corellia Human ## # ... with 25 more rows, and 14 more variables:## # hair_color_auburn..grey <dbl>, hair_color_auburn..white <dbl>,## # hair_color_black <dbl>, hair_color_blond <dbl>,## # hair_color_brown <dbl>, hair_color_brown..grey <dbl>,## # hair_color_grey <dbl>, hair_color_none <dbl>, hair_color_white <dbl>,## # skin_color_fair <dbl>, skin_color_light <dbl>, skin_color_pale <dbl>,## # skin_color_tan <dbl>, skin_color_white <dbl>
More steps
recipes comes with many helpful preprocessing steps. For example, step_interact
generates columns with interaction terms, step_log
preforms log transformation, and step_pca
replaces highly correlated variables with their principal component(s). Here is a complete list of steps:
## [1] "step_arrange" "step_bagimpute" "step_bin2factor" ## [4] "step_BoxCox" "step_bs" "step_center" ## [7] "step_classdist" "step_corr" "step_count" ## [10] "step_date" "step_depth" "step_discretize" ## [13] "step_downsample" "step_dummy" "step_factor2string"## [16] "step_filter" "step_geodist" "step_holiday" ## [19] "step_hyperbolic" "step_ica" "step_integer" ## [22] "step_interact" "step_intercept" "step_inverse" ## [25] "step_invlogit" "step_isomap" "step_knnimpute" ## [28] "step_kpca" "step_lag" "step_lincomb" ## [31] "step_log" "step_logit" "step_lowerimpute" ## [34] "step_meanimpute" "step_medianimpute" "step_modeimpute" ## [37] "step_mutate" "step_naomit" "step_nnmf" ## [40] "step_novel" "step_ns" "step_num2factor" ## [43] "step_nzv" "step_ordinalscore" "step_other" ## [46] "step_pca" "step_pls" "step_poly" ## [49] "step_profile" "step_range" "step_ratio" ## [52] "step_regex" "step_relu" "step_rm" ## [55] "step_rollimpute" "step_sample" "step_scale" ## [58] "step_shuffle" "step_slice" "step_spatialsign" ## [61] "step_sqrt" "step_string2factor" "step_unorder" ## [64] "step_upsample" "step_window" "step_YeoJohnson" ## [67] "step_zv"
Further resources
- Grant’s “Lecture 8: Regression analysis in R”
- The recipes official website.
- This parsnip package vignette which shows how recipes fits within the workflow of building machine learning models.