recipes for Dummies | ITAMAR CASPI (2024)

This short post can be viewed as an unofficial appendix to Grant McDermott’s terrific lecture on “Regression analysis in R” (go read it!). In particular, it is meant to extend the “Dummy variables” section of that lecture by introducing you to the recipes package, authored by Max Kuhn and Hadley Wickham.

The recipes Package

recipes basically provides a “tidy” approach to data preprocessing. Though recipes’ true greatness reveals itself in the “feature engineering” stage of building machine learning models, I find it extremely useful even for the simple task of generating dummy variables before running a linear regression.

The approach of recepies, as its name hints, is related to the process of cooking (or baking…). Your variables are the ingredients and recipes’ collection of step_{X} functions define what you want to do with your ingredient. If you follow the recipe’s instructions carefully you will end up with a new (and tasty) data frame that includes the new and transformed variables you need.

Generating dummies using recipes

In this tutorial, we will focus on one recipes function called step_dummy that makes the task of generating dummies a breeze.

We start by loading the tidyverse and recipes packages:

library(tidyverse)library(recipes)

Like Grant, we’ll be working with the starwars data frame.

starwars
## # A tibble: 87 x 13## name height mass hair_color skin_color eye_color birth_year gender## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 Luke~ 172 77 blond fair blue 19 male ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA> ## 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA> ## 4 Dart~ 202 136 none white yellow 41.9 male ## 5 Leia~ 150 49 brown light brown 19 female## 6 Owen~ 178 120 brown, gr~ light blue 52 male ## 7 Beru~ 165 75 brown light blue 47 female## 8 R5-D4 97 32 <NA> white, red red NA <NA> ## 9 Bigg~ 183 84 black light brown 24 male ## 10 Obi-~ 182 77 auburn, w~ fair blue-gray 57 male ## # ... with 77 more rows, and 5 more variables: homeworld <chr>,## # species <chr>, films <list>, vehicles <list>, starships <list>

Lets get down to cooking. We first need to prepare our dataframe. In this case, we will use Grant’s humans dataframe.

humans <- starwars %>% filter(species == "Human") %>% select(name:species)humans
## # A tibble: 35 x 10## name height mass hair_color skin_color eye_color birth_year gender## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 Luke~ 172 77 blond fair blue 19 male ## 2 Dart~ 202 136 none white yellow 41.9 male ## 3 Leia~ 150 49 brown light brown 19 female## 4 Owen~ 178 120 brown, gr~ light blue 52 male ## 5 Beru~ 165 75 brown light blue 47 female## 6 Bigg~ 183 84 black light brown 24 male ## 7 Obi-~ 182 77 auburn, w~ fair blue-gray 57 male ## 8 Anak~ 188 84 blond fair blue 41.9 male ## 9 Wilh~ 180 NA auburn, g~ fair blue 64 male ## 10 Han ~ 180 80 brown fair brown 29 male ## # ... with 25 more rows, and 2 more variables: homeworld <chr>,## # species <chr>

Now it is time to define the ingredients of our recipe, which are basically the variables in humans:

humans_rec <- humans %>% recipe(mass ~ .)summary(humans_rec)
## # A tibble: 10 x 4## variable type role source ## <chr> <chr> <chr> <chr> ## 1 name nominal predictor original## 2 height numeric predictor original## 3 hair_color nominal predictor original## 4 skin_color nominal predictor original## 5 eye_color nominal predictor original## 6 birth_year numeric predictor original## 7 gender nominal predictor original## 8 homeworld nominal predictor original## 9 species nominal predictor original## 10 mass numeric outcome original

Note that we’ve defined mass as our “outcome” variable and the rest of the variables are defined as “predictors” (this is how ML folks call dependent and independent variables).

In the next step, we will write down our recipe for our variables (yeah. I know. Recipe for humans. Yuck. I blame Grant for choosing this data frame identifier…). Each step in the recipe contains instructions about what to do to some or all our variables included in that step.

In the following example, we will use step_dummy to generate numeric (zero-one) columns for each possible category of hair_color and skin_color. Then, we will use the prep function in order to associate our recipe with the humans data frame.

humans_cell <- humans_rec %>% step_dummy(skin_color, hair_color) %>% prep(training = humans)summary(humans_cell)
## # A tibble: 22 x 4## variable type role source ## <chr> <chr> <chr> <chr> ## 1 name nominal predictor original## 2 height numeric predictor original## 3 eye_color nominal predictor original## 4 birth_year numeric predictor original## 5 gender nominal predictor original## 6 homeworld nominal predictor original## 7 species nominal predictor original## 8 mass numeric outcome original## 9 skin_color_fair numeric predictor derived ## 10 skin_color_light numeric predictor derived ## # ... with 12 more rows

Note that now we’ve added 12 new variable definitions to our dataframe. These are the (straightforward) names of our new dummies. For example, in the 9th row, you can find skin_color_fair, the dummy for skin_color == "fair".

Calling the juice() function generates a new data frame according to our predefined recipe.

humans_juiced <- juice(humans_cell)humans_juiced
## # A tibble: 35 x 22## name height eye_color birth_year gender homeworld species mass## <fct> <int> <fct> <dbl> <fct> <fct> <fct> <dbl>## 1 Luke~ 172 blue 19 male Tatooine Human 77## 2 Dart~ 202 yellow 41.9 male Tatooine Human 136## 3 Leia~ 150 brown 19 female Alderaan Human 49## 4 Owen~ 178 blue 52 male Tatooine Human 120## 5 Beru~ 165 blue 47 female Tatooine Human 75## 6 Bigg~ 183 brown 24 male Tatooine Human 84## 7 Obi-~ 182 blue-gray 57 male Stewjon Human 77## 8 Anak~ 188 blue 41.9 male Tatooine Human 84## 9 Wilh~ 180 blue 64 male Eriadu Human NA## 10 Han ~ 180 brown 29 male Corellia Human 80## # ... with 25 more rows, and 14 more variables: skin_color_fair <dbl>,## # skin_color_light <dbl>, skin_color_pale <dbl>, skin_color_tan <dbl>,## # skin_color_white <dbl>, hair_color_auburn..grey <dbl>,## # hair_color_auburn..white <dbl>, hair_color_black <dbl>,## # hair_color_blond <dbl>, hair_color_brown <dbl>,## # hair_color_brown..grey <dbl>, hair_color_grey <dbl>,## # hair_color_none <dbl>, hair_color_white <dbl>

Done! Now, let’s take a closer look at our new skin_color dummies:

humans_juiced %>% select(starts_with("skin_color"))
## # A tibble: 35 x 5## skin_color_fair skin_color_light skin_color_pale skin_color_tan## <dbl> <dbl> <dbl> <dbl>## 1 1 0 0 0## 2 0 0 0 0## 3 0 1 0 0## 4 0 1 0 0## 5 0 1 0 0## 6 0 1 0 0## 7 1 0 0 0## 8 1 0 0 0## 9 1 0 0 0## 10 1 0 0 0## # ... with 25 more rows, and 1 more variable: skin_color_white <dbl>

As you can see, instead of skin_color we now have five zero-one numeric (<dbl>) columns, each corresponding to a specific category of skin_color, excluding “dark” which is set as the base category. Note that unless instructed othewise, step_dummy results in \(C\)-1 dummies, where \(C\) is the number of categories. I.e., step_dummy excludes one category by default.

NOTE: Unlike lm(), which automatically handles factor variables for you, with most machine learning it well advised to work with numeric columns as input. As we just saw, recipes was built with this feature in mind.

Thanks to the pipe operator we can do all of the above in a single command:

humans_juiced <- humans %>% recipe( ~ .) %>% step_dummy(hair_color, skin_color) %>% prep() %>% juice()humans_juiced
## # A tibble: 35 x 22## name height mass eye_color birth_year gender homeworld species## <fct> <int> <dbl> <fct> <dbl> <fct> <fct> <fct> ## 1 Luke~ 172 77 blue 19 male Tatooine Human ## 2 Dart~ 202 136 yellow 41.9 male Tatooine Human ## 3 Leia~ 150 49 brown 19 female Alderaan Human ## 4 Owen~ 178 120 blue 52 male Tatooine Human ## 5 Beru~ 165 75 blue 47 female Tatooine Human ## 6 Bigg~ 183 84 brown 24 male Tatooine Human ## 7 Obi-~ 182 77 blue-gray 57 male Stewjon Human ## 8 Anak~ 188 84 blue 41.9 male Tatooine Human ## 9 Wilh~ 180 NA blue 64 male Eriadu Human ## 10 Han ~ 180 80 brown 29 male Corellia Human ## # ... with 25 more rows, and 14 more variables:## # hair_color_auburn..grey <dbl>, hair_color_auburn..white <dbl>,## # hair_color_black <dbl>, hair_color_blond <dbl>,## # hair_color_brown <dbl>, hair_color_brown..grey <dbl>,## # hair_color_grey <dbl>, hair_color_none <dbl>, hair_color_white <dbl>,## # skin_color_fair <dbl>, skin_color_light <dbl>, skin_color_pale <dbl>,## # skin_color_tan <dbl>, skin_color_white <dbl>

More steps

recipes comes with many helpful preprocessing steps. For example, step_interact generates columns with interaction terms, step_log preforms log transformation, and step_pca replaces highly correlated variables with their principal component(s). Here is a complete list of steps:

## [1] "step_arrange" "step_bagimpute" "step_bin2factor" ## [4] "step_BoxCox" "step_bs" "step_center" ## [7] "step_classdist" "step_corr" "step_count" ## [10] "step_date" "step_depth" "step_discretize" ## [13] "step_downsample" "step_dummy" "step_factor2string"## [16] "step_filter" "step_geodist" "step_holiday" ## [19] "step_hyperbolic" "step_ica" "step_integer" ## [22] "step_interact" "step_intercept" "step_inverse" ## [25] "step_invlogit" "step_isomap" "step_knnimpute" ## [28] "step_kpca" "step_lag" "step_lincomb" ## [31] "step_log" "step_logit" "step_lowerimpute" ## [34] "step_meanimpute" "step_medianimpute" "step_modeimpute" ## [37] "step_mutate" "step_naomit" "step_nnmf" ## [40] "step_novel" "step_ns" "step_num2factor" ## [43] "step_nzv" "step_ordinalscore" "step_other" ## [46] "step_pca" "step_pls" "step_poly" ## [49] "step_profile" "step_range" "step_ratio" ## [52] "step_regex" "step_relu" "step_rm" ## [55] "step_rollimpute" "step_sample" "step_scale" ## [58] "step_shuffle" "step_slice" "step_spatialsign" ## [61] "step_sqrt" "step_string2factor" "step_unorder" ## [64] "step_upsample" "step_window" "step_YeoJohnson" ## [67] "step_zv"

Further resources

recipes for Dummies | ITAMAR CASPI (2024)

References

Top Articles
Keto Gingerbread Cookies Recipe - Easy Holiday Keto Cookies
How To Grow Mushrooms in Buckets [Complete Guide] | GroCycle
Spasa Parish
Rentals for rent in Maastricht
159R Bus Schedule Pdf
Sallisaw Bin Store
Black Adam Showtimes Near Maya Cinemas Delano
Espn Transfer Portal Basketball
Pollen Levels Richmond
11 Best Sites Like The Chive For Funny Pictures and Memes
Things to do in Wichita Falls on weekends 12-15 September
Craigslist Pets Huntsville Alabama
Paulette Goddard | American Actress, Modern Times, Charlie Chaplin
Red Dead Redemption 2 Legendary Fish Locations Guide (“A Fisher of Fish”)
What's the Difference Between Halal and Haram Meat & Food?
Tyreek Hill admits some regrets but calls for officer who restrained him to be fired | CNN
Haverhill, MA Obituaries | Driscoll Funeral Home and Cremation Service
Rogers Breece Obituaries
Ems Isd Skyward Family Access
Elektrische Arbeit W (Kilowattstunden kWh Strompreis Berechnen Berechnung)
Omni Id Portal Waconia
Kellifans.com
Banned in NYC: Airbnb One Year Later
Four-Legged Friday: Meet Tuscaloosa's Adoptable All-Stars Cub & Pickle
Model Center Jasmin
Ice Dodo Unblocked 76
Is Slatt Offensive
Labcorp Locations Near Me
Storm Prediction Center Convective Outlook
Experience the Convenience of Po Box 790010 St Louis Mo
Fungal Symbiote Terraria
modelo julia - PLAYBOARD
Poker News Views Gossip
Abby's Caribbean Cafe
Joanna Gaines Reveals Who Bought the 'Fixer Upper' Lake House and Her Favorite Features of the Milestone Project
Tri-State Dog Racing Results
Navy Qrs Supervisor Answers
Trade Chart Dave Richard
Lincoln Financial Field Section 110
Free Stuff Craigslist Roanoke Va
Wi Dept Of Regulation & Licensing
Pick N Pull Near Me [Locator Map + Guide + FAQ]
Crystal Westbrooks Nipple
Ice Hockey Dboard
Über 60 Prozent Rabatt auf E-Bikes: Aldi reduziert sämtliche Pedelecs stark im Preis - nur noch für kurze Zeit
Wie blocke ich einen Bot aus Boardman/USA - sellerforum.de
Infinity Pool Showtimes Near Maya Cinemas Bakersfield
Dermpathdiagnostics Com Pay Invoice
How To Use Price Chopper Points At Quiktrip
Maria Butina Bikini
Busted Newspaper Zapata Tx
Latest Posts
Article information

Author: Duane Harber

Last Updated:

Views: 6322

Rating: 4 / 5 (51 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Duane Harber

Birthday: 1999-10-17

Address: Apt. 404 9899 Magnolia Roads, Port Royceville, ID 78186

Phone: +186911129794335

Job: Human Hospitality Planner

Hobby: Listening to music, Orienteering, Knapping, Dance, Mountain biking, Fishing, Pottery

Introduction: My name is Duane Harber, I am a modern, clever, handsome, fair, agreeable, inexpensive, beautiful person who loves writing and wants to share my knowledge and understanding with you.