Tidy Tuesday Exercise 2

Egg Production

4/12/2023

Get the Data

Upload public access data for exercise via tidytuesdayR package.

library(tidytuesdayR)

Warning: package 'tidytuesdayR' was built under R version 4.2.2

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.2.3

Warning: package 'ggplot2' was built under R version 4.2.3

Warning: package 'tibble' was built under R version 4.2.3

Warning: package 'tidyr' was built under R version 4.2.3

Warning: package 'readr' was built under R version 4.2.3

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'stringr' was built under R version 4.2.2

Warning: package 'forcats' was built under R version 4.2.3

Warning: package 'lubridate' was built under R version 4.2.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(here)

Warning: package 'here' was built under R version 4.2.2

here() starts at C:/Users/Raquel/GitHub/MADA/RaquelFrancisco-MADA-portfolio

library(janitor)

Warning: package 'janitor' was built under R version 4.2.2


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(tidymodels)

Warning: package 'tidymodels' was built under R version 4.2.3

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.4     ✔ rsample      1.1.1
✔ dials        1.2.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.0     ✔ yardstick    1.1.0
✔ recipes      1.0.5

Warning: package 'broom' was built under R version 4.2.3

Warning: package 'dials' was built under R version 4.2.3

Warning: package 'infer' was built under R version 4.2.2

Warning: package 'modeldata' was built under R version 4.2.2

Warning: package 'parsnip' was built under R version 4.2.3

Warning: package 'recipes' was built under R version 4.2.3

Warning: package 'rsample' was built under R version 4.2.2

Warning: package 'tune' was built under R version 4.2.3

Warning: package 'workflows' was built under R version 4.2.3

Warning: package 'workflowsets' was built under R version 4.2.3

Warning: package 'yardstick' was built under R version 4.2.2

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org

library(dplyr)
library(ranger)

Warning: package 'ranger' was built under R version 4.2.3

library(rpart)

Warning: package 'rpart' was built under R version 4.2.3


Attaching package: 'rpart'

The following object is masked from 'package:dials':

    prune

library(glmnet)

Warning: package 'glmnet' was built under R version 4.2.3

Loading required package: Matrix

Warning: package 'Matrix' was built under R version 4.2.3


Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-7

library(rpart.plot)

Warning: package 'rpart.plot' was built under R version 4.2.3

library(vip)

Warning: package 'vip' was built under R version 4.2.3


Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2023-04-11')

--- Compiling #TidyTuesday Information for 2023-04-11 ----
--- There are 2 files available ---
--- Starting Download ---


    Downloading file 1 of 2: `egg-production.csv`
    Downloading file 2 of 2: `cage-free-percentages.csv`

--- Download complete ---

tuesdata <- tidytuesdayR::tt_load(2023, week = 15)

--- Compiling #TidyTuesday Information for 2023-04-11 ----
--- There are 2 files available ---
--- Starting Download ---


    Downloading file 1 of 2: `egg-production.csv`
    Downloading file 2 of 2: `cage-free-percentages.csv`

--- Download complete ---

summary(tuesdata)

                      Length Class       Mode
egg-production        6      spec_tbl_df list
cage-free-percentages 4      spec_tbl_df list

eggproduction <- tuesdata$`egg-production`
cagefreepercentages <- tuesdata$`cage-free-percentages`

Data Dictonary

US Egg Production

The data this week comes from The Humane League’s US Egg Production dataset by Samara Mendez. Dataset and code is available for this project on OSF at US Egg Production Data Set.

This dataset tracks the supply of cage-free eggs in the United States from December 2007 to February 2021. For TidyTuesday we’ve used data through February 2021, but the full dataset, with data through the present, is available in the OSF project.

In this project, they synthesize an analysis-ready data set that tracks cage-free hens and the supply of cage-free eggs relative to the overall numbers of hens and table eggs in the United States. The data set is based on reports produced by the United States Department of Agriculture (USDA), which are published weekly or monthly. They supplement these data with definitions and a taxonomy of egg products drawn from USDA and industry publications. The data include flock size (both absolute and relative) and egg production of cage-free hens as well as all table-egg-laying hens in the US, collected to understand the impact of the industry’s cage-free transition on hens. Data coverage ranges from December 2007 to February 2021.

Egg Production

variable	class	description
observed_month	double	Month in which report observations are collected,Dates are recorded in ISO 8601 format YYYY-MM-DD
prod_type	character	type of egg product: hatching, table eggs
prod_process	character	type of production process and housing: cage-free (organic), cage-free (non-organic), all. The value ‘all’ includes cage-free and conventional housing.
n_hens	double	number of hens produced by hens for a given month-type-process combo
n_eggs	double	number of eggs producing eggs for a given month-type-process combo
source	character	Original USDA report from which data are sourced. Values correspond to titles of PDF reports. Date of report is included in title.

Cage Free Percentages

variable	class	description
observed_month	double	Month in which report observations are collected,Dates are recorded in ISO 8601 format YYYY-MM-DD
percent_hens	double	observed or computed percentage of cage-free hens relative to all table-egg-laying hens
percent_eggs	double	computed percentage of cage-free eggs relative to all table eggs,This variable is not available for data sourced from the Egg Markets Overview report
source	character	Original USDA report from which data are sourced. Values correspond to titles of PDF reports. Date of report is included in title.

#Exploring Egg Production

tibble(eggproduction)

# A tibble: 220 × 6
   observed_month prod_type     prod_process   n_hens     n_eggs source         
   <date>         <chr>         <chr>           <dbl>      <dbl> <chr>          
 1 2016-07-31     hatching eggs all          57975000 1147000000 ChicEggs-09-23…
 2 2016-08-31     hatching eggs all          57595000 1142700000 ChicEggs-10-21…
 3 2016-09-30     hatching eggs all          57161000 1093300000 ChicEggs-11-22…
 4 2016-10-31     hatching eggs all          56857000 1126700000 ChicEggs-12-23…
 5 2016-11-30     hatching eggs all          57116000 1096600000 ChicEggs-01-24…
 6 2016-12-31     hatching eggs all          57750000 1132900000 ChicEggs-02-28…
 7 2017-01-31     hatching eggs all          57991000 1123400000 ChicEggs-03-21…
 8 2017-02-28     hatching eggs all          58286000 1014500000 ChicEggs-04-21…
 9 2017-03-31     hatching eggs all          58735000 1128500000 ChicEggs-05-22…
10 2017-04-30     hatching eggs all          59072000 1097200000 ChicEggs-06-23…
# ℹ 210 more rows

ggplot() +
    geom_point(data = eggproduction, aes(x = n_hens, y = n_eggs, color = prod_process), shape = 19) +
    ggtitle("Laying Efficency", subtitle = "Number of hens vs number of eggs laid") +
    labs(x = "Hens", y = "Eggs")

ggplot() +
    geom_point(data = eggproduction, aes(x = n_hens, y = n_eggs, color = prod_process), shape = 19) +
    ggtitle("Laying Efficency", subtitle = "Number of hens vs number of eggs laid") +
    labs(x = "Hens", y = "Eggs") +
    xlim(0, 70000000) +
    ylim(0, 1700000000)

Warning: Removed 55 rows containing missing values (`geom_point()`).

ggplot() +
    geom_line(data = eggproduction, aes(x = observed_month, y = n_eggs, color = prod_process)) +
    ggtitle("Eggs Production over time", subtitle = "") +
    labs(x = "Time", y = "Eggs")

#Exploring Cage Free %

tibble(cagefreepercentages)

# A tibble: 96 × 4
   observed_month percent_hens percent_eggs source                             
   <date>                <dbl>        <dbl> <chr>                              
 1 2007-12-31              3.2           NA Egg-Markets-Overview-2019-10-19.pdf
 2 2008-12-31              3.5           NA Egg-Markets-Overview-2019-10-19.pdf
 3 2009-12-31              3.6           NA Egg-Markets-Overview-2019-10-19.pdf
 4 2010-12-31              4.4           NA Egg-Markets-Overview-2019-10-19.pdf
 5 2011-12-31              5.4           NA Egg-Markets-Overview-2019-10-19.pdf
 6 2012-12-31              6             NA Egg-Markets-Overview-2019-10-19.pdf
 7 2013-12-31              5.9           NA Egg-Markets-Overview-2019-10-19.pdf
 8 2014-12-31              5.7           NA Egg-Markets-Overview-2019-10-19.pdf
 9 2015-12-31              8.6           NA Egg-Markets-Overview-2019-10-19.pdf
10 2016-04-30              9.9           NA Egg-Markets-Overview-2016-12-02.pdf
# ℹ 86 more rows

ggplot() +
    geom_line(data = cagefreepercentages, aes(x = observed_month, y = percent_eggs), color =  'darkgreen') +
    geom_line(data = cagefreepercentages, aes(x = observed_month, y = percent_hens), color =  'brown') +
    ggtitle("% Hens and Eggs relative to all Tables", subtitle = "") +
    labs(x = "Year", y = "Hens & Eggs (%)")

Warning: Removed 11 rows containing missing values (`geom_line()`).

Will merge data via date

ALLdata <- inner_join(cagefreepercentages, eggproduction, by="observed_month")

Warning in inner_join(cagefreepercentages, eggproduction, by = "observed_month"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 11 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

tibble(ALLdata)

# A tibble: 238 × 9
   observed_month percent_hens percent_eggs source.x      prod_type prod_process
   <date>                <dbl>        <dbl> <chr>         <chr>     <chr>       
 1 2016-08-31             10.1         9.63 computed      hatching… all         
 2 2016-08-31             10.1         9.63 computed      table eg… all         
 3 2016-08-31             10.1         9.63 computed      table eg… cage-free (…
 4 2016-08-31             10.1         9.63 computed      table eg… cage-free (…
 5 2016-08-31             12          NA    Egg-Markets-… hatching… all         
 6 2016-08-31             12          NA    Egg-Markets-… table eg… all         
 7 2016-08-31             12          NA    Egg-Markets-… table eg… cage-free (…
 8 2016-08-31             12          NA    Egg-Markets-… table eg… cage-free (…
 9 2016-09-30             10.1         9.56 computed      hatching… all         
10 2016-09-30             10.1         9.56 computed      table eg… all         
# ℹ 228 more rows
# ℹ 3 more variables: n_hens <dbl>, n_eggs <dbl>, source.y <chr>

#Now to avoid confusion remove sources and prod_process "all"

allclean <- ALLdata %>%
  select(observed_month  | percent_hens | n_hens | percent_eggs |  n_eggs | prod_process) %>%
  filter(!prod_process == 'all')

tibble(allclean)

# A tibble: 120 × 6
   observed_month percent_hens   n_hens percent_eggs     n_eggs prod_process    
   <date>                <dbl>    <dbl>        <dbl>      <dbl> <chr>           
 1 2016-08-31             10.1 17000000         9.63 397884291. cage-free (non-…
 2 2016-08-31             10.1 13500000         9.63 315968297. cage-free (orga…
 3 2016-08-31             12   17000000        NA    397884291. cage-free (non-…
 4 2016-08-31             12   13500000        NA    315968297. cage-free (orga…
 5 2016-09-30             10.1 17000000         9.56 383774914. cage-free (non-…
 6 2016-09-30             10.1 13500000         9.56 304762114. cage-free (orga…
 7 2016-10-31             12.3 23500000        11.6  546374469. cage-free (non-…
 8 2016-10-31             12.3 14100000        11.6  327825000  cage-free (orga…
 9 2016-11-30             12.1 23500000        11.4  530864743. cage-free (non-…
10 2016-11-30             12.1 14100000        11.4  318519771. cage-free (orga…
# ℹ 110 more rows

More data visualization

ggplot() +
  geom_point(data = allclean, aes(x = observed_month, y = percent_hens, size = n_hens), fill = 'brown', shape = 21, colour = "black") +
  geom_point(data = allclean, aes(x = observed_month, y = percent_eggs, size = n_eggs), fill = 'gold', shape = 21, colour = "black")  +
  ggtitle("Cage free eggs vs Years", subtitle = "") +
labs(x = "Year", y = "% Eggs and Hens")

Warning: Removed 12 rows containing missing values (`geom_point()`).

ggplot() +
  geom_line(data = allclean, aes(x = observed_month, y = n_hens, colour = prod_process)) +
  ggtitle("Cage free hens", subtitle = "Organic vs non-Organic") +
labs(x = "Year", y = "Hens")

ggplot() +
  geom_line(data = allclean, aes(x = observed_month, y = n_eggs, colour = prod_process)) +
  ggtitle("Cage free egg production", subtitle = "Organic vs non-Organic") +
labs(x = "Year", y = "Eggs")

As the demand for cage free eggs increases so do the number of hens producing them.However, it seems like egg production may be less efficient in organic hens. We are going to explore!

#Fitting a model ##Hypothesis: Non-organic egg production is more efficient (more eggs produced per hen), then organic egg production in cage-free facilities.

###Data Setup

#prepare data for machine learning

set.seed(123)
# Fix the random numbers by setting the seed 

cdata_split <- initial_split(allclean, prop = 2.8/4, strata = n_eggs) #70% training, 30% testing

# Create data frames for the two sets:
train_cdata <- training(cdata_split)
test_cdata  <- testing(cdata_split)

#5-fold cross-validation, 5 times repeated
fold_cdata <- vfold_cv(train_cdata, v = 5, repeats = 5, strata = prod_process)

#Create a recipe for the data and fitting. 
data_recipe <- recipe(n_eggs ~ prod_process, data = train_cdata) %>%
  step_dummy(all_nominal(), -all_outcomes())

Null model performance

null_recipe <- recipe(n_eggs ~ 1, data = train_cdata) %>%
  step_dummy(all_nominal(), -all_outcomes())

# Logistic model recipe
recipe_mod <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

# Model workflow to pair model and recipe 
null_flow <- workflow() %>% 
  add_model(recipe_mod) %>% 
  add_recipe(null_recipe)

#fit the null model to the folds made from the train data set.
null_train <- fit_resamples(null_flow, resamples = fold_cdata)

→ A | warning: There was 1 warning in `dplyr::summarise()`.
               ℹ In argument: `.estimate = metric_fn(truth = n_eggs, estimate = .pred, na_rm =
                 na_rm)`.
               Caused by warning:
               ! A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.

There were issues with some computations   A: x1

There were issues with some computations   A: x7

There were issues with some computations   A: x13

There were issues with some computations   A: x21

There were issues with some computations   A: x25

#Compute the RMSE for both training and test data
Null_Met <- collect_metrics(null_train)
Null_Met

# A tibble: 2 × 6
  .metric .estimator       mean     n   std_err .config             
  <chr>   <chr>           <dbl> <int>     <dbl> <chr>               
1 rmse    standard   391114191.    25 10381628. Preprocessor1_Model1
2 rsq     standard         NaN      0       NA  Preprocessor1_Model1

#RMSE = 391114191

###Test Data with a linear model

lm_flow <- workflow() %>% 
  add_model(recipe_mod) %>% 
  add_recipe(data_recipe)

fit<- lm_flow %>%
  fit(data = train_cdata)

fit %>%
  extract_fit_parsnip() %>%
  tidy()

# A tibble: 2 × 5
  term                                estimate std.error statistic  p.value
  <chr>                                  <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                       990483520. 35900514.      27.6 2.94e-43
2 prod_process_cage.free..organic. -637575077. 50770994.     -12.6 8.55e-21

###Performance metrics

aug_test <- augment(fit, train_cdata)
rmse <- aug_test %>% rmse(truth = n_eggs, .pred)
rsq <- aug_test %>% rsq(truth = n_eggs, .pred)
metrics<- full_join(rmse, rsq)

Joining with `by = join_by(.metric, .estimator, .estimate)`

metrics

# A tibble: 2 × 3
  .metric .estimator     .estimate
  <chr>   <chr>              <dbl>
1 rmse    standard   229875450.   
2 rsq     standard           0.658

#RMSE = 2.298755e+08
#Something is going wrong here

#Residuals?
egg_mod<- lm(n_eggs ~ prod_process, data = train_cdata)
res<- resid(egg_mod)
plot(fitted(egg_mod), res)

#Tree model

#Tune
tune_spec_dtree <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune()) %>%
  set_engine("rpart") %>% 
  set_mode("regression")
tune_spec_dtree

Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = tune()
  tree_depth = tune()

Computational engine: rpart

dtree_wf <- workflow() %>%
  add_model(tune_spec_dtree) %>%
  add_recipe(data_recipe)

#create a regular grid of values to try using some convenience functions 

tree_grid_dtree <-
  grid_regular(
    cost_complexity(), 
    tree_depth(), 
    levels = 5)

tree_grid_dtree

# A tibble: 25 × 2
   cost_complexity tree_depth
             <dbl>      <int>
 1    0.0000000001          1
 2    0.0000000178          1
 3    0.00000316            1
 4    0.000562              1
 5    0.1                   1
 6    0.0000000001          4
 7    0.0000000178          4
 8    0.00000316            4
 9    0.000562              4
10    0.1                   4
# ℹ 15 more rows

#### Tuning using Cross-validation

dtree_resample <- 
  dtree_wf %>% 
  tune_grid(
    resamples = fold_cdata,
    grid = tree_grid_dtree)

dtree_resample %>%
  collect_metrics()

# A tibble: 50 × 8
   cost_complexity tree_depth .metric .estimator      mean     n std_err .config
             <dbl>      <int> <chr>   <chr>          <dbl> <int>   <dbl> <chr>  
 1    0.0000000001          1 rmse    standard     2.35e+8    25 6.29e+6 Prepro…
 2    0.0000000001          1 rsq     standard     6.78e-1    25 1.72e-2 Prepro…
 3    0.0000000178          1 rmse    standard     2.35e+8    25 6.29e+6 Prepro…
 4    0.0000000178          1 rsq     standard     6.78e-1    25 1.72e-2 Prepro…
 5    0.00000316            1 rmse    standard     2.35e+8    25 6.29e+6 Prepro…
 6    0.00000316            1 rsq     standard     6.78e-1    25 1.72e-2 Prepro…
 7    0.000562              1 rmse    standard     2.35e+8    25 6.29e+6 Prepro…
 8    0.000562              1 rsq     standard     6.78e-1    25 1.72e-2 Prepro…
 9    0.1                   1 rmse    standard     2.35e+8    25 6.29e+6 Prepro…
10    0.1                   1 rsq     standard     6.78e-1    25 1.72e-2 Prepro…
# ℹ 40 more rows

Plot

dtree_resample %>%
  collect_metrics() %>%
  mutate(tree_depth = factor(tree_depth)) %>%
  ggplot(aes(cost_complexity, mean, color = tree_depth)) +
  geom_line(linewidth = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) +
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option = "plasma", begin = .9, end = 0)

dtree_resample %>%
  show_best(n=1)

Warning: No value of `metric` was given; metric 'rmse' will be used.

# A tibble: 1 × 8
  cost_complexity tree_depth .metric .estimator       mean     n std_err .config
            <dbl>      <int> <chr>   <chr>           <dbl> <int>   <dbl> <chr>  
1    0.0000000001          1 rmse    standard   234787516.    25  6.29e6 Prepro…

#these numbers are strange...

best_tree <- dtree_resample %>%
  select_best()

Warning: No value of `metric` was given; metric 'rmse' will be used.

best_tree

# A tibble: 1 × 3
  cost_complexity tree_depth .config              
            <dbl>      <int> <chr>                
1    0.0000000001          1 Preprocessor1_Model01

#Make workflow

dtree_final_wf <- 
  dtree_wf %>% 
  finalize_workflow(best_tree)

dtree_final_fit <- 
  dtree_final_wf %>%
  fit(train_cdata) 

dtree_residuals <- dtree_final_fit %>%
  augment(train_cdata) %>%
  select(c(.pred, n_eggs)) %>%
  mutate(.resid = n_eggs - .pred) 
  
#calculate residuals and make new row.
dtree_residuals

# A tibble: 84 × 3
        .pred     n_eggs     .resid
        <dbl>      <dbl>      <dbl>
 1 352908443. 304762114. -48146329.
 2 352908443. 327825000  -25083443.
 3 352908443. 318519771. -34388672.
 4 352908443. 325639234. -27269209.
 5 352908443. 330010766. -22897678.
 6 352908443. 298074240  -54834203.
 7 352908443. 333754149. -19154295.
 8 352908443. 333754149. -19154295.
 9 352908443. 330689829. -22218615.
10 352908443. 341712823. -11195620.
# ℹ 74 more rows

##Plot Pred vs Actual & Residuals

#Actual
dtree_pred_plot <- ggplot(dtree_residuals, 
                          aes(x = n_eggs, 
                              y = .pred)) + 
  geom_point() + 
  labs(title = "Predictions vs Actual: Decision Tree", 
       x = "Egg Outcome", 
       y = "Egg Prediction")

dtree_pred_plot

#Residuals

dtree_residual_plot <- ggplot(dtree_residuals, 
                              aes(y = .resid, 
                                  x = .pred)) + 
  geom_point() + 
  labs(title = "Predictions vs Residuals: Decision Tree", 
       x = "Egg Prediction", 
       y = "Residuals")

plot(dtree_residual_plot)

Regression Tree

tree <- rpart(n_eggs ~ prod_process, data = train_cdata)

rpart.plot(tree)

p <- predict(tree, train_cdata)

rmse<- sqrt(mean(train_cdata$n_eggs - p))

r2<- (cor(train_cdata$n_eggs,p))

r2

[1] 0.8111138

rmse

[1] 0.0004083695

###Random Forest Model

#detect cores for RFM computation
cores <- parallel::detectCores()
cores

[1] 4

#Specify Model
rf_mod <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger", num.threads = cores) %>% 
  set_mode("regression")

#Crea Wortekflow
rf_wf <- workflow() %>%
  add_model(rf_mod) %>%
  add_recipe(data_recipe)

#Create Tuning Grid
rf_grid  <- expand.grid(mtry = c(3, 4, 5, 6),
                        min_n = c(40,50,60), 
                        trees = c(500,1000)  )

#Cross-validation
rf_resample <- 
  rf_wf %>% 
  tune_grid(fold_cdata,
            grid = 25,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(yardstick::rmse))

i Creating pre-processing data to finalize unknown parameter: mtry

rf_resample %>%
  collect_metrics()

# A tibble: 23 × 8
    mtry min_n .metric .estimator       mean     n  std_err .config             
   <int> <int> <chr>   <chr>           <dbl> <int>    <dbl> <chr>               
 1     1    34 rmse    standard   234869869.    25 6236595. Preprocessor1_Model…
 2     1     8 rmse    standard   234791478.    25 6314424. Preprocessor1_Model…
 3     1    30 rmse    standard   234709445.    25 6291365. Preprocessor1_Model…
 4     1    36 rmse    standard   234677565.    25 6304081. Preprocessor1_Model…
 5     1    27 rmse    standard   234944370.    25 6302353. Preprocessor1_Model…
 6     1    11 rmse    standard   234869238.    25 6302126. Preprocessor1_Model…
 7     1     6 rmse    standard   234888556.    25 6333906. Preprocessor1_Model…
 8     1    23 rmse    standard   234716166.    25 6280759. Preprocessor1_Model…
 9     1    19 rmse    standard   234744630.    25 6313443. Preprocessor1_Model…
10     1    15 rmse    standard   234777118.    25 6259570. Preprocessor1_Model…
# ℹ 13 more rows

#select best models
rf_resample %>%
  show_best(n=1)

# A tibble: 1 × 8
   mtry min_n .metric .estimator       mean     n  std_err .config              
  <int> <int> <chr>   <chr>           <dbl> <int>    <dbl> <chr>                
1     1    36 rmse    standard   234677565.    25 6304081. Preprocessor1_Model04

#Selects best performing model
best_rf <- rf_resample %>%
  select_best(method = "rmse")

#RMSE = 234720024 STD_ERR = 6282911

Plot Performance

#Plot of actual train data
rf_resample %>%
  autoplot()

#Final Fit

rf_final_wf <- 
  rf_wf %>% 
  finalize_workflow(best_rf)

rf_final_fit <- 
  rf_final_wf %>%
  fit(train_cdata) 

#Calculate residuals
rf_residuals <- rf_final_fit %>%
  augment(train_cdata) %>% 
  select(c(.pred, n_eggs)) %>%
  mutate(.resid = n_eggs - .pred) 

rf_residuals

# A tibble: 84 × 3
        .pred     n_eggs     .resid
        <dbl>      <dbl>      <dbl>
 1 352715875. 304762114. -47953761.
 2 352715875. 327825000  -24890875.
 3 352715875. 318519771. -34196104.
 4 352715875. 325639234. -27076641.
 5 352715875. 330010766. -22705110.
 6 352715875. 298074240  -54641635.
 7 352715875. 333754149. -18961727.
 8 352715875. 333754149. -18961727.
 9 352715875. 330689829. -22026047.
10 352715875. 341712823. -11003052.
# ℹ 74 more rows

#Model Predictions from Tuned vs Actual Outcomes

rf_pred_plot <- ggplot(rf_residuals, 
                          aes(x = n_eggs, 
                              y = .pred)) + 
  geom_point() + 
  labs(title = "Predictions vs Actual: Random Forest", 
       x = "Eggs Actual", 
       y = "Eggs Prediction")

rf_pred_plot

rf_residual_plot <- ggplot(rf_residuals, 
                              aes(y = .resid, 
                                  x = .pred)) + 
  geom_point() + 
  labs(title = "Predictions vs Residuals: Random Forest", 
       x = "Egg Prediction", 
       y = "Residuals")

plot(rf_residual_plot) #view plot

Final Assignment

These models acted strangely for me. I believe the best performing (i.e., the most normal) was the Tree model.

#Model
dtree_final_fit_TEST <- 
  dtree_final_wf %>%
  fit(test_cdata) 

dtree_residualsT <- dtree_final_fit_TEST %>%
  augment(test_cdata) %>%
  select(c(.pred, n_eggs)) %>%
  mutate(.resid = n_eggs - .pred) 
  
#calculate residuals and make new row.
dtree_residualsT

# A tibble: 36 × 3
        .pred     n_eggs      .resid
        <dbl>      <dbl>       <dbl>
 1 964685623. 397884291. -566801331.
 2 359979620  315968297.  -44011323.
 3 359979620  315968297.  -44011323.
 4 964685623. 383774914. -580910709.
 5 964685623. 542733120  -421952503.
 6 359979620  325639234.  -34340386.
 7 964685623. 542733120  -421952503.
 8 964685623. 550017411. -414668211.
 9 359979620  330689829.  -29289791.
10 964685623. 636840617. -327845006.
# ℹ 26 more rows

##Plot Pred vs Actual & Residuals

#Actual
dtree_pred_plot2 <- ggplot(dtree_residualsT, 
                          aes(x = n_eggs, 
                              y = .pred)) + 
  geom_point() + 
  labs(title = "Predictions vs Actual: Decision Tree", 
       x = "Egg Outcome", 
       y = "Egg Prediction")

dtree_pred_plot2

#Residuals

dtree_residual_plot2 <- ggplot(dtree_residualsT, 
                              aes(y = .resid, 
                                  x = .pred)) + 
  geom_point() + 
  labs(title = "Predictions vs Residuals: Decision Tree", 
       x = "Egg Prediction", 
       y = "Residuals")

plot(dtree_residual_plot2)

dtree_final_fit_TEST

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
n= 36 

node), split, n, deviance, yval
      * denotes terminal node

1) root 36 6.282364e+18 662332600  
  2) prod_process_cage.free..organic.>=0.5 18 2.316908e+16 359979600 *
  3) prod_process_cage.free..organic.< 0.5 18 2.968171e+18 964685600 *

Actually I am unsure…