Flu Analysis: Fitting

Load the data and packages:

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.2.2

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

Warning: package 'stringr' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(tidymodels) #build models

Warning: package 'tidymodels' was built under R version 4.2.2

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.3     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.4

Warning: package 'broom' was built under R version 4.2.2

Warning: package 'dials' was built under R version 4.2.2

Warning: package 'infer' was built under R version 4.2.2

Warning: package 'modeldata' was built under R version 4.2.2

Warning: package 'parsnip' was built under R version 4.2.2

Warning: package 'recipes' was built under R version 4.2.2

Warning: package 'rsample' was built under R version 4.2.2

Warning: package 'tune' was built under R version 4.2.2

Warning: package 'workflows' was built under R version 4.2.2

Warning: package 'workflowsets' was built under R version 4.2.2

Warning: package 'yardstick' was built under R version 4.2.2

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages

library(here) #help read/import data

Warning: package 'here' was built under R version 4.2.2

here() starts at C:/Users/Raquel/GitHub/MADA/RaquelFrancisco-MADA-portfolio

library(ggplot2) #data visualization
library(broom.mixed) # for converting bayesian models to tidy tibbles

Warning: package 'broom.mixed' was built under R version 4.2.2

library(dotwhisker)  # for visualizing regression results

Warning: package 'dotwhisker' was built under R version 4.2.2

library(performance) #evaluate model fit and performance

Warning: package 'performance' was built under R version 4.2.2


Attaching package: 'performance'

The following objects are masked from 'package:yardstick':

    mae, rmse

data <- readRDS(here('fluanalysis/data/SypAct_clean.rds')) #upload cleaned data
tibble(data) # to get a look at the data

# A tibble: 730 × 26
   SwollenLymph…¹ Chest…² Chill…³ Nasal…⁴ Sneeze Fatigue Subje…⁵ Heada…⁶ Weakn…⁷
   <fct>          <fct>   <fct>   <fct>   <fct>  <fct>   <fct>   <fct>   <fct>  
 1 Yes            No      No      No      No     Yes     Yes     Yes     Mild   
 2 Yes            Yes     No      Yes     No     Yes     Yes     Yes     Severe 
 3 Yes            Yes     Yes     Yes     Yes    Yes     Yes     Yes     Severe 
 4 Yes            Yes     Yes     Yes     Yes    Yes     Yes     Yes     Severe 
 5 Yes            No      Yes     No      No     Yes     Yes     Yes     Modera…
 6 No             No      Yes     No      Yes    Yes     Yes     Yes     Modera…
 7 No             No      Yes     No      No     Yes     Yes     No      Mild   
 8 No             Yes     Yes     Yes     Yes    Yes     Yes     Yes     Severe 
 9 Yes            Yes     Yes     Yes     No     Yes     Yes     Yes     Modera…
10 No             Yes     No      Yes     No     Yes     No      Yes     Modera…
# … with 720 more rows, 17 more variables: CoughIntensity <fct>, Myalgia <fct>,
#   RunnyNose <fct>, AbPain <fct>, ChestPain <fct>, Diarrhea <fct>,
#   EyePn <fct>, Insomnia <fct>, ItchyEye <fct>, Nausea <fct>, EarPn <fct>,
#   Pharyngitis <fct>, Breathless <fct>, ToothPn <fct>, Vomit <fct>,
#   Wheeze <fct>, BodyTemp <dbl>, and abbreviated variable names
#   ¹SwollenLymphNodes, ²ChestCongestion, ³ChillsSweats, ⁴NasalCongestion,
#   ⁵SubjectiveFever, ⁶Headache, ⁷Weakness

str(data)

'data.frame':   730 obs. of  26 variables:
 $ SwollenLymphNodes: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 2 1 ...
  ..- attr(*, "label")= chr "Swollen Lymph Nodes"
 $ ChestCongestion  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
  ..- attr(*, "label")= chr "Chest Congestion"
 $ ChillsSweats     : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 2 2 2 1 ...
  ..- attr(*, "label")= chr "Chills/Sweats"
 $ NasalCongestion  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
  ..- attr(*, "label")= chr "Nasal Congestion"
 $ Sneeze           : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2 1 2 1 1 ...
  ..- attr(*, "label")= chr "Sneeze"
 $ Fatigue          : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "label")= chr "Fatigue"
 $ SubjectiveFever  : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
  ..- attr(*, "label")= chr "Subjective Fever"
 $ Headache         : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
  ..- attr(*, "label")= chr "Headache"
 $ Weakness         : Factor w/ 4 levels "None","Mild",..: 2 4 4 4 3 3 2 4 3 3 ...
 $ CoughIntensity   : Factor w/ 4 levels "None","Mild",..: 4 4 2 3 1 3 4 3 3 3 ...
  ..- attr(*, "label")= chr "Cough Severity"
 $ Myalgia          : Factor w/ 4 levels "None","Mild",..: 2 4 4 4 2 3 2 4 3 2 ...
 $ RunnyNose        : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 2 2 2 2 ...
  ..- attr(*, "label")= chr "Runny Nose"
 $ AbPain           : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Abdominal Pain"
 $ ChestPain        : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 2 1 1 1 ...
  ..- attr(*, "label")= chr "Chest Pain"
 $ Diarrhea         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
 $ EyePn            : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Eye Pain"
 $ Insomnia         : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 1 2 2 2 ...
  ..- attr(*, "label")= chr "Sleeplessness"
 $ ItchyEye         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Itchy Eyes"
 $ Nausea           : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 1 1 2 2 ...
 $ EarPn            : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Ear Pain"
 $ Pharyngitis      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 1 1 ...
  ..- attr(*, "label")= chr "Sore Throat"
 $ Breathless       : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 2 ...
  ..- attr(*, "label")= chr "Breathlessness"
 $ ToothPn          : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 2 1 ...
  ..- attr(*, "label")= chr "Tooth Pain"
 $ Vomit            : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
  ..- attr(*, "label")= chr "Vomiting"
 $ Wheeze           : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 1 1 ...
  ..- attr(*, "label")= chr "Wheezing"
 $ BodyTemp         : num  98.3 100.4 100.8 98.8 100.5 ...

Code will include:

Fitting a linear model to the continuous outcome (Body temperature) using only the main predictor of interest.
Fitting another linear model to the continuous outcome using all (important) predictors of interest.
Comparing the model results for the model with just the main predictor and all predictors.
Fitting a logistic model to the categorical outcome (Nausea) using only the main predictor of interest.
Fitting another logistic model to the categorical outcome using all (important) predictors of interest.
Compares the model results for the categorical model with just the main predictor and all predictors.

Linear Model: Body Temperature vs. Myalgia

#plot suspect interaction
ggplot(data,
       aes(x = Myalgia, y = BodyTemp)) + 
  geom_boxplot()

lm_mod <- linear_reg() #note the default is lm(), thus we do not need to "set" the computational engine

lm_fit1 <- lm_mod %>% 
  fit(BodyTemp ~ Myalgia, data = data)

tidy(lm_fit1) #several significant results

# A tibble: 4 × 5
  term            estimate std.error statistic p.value
  <chr>              <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)       98.7       0.134    734.    0     
2 MyalgiaMild        0.322     0.157      2.04  0.0413
3 MyalgiaModerate    0.271     0.150      1.81  0.0711
4 MyalgiaSevere      0.404     0.175      2.31  0.0214

tidy(lm_fit1) %>% ##help visualize model with dot-whisker plot
  dwplot(dot_args = list(size = 2, color = "black"),
         whisker_args = list(color = "black"),
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))

Linear Model: Body Temperature vs. All Variables

lm_fit_all1 <- lm_mod %>% 
  fit(BodyTemp ~ ., data = data)

tidy(lm_fit_all1) #several significant results

# A tibble: 32 × 5
   term                 estimate std.error statistic   p.value
   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)          97.9        0.304   323.     0        
 2 SwollenLymphNodesYes -0.166      0.0920   -1.81   0.0714   
 3 ChestCongestionYes    0.103      0.0971    1.06   0.291    
 4 ChillsSweatsYes       0.192      0.127     1.51   0.131    
 5 NasalCongestionYes   -0.222      0.113    -1.95   0.0512   
 6 SneezeYes            -0.373      0.0980   -3.80   0.000156 
 7 FatigueYes            0.273      0.161     1.70   0.0901   
 8 SubjectiveFeverYes    0.437      0.103     4.25   0.0000244
 9 HeadacheYes           0.00498    0.125     0.0397 0.968    
10 WeaknessMild          0.00552    0.189     0.0292 0.977    
# … with 22 more rows

tidy(lm_fit_all1) %>% ##help visualize regression with dot-whisker plot w/ 95% CI
  dwplot(dot_args = list(size = 2, color = "black"), #Coefficient Estimates
         whisker_args = list(color = "black"), #CI
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))

Model Output Comparison

#Check the goodness of "fit" with these 2 lm() models
check_model(lm_fit1$fit) #allows for streamlined way to look at QQ plot, normality, etc

check_model(lm_fit_all1$fit)

Variable `Component` is not in your data frame :/

#Compare output
#Visually
dwplot(list(Myalgia = lm_fit1, AllVariables = lm_fit_all1)) #compare dwplots

glance(lm_fit1) %>%
  dplyr::select(adj.r.squared, AIC, BIC, p.value)

# A tibble: 1 × 4
  adj.r.squared   AIC   BIC p.value
          <dbl> <dbl> <dbl>   <dbl>
1       0.00387 2337. 2360.   0.121

glance(lm_fit_all1) %>%
  dplyr::select(adj.r.squared, AIC, BIC, p.value)

# A tibble: 1 × 4
  adj.r.squared   AIC   BIC      p.value
          <dbl> <dbl> <dbl>        <dbl>
1        0.0853 2302. 2453. 0.0000000247

compare_performance(lm_fit1,lm_fit_all1) #better way

# Comparison of Model Performance Indices

Name        | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
----------------------------------------------------------------------------------------------------------
lm_fit1     |   _lm | 2336.6 (<.001) | 2336.6 (<.001) | 2359.5 (>.999) | 0.008 |     0.004 | 1.191 | 1.194
lm_fit_all1 |   _lm | 2301.6 (>.999) | 2304.8 (>.999) | 2453.1 (<.001) | 0.124 |     0.085 | 1.119 | 1.144

#Via ANOVA
anova(lm_fit1$fit, lm_fit_all1$fit)

Analysis of Variance Table

Model 1: BodyTemp ~ Myalgia
Model 2: BodyTemp ~ SwollenLymphNodes + ChestCongestion + ChillsSweats + 
    NasalCongestion + Sneeze + Fatigue + SubjectiveFever + Headache + 
    Weakness + CoughIntensity + Myalgia + RunnyNose + AbPain + 
    ChestPain + Diarrhea + EyePn + Insomnia + ItchyEye + Nausea + 
    EarPn + Pharyngitis + Breathless + ToothPn + Vomit + Wheeze
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    726 1035.07                                  
2    698  913.79 28    121.28 3.3086 3.294e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic Model: Nausea vs. Myalgia

#plot suspect interaction
ggplot(data, aes(x = Myalgia, y = Nausea)) + 
  geom_count()

glm_mod <- logistic_reg(mode = "classification",
  engine = "glm",
  penalty = NULL,
  mixture = NULL) #define mode so it is a glm

glm_fit1 <- glm_mod %>% 
  fit(Nausea ~ Myalgia, data = data)

tidy(glm_fit1) #several significant results

# A tibble: 4 × 5
  term            estimate std.error statistic     p.value
  <chr>              <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)       -1.37      0.280    -4.90  0.000000980
2 MyalgiaMild        0.291     0.321     0.905 0.366      
3 MyalgiaModerate    0.926     0.302     3.07  0.00217    
4 MyalgiaSevere      1.42      0.337     4.22  0.0000244

tidy(glm_fit1) %>% ##help visualize model with dot-whisker plot
  dwplot(dot_args = list(size = 2, color = "black"),
         whisker_args = list(color = "black"),
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))

Logistic Model: Nausea vs. All Variables

glm_fit_all1 <- glm_mod %>% 
  fit(Nausea ~ ., data = data)

tidy(glm_fit_all1) #several significant results

# A tibble: 32 × 5
   term                 estimate std.error statistic p.value
   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)             0.207     7.81     0.0265  0.979 
 2 SwollenLymphNodesYes   -0.247     0.196   -1.26    0.207 
 3 ChestCongestionYes      0.265     0.210    1.26    0.207 
 4 ChillsSweatsYes         0.291     0.287    1.01    0.311 
 5 NasalCongestionYes      0.440     0.253    1.74    0.0822
 6 SneezeYes               0.165     0.210    0.787   0.431 
 7 FatigueYes              0.231     0.371    0.622   0.534 
 8 SubjectiveFeverYes      0.255     0.223    1.14    0.253 
 9 HeadacheYes             0.337     0.285    1.18    0.236 
10 WeaknessMild           -0.120     0.447   -0.269   0.788 
# … with 22 more rows

tidy(glm_fit_all1) %>% ##help visualize regression with dot-whisker plot w/ 95% CI
  dwplot(dot_args = list(size = 2, color = "black"), #Coefficient Estimates
         whisker_args = list(color = "black"), #CI
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))

Model Output Comparison

#Check the goodness of "fit" with these 2 lm() models
check_model(glm_fit1$fit) #QQ plot seems off, residuals may be an issue

check_model(glm_fit_all1$fit)

Variable `Component` is not in your data frame :/

#Compare output
#Visually
dwplot(list(Myalgia = glm_fit1, AllVariables = glm_fit_all1)) #compare dwplots

glance(glm_fit1) %>%
  dplyr::select( AIC, BIC)

# A tibble: 1 × 2
    AIC   BIC
  <dbl> <dbl>
1  920.  939.

glance(glm_fit_all1) %>%
  dplyr::select( AIC, BIC) #better fit

# A tibble: 1 × 2
    AIC   BIC
  <dbl> <dbl>
1  816.  963.

compare_performance(glm_fit1,glm_fit_all1)

# Comparison of Model Performance Indices

Name         | Model | AIC (weights) | AICc (weights) | BIC (weights) | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP
--------------------------------------------------------------------------------------------------------------------------------------------------
glm_fit1     |  _glm | 920.3 (<.001) |  920.3 (<.001) | 938.7 (>.999) |     0.044 | 0.466 | 1.121 |    0.625 |  -110.929 |           0.006 | 0.565
glm_fit_all1 |  _glm | 816.1 (>.999) |  819.2 (>.999) | 963.1 (<.001) |     0.246 | 0.415 | 1.038 |    0.515 |      -Inf |           0.002 | 0.657

#Via ANOVA
anova(glm_fit1$fit, glm_fit_all1$fit)

Analysis of Deviance Table

Model 1: Nausea ~ Myalgia
Model 2: Nausea ~ SwollenLymphNodes + ChestCongestion + ChillsSweats + 
    NasalCongestion + Sneeze + Fatigue + SubjectiveFever + Headache + 
    Weakness + CoughIntensity + Myalgia + RunnyNose + AbPain + 
    ChestPain + Diarrhea + EyePn + Insomnia + ItchyEye + EarPn + 
    Pharyngitis + Breathless + ToothPn + Vomit + Wheeze + BodyTemp
  Resid. Df Resid. Dev Df Deviance
1       726     912.28            
2       698     752.13 28   160.15