Flu Analysis: Wrangling

Load the data and packages:

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.2
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(here)
Warning: package 'here' was built under R version 4.2.2
here() starts at C:/Users/Raquel/GitHub/MADA/RaquelFrancisco-MADA-portfolio
library(tidymodels)
Warning: package 'tidymodels' was built under R version 4.2.2
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.3     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.4     
Warning: package 'broom' was built under R version 4.2.2
Warning: package 'dials' was built under R version 4.2.2
Warning: package 'infer' was built under R version 4.2.2
Warning: package 'modeldata' was built under R version 4.2.2
Warning: package 'parsnip' was built under R version 4.2.2
Warning: package 'recipes' was built under R version 4.2.2
Warning: package 'rsample' was built under R version 4.2.2
Warning: package 'tune' was built under R version 4.2.2
Warning: package 'workflows' was built under R version 4.2.2
Warning: package 'workflowsets' was built under R version 4.2.2
Warning: package 'yardstick' was built under R version 4.2.2
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages
raw_flu <- readRDS(here('fluanalysis/data/SympAct_Any_Pos.Rda'))
tibble(raw_flu) # to get a look at the data
# A tibble: 735 × 63
   DxName1       DxName2 DxName3 DxName4 DxName5 Uniqu…¹ Activ…² Activ…³ Swoll…⁴
   <fct>         <fct>   <fct>   <fct>   <fct>   <chr>     <int> <fct>   <fct>  
 1 Influenza li… <NA>    <NA>    <NA>    <NA>    340_17…      10 10      Yes    
 2 Acute tonsil… Influe… <NA>    <NA>    <NA>    340_17…       6 6       Yes    
 3 Influenza li… Acute … <NA>    <NA>    <NA>    342_17…       2 2       Yes    
 4 Influenza li… Unspec… <NA>    <NA>    <NA>    342_17…       2 2       Yes    
 5 Acute pharyn… Influe… <NA>    <NA>    <NA>    342_17…       5 5       Yes    
 6 Influenza li… <NA>    <NA>    <NA>    <NA>    343_17…       3 3       No     
 7 Fever, unspe… Influe… <NA>    <NA>    <NA>    343_17…       4 4       No     
 8 Acute upper … Impact… <NA>    <NA>    <NA>    344_17…       0 0       No     
 9 Influenza li… Acute … Fever,… Other … Headac… 344_17…       0 0       Yes    
10 Influenza li… <NA>    <NA>    <NA>    <NA>    344_17…       5 5       No     
# … with 725 more rows, 54 more variables: ChestCongestion <fct>,
#   ChillsSweats <fct>, NasalCongestion <fct>, CoughYN <fct>, Sneeze <fct>,
#   Fatigue <fct>, SubjectiveFever <fct>, Headache <fct>, Weakness <fct>,
#   WeaknessYN <fct>, CoughIntensity <fct>, CoughYN2 <fct>, Myalgia <fct>,
#   MyalgiaYN <fct>, RunnyNose <fct>, AbPain <fct>, ChestPain <fct>,
#   Diarrhea <fct>, EyePn <fct>, Insomnia <fct>, ItchyEye <fct>, Nausea <fct>,
#   EarPn <fct>, Hearing <fct>, Pharyngitis <fct>, Breathless <fct>, …

Remove unwanted data columns

+ Feature/Variable removal

Remove all variables that have Score or Total or FluA or FluB or Dxname or Activity in their name. Also remove the variable Unique.Visit. Remove any NA observations.

Now data set contains 32 variables coding for presence or absence of some symptom. Only one, temperature, is continuous.

flu_clean <- raw_flu %>% #sort through the data and removing all variables (columns) that include the words: Score or Total or FluA or FluB or Dxname or Activity
  select(-contains(c("Score", "Total", "FluA", "FluB", "Dxname", "Activity"))) %>% #now will remove all columns that include Unique.Visit
  select(-contains(c("Unique.Visit"))) %>%
  drop_na %>% #Will drop all nas now 
select(-contains(c("WeaknessYN", "CoughYN", "CoughYN2", "MyalgiaYN"))) ## Remove repeated variables (#Weakness, Cough and Myalgia)

Remove binary predictors with <50 entries

summary(flu_clean)
 SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion Sneeze   
 No :418           No :323         No :130      No :167         No :339  
 Yes:312           Yes:407         Yes:600      Yes:563         Yes:391  
                                                                         
                                                                         
                                                                         
                                                                         
 Fatigue   SubjectiveFever Headache      Weakness    CoughIntensity
 No : 64   No :230         No :115   None    : 49   None    : 47   
 Yes:666   Yes:500         Yes:615   Mild    :223   Mild    :154   
                                     Moderate:338   Moderate:357   
                                     Severe  :120   Severe  :172   
                                                                   
                                                                   
     Myalgia    RunnyNose AbPain    ChestPain Diarrhea  EyePn     Insomnia 
 None    : 79   No :211   No :639   No :497   No :631   No :617   No :315  
 Mild    :213   Yes:519   Yes: 91   Yes:233   Yes: 99   Yes:113   Yes:415  
 Moderate:325                                                              
 Severe  :113                                                              
                                                                           
                                                                           
 ItchyEye  Nausea    EarPn     Hearing   Pharyngitis Breathless ToothPn  
 No :551   No :475   No :568   No :700   No :119     No :436    No :565  
 Yes:179   Yes:255   Yes:162   Yes: 30   Yes:611     Yes:294    Yes:165  
                                                                         
                                                                         
                                                                         
                                                                         
 Vision    Vomit     Wheeze       BodyTemp     
 No :711   No :652   No :510   Min.   : 97.20  
 Yes: 19   Yes: 78   Yes:220   1st Qu.: 98.20  
                               Median : 98.50  
                               Mean   : 98.94  
                               3rd Qu.: 99.30  
                               Max.   :103.10  
#Both vision and Hearing have under 50 of either Yes/No, remove

flu_clean_2 <- flu_clean %>%
select(-contains(c("Vision", "Hearing")))

Save file into project:

saveRDS(flu_clean_2, file= here("fluanalysis", "data", "SypAct_clean.rds")) #will save as a data.frame in the RDS file