Remove all variables that have Score or Total or FluA or FluB or Dxname or Activity in their name. Also remove the variable Unique.Visit. Remove any NA observations.
Now data set contains 32 variables coding for presence or absence of some symptom. Only one, temperature, is continuous.
flu_clean <- raw_flu %>%#sort through the data and removing all variables (columns) that include the words: Score or Total or FluA or FluB or Dxname or Activityselect(-contains(c("Score", "Total", "FluA", "FluB", "Dxname", "Activity"))) %>%#now will remove all columns that include Unique.Visitselect(-contains(c("Unique.Visit"))) %>% drop_na %>%#Will drop all nas now select(-contains(c("WeaknessYN", "CoughYN", "CoughYN2", "MyalgiaYN"))) ## Remove repeated variables (#Weakness, Cough and Myalgia)
Remove binary predictors with <50 entries
summary(flu_clean)
SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion Sneeze
No :418 No :323 No :130 No :167 No :339
Yes:312 Yes:407 Yes:600 Yes:563 Yes:391
Fatigue SubjectiveFever Headache Weakness CoughIntensity
No : 64 No :230 No :115 None : 49 None : 47
Yes:666 Yes:500 Yes:615 Mild :223 Mild :154
Moderate:338 Moderate:357
Severe :120 Severe :172
Myalgia RunnyNose AbPain ChestPain Diarrhea EyePn Insomnia
None : 79 No :211 No :639 No :497 No :631 No :617 No :315
Mild :213 Yes:519 Yes: 91 Yes:233 Yes: 99 Yes:113 Yes:415
Moderate:325
Severe :113
ItchyEye Nausea EarPn Hearing Pharyngitis Breathless ToothPn
No :551 No :475 No :568 No :700 No :119 No :436 No :565
Yes:179 Yes:255 Yes:162 Yes: 30 Yes:611 Yes:294 Yes:165
Vision Vomit Wheeze BodyTemp
No :711 No :652 No :510 Min. : 97.20
Yes: 19 Yes: 78 Yes:220 1st Qu.: 98.20
Median : 98.50
Mean : 98.94
3rd Qu.: 99.30
Max. :103.10
#Both vision and Hearing have under 50 of either Yes/No, removeflu_clean_2 <- flu_clean %>%select(-contains(c("Vision", "Hearing")))
Save file into project:
saveRDS(flu_clean_2, file=here("fluanalysis", "data", "SypAct_clean.rds")) #will save as a data.frame in the RDS file