R Coding Exercise

Import data and begin to clean data using Tidyverse

#If needed install package that has our data (dslabs) abd the package that will help us clean the data (tidyverse)
#install.packages(dslabs)
#install.packages(tidyverse)
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.2
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)
Warning: package 'dslabs' was built under R version 4.2.2
#look at help file for gapminder data
#?gapminder
#get an overview of data structure
str(gapminder)
'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
#get a summary of data
summary(gapminder)
                country           year      infant_mortality life_expectancy
 Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
 Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
 Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
 Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
 Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
 Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
 (Other)            :10203                  NA's   :1453                    
   fertility       population             gdp               continent   
 Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
 1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
 Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
 Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
 3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
 Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
 NA's   :187     NA's   :185         NA's   :2972                       
             region    
 Western Asia   :1026  
 Eastern Africa : 912  
 Western Africa : 912  
 Caribbean      : 741  
 South America  : 684  
 Southern Europe: 684  
 (Other)        :5586  
#determine the type of object gapminder is
class(gapminder)
[1] "data.frame"
#make a tibble data frame
Data <- gapminder  
as.tibble(Data)
Warning: `as.tibble()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` instead.
ℹ The signature and semantics have changed, see `?as_tibble`.
# A tibble: 10,545 × 9
   country   year infant_mortality life_expectancy fertility population      gdp
   <fct>    <int>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
 1 Albania   1960            115.             62.9      6.19    1636054 NA      
 2 Algeria   1960            148.             47.5      7.65   11124892  1.38e10
 3 Angola    1960            208              36.0      7.32    5270844 NA      
 4 Antigua…  1960             NA              63.0      4.43      54681 NA      
 5 Argenti…  1960             59.9            65.4      3.11   20619075  1.08e11
 6 Armenia   1960             NA              66.9      4.55    1867396 NA      
 7 Aruba     1960             NA              65.7      4.82      54208 NA      
 8 Austral…  1960             20.3            70.9      3.45   10292328  9.67e10
 9 Austria   1960             37.3            68.8      2.7     7065525  5.24e10
10 Azerbai…  1960             NA              61.3      5.57    3897889 NA      
# ℹ 10,535 more rows
# ℹ 2 more variables: continent <fct>, region <fct>

Filter all BUT Africa data

africadata <- Data %>%
filter(continent == "Africa")

Subset and review new data

deadbabies <- africadata %>%
select(infant_mortality, life_expectancy)

str(deadbabies)
'data.frame':   2907 obs. of  2 variables:
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
summary(deadbabies)
 infant_mortality life_expectancy
 Min.   : 11.40   Min.   :13.20  
 1st Qu.: 62.20   1st Qu.:48.23  
 Median : 93.40   Median :53.98  
 Mean   : 95.12   Mean   :54.38  
 3rd Qu.:124.70   3rd Qu.:60.10  
 Max.   :237.40   Max.   :77.60  
 NA's   :226                     
popafrica <- africadata %>%
select(population, life_expectancy)

str(popafrica)
'data.frame':   2907 obs. of  2 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...
summary(popafrica)
   population        life_expectancy
 Min.   :    41538   Min.   :13.20  
 1st Qu.:  1605232   1st Qu.:48.23  
 Median :  5570982   Median :53.98  
 Mean   : 12235961   Mean   :54.38  
 3rd Qu.: 13888152   3rd Qu.:60.10  
 Max.   :182201962   Max.   :77.60  
 NA's   :51                         

Plot Data Population Vs Life Exp

#Pop on x-axis
plot1c <- ggplot(data = africadata) + 
  geom_point(mapping = aes(x = population, y =     life_expectancy, color=country)) +
scale_x_continuous(trans = 'log10') 

plot1y <- ggplot(data = africadata) + 
  geom_point(mapping = aes(x = population, y =     life_expectancy, color=year)) +
scale_x_continuous(trans = 'log10') 

plot1c
Warning: Removed 51 rows containing missing values (`geom_point()`).

plot1y
Warning: Removed 51 rows containing missing values (`geom_point()`).

Inf Mortality Vs Life Exp

plot2y <- ggplot(data = africadata) + 
  geom_point(mapping = aes(x = life_expectancy, y =  infant_mortality, color=year)) +
scale_x_continuous(trans = 'log10') 

plot2c <- ggplot(data = africadata) + 
  geom_point(mapping = aes(x = life_expectancy, y =  infant_mortality, color=country)) +
scale_x_continuous(trans = 'log10') 

plot2c
Warning: Removed 226 rows containing missing values (`geom_point()`).

plot2y
Warning: Removed 226 rows containing missing values (`geom_point()`).

Finding all the NAs (ie the missing data)

africadata %>%
  summarise(count = sum(is.na(infant_mortality)))
  count
1   226
africadata %>%
  summarise(count = sum(is.na(life_expectancy))) 
  count
1     0
africadata %>%
  summarise(count = sum(is.na(population))) 
  count
1    51

Now Filter 2000 data from africadata

africa2000data <- africadata %>%
  filter(year == '2000')

Plot 2000 data 2000: Population Vs Life Exp

plot3c <- ggplot(data = africa2000data) + 
  geom_point(mapping = aes(x = population, y = life_expectancy, color=country)) +
scale_x_continuous(trans = 'log10') 

plot3c

2000: Inf Mort Vs Life Exp

plot4c <- ggplot(data = africa2000data) + 
  geom_point(mapping = aes(x = life_expectancy, y = infant_mortality, color=country))

plot4c

Simple Stats to Evaluate Data

fit1 <- lm(infant_mortality ~ life_expectancy, africa2000data)
fit2 <- lm(population ~ life_expectancy, africa2000data)

summary(fit1)

Call:
lm(formula = infant_mortality ~ life_expectancy, data = africa2000data)

Residuals:
    Min      1Q  Median      3Q     Max 
-67.262  -9.806  -1.891  12.460  52.285 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     219.0135    21.4781  10.197 1.05e-13 ***
life_expectancy  -2.4854     0.3769  -6.594 2.83e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.55 on 49 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4593 
F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08
summary(fit2)

Call:
lm(formula = population ~ life_expectancy, data = africa2000data)

Residuals:
      Min        1Q    Median        3Q       Max 
-18308728 -12957963  -6425955   2079794 107435285 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)      5074933   21193712   0.239    0.812
life_expectancy   187799     371938   0.505    0.616

Residual standard error: 22250000 on 49 degrees of freedom
Multiple R-squared:  0.005176,  Adjusted R-squared:  -0.01513 
F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159

Visualizing Stats 2000: Population Vs Life Exp

ggplot(africa2000data, aes(x = life_expectancy, y = infant_mortality)) + geom_point() +
  stat_smooth(method = "lm", col = "red")
`geom_smooth()` using formula = 'y ~ x'

2000: Inf Mort Vs Life Exp

ggplot(africa2000data, aes(x = life_expectancy, y = population)) + geom_point() +
  stat_smooth(method = "lm", col = "red")
`geom_smooth()` using formula = 'y ~ x'