FFCRegressionImputation
This vignette demonstrates the integration of the ffc-data-processing
methods with the imputation methods from the FFCRegressionImputation
package by Anna Filippova.
First, install the FFCRegressionImputation package from GitHub using devtools
.
if (!"devtools" %in% installed.packages()) install.packages("devtools")
if (!"FFCRegressionImputation" %in% installed.packages()) {
devtools::install_github("annafil/FFCRegressionImputation")
}
library(FFCRegressionImputation)
Then, source the required packages and functions from ffc-data-processing.
source("init.R")
First, we use the Stata background.dta file to extract information about variable types. For this vignette, only the constructed variables are treated. Subsetting the variables makes later steps computationally feasible.
This function chain is very similar to the code in the vignette Example 2, but broken out into individual steps.
The results of the data processing are summarized below. Categorical variables are now factors, and continuous variables (as well as the challengeID
identifier) are numeric.
# get covariate variable type information from Stata file
background_dta <- read_dta("data/background.dta")
background_variable_types <-
background_dta %>%
subset_vars_keep(get_vars_constructed) %>%
recode_na_character() %>%
labelled_to_factor() %>%
labelled_to_numeric() %>%
character_to_factor() %>%
character_to_numeric() %>%
# use a less conservative threshold than default
character_to_factor_or_numeric(threshold = 29)
summarize_variable_classes(background_variable_types)
## factor numeric
## 269 79
Using this processed data set, we store the names of the categorical and continuous variables separately.
categorical_vars <- get_vars_categorical(background_variable_types)
continuous_vars <- get_vars_continuous(background_variable_types)
It’s important to check that we’ve classified the variables appropriately; we can do this using the metadata provided by the summarize_variables
function.
variable_metadata <-
background_dta %>%
subset_vars_keep(get_vars_constructed) %>%
summarize_variables(background_variable_types) %>%
arrange(variable_type, desc(unique_values))
variable_metadata %>% head(10) %>% print(width = 91)
## # A tibble: 10 x 4
## variable label unique_values
## <chr> <chr> <dbl>
## 1 cf4age Constructed - Father's age (years) 41
## 2 cm4age Constructed - Mother's age (years) 34
## 3 cm3amrf Constructed - Mother age when married father (years) 26
## 4 cm4amrf Constructed - Mother age when married father (years) 26
## 5 cm3alvf Constructed - Mother age when started living with father (years) 24
## 6 cm4alvf Constructed - Mother age when started living with father (years) 24
## 7 cm4b_age Constructed - Child age at time of mother interview (months) 17
## 8 cf4b_age Constructed - Child age at time of father interview (months) 17
## 9 cm2relf Constructed - Mother's romantic relationship w/BF at one-year 10
## 10 cm3relf Constructed - Mother relationship with father at year three 10
## # ... with 1 more variables: variable_type <chr>
It turns out that some censored continuous variables are converted to factors instead of numerics. For this example, we want to discard the information about censoring for the first 8 variables shown above. In other words, we are going to treat someone who is labelled as “20 and younger” as if we knew they were exactly 20 years old.
We do this by removing those variable names from the categorical_vars
list, and adding them to the continuous_vars
list.
censored_vars <- variable_metadata$variable[1:8]
categorical_vars <-
categorical_vars[!categorical_vars %in% censored_vars]
continuous_vars <- c(continuous_vars, censored_vars)
We now have 261 categorical variables and 86 continuous variables, in addition to the challengeID
identifier.
These steps make use of functions from FFCRegressionImputation
to do logical and regression-based imputation.
The initialization function from FFCRegressionImputation
reads in and runs on the background.csv file. That isn’t a problem, because we’ve already retrieved and stored the information we need from the background.dta file.
The function logically imputes some missing values for age-related variables. It also generates variables with NA count information, but these aren’t included in subsequent steps of this vignette.
# run initial cleaning and imputation for all variables on csv file
background_csv <- initImputation(data = "data/background.csv")
## Importing data...
## Run logical imputation...
## [1] "Generating refusalcount, dontknowcount, nacount..."
## [1] "Running logical age imputation ... "
## [1] "Done with logical age imputation!"
## Drop missing data...
Next, use the previously calculated information about which variables are categorical and which are continuous to split the new data frame in two sets of constructed variables. In this vignette, we will only run regression-based imputation on the continuous variables.
Note that initImputation
renames three constructed variables by prefixing them with c-. These variables are included manually in get_vars_constructed
and not renamed there, so the new variables need to be explicitly included here.
# split csv file using variable type information
background_categorical <-
background_csv %>%
select(challengeID, one_of(categorical_vars), co5oint, ct5int)
background_continuous <-
background_csv %>%
select(challengeID, one_of(continuous_vars), cn5d2_age)
# not used, included for illustrative purposes only
background_other <-
background_csv %>%
select(-one_of(continuous_vars, categorical_vars))
The corMatrix
function produces a matrix of correlations between covariates, and an object containing both this matrix and a data frame for use in subsequent imputation.
This is the most computationally-intensive step, and subsetting the background covariates down to the constructed continuous covariates speeds it up considerably. It can also be run in parallel, as it is here.
# impute numeric covariates using regression imputation
output <- corMatrix(data = background_continuous, parallel = 1)
## Enabling parallelization
regImputation
uses the most correlated predictors to impute missing values in the given data frame. Here, we impute only the continuous variables.
By default, this uses an OLS linear model; it can run a LASSO with the polywog
package instead.
background_continuous_imputed <-
regImputation(output$df, output$corMatrix, parallel = 1)
## Using lm...
## Enabling parallelization
Now, we can join the categorical variables we set aside and the newly imputed continuous variables back together. For consistency’s sake, we rearrange them into the original order.
Some variables have been dropped due to lack of variation, and these now generate warnings which can be ignored.
# merge continuous and categorical variables back together
background_imputed <-
full_join(background_categorical, background_continuous_imputed,
by = "challengeID") %>%
# put variables in order
select(challengeID, one_of(get_vars_constructed(background_csv)))
## Warning: Unknown variables: `cm3alvf`
Finally, read in and merge the outcome data. (merge_train
is a wrapper around dplyr::full_join
with special handling for the joining variable challengeID
.)
# merge background data with train
train <- read_csv("data/train.csv")
## Parsed with column specification:
## cols(
## challengeID = col_integer(),
## gpa = col_double(),
## grit = col_double(),
## materialHardship = col_double(),
## eviction = col_integer(),
## layoff = col_integer(),
## jobTraining = col_integer()
## )
ffc <- merge_train(background_imputed, train)
To build a model of these outcomes, you would want to treat the missing values in the categorical variables as well, either by treating them identically to the continuous variables and running them through regImputation
or by using some other imputation strategy.