RNAseqCovarImpute - Impute Covariate Data in RNA Sequencing Studies
The RNAseqCovarImpute package makes linear model analysis
for RNA sequencing read counts compatible with multiple
imputation (MI) of missing covariates. A major problem with
implementing MI in RNA sequencing studies is that the outcome
data must be included in the imputation prediction models to
avoid bias. This is difficult in omics studies with
high-dimensional data. The first method we developed in the
RNAseqCovarImpute package surmounts the problem of
high-dimensional outcome data by binning genes into smaller
groups to analyze pseudo-independently. This method implements
covariate MI in gene expression studies by 1) randomly binning
genes into smaller groups, 2) creating M imputed datasets
separately within each bin, where the imputation predictor
matrix includes all covariates and the log counts per million
(CPM) for the genes within each bin, 3) estimating gene
expression changes using `limma::voom` followed by
`limma::lmFit` functions, separately on each M imputed dataset
within each gene bin, 4) un-binning the gene sets and stacking
the M sets of model results before applying the
`limma::squeezeVar` function to apply a variance shrinking
Bayesian procedure to each M set of model results, 5) pooling
the results with Rubins’ rules to produce combined
coefficients, standard errors, and P-values, and 6) adjusting
P-values for multiplicity to account for false discovery rate
(FDR). A faster method uses principal component analysis (PCA)
to avoid binning genes while still retaining outcome
information in the MI models. Binning genes into smaller groups
requires that the MI and limma-voom analysis is run many times
(typically hundreds). The more computationally efficient MI PCA
method implements covariate MI in gene expression studies by 1)
performing PCA on the log CPM values for all genes using the
Bioconductor `PCAtools` package, 2) creating M imputed datasets
where the imputation predictor matrix includes all covariates
and the optimum number of PCs to retain (e.g., based on Horn’s
parallel analysis or the number of PCs that account for >80%
explained variation), 3) conducting the standard limma-voom
pipeline with the `voom` followed by `lmFit` followed by
`eBayes` functions on each M imputed dataset, 4) pooling the
results with Rubins’ rules to produce combined coefficients,
standard errors, and P-values, and 5) adjusting P-values for
multiplicity to account for false discovery rate (FDR).