scPower optimizes the design of scRNA-Seq experiments

scPower is an R package for power analysis and design of single cell transcriptomics experiments for differential expression (DE) and eQTL analysis. A detection of cell-type specific DE and eQTL genes is possible with the help of single cell RNA-seq. It enables the user to calculate the power for a given experimental setup and to choose for a restricted budget the optimal combination of experimental parameters which maximizes the power. Here, we show the power analysis for a DE study, for an eQTL study it works in similar way.

Load necessary libraries and set up input and output locations.

suppressPackageStartupMessages({
library(scPower)
library(reshape2)
library(ggplot2)
})
set.seed(124)

Generation of new priors from the ROSMAP data set

Two types of priors are necessary for scPower simulations:

the cell type specific expression distribution of all genes for the expression probability curves, which can be fitted from single cell RNA-seq data of the same technology (e.g. data from a pilot study). The distribution include the read-UMI fit to estimate the UMI counts per cell, the gamma mixed fits for the mean values of each gene and the mean-dispersion fits for the dispersion values of each gene.
effect sizes and expression ranks of the DE genes, which can be taken from any kind of study or simulated. We use log2 fold changes from inhibitory neurons here.

They can be generated from existing data set. Here, we use 4 samples from the ROSMAP dataset (https://pubmed.ncbi.nlm.nih.gov/31042697/).

Load count matrix of 4 samples (2 ADs and 2 CTRLs) from the ROAMAP dataset.

#SampleNames <- unique(ROSMAP$individualID)

#R8744945.counts <- ROSMAP@assays$RNA@counts[,ROSMAP$individualID == "R8744945"]
#R9426782.counts <- ROSMAP@assays$RNA@counts[,ROSMAP$individualID == "R9426782"]
#R9307768.counts <- ROSMAP@assays$RNA@counts[,ROSMAP$individualID == "R9307768"]
#R8608442.counts <- ROSMAP@assays$RNA@counts[,ROSMAP$individualID == "R8608442"]

#list(R8744945.counts,R9426782.counts,R9307768.counts,R8608442.counts) -> countMatrix
#names(countMatrix) <- SampleNames[1:4]
#save(countMatrix, file="CountMatrix1to4.RData")

load("CountMatrix1to4.RData")
#Dimensions of the three count matrices
sapply(countMatrix,dim)

##      R8744945 R9426782 R9307768 R8608442
## [1,]    17775    17775    17775    17775
## [2,]      484      297     1169     2024

Counting observed expressed genes

The number of expressed genes in the data set can be estimated by reformating the 2d count matrix into a 3d pseudobulk matrix, using the function “create.pseudobulk”. Therefore, an annotation data frame is required with individual and cell type annotations for each cells. The rows of the annotaiton data frame must match the columns in the count matrix (same ordering of cells). After the pseudobulk matrix is created, the expressed genes are extracted using the function “calculate.gene.counts”.

expressed.genes.df <- NULL
for(name in names(countMatrix)){
    count.matrix <- countMatrix[[name]]
    annot.df <- data.frame(individual=paste0("S",rep(1:5,length.out=ncol(count.matrix))),
                       cell.type=rep("Mic",ncol(count.matrix)))
    #Reformat count matrix into pseudobulk matrix
    pseudo.bulk <- create.pseudobulk(count.matrix, annot.df)
    #Calculate expressed genes in the pseudobulk matrix
    expressed.genes <- calculate.gene.counts(pseudo.bulk, min.counts=3, perc.indiv = 0.5)
    #Get the number of expressed genes
    num.expressed.genes <- nrow(expressed.genes)
  
    #Save expressed genes
    expressed.genes.df <- rbind(expressed.genes.df, data.frame(matrix=name,
                num.cells=ncol(count.matrix), expressed.genes=num.expressed.genes))
}

print(expressed.genes.df)

##     matrix num.cells expressed.genes
## 1 R8744945       484           11206
## 2 R9426782       297           10207
## 3 R9307768      1169           12196
## 4 R8608442      2024           12283

Estimation of negative binomial paramters for each gene

As a first step for fitting the expression distribution, the negative binomial fit for each gene, i.e. the mean and disperion parameter, is estimated with “nbinom.estimation”. The function uses DESeq for library normalization and parameter estimation. The “poscounts” normalization is used here as the snRNA-Seq data is very sparse.

The function returns a list with three elements: the normalized mean values, the dispersion values and the parameters of the mean-dispersion function fitted from DESeq.

norm.mean.values<-NULL
disp.param<-NULL
for(name in names(countMatrix)){
    temp <- nbinom.estimation(as.matrix(countMatrix[[name]]), "poscounts")
  
    #Save the normalized mean values
    norm.mean.values.temp <- temp[[1]]
    norm.mean.values.temp$matrix <- name
    norm.mean.values <- rbind(norm.mean.values,norm.mean.values.temp)
  
    #Save the parameter of the mean-dispersion function
    disp.param.temp <- temp[[3]]
    disp.param.temp$matrix <- name
    disp.param <- rbind(disp.param,disp.param.temp)
}

head(norm.mean.values)

##         gene        mean   matrix
## 1 FO538757.2 0.347628457 R8744945
## 2     SAMD11 0.002081430 R8744945
## 3      NOC2L 0.227857100 R8744945
## 4     KLHL17 0.019762945 R8744945
## 5    PLEKHN1 0.000000000 R8744945
## 6      PERM1 0.002087122 R8744945

print(disp.param)

##   asymptDisp  extraPois   matrix
## 1   0.610894 0.15445884 R8744945
## 2   1.531188 0.10420800 R9426782
## 3   0.988553 0.10010291 R9307768
## 4   1.636793 0.08543065 R8608442

Estimation of a gamma mixed distribution over all means

A mixed distribution of a zero component and two left censored gamma components is fitted for each mean value vector. As a fraction of the zero values can be expressed genes below the detection threshold of the experiment, the gamma distributions are modelled as left censored. Here, we use 1/num_cells_celltype as the censoring point.

gamma.fits <- NULL
for(name in names(countMatrix)){
  
  #Number of cells per cell type as censoring point
  censoredPoint <- 1 / ncol(countMatrix[[name]])
  
  norm.mean.values.temp <- norm.mean.values[norm.mean.values$matrix == name,]
  gamma.fit.temp <- mixed.gamma.estimation(norm.mean.values.temp$mean, num.genes.kept = 17775, censoredPoint = censoredPoint)
  gamma.fit.temp$matrix <- name
  gamma.fits <- rbind(gamma.fits,gamma.fit.temp)
}

## Warning in em(mean.vals, ncomp = 3, prop = c(zero.prop, 1 - (zero.prop + : Problem in the EM algorithm: likelihood is decreasing!

## Warning in em(mean.vals, ncomp = 3, prop = c(zero.prop, 1 - (zero.prop + : Problem in the EM algorithm: likelihood is decreasing!

print(gamma.fits)

##                p1        p2        s1        s2       r1        r2   matrix
## shape  0.01852296 0.9310087 0.4194624 0.8134898 2.258687 0.5103385 R8744945
## shape1 0.01789153 0.9316999 0.5549020 1.2062766 3.923257 1.1764886 R9426782
## shape2 0.01447667 0.9350203 0.4031303 0.8505455 2.937134 0.7061003 R9307768
## shape3 0.01401963 0.9351703 0.4109758 0.8033830 5.389559 1.0674429 R8608442

Parameterization of the parameters of the gamma fits by the mean UMI counts per cell

The gamma fits over all matrices are parameterized by the mean UMI counts per cell.

#Estimate the mean umi values per cell for each matrix
umi.values <- NULL
for(name in names(countMatrix)){
  mean.umi <- meanUMI.calculation(as.matrix(countMatrix[[name]]))
  umi.values <- rbind(umi.values,data.frame(mean.umi,matrix=name))
}

print(umi.values)

##   mean.umi   matrix
## 1 4772.351 R8744945
## 2 3832.569 R9426782
## 3 3497.281 R9307768
## 4 2007.899 R8608442

gamma.fits <- merge(gamma.fits,umi.values,by="matrix")

#Convert the gamma fits from the shape-rate parametrization to the mean-sd parametrization
gamma.fits <- convert.gamma.parameters(gamma.fits)

#Visualize the linear relationship between gamma parameters and UMI values in plots
plot.values <- melt(gamma.fits,id.vars=c("matrix","mean.umi"))
plot.values <- plot.values[plot.values$variable %in% c("mean1","mean2","sd1","sd2","p1","p2"),]

ggplot(plot.values,aes(x=mean.umi,y=value)) + geom_point()+geom_line() + facet_wrap(~variable,ncol=2,scales="free")

#Fit relationship between gamma parameters and UMI values
gamma.linear.fit.new <- umi.gamma.relation(gamma.fits)
print(gamma.linear.fit.new)

##   parameter     intercept      meanUMI
## 1        p1  0.0101833024 1.713494e-06
## 2     mean1 -0.0025287642 3.903387e-05
## 3     mean2  0.1473966685 2.825596e-04
## 4       sd1 -0.0001510715 5.757131e-05
## 5       sd2  0.1681951885 2.958093e-04
## 6        p3  0.0504856631 0.000000e+00

Estimation of median dispersion function for each cell type

For the dispersion parameter, no relation with the UMI counts was found. Therefore, simply the median value over all subsampling runs is taken for each parameter of the mean-dispersion function.

disp.fun.general.new <- dispersion.function.estimation(disp.param)
print(disp.fun.general.new)

##   asymptDisp extraPois
## 1   1.259871 0.1021555

Fitting a functions for UMI counts dependent on read depth

As a last point, the relationship between reads and UMIs is fitted logarithmically. Therefore, the number of mapped reads is necessary, which can be gained from summary statistics after the mapping (e.g. in the cellranger summary statistics).

# Annotation of cell type for all fitted data frames
gamma.linear.fit.new$ct <- "New_ct"
disp.fun.general.new$ct <- "New_ct"

#Number of mapped reads taken from cellranger summary statistics
mapped.reads <- data.frame(matrix = names(countMatrix), transcriptome.mapped.reads=c(21186,15690,12563,11324))

#Plot relationship between mean reads per cell and mean UMI per cell
read.umis <- merge(umi.values,mapped.reads,by="matrix")
print(read.umis)

##     matrix mean.umi transcriptome.mapped.reads
## 1 R8608442 2007.899                      11324
## 2 R8744945 4772.351                      21186
## 3 R9307768 3497.281                      12563
## 4 R9426782 3832.569                      15690

ggplot(read.umis,aes(x=transcriptome.mapped.reads,y=mean.umi))+ geom_point()+geom_line()

#Fit relationship between mean reads per cell and mean UMI per cell
read.umi.fit.new <- umi.read.relation(read.umis)
print(read.umi.fit.new)

##             intercept    reads
## (Intercept) -32948.49 3800.074

Validation of expression probability model

In the section “Counting observed expressed genes”, the number of expressed genes in our example count matrices were calculated and saved in the data frame “expressed.genes.df”. To validate our model, we try to predict the same number of expressed genes using our fitted model.

#Merge the observed numbers of expressed genes with the read depth
expressed.genes.df <- merge(expressed.genes.df, mapped.reads, by="matrix")

#Get the number of cells per cell type and individual
expressed.genes.df$cells.indiv <- expressed.genes.df$num.cells/5
expressed.genes.df$estimated.genes <- NA
for(i in 1:nrow(expressed.genes.df)){
  
  #Vector with the expression probability for each gene
  expr.prob <- estimate.exp.prob.param(nSamples=5, readDepth=expressed.genes.df$transcriptome.mapped.reads[i],
                                    nCellsCt=expressed.genes.df$cells.indiv[i],
                                    read.umi.fit = read.umi.fit.new,
                                    gamma.mixed.fits = gamma.linear.fit.new,
                                    ct = "New_ct",
                                    disp.fun.param = disp.fun.general.new,
                                    min.counts = 3,
                                    perc.indiv = 0.5)
  
  #Expected number of expressed genes
  expressed.genes.df$estimated.genes[i] <- round(sum(expr.prob))
  
}
print(expressed.genes.df)

##     matrix num.cells expressed.genes transcriptome.mapped.reads cells.indiv estimated.genes
## 1 R8608442      2024           12283                      11324       404.8           15215
## 2 R8744945       484           11206                      21186        96.8           13228
## 3 R9307768      1169           12196                      12563       233.8           14228
## 4 R9426782       297           10207                      15690        59.4           10533

plot.expressed.genes.df<-reshape2::melt(expressed.genes.df, id.vars=c("matrix","num.cells","cells.indiv","transcriptome.mapped.reads"))

ggplot(plot.expressed.genes.df,aes(x=transcriptome.mapped.reads,y=value,
        color=variable)) + geom_point()+geom_line()

Use of expression probability model for power calculation

The fitted values can adapt the power calculation to the specific conditions of the experiment, e.g. a specific cell type of interest.

power<-power.general.restrictedDoublets(nSamples=20,nCells=5000,readDepth=25000,
                                        ct.freq=0.2,type="de",
                                        ref.study=scPower::de.ref.study,
                                        ref.study.name="Nicodemus-Johnson_AEC",
                                        cellsPerLane=20000,
                                        read.umi.fit = read.umi.fit.new,
                                        gamma.mixed.fits = gamma.linear.fit.new,
                                        ct="New_ct",
                                        disp.fun.param=disp.fun.general.new,
                                        mappingEfficiency = 0.8,
                                        min.UMI.counts = 3,
                                        perc.indiv.expr = 0.5,
                                        sign.threshold = 0.05,
                                        MTmethod="FDR")

print(power)

##                    name powerDetect exp.probs     power sampleSize totalCells usableCells multipletFraction ctCells readDepth readDepthSinglet mappedReadDepth
## 1 Nicodemus-Johnson_AEC   0.9023859         1 0.9023859         20       5000        4233            0.1534     847     25000         22206.67        17765.33
##   expressedGenes
## 1          17523

Selection of optimal parameter combination for a restricted budget

scPower gives the user the opportunity to select the best parameter combination for a restricted budget. In the following example, we optimize the experimental design for a budget of 30,000. For two of the three parameters, sample size, cells per individual and read depth, vectors with potential values are required, the third variable is determined uniquely given the other two and the overall budget. Which of the variables is left out, can be freely chosen. In this example we set the number of cells per individual and read depths in the parameters readDepthRange and cellPersRange, and leave out the sample size, which will be determined by the algorithm.

In this example, we set useSimulatedPower as FALSE to speed up the calculation. It should be set as TRUE to increase accuracy.

opt.design <- optimize.constant.budget(totalBudget = 30000,type = "de",
                                     ct="New_ct",ct.freq=0.2,
                                     costKit = 5000,
                                     costFlowCell = 13000,
                                     readsPerFlowcell = 4100*10^6,
                                     ref.study = scPower::de.ref.study,
                                     ref.study.name = "Nicodemus-Johnson_AEC",
                                     samplesPerLane = 4,
                                     read.umi.fit = read.umi.fit.new,
                                     gamma.mixed.fits = gamma.linear.fit.new,
                                     disp.fun.param = disp.fun.general.new,
                                     nSamplesRange=NULL,
                                     nCellsRange = seq(1000,10000,by=1000),
                                     readDepthRange = seq(20000,70000,by=5000),
                                     mappingEfficiency = 0.8,
                                     sign.threshold = 0.05,
                                     MTmethod="FDR",
                                     useSimulatedPower = FALSE,
                                     speedPowerCalc = FALSE)


head(opt.design)

##                    name powerDetect exp.probs     power sampleSize totalCells usableCells multipletFraction ctCells readDepth readDepthSinglet mappedReadDepth
## 1 Nicodemus-Johnson_AEC   0.8350547 0.8986476 0.9208892        110       1000         969           0.03068     194     20000         19504.20        15603.36
## 2 Nicodemus-Johnson_AEC   0.9589929 0.9934820 0.9655084         88       2000        1877           0.06136     375     20000         19039.82        15231.86
## 3 Nicodemus-Johnson_AEC   0.9776422 0.9999998 0.9776424         74       3000        2724           0.09204     545     20000         18597.04        14877.63
## 4 Nicodemus-Johnson_AEC   0.9818622 1.0000000 0.9818622         64       4000        3509           0.12272     702     20000         18171.00        14536.80
## 5 Nicodemus-Johnson_AEC   0.9831178 1.0000000 0.9831178         56       5000        4233           0.15340     847     20000         17765.33        14212.27
## 6 Nicodemus-Johnson_AEC   0.9833437 1.0000000 0.9833437         50       6000        4896           0.18408     979     20000         17378.01        13902.41
##   expressedGenes
## 1          14424
## 2          15915
## 3          16581
## 4          16965
## 5          17215
## 6          17387

#Optimal experimental design combination
print(opt.design[which.max(opt.design$powerDetect),])

##                     name powerDetect exp.probs     power sampleSize totalCells usableCells multipletFraction ctCells readDepth readDepthSinglet mappedReadDepth
## 26 Nicodemus-Johnson_AEC   0.9870326         1 0.9870326         38       6000        4896           0.18408     979     30000         26067.01        20853.61
##    expressedGenes
## 26          17905

Session info

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MKmisc_1.8     Matrix_1.3-4   ggplot2_3.3.5  reshape2_1.4.4 scPower_1.0.1  rmarkdown_2.9 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7         lattice_0.20-44    prettyunits_1.1.1  ps_1.6.0           assertthat_0.2.1   rprojroot_2.0.2    digest_0.6.27      utf8_1.2.2        
##  [9] R6_2.5.0           plyr_1.8.6         evaluate_0.14      highr_0.9          pillar_1.6.2       rlang_0.4.11       curl_4.3.2         callr_3.7.0       
## [17] jquerylib_0.1.4    desc_1.3.0         labeling_0.4.2     devtools_2.4.3     stringr_1.4.0      munsell_0.5.0      compiler_4.1.1     xfun_0.25         
## [25] pkgconfig_2.0.3    pkgbuild_1.2.0     htmltools_0.5.1.1  tidyselect_1.1.1   tibble_3.1.3       fansi_0.5.0        crayon_1.4.1       dplyr_1.0.7       
## [33] withr_2.4.2        grid_4.1.1         jsonlite_1.7.2     gtable_0.3.0       lifecycle_1.0.0    DBI_1.1.1          magrittr_2.0.1     scales_1.1.1      
## [41] cli_3.0.1          stringi_1.7.3      cachem_1.0.5       farver_2.1.0       fs_1.5.0           remotes_2.3.0      testthat_3.0.4     limma_3.48.3      
## [49] robustbase_0.93-8  bslib_0.2.5.1      ellipsis_0.3.2     vctrs_0.3.8        generics_0.1.0     RColorBrewer_1.1-2 tools_4.1.1        glue_1.4.2        
## [57] DEoptimR_1.0-9     purrr_0.3.4        processx_3.5.2     pkgload_1.2.1      fastmap_1.1.0      yaml_2.2.1         colorspace_2.0-2   sessioninfo_1.1.1 
## [65] memoise_2.0.0      knitr_1.33         sass_0.4.0         usethis_2.0.1