scPower is an R package for power analysis and design of single cell transcriptomics experiments for differential expression (DE) and eQTL analysis. A detection of cell-type specific DE and eQTL genes is possible with the help of single cell RNA-seq. It enables the user to calculate the power for a given experimental setup and to choose for a restricted budget the optimal combination of experimental parameters which maximizes the power. Here, we show the power analysis for a DE study, for an eQTL study it works in similar way.
Load necessary libraries and set up input and output locations.
suppressPackageStartupMessages({
library(scPower)
library(reshape2)
library(ggplot2)
})
set.seed(124)
Two types of priors are necessary for scPower simulations:
the cell type specific expression distribution of all genes for the expression probability curves, which can be fitted from single cell RNA-seq data of the same technology (e.g. data from a pilot study). The distribution include the read-UMI fit to estimate the UMI counts per cell, the gamma mixed fits for the mean values of each gene and the mean-dispersion fits for the dispersion values of each gene.
effect sizes and expression ranks of the DE genes, which can be taken from any kind of study or simulated. We use log2 fold changes from inhibitory neurons here.
They can be generated from existing data set. Here, we use 4 samples from the ROSMAP dataset (https://pubmed.ncbi.nlm.nih.gov/31042697/).
#SampleNames <- unique(ROSMAP$individualID)
#R8744945.counts <- ROSMAP@assays$RNA@counts[,ROSMAP$individualID == "R8744945"]
#R9426782.counts <- ROSMAP@assays$RNA@counts[,ROSMAP$individualID == "R9426782"]
#R9307768.counts <- ROSMAP@assays$RNA@counts[,ROSMAP$individualID == "R9307768"]
#R8608442.counts <- ROSMAP@assays$RNA@counts[,ROSMAP$individualID == "R8608442"]
#list(R8744945.counts,R9426782.counts,R9307768.counts,R8608442.counts) -> countMatrix
#names(countMatrix) <- SampleNames[1:4]
#save(countMatrix, file="CountMatrix1to4.RData")
load("CountMatrix1to4.RData")
#Dimensions of the three count matrices
sapply(countMatrix,dim)
## R8744945 R9426782 R9307768 R8608442
## [1,] 17775 17775 17775 17775
## [2,] 484 297 1169 2024
The number of expressed genes in the data set can be estimated by reformating the 2d count matrix into a 3d pseudobulk matrix, using the function “create.pseudobulk”. Therefore, an annotation data frame is required with individual and cell type annotations for each cells. The rows of the annotaiton data frame must match the columns in the count matrix (same ordering of cells). After the pseudobulk matrix is created, the expressed genes are extracted using the function “calculate.gene.counts”.
expressed.genes.df <- NULL
for(name in names(countMatrix)){
count.matrix <- countMatrix[[name]]
annot.df <- data.frame(individual=paste0("S",rep(1:5,length.out=ncol(count.matrix))),
cell.type=rep("Mic",ncol(count.matrix)))
#Reformat count matrix into pseudobulk matrix
pseudo.bulk <- create.pseudobulk(count.matrix, annot.df)
#Calculate expressed genes in the pseudobulk matrix
expressed.genes <- calculate.gene.counts(pseudo.bulk, min.counts=3, perc.indiv = 0.5)
#Get the number of expressed genes
num.expressed.genes <- nrow(expressed.genes)
#Save expressed genes
expressed.genes.df <- rbind(expressed.genes.df, data.frame(matrix=name,
num.cells=ncol(count.matrix), expressed.genes=num.expressed.genes))
}
print(expressed.genes.df)
## matrix num.cells expressed.genes
## 1 R8744945 484 11206
## 2 R9426782 297 10207
## 3 R9307768 1169 12196
## 4 R8608442 2024 12283
As a first step for fitting the expression distribution, the negative binomial fit for each gene, i.e. the mean and disperion parameter, is estimated with “nbinom.estimation”. The function uses DESeq for library normalization and parameter estimation. The “poscounts” normalization is used here as the snRNA-Seq data is very sparse.
The function returns a list with three elements: the normalized mean values, the dispersion values and the parameters of the mean-dispersion function fitted from DESeq.
norm.mean.values<-NULL
disp.param<-NULL
for(name in names(countMatrix)){
temp <- nbinom.estimation(as.matrix(countMatrix[[name]]), "poscounts")
#Save the normalized mean values
norm.mean.values.temp <- temp[[1]]
norm.mean.values.temp$matrix <- name
norm.mean.values <- rbind(norm.mean.values,norm.mean.values.temp)
#Save the parameter of the mean-dispersion function
disp.param.temp <- temp[[3]]
disp.param.temp$matrix <- name
disp.param <- rbind(disp.param,disp.param.temp)
}
head(norm.mean.values)
## gene mean matrix
## 1 FO538757.2 0.347628457 R8744945
## 2 SAMD11 0.002081430 R8744945
## 3 NOC2L 0.227857100 R8744945
## 4 KLHL17 0.019762945 R8744945
## 5 PLEKHN1 0.000000000 R8744945
## 6 PERM1 0.002087122 R8744945
print(disp.param)
## asymptDisp extraPois matrix
## 1 0.610894 0.15445884 R8744945
## 2 1.531188 0.10420800 R9426782
## 3 0.988553 0.10010291 R9307768
## 4 1.636793 0.08543065 R8608442
A mixed distribution of a zero component and two left censored gamma components is fitted for each mean value vector. As a fraction of the zero values can be expressed genes below the detection threshold of the experiment, the gamma distributions are modelled as left censored. Here, we use 1/num_cells_celltype as the censoring point.
gamma.fits <- NULL
for(name in names(countMatrix)){
#Number of cells per cell type as censoring point
censoredPoint <- 1 / ncol(countMatrix[[name]])
norm.mean.values.temp <- norm.mean.values[norm.mean.values$matrix == name,]
gamma.fit.temp <- mixed.gamma.estimation(norm.mean.values.temp$mean, num.genes.kept = 17775, censoredPoint = censoredPoint)
gamma.fit.temp$matrix <- name
gamma.fits <- rbind(gamma.fits,gamma.fit.temp)
}
## Warning in em(mean.vals, ncomp = 3, prop = c(zero.prop, 1 - (zero.prop + : Problem in the EM algorithm: likelihood is decreasing!
## Warning in em(mean.vals, ncomp = 3, prop = c(zero.prop, 1 - (zero.prop + : Problem in the EM algorithm: likelihood is decreasing!
print(gamma.fits)
## p1 p2 s1 s2 r1 r2 matrix
## shape 0.01852296 0.9310087 0.4194624 0.8134898 2.258687 0.5103385 R8744945
## shape1 0.01789153 0.9316999 0.5549020 1.2062766 3.923257 1.1764886 R9426782
## shape2 0.01447667 0.9350203 0.4031303 0.8505455 2.937134 0.7061003 R9307768
## shape3 0.01401963 0.9351703 0.4109758 0.8033830 5.389559 1.0674429 R8608442
The gamma fits over all matrices are parameterized by the mean UMI counts per cell.
#Estimate the mean umi values per cell for each matrix
umi.values <- NULL
for(name in names(countMatrix)){
mean.umi <- meanUMI.calculation(as.matrix(countMatrix[[name]]))
umi.values <- rbind(umi.values,data.frame(mean.umi,matrix=name))
}
print(umi.values)
## mean.umi matrix
## 1 4772.351 R8744945
## 2 3832.569 R9426782
## 3 3497.281 R9307768
## 4 2007.899 R8608442
gamma.fits <- merge(gamma.fits,umi.values,by="matrix")
#Convert the gamma fits from the shape-rate parametrization to the mean-sd parametrization
gamma.fits <- convert.gamma.parameters(gamma.fits)
#Visualize the linear relationship between gamma parameters and UMI values in plots
plot.values <- melt(gamma.fits,id.vars=c("matrix","mean.umi"))
plot.values <- plot.values[plot.values$variable %in% c("mean1","mean2","sd1","sd2","p1","p2"),]
ggplot(plot.values,aes(x=mean.umi,y=value)) + geom_point()+geom_line() + facet_wrap(~variable,ncol=2,scales="free")
#Fit relationship between gamma parameters and UMI values
gamma.linear.fit.new <- umi.gamma.relation(gamma.fits)
print(gamma.linear.fit.new)
## parameter intercept meanUMI
## 1 p1 0.0101833024 1.713494e-06
## 2 mean1 -0.0025287642 3.903387e-05
## 3 mean2 0.1473966685 2.825596e-04
## 4 sd1 -0.0001510715 5.757131e-05
## 5 sd2 0.1681951885 2.958093e-04
## 6 p3 0.0504856631 0.000000e+00
For the dispersion parameter, no relation with the UMI counts was found. Therefore, simply the median value over all subsampling runs is taken for each parameter of the mean-dispersion function.
disp.fun.general.new <- dispersion.function.estimation(disp.param)
print(disp.fun.general.new)
## asymptDisp extraPois
## 1 1.259871 0.1021555
As a last point, the relationship between reads and UMIs is fitted logarithmically. Therefore, the number of mapped reads is necessary, which can be gained from summary statistics after the mapping (e.g. in the cellranger summary statistics).
# Annotation of cell type for all fitted data frames
gamma.linear.fit.new$ct <- "New_ct"
disp.fun.general.new$ct <- "New_ct"
#Number of mapped reads taken from cellranger summary statistics
mapped.reads <- data.frame(matrix = names(countMatrix), transcriptome.mapped.reads=c(21186,15690,12563,11324))
#Plot relationship between mean reads per cell and mean UMI per cell
read.umis <- merge(umi.values,mapped.reads,by="matrix")
print(read.umis)
## matrix mean.umi transcriptome.mapped.reads
## 1 R8608442 2007.899 11324
## 2 R8744945 4772.351 21186
## 3 R9307768 3497.281 12563
## 4 R9426782 3832.569 15690
ggplot(read.umis,aes(x=transcriptome.mapped.reads,y=mean.umi))+ geom_point()+geom_line()
#Fit relationship between mean reads per cell and mean UMI per cell
read.umi.fit.new <- umi.read.relation(read.umis)
print(read.umi.fit.new)
## intercept reads
## (Intercept) -32948.49 3800.074
In the section “Counting observed expressed genes”, the number of expressed genes in our example count matrices were calculated and saved in the data frame “expressed.genes.df”. To validate our model, we try to predict the same number of expressed genes using our fitted model.
#Merge the observed numbers of expressed genes with the read depth
expressed.genes.df <- merge(expressed.genes.df, mapped.reads, by="matrix")
#Get the number of cells per cell type and individual
expressed.genes.df$cells.indiv <- expressed.genes.df$num.cells/5
expressed.genes.df$estimated.genes <- NA
for(i in 1:nrow(expressed.genes.df)){
#Vector with the expression probability for each gene
expr.prob <- estimate.exp.prob.param(nSamples=5, readDepth=expressed.genes.df$transcriptome.mapped.reads[i],
nCellsCt=expressed.genes.df$cells.indiv[i],
read.umi.fit = read.umi.fit.new,
gamma.mixed.fits = gamma.linear.fit.new,
ct = "New_ct",
disp.fun.param = disp.fun.general.new,
min.counts = 3,
perc.indiv = 0.5)
#Expected number of expressed genes
expressed.genes.df$estimated.genes[i] <- round(sum(expr.prob))
}
print(expressed.genes.df)
## matrix num.cells expressed.genes transcriptome.mapped.reads cells.indiv estimated.genes
## 1 R8608442 2024 12283 11324 404.8 15215
## 2 R8744945 484 11206 21186 96.8 13228
## 3 R9307768 1169 12196 12563 233.8 14228
## 4 R9426782 297 10207 15690 59.4 10533
plot.expressed.genes.df<-reshape2::melt(expressed.genes.df, id.vars=c("matrix","num.cells","cells.indiv","transcriptome.mapped.reads"))
ggplot(plot.expressed.genes.df,aes(x=transcriptome.mapped.reads,y=value,
color=variable)) + geom_point()+geom_line()
The fitted values can adapt the power calculation to the specific conditions of the experiment, e.g. a specific cell type of interest.
power<-power.general.restrictedDoublets(nSamples=20,nCells=5000,readDepth=25000,
ct.freq=0.2,type="de",
ref.study=scPower::de.ref.study,
ref.study.name="Nicodemus-Johnson_AEC",
cellsPerLane=20000,
read.umi.fit = read.umi.fit.new,
gamma.mixed.fits = gamma.linear.fit.new,
ct="New_ct",
disp.fun.param=disp.fun.general.new,
mappingEfficiency = 0.8,
min.UMI.counts = 3,
perc.indiv.expr = 0.5,
sign.threshold = 0.05,
MTmethod="FDR")
print(power)
## name powerDetect exp.probs power sampleSize totalCells usableCells multipletFraction ctCells readDepth readDepthSinglet mappedReadDepth
## 1 Nicodemus-Johnson_AEC 0.9023859 1 0.9023859 20 5000 4233 0.1534 847 25000 22206.67 17765.33
## expressedGenes
## 1 17523
scPower gives the user the opportunity to select the best parameter combination for a restricted budget. In the following example, we optimize the experimental design for a budget of 30,000. For two of the three parameters, sample size, cells per individual and read depth, vectors with potential values are required, the third variable is determined uniquely given the other two and the overall budget. Which of the variables is left out, can be freely chosen. In this example we set the number of cells per individual and read depths in the parameters readDepthRange and cellPersRange, and leave out the sample size, which will be determined by the algorithm.
In this example, we set useSimulatedPower as FALSE to speed up the calculation. It should be set as TRUE to increase accuracy.
opt.design <- optimize.constant.budget(totalBudget = 30000,type = "de",
ct="New_ct",ct.freq=0.2,
costKit = 5000,
costFlowCell = 13000,
readsPerFlowcell = 4100*10^6,
ref.study = scPower::de.ref.study,
ref.study.name = "Nicodemus-Johnson_AEC",
samplesPerLane = 4,
read.umi.fit = read.umi.fit.new,
gamma.mixed.fits = gamma.linear.fit.new,
disp.fun.param = disp.fun.general.new,
nSamplesRange=NULL,
nCellsRange = seq(1000,10000,by=1000),
readDepthRange = seq(20000,70000,by=5000),
mappingEfficiency = 0.8,
sign.threshold = 0.05,
MTmethod="FDR",
useSimulatedPower = FALSE,
speedPowerCalc = FALSE)
head(opt.design)
## name powerDetect exp.probs power sampleSize totalCells usableCells multipletFraction ctCells readDepth readDepthSinglet mappedReadDepth
## 1 Nicodemus-Johnson_AEC 0.8350547 0.8986476 0.9208892 110 1000 969 0.03068 194 20000 19504.20 15603.36
## 2 Nicodemus-Johnson_AEC 0.9589929 0.9934820 0.9655084 88 2000 1877 0.06136 375 20000 19039.82 15231.86
## 3 Nicodemus-Johnson_AEC 0.9776422 0.9999998 0.9776424 74 3000 2724 0.09204 545 20000 18597.04 14877.63
## 4 Nicodemus-Johnson_AEC 0.9818622 1.0000000 0.9818622 64 4000 3509 0.12272 702 20000 18171.00 14536.80
## 5 Nicodemus-Johnson_AEC 0.9831178 1.0000000 0.9831178 56 5000 4233 0.15340 847 20000 17765.33 14212.27
## 6 Nicodemus-Johnson_AEC 0.9833437 1.0000000 0.9833437 50 6000 4896 0.18408 979 20000 17378.01 13902.41
## expressedGenes
## 1 14424
## 2 15915
## 3 16581
## 4 16965
## 5 17215
## 6 17387
#Optimal experimental design combination
print(opt.design[which.max(opt.design$powerDetect),])
## name powerDetect exp.probs power sampleSize totalCells usableCells multipletFraction ctCells readDepth readDepthSinglet mappedReadDepth
## 26 Nicodemus-Johnson_AEC 0.9870326 1 0.9870326 38 6000 4896 0.18408 979 30000 26067.01 20853.61
## expressedGenes
## 26 17905
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MKmisc_1.8 Matrix_1.3-4 ggplot2_3.3.5 reshape2_1.4.4 scPower_1.0.1 rmarkdown_2.9
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 lattice_0.20-44 prettyunits_1.1.1 ps_1.6.0 assertthat_0.2.1 rprojroot_2.0.2 digest_0.6.27 utf8_1.2.2
## [9] R6_2.5.0 plyr_1.8.6 evaluate_0.14 highr_0.9 pillar_1.6.2 rlang_0.4.11 curl_4.3.2 callr_3.7.0
## [17] jquerylib_0.1.4 desc_1.3.0 labeling_0.4.2 devtools_2.4.3 stringr_1.4.0 munsell_0.5.0 compiler_4.1.1 xfun_0.25
## [25] pkgconfig_2.0.3 pkgbuild_1.2.0 htmltools_0.5.1.1 tidyselect_1.1.1 tibble_3.1.3 fansi_0.5.0 crayon_1.4.1 dplyr_1.0.7
## [33] withr_2.4.2 grid_4.1.1 jsonlite_1.7.2 gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1
## [41] cli_3.0.1 stringi_1.7.3 cachem_1.0.5 farver_2.1.0 fs_1.5.0 remotes_2.3.0 testthat_3.0.4 limma_3.48.3
## [49] robustbase_0.93-8 bslib_0.2.5.1 ellipsis_0.3.2 vctrs_0.3.8 generics_0.1.0 RColorBrewer_1.1-2 tools_4.1.1 glue_1.4.2
## [57] DEoptimR_1.0-9 purrr_0.3.4 processx_3.5.2 pkgload_1.2.1 fastmap_1.1.0 yaml_2.2.1 colorspace_2.0-2 sessioninfo_1.1.1
## [65] memoise_2.0.0 knitr_1.33 sass_0.4.0 usethis_2.0.1