An accurate and robust imputation method scImpute for single-cell RNA-seq data
http://jsb.ucla.edu/sites/default/files/publications/NC_scImpute.pdf
18年UCLA刚发在NC上的一篇technology的文章,软件试过有用但是在本人数据上并没有明显的优越性,有待更多考证。
不过原理需求甚解以便效尤后续之分析。
the workflow in the imputation step of scImpute method. scImpute first learns each gene’s dropout probability in each cell by fitting a mixture model. Next, scImpute imputes the (highly probable) dropout values in cell j (gene set Aj ) by borrowing information of the same gene in other similar cells, which are selected based on gene set Bj (not severely affected by dropout events).
selecting matrix 的图示原理:
与其他软件的比较,空白对照和平行对照:scimpute、MAGIC、SAVER
one more dimension reduction plot:
clustering:
效果先摆上来,确实很不错。那么具体算法和统计模型是如何架构的呢:
step by step Factorization:
establishing the normalized count matrix
1. PCA is performed on matrix X for dimension reduction and the resulting matrix is denoted as Z, where columns represent cells and rows represent principal components (PCs). The purpose of dimension reduction is to reduce the impact of large portions of dropout values. The PCs are selected such that at least 40% of the variance in data could be explained. 2. Based on the PCA-transformed data Z, the distance matrix DJ×J between the cells could be calculated. For each cell j, we denote its distance to the nearest neighbor as lj. For the set L = {l1, …, lJ}, we denote its first quartile as Q1, and third quartile as Q3. The outlier cells are those cells which do not have close neighbors:
equation1
For each outlier cell, we set its candidate neighbor set Nj = ∅. Please note that the outlier cells could be a result of experimental/technical errors or biases, but they may also represent real biological variation as rare cell types. scImpute would not impute gene expression values in outlier cells, nor use them to impute gene expression values in other cells. 3. The remaining cells {1, …, J}O are clustered into K groups by spectral clustering23. We denote gj = k if cell j is assigned to cluster k (k = 1, …, K). Hence, cell j has the candidate neighbor set Nj ¼ j ′ : gj ′ ¼ gj; j ′ ≠j
用PCA的方法,将上一步的均一化之后的matrix降维分析,得到一个叫Z的e PCA-transformed的矩阵,据此算出一个细胞与细胞之间的distance matrix Dj*j,outliners就通过这个距离矩阵中的细胞间相对位置进行outliners的判断与筛选,acoording to equation1
之后再将剩下来的细胞,在一个O range 范围内的细胞都cluster到一个group,这样得到的k个groups就可以call it as the subpopulations of those cell population
###############那么处理好哪些细胞是可用的问题之后我们需要对这些大数据量的单细胞分布做一个有统计学意义的描述:
For each gene i, its expression in cell subpopulation k is modeled as a random variable XðkÞ i with density functi
equation2
其中 是基因i在细胞亚群k中的 dropout率 α,β 是基因i 在gamam分布中的形态与位置参数, and μ, σ 是基因i在正态分布中的均值和标准差,这些参数都是用的EM也就是最大似然估计来进行的估测。
这个公式主要意义在于诠释了在基因的不同表达情况下,如何更好的衡量它是不是一个dorpout的value还是反应了一个真实的生物变异
equation3
公式三就是dropout rate的计算公式
下面来到文章中核心的如何去impute those we found dropout points above:
Imputation of dropout values. Now, we impute the gene expressions cell by cell. For each cell j, we select a gene set Aj in need of imputation based on the genes’ dropout probabilities in cell j: Aj = {i : dij ≥ t}, where t is a threshold on dropout probabilities. We also have a gene set Bj = {i : dij < t} that have accurate gene expression with high confidence and do not need imputation. We learn cells’ similarities through the gene set Bj. Then we impute the expression of genes in the set Aj by borrowing information from the same gene’s expression in other similar cells learned from Bj. Supplementary Figs. 19 and 20c give some real data distributions of genes' zero count proportions across cells and genes' dropout probabilities, showing that it is reasonable to divide genes into two sets. To learn the cells similar to cell j from Bj, we use the non-negative least squares (NNLS) regression:
equation4
Recall that Nj represents the indices of cells that are candidate neighbors of cell j. The response XBj;j is a vector representing the Bj rows in the j-th column of X, the design matrix XBj;Nj is a sub-matrix of X with dimensions Bj Nj , and the coefficients β(j) is a vector of length Nj . Note that NNLS itself has the property of leading to a sparse estimate bβðjÞ , whose components may have exact zeros39, so NNLS can be used to select similar cells of cell j from its neighbors Nj. Finally, the estimated coefficients bβðjÞ from the set Bj are used to impute the expression of genes in the set Aj in cell j:
equation5
说了一大堆,最核心大的就是 We learn cells’ similarities through the gene set Bj. Then we impute the expression of genes in the set Aj by borrowing information from the same gene’s expression in other similar cells learned from Bj.NNLS是非负最小二乘回归的缩写,在寻找cellJ邻近的相似细胞的时候可以派上用场,Bj是从gene set B中得到的估计系数,用于在A geneset中对有dropout 的基因表达矩阵进行impute。其中A gene set 是找出来的需要impute的set 然而B是找出来的相对标准以及精确的不需impute的gene set,用一个dij与t threshold的一个比较得出。得到的一个稀疏估计值βhatJ , 是拥有几乎完全为0的表达量组分的一个估计系数(whose components may have exact zeros)
至此我们可以将need imputed matrix Xij分为从来自A geneset 以及B geneset的两种的情况。
We construct a separate regression model for each cell to impute the expression of genes with high dropout probabilities。整个scimpute的过程,只需要两个参数的人为设置,第一个是K就是cluster到多少个gourd的个数,以及一个dropout的rate threshold t。
advantages of scimpute in article:scImpute simultaneously determines the values that need imputation, and would not introduce biases to the high expression values of accurately measured genes。但是scImpute的inputing 相对保守不会overscImpute也不会过于sparse。
############validation step
Generation of simulated scRNA-seq data.##自行看文章
Four evaluation measures of clustering results
(adjusted Rand index, Jaccard index, normalized mutual information (nmi), and purity)
adjusted Rand index:是在聚类分类中的用的比较多的经典的检验方法,惩罚的是假阳性以及假阴性的分类事件,
Jaccard index:类似ARI,但是JI并不能很精确的判定真阴性事件。
NMI:是从信息理论的角度解读亚群与亚群之间的相似性
purity:纯度,从真正的一个聚类中的得来的样本数的百分比。