• SC3聚类 | 拉普拉斯矩阵 | Laplacian matrix | 图论 | R代码


    Laplacian和PCA貌似是同一种性质的方法,坐标系变换。只是拉普拉斯属于图论的范畴,术语更加专业了。

    要看就把一篇文章看完整,再看其中有什么值得借鉴的,总结归纳理解后的东西才是属于你的。

    问题:

    1. 这篇文章有哪些亮点决定他能发NM?单细胞,consensus,较好的表现,包装了一些专业的术语,显得自己很专业,其实真正做的东西很少;

    2. consensus方法的本质是什么?

    3. 工具的评估准则?ARI,silhouette index

    4. SC3的最大缺点是什么?速度太慢,超过1000个细胞就非常耗费计算和存储资源

    5. 能看懂SC3这个R包的逻辑吗?核心的就4步,多种距离度量,转换,kmeans聚类,consensus;

    The main sc3 method explained above is a wrapper that calls several other SC3 methods in the following order:

    • sc3_prepare
    • sc3_estimate_k - Tracy-Widom theory - random matrix theory (RMT)
    • sc3_calc_dists
    • sc3_calc_transfs
    • sc3_kmeans
    • sc3_calc_consens
    • sc3_calc_biology

    6. 有很多问题没有回答,这篇文章偏技工!核心就是kmeans,打了个复杂的包而已。

    • 不同距离的度量有什么差异?
    • 为什么要做两种转换PCA和laplacian?
    • 为什么选择了kmeans?不知道它有天然的劣势吗
    • 做consensus的理论依据是什么?凭什么说做了一致性后结果就更好?

    最近在看SC3聚类这篇文章,SC3使用了这个工具。

    SC3: consensus clustering of single-cell RNA-seq data

    All distance matrices are then transformed using either principal component analysis (PCA) or by calculating the eigenvectors of the associated graph Laplacian (L = I – D–1/2AD–1/2, where I is the identity matrix, A is a similarity matrix (A = e–A′/max(A′)), where A′ is a distance matrix) and D is the degree matrix of A, a diagonal matrix that contains the row-sums of A on the diagonal (Dii = ΣjAij). The columns of the resulting matrices are then sorted in ascending order by their corresponding eigenvalues.

    先看下该工具的功能:SC3 package manual

    跑一下常规代码:

    library(SingleCellExperiment)
    library(SC3)
    library(scater)
    
    head(ann)
    yan[1:3, 1:3]
    
    # create a SingleCellExperiment object
    sce <- SingleCellExperiment(
      assays = list(
        counts = as.matrix(yan),
        logcounts = log2(as.matrix(yan) + 1)
      ), 
      colData = ann
    )
    
    # define feature names in feature_symbol column
    rowData(sce)$feature_symbol <- rownames(sce)
    # remove features with duplicated names
    sce <- sce[!duplicated(rowData(sce)$feature_symbol), ]
    
    # define spike-ins
    isSpike(sce, "ERCC") <- grepl("ERCC", rowData(sce)$feature_symbol)
    
    plotPCA(sce, colour_by = "cell_type1")
    
    sce <- sc3(sce, ks = 2:4, biology = TRUE)
    # sc3_interactive(sce)
    # sc3_export_results_xls(sce)
    
    ######################################
    sce <- sc3_prepare(sce)
    
    sce <- sc3_estimate_k(sce)
    
    sce <- sc3_calc_dists(sce)
    names(metadata(sce)$sc3$distances)
    
    sce <- sc3_calc_transfs(sce)
    names(metadata(sce)$sc3$transformations)
    metadata(sce)$sc3$distances
    
    sce <- sc3_kmeans(sce, ks = 2:4)
    names(metadata(sce)$sc3$kmeans)
    
    col_data <- colData(sce)
    head(col_data[ , grep("sc3_", colnames(col_data))])
    sce <- sc3_calc_consens(sce)
    names(metadata(sce)$sc3$consensus)
    names(metadata(sce)$sc3$consensus$`3`)
    
    col_data <- colData(sce)
    head(col_data[ , grep("sc3_", colnames(col_data))])
    
    sce <- sc3_calc_biology(sce, ks = 2:4)
    
    sce <- sc3_run_svm(sce, ks = 2:4)
    col_data <- colData(sce)
    head(col_data[ , grep("sc3_", colnames(col_data))])
    

      

    接下来会尝试拆一下该工具。

    怎么拆这个工具?

    这种封装的很好的R包其实比较难拆,一般的通过函数名字就可以看到R代码,但这里你输入函数名,如sc3_calc_dists,看到的只是以下的封装好的代码:

    new("nonstandardGenericFunction", .Data = function (object) 
    {
        standardGeneric("sc3_calc_dists")
    }, generic = structure("sc3_calc_dists", package = "SC3"), package = "SC3", 
        group = list(), valueClass = character(0), signature = "object", 
        default = NULL, skeleton = (function (object) 
        stop("invalid call in method dispatch to 'sc3_calc_dists' (no default method)", 
            domain = NA))(object))  

    暂时还不熟悉这种形式,所以只能通过函数名去GitHub里面查了。

    GitHub真的很优秀,可以直接查文件内部代码,可以很快定位到sc3_calc_dists。

    再配合这个目录插件,效率提高了不少,https://www.octotree.io/?utm_source=lite&utm_medium=extension

    以下是封装前的代码:

    #' Calculate distances between the cells.
    #' 
    #' This function calculates distances between the cells. It
    #' creates and populates the following items of the code{sc3} slot of the code{metadata(object)}:
    #' itemize{
    #'   item code{distances} - contains a list of distance matrices corresponding to
    #'   Euclidean, Pearson and Spearman distances.
    #' }
    #' 
    #' @name sc3_calc_dists
    #' @aliases sc3_calc_dists, sc3_calc_dists,SingleCellExperiment-method
    #' 
    #' @param object an object of code{SingleCellExperiment} class
    #' 
    #' @return an object of code{SingleCellExperiment} class
    #' 
    #' @importFrom doRNG %dorng%
    #' @importFrom foreach foreach %dopar%
    #' @importFrom parallel makeCluster stopCluster
    #' @importFrom doParallel registerDoParallel
    sc3_calc_dists.SingleCellExperiment <- function(object) {
        dataset <- get_processed_dataset(object)
        
        # check whether in the SVM regime
        if (!is.null(metadata(object)$sc3$svm_train_inds)) {
            dataset <- dataset[, metadata(object)$sc3$svm_train_inds]
        }
        
        # NULLing the variables to avoid notes in R CMD CHECK
        i <- NULL
        
        distances <- c("euclidean", "pearson", "spearman")
        
        message("Calculating distances between the cells...")
        
        if (metadata(object)$sc3$n_cores > length(distances)) {
            n_cores <- length(distances)
        } else {
            n_cores <- metadata(object)$sc3$n_cores
        }
        
        cl <- parallel::makeCluster(n_cores, outfile = "")
        doParallel::registerDoParallel(cl, cores = n_cores)
        
        # calculate distances in parallel
        dists <- foreach::foreach(i = distances) %dorng% {
            try({
                calculate_distance(dataset, i)
            })
        }
        
        # stop local cluster
        parallel::stopCluster(cl)
        
        names(dists) <- distances
        
        metadata(object)$sc3$distances <- dists
        return(object)
    }
    
    #' @rdname sc3_calc_dists
    #' @aliases sc3_calc_dists
    setMethod("sc3_calc_dists", signature(object = "SingleCellExperiment"), sc3_calc_dists.SingleCellExperiment)  

    通过setMethod链接到一起的。

    顺路找到了原函数:

    #' Calculate a distance matrix
    #'
    #' Distance between the cells, i.e. columns, in the input expression matrix are
    #' calculated using the Euclidean, Pearson and Spearman metrics to construct
    #' distance matrices.
    #'
    #' @param data expression matrix
    #' @param method one of the distance metrics: 'spearman', 'pearson', 'euclidean'
    #' @return distance matrix
    #'
    #' @importFrom stats cor dist
    #' 
    #' @useDynLib SC3
    #' @importFrom Rcpp sourceCpp
    #'
    calculate_distance <- function(data, method) {
        return(if (method == "spearman") {
            as.matrix(1 - cor(data, method = "spearman"))
        } else if (method == "pearson") {
            as.matrix(1 - cor(data, method = "pearson"))
        } else {
            ED2(data)
        })
    }
    
    #' Distance matrix transformation
    #'
    #' All distance matrices are transformed using either principal component 
    #' analysis (PCA) or by calculating the 
    #' eigenvectors of the graph Laplacian (Spectral). 
    #' The columns of the resulting matrices are then sorted in 
    #' descending order by their corresponding eigenvalues.
    #'
    #' @param dists distance matrix
    #' @param method transformation method: either 'pca' or
    #' 'laplacian'
    #' @return transformed distance matrix
    #'
    #' @importFrom stats prcomp cmdscale
    #'
    transformation <- function(dists, method) {
        if (method == "pca") {
            t <- prcomp(dists, center = TRUE, scale. = TRUE)
            return(t$rotation)
        } else if (method == "laplacian") {
            L <- norm_laplacian(dists)
            l <- eigen(L)
            # sort eigenvectors by their eigenvalues
            return(l$vectors[, order(l$values)])
        }
    }
    
    #' Calculate consensus matrix
    #'
    #' Consensus matrix is calculated using the Cluster-based Similarity 
    #' Partitioning Algorithm (CSPA). For each clustering solution a binary 
    #' similarity matrix is constructed from the corresponding cell labels: 
    #' if two cells belong to the same cluster, their similarity is 1, otherwise 
    #' the similarity is 0. A consensus matrix is calculated by averaging all 
    #' similarity matrices.
    #'
    #' @param clusts a matrix containing clustering solutions in columns
    #' @return consensus matrix
    #' 
    #' @useDynLib SC3
    #' @importFrom Rcpp sourceCpp
    #' @export
    consensus_matrix <- function(clusts) {
        res <- consmx(clusts)
        colnames(res) <- as.character(c(1:nrow(clusts)))
        rownames(res) <- as.character(c(1:nrow(clusts)))
        return(res)
    }
    
    • 距离计算
    • 转换
    • consensus

    都在这里。。。  

    ED2是他们实验室自己用Rcpp写的一个计算欧氏距离的工具。

    transformation输入的是对称的距离矩阵(行列都是样本细胞),然后做完PCA,返回了rotation,不知道这样做有什么意义?

    还真有用PCA来处理距离相似度矩阵的,MDS,目的就是降维,因为后面要用kmean聚类;

    然后对每一个降维了的矩阵用kmeans;

    consensus用的是这个算法:Cluster-based Similarity Partitioning Algorithm (CSPA),做这个的意义何在?输入是每个细胞的多重聚类结果,然后做了一个一致性统一。

    参考:

    拉普拉斯矩阵(Laplacian matrix)

  • 相关阅读:
    [ html canvas getImageData Object.data.length ] canvas绘图属性 getImageData Object.data.length 属性讲解
    [ html canvas 模仿支付宝刮刮卡效果 ] canvas绘图属性 模仿支付宝刮刮卡效果实例演示
    [ javascript html Dom image 对象事件加载方式 ] 对象事件加载方式
    [ javascript New Image() ] New Image() 对象讲解
    Django安装及环境配置
    python使用opencv实现人脸识别系统
    redis 操作常用命令
    关于Python获取SQLSERVER数据库中文显示乱码问题
    Topshelf创建windows服务初探
    Js获取ip地址
  • 原文地址:https://www.cnblogs.com/leezx/p/10878506.html
Copyright © 2020-2023  润新知