• spark(1.1) mllib 源代码分析


    在spark mllib 1.1加入版本stat包,其中包括一些统计数据有关的功能。本文分析中卡方检验和实施的主要原则:

    一个、根本

      在stat包实现Pierxunka方检验,它包括以下类别

        (1)适配度检验(Goodness of Fit test):验证一组观察值的次数分配是否异于理论上的分配。

        (2)独立性检验(independence test) :验证从两个变量抽出的配对观察值组是否互相独立(比如:每次都从A国和B国各抽一个人,看他们的反应是否与国籍无关)

      计算公式:

     chi^2 =   sum_{i=1}^{r} sum_{j=1}^{c} {(O_{i,j} - E_{i,j})^2 over E_{i,j}}.

        当中O表示观測值,E表示期望值

      具体原理能够參考:http://zh.wikipedia.org/wiki/%E7%9A%AE%E7%88%BE%E6%A3%AE%E5%8D%A1%E6%96%B9%E6%AA%A2%E5%AE%9A

    二、java api调用example

      https://github.com/tovin-xu/mllib_example/blob/master/src/main/java/com/mllib/example/stat/ChiSquaredSuite.java

    三、源代码分析

      1、外部api

        通过Statistics类提供了4个外部接口  

    复制代码
    // Goodness of Fit test
    def chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult = {
        ChiSqTest.chiSquared(observed, expected)
      }
    //Goodness of Fit test
    def chiSqTest(observed: Vector): ChiSqTestResult = ChiSqTest.chiSquared(observed)
    
    //independence test
    def chiSqTest(observed: Matrix): ChiSqTestResult = ChiSqTest.chiSquaredMatrix(observed)
    //independence test
    def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
        ChiSqTest.chiSquaredFeatures(data)
    }
    复制代码

      2、Goodness of Fit test实现

      这个比較简单。关键是依据(observed-expected)2/expected计算卡方值

    复制代码
     /*
       * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
       * Uniform distribution is assumed when `expected` is not passed in.
       */
      def chiSquared(observed: Vector,
          expected: Vector = Vectors.dense(Array[Double]()),
          methodName: String = PEARSON.name): ChiSqTestResult = {
    
        // Validate input arguments
        val method = methodFromString(methodName)
        if (expected.size != 0 && observed.size != expected.size) {
          throw new IllegalArgumentException("observed and expected must be of the same size.")
        }
        val size = observed.size
        if (size > 1000) {
          logWarning("Chi-squared approximation may not be accurate due to low expected frequencies "
            + s" as a result of a large number of categories: $size.")
        }
        val obsArr = observed.toArray
      // 假设expected值没有设置,默认取1.0 / size
        val expArr = if (expected.size == 0) Array.tabulate(size)(_ => 1.0 / size) else expected.toArray
    
      / 假设expected、observed值都必需要大于1
        if (!obsArr.forall(_ >= 0.0)) {
          throw new IllegalArgumentException("Negative entries disallowed in the observed vector.")
        }
        if (expected.size != 0 && ! expArr.forall(_ >= 0.0)) {
          throw new IllegalArgumentException("Negative entries disallowed in the expected vector.")
        }
    
        // Determine the scaling factor for expected
        val obsSum = obsArr.sum
        val expSum = if (expected.size == 0.0) 1.0 else expArr.sum
        val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else obsSum / expSum
    
        // compute chi-squared statistic
        val statistic = obsArr.zip(expArr).foldLeft(0.0) { case (stat, (obs, exp)) =>
          if (exp == 0.0) {
            if (obs == 0.0) {
              throw new IllegalArgumentException("Chi-squared statistic undefined for input vectors due"
                + " to 0.0 values in both observed and expected.")
            } else {
              return new ChiSqTestResult(0.0, size - 1, Double.PositiveInfinity, PEARSON.name,
                NullHypothesis.goodnessOfFit.toString)
            }
          }
      // 计算(observed-expected)2/expected
          if (scale == 1.0) {
            stat + method.chiSqFunc(obs, exp)
          } else {
            stat + method.chiSqFunc(obs, exp * scale)
          }
        }
        val df = size - 1
        val pValue = chiSquareComplemented(df, statistic)
        new ChiSqTestResult(pValue, df, statistic, PEARSON.name, NullHypothesis.goodnessOfFit.toString)
      }
    复制代码

      3、independence test实现

        先通过以下的公式计算expected值,矩阵共同拥有 r 行 c 列

         E_{i,j}=frac{left(sum_{n_c=1}^c O_{i,n_c}
ight) cdotleft(sum_{n_r=1}^r O_{n_r,j}
ight)}{N}

        然后依据(observed-expected)2/expected计算卡方值

    复制代码
    /*
       * Pearon's independence test on the input contingency matrix.
       * TODO: optimize for SparseMatrix when it becomes supported.
       */
      def chiSquaredMatrix(counts: Matrix, methodName:String = PEARSON.name): ChiSqTestResult = {
        val method = methodFromString(methodName)
        val numRows = counts.numRows
        val numCols = counts.numCols
    
        // get row and column sums
        val colSums = new Array[Double](numCols)
        val rowSums = new Array[Double](numRows)
        val colMajorArr = counts.toArray
        var i = 0
        while (i < colMajorArr.size) {
          val elem = colMajorArr(i)
          if (elem < 0.0) {
            throw new IllegalArgumentException("Contingency table cannot contain negative entries.")
          }
          colSums(i / numRows) += elem
          rowSums(i % numRows) += elem
          i += 1
        }
        val total = colSums.sum
    
        // second pass to collect statistic
        var statistic = 0.0
        var j = 0
        while (j < colMajorArr.size) {
          val col = j / numRows
          val colSum = colSums(col)
          if (colSum == 0.0) {
            throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"
              + s"0 sum in column [$col].")
          }
          val row = j % numRows
          val rowSum = rowSums(row)
          if (rowSum == 0.0) {
            throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"
              + s"0 sum in row [$row].")
          }
          val expected = colSum * rowSum / total
          statistic += method.chiSqFunc(colMajorArr(j), expected)
          j += 1
        }
        val df = (numCols - 1) * (numRows - 1)
        val pValue = chiSquareComplemented(df, statistic)
        new ChiSqTestResult(pValue, df, statistic, methodName, NullHypothesis.independence.toString)
      }

    版权声明:本文博客原创文章,博客,未经同意,不得转载。

  • 相关阅读:
    AxInterop.VPIClient DLL注册
    多个事务同时操作数据库
    aspx小试
    WPF 或得PNG图片的外形Path的Data
    Spass导出数据
    Excel VBA小试
    合并Excel文件
    asp.net 中文编码问题
    Delphi中的容器类(3)
    Delphi中的容器类(1)
  • 原文地址:https://www.cnblogs.com/zfyouxi/p/4731120.html
Copyright © 2020-2023  润新知