• 数据挖掘:关联规则的apriori算法在weka的源码分析


      相对于机器学习,关联规则的apriori算法更偏向于数据挖掘。

    1) 测试文档中调用weka的关联规则apriori算法,如下

    try {
                File file = new File("F:\tools/lib/data/contact-lenses.arff");
                ArffLoader loader = new ArffLoader();
                loader.setFile(file);
                Instances m_instances = loader.getDataSet();
                
                Discretize discretize = new Discretize();
                discretize.setInputFormat(m_instances);
                m_instances = Filter.useFilter(m_instances, discretize);
                Apriori apriori = new Apriori();
                apriori.buildAssociations(m_instances);
                System.out.println(apriori.toString());
            } catch (Exception e) {
                e.printStackTrace();
            }

    步骤

    1 读取数据集data,并提取样本集instances

    2 离散化属性Discretize

    3 创建Apriori 关联规则模型

    4 输出大频率项集和关联规则集

    2) 创建分类器的时候,调用设置默认参数方法

      public void resetOptions() {
    
        m_removeMissingCols = false;
        m_verbose = false;
        m_delta = 0.05;
        m_minMetric = 0.90;
        m_numRules = 10;
        m_lowerBoundMinSupport = 0.1;
        m_upperBoundMinSupport = 1.0;
        m_significanceLevel = -1;
        m_outputItemSets = false;
        m_car = false;
        m_classIndex = -1;
      }

    参数详细解析,见后面的备注1

    3)buildAssociations方法的解析,源码如下

    public void buildAssociations(Instances instances) throws Exception {
    
        double[] confidences, supports;
        int[] indices;
        FastVector[] sortedRuleSet;
        int necSupport = 0;
    
        instances = new Instances(instances);
    
        if (m_removeMissingCols) {
          instances = removeMissingColumns(instances);
        }
        if (m_car && m_metricType != CONFIDENCE)
          throw new Exception("For CAR-Mining metric type has to be confidence!");
    
        // only set class index if CAR is requested
        if (m_car) {
          if (m_classIndex == -1) {
            instances.setClassIndex(instances.numAttributes() - 1);
          } else if (m_classIndex <= instances.numAttributes() && m_classIndex > 0) {
            instances.setClassIndex(m_classIndex - 1);
          } else {
            throw new Exception("Invalid class index.");
          }
        }
    
        // can associator handle the data?
        getCapabilities().testWithFail(instances);
    
        m_cycles = 0;
    
        // make sure that the lower bound is equal to at least one instance
        double lowerBoundMinSupportToUse = (m_lowerBoundMinSupport
            * instances.numInstances() < 1.0) ? 1.0 / instances.numInstances()
            : m_lowerBoundMinSupport;
    
        if (m_car) {
          // m_instances does not contain the class attribute
          m_instances = LabeledItemSet.divide(instances, false);
    
          // m_onlyClass contains only the class attribute
          m_onlyClass = LabeledItemSet.divide(instances, true);
        } else
          m_instances = instances;
    
        if (m_car && m_numRules == Integer.MAX_VALUE) {
          // Set desired minimum support
          m_minSupport = lowerBoundMinSupportToUse;
        } else {
          // Decrease minimum support until desired number of rules found.
          m_minSupport = m_upperBoundMinSupport - m_delta;
          m_minSupport = (m_minSupport < lowerBoundMinSupportToUse) ? lowerBoundMinSupportToUse
              : m_minSupport;
        }
    
        do {
    
          // Reserve space for variables
          m_Ls = new FastVector();
          m_hashtables = new FastVector();
          m_allTheRules = new FastVector[6];
          m_allTheRules[0] = new FastVector();
          m_allTheRules[1] = new FastVector();
          m_allTheRules[2] = new FastVector();
          if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
            m_allTheRules[3] = new FastVector();
            m_allTheRules[4] = new FastVector();
            m_allTheRules[5] = new FastVector();
          }
          sortedRuleSet = new FastVector[6];
          sortedRuleSet[0] = new FastVector();
          sortedRuleSet[1] = new FastVector();
          sortedRuleSet[2] = new FastVector();
          if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
            sortedRuleSet[3] = new FastVector();
            sortedRuleSet[4] = new FastVector();
            sortedRuleSet[5] = new FastVector();
          }
          if (!m_car) {
            // Find large itemsets and rules
            findLargeItemSets();
            if (m_significanceLevel != -1 || m_metricType != CONFIDENCE)
              findRulesBruteForce();
            else
              findRulesQuickly();
          } else {
            findLargeCarItemSets();
            findCarRulesQuickly();
          }
    
          // prune rules for upper bound min support
          if (m_upperBoundMinSupport < 1.0) {
            pruneRulesForUpperBoundSupport();
          }
    
          int j = m_allTheRules[2].size() - 1;
          supports = new double[m_allTheRules[2].size()];
          for (int i = 0; i < (j + 1); i++)
            supports[j - i] = ((double) ((ItemSet) m_allTheRules[1]
                .elementAt(j - i)).support()) * (-1);
          indices = Utils.stableSort(supports);
          for (int i = 0; i < (j + 1); i++) {
            sortedRuleSet[0].addElement(m_allTheRules[0].elementAt(indices[j - i]));
            sortedRuleSet[1].addElement(m_allTheRules[1].elementAt(indices[j - i]));
            sortedRuleSet[2].addElement(m_allTheRules[2].elementAt(indices[j - i]));
            if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
              sortedRuleSet[3].addElement(m_allTheRules[3]
                  .elementAt(indices[j - i]));
              sortedRuleSet[4].addElement(m_allTheRules[4]
                  .elementAt(indices[j - i]));
              sortedRuleSet[5].addElement(m_allTheRules[5]
                  .elementAt(indices[j - i]));
            }
          }
    
          // Sort rules according to their confidence
          m_allTheRules[0].removeAllElements();
          m_allTheRules[1].removeAllElements();
          m_allTheRules[2].removeAllElements();
          if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
            m_allTheRules[3].removeAllElements();
            m_allTheRules[4].removeAllElements();
            m_allTheRules[5].removeAllElements();
          }
          confidences = new double[sortedRuleSet[2].size()];
          int sortType = 2 + m_metricType;
    
          for (int i = 0; i < sortedRuleSet[2].size(); i++)
            confidences[i] = ((Double) sortedRuleSet[sortType].elementAt(i))
                .doubleValue();
          indices = Utils.stableSort(confidences);
          for (int i = sortedRuleSet[0].size() - 1; (i >= (sortedRuleSet[0].size() - m_numRules))
              && (i >= 0); i--) {
            m_allTheRules[0].addElement(sortedRuleSet[0].elementAt(indices[i]));
            m_allTheRules[1].addElement(sortedRuleSet[1].elementAt(indices[i]));
            m_allTheRules[2].addElement(sortedRuleSet[2].elementAt(indices[i]));
            if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
              m_allTheRules[3].addElement(sortedRuleSet[3].elementAt(indices[i]));
              m_allTheRules[4].addElement(sortedRuleSet[4].elementAt(indices[i]));
              m_allTheRules[5].addElement(sortedRuleSet[5].elementAt(indices[i]));
            }
          }
    
          if (m_verbose) {
            if (m_Ls.size() > 1) {
              System.out.println(toString());
            }
          }
    
          if (m_minSupport == lowerBoundMinSupportToUse
              || m_minSupport - m_delta > lowerBoundMinSupportToUse)
            m_minSupport -= m_delta;
          else
            m_minSupport = lowerBoundMinSupportToUse;
    
          necSupport = Math.round((float) ((m_minSupport * m_instances
              .numInstances()) + 0.5));
    
          m_cycles++;
        } while ((m_allTheRules[0].size() < m_numRules)
            && (Utils.grOrEq(m_minSupport, lowerBoundMinSupportToUse))
            /* (necSupport >= lowerBoundNumInstancesSupport) */
            /* (Utils.grOrEq(m_minSupport, m_lowerBoundMinSupport)) */&& (necSupport >= 1));
        m_minSupport += m_delta;
      }

    主要步骤解析:

    1 使用removeMissingColumns方法,删除缺失属性的列

    2 如果参数m_car是真,则进行划分;因为m_car是真的意思是挖掘与关联规则的有关的规则,所以划分成两部分,一部分有关,一部分无关,删除无关的即可;

    3 方法findLargeItemSets查找大频率项集;具体源码见下面

    4 方法findRulesQuickly查找所有的关联规则集;

    5 方法pruneRulesForUpperBoundSupport删除不满足最小置信度的规则集;

    6)按照置信度把规则集排序;

    4)查找大频率项集findLargeItemSets源码如下:

    private void findLargeItemSets() throws Exception {
    
        FastVector kMinusOneSets, kSets;
        Hashtable hashtable;
        int necSupport, necMaxSupport, i = 0;
    
        // Find large itemsets
    
        // minimum support
        necSupport = (int) (m_minSupport * m_instances.numInstances() + 0.5);
        necMaxSupport = (int) (m_upperBoundMinSupport * m_instances.numInstances() + 0.5);
    
        kSets = AprioriItemSet.singletons(m_instances);
        AprioriItemSet.upDateCounters(kSets, m_instances);
        kSets = AprioriItemSet.deleteItemSets(kSets, necSupport,
            m_instances.numInstances());
        if (kSets.size() == 0)
          return;
        do {
          m_Ls.addElement(kSets);
          kMinusOneSets = kSets;
          kSets = AprioriItemSet.mergeAllItemSets(kMinusOneSets, i,
              m_instances.numInstances());
          hashtable = AprioriItemSet.getHashtable(kMinusOneSets,
              kMinusOneSets.size());
          m_hashtables.addElement(hashtable);
          kSets = AprioriItemSet.pruneItemSets(kSets, hashtable);
          AprioriItemSet.upDateCounters(kSets, m_instances);
          kSets = AprioriItemSet.deleteItemSets(kSets, necSupport,
              m_instances.numInstances());
          i++;
        } while (kSets.size() > 0);
      }

    主要步骤:

    1  类AprioriItemSet.singletons方法,将给定数据集的头信息转换成一个项集的集合, 头信息中的值的顺序是按字典序。

    2 方法upDateCounters查找一频繁项目集;

    3 AprioriItemSet.deleteItemSets方法,删除不满足支持度区间的项目集;

    4 使用方法mergeAllItemSets(源码如下)由k-1项目集循环生出k频繁项目集,并且使用方法deleteItemSets删除不满足支持度区间的项目集;

    5)由k-1项目集循环生出k频繁项目集的方法mergeAllItemSets,源码如下:

    public static FastVector mergeAllItemSets(FastVector itemSets, int size,
          int totalTrans) {
    
        FastVector newVector = new FastVector();
        ItemSet result;
        int numFound, k;
    
        for (int i = 0; i < itemSets.size(); i++) {
          ItemSet first = (ItemSet) itemSets.elementAt(i);
          out: for (int j = i + 1; j < itemSets.size(); j++) {
            ItemSet second = (ItemSet) itemSets.elementAt(j);
            result = new AprioriItemSet(totalTrans);
            result.m_items = new int[first.m_items.length];
    
            // Find and copy common prefix of size 'size'
            numFound = 0;
            k = 0;
            while (numFound < size) {
              if (first.m_items[k] == second.m_items[k]) {
                if (first.m_items[k] != -1)
                  numFound++;
                result.m_items[k] = first.m_items[k];
              } else
                break out;
              k++;
            }
    
            // Check difference
            while (k < first.m_items.length) {
              if ((first.m_items[k] != -1) && (second.m_items[k] != -1))
                break;
              else {
                if (first.m_items[k] != -1)
                  result.m_items[k] = first.m_items[k];
                else
                  result.m_items[k] = second.m_items[k];
              }
              k++;
            }
            if (k == first.m_items.length) {
              result.m_counter = 0;
              newVector.addElement(result);
            }
          }
        }
        return newVector;
      }

    调用方法generateRules生出关联规则

    6)生出关联规则的方法generateRules,源码如下

    public FastVector[] generateRules(double minConfidence,
          FastVector hashtables, int numItemsInSet) {
    
        FastVector premises = new FastVector(), consequences = new FastVector(), conf = new FastVector();
        FastVector[] rules = new FastVector[3], moreResults;
        AprioriItemSet premise, consequence;
        Hashtable hashtable = (Hashtable) hashtables.elementAt(numItemsInSet - 2);
    
        // Generate all rules with one item in the consequence.
        for (int i = 0; i < m_items.length; i++)
          if (m_items[i] != -1) {
            premise = new AprioriItemSet(m_totalTransactions);
            consequence = new AprioriItemSet(m_totalTransactions);
            premise.m_items = new int[m_items.length];
            consequence.m_items = new int[m_items.length];
            consequence.m_counter = m_counter;
    
            for (int j = 0; j < m_items.length; j++)
              consequence.m_items[j] = -1;
            System.arraycopy(m_items, 0, premise.m_items, 0, m_items.length);
            premise.m_items[i] = -1;
    
            consequence.m_items[i] = m_items[i];
            premise.m_counter = ((Integer) hashtable.get(premise)).intValue();
            premises.addElement(premise);
            consequences.addElement(consequence);
            conf.addElement(new Double(confidenceForRule(premise, consequence)));
          }
        rules[0] = premises;
        rules[1] = consequences;
        rules[2] = conf;
        pruneRules(rules, minConfidence);
    
        // Generate all the other rules
        moreResults = moreComplexRules(rules, numItemsInSet, 1, minConfidence,
            hashtables);
        if (moreResults != null)
          for (int i = 0; i < moreResults[0].size(); i++) {
            rules[0].addElement(moreResults[0].elementAt(i));
            rules[1].addElement(moreResults[1].elementAt(i));
            rules[2].addElement(moreResults[2].elementAt(i));
          }
        return rules;
      }

    几个我想说的

    1)不想输出为0的项,可以设置成缺失值?,因为算法会自动删除缺失值的列,不参与关联规则的生成;

    2)按照置信度对关联规则排序,是关联规则分类器中使用的,只是提取关联规则,不需要排序;

    备注

    1)weka的关联规则中参数的详解

    1.        car 如果设为真,则会挖掘类关联规则而不是全局关联规则。也就是只保留与类标签有关的关联规则,设置索引为-1
    2.        classindex 类属性索引。如果设置为-1,最后的属性被当做类属性。
    3.        delta 以此数值为迭代递减单位。不断减小支持度直至达到最小支持度或产生了满足数量要求的规则。
    4.        lowerBoundMinSupport 最小支持度下界。
    5.        metricType 度量类型。设置对规则进行排序的度量依据。可以是:置信度(类关联规则只能用置信度挖掘),提升度(lift),杠杆率(leverage),确信度(conviction)。
    在 Weka中设置了几个类似置信度(confidence)的度量来衡量规则的关联程度,它们分别是:
    a)        Lift : P(A,B)/(P(A)P(B)) Lift=1时表示A和B独立。这个数越大(>1),越表明A和B存在于一个购物篮中不是偶然现象,有较强的关联度.
    b)        Leverage :P(A,B)-P(A)P(B)Leverage=0时A和B独立,Leverage越大A和B的关系越密切
    c)        Conviction:P(A)P(!B)/P(A,!B) (!B表示B没有发生) Conviction也是用来衡量A和B的独立性。从它和lift的关系(对B取反,代入Lift公式后求倒数)可以看出,这个值越大, A、B越关联。
    6.        minMtric 度量的最小值。
    7.        numRules 要发现的规则数。
    8.        outputItemSets 如果设置为真,会在结果中输出项集。
    9.        removeAllMissingCols 移除全部为缺省值的列。
     
    10.    significanceLevel 重要程度。重要性测试(仅用于置信度)。
     
    11.    upperBoundMinSupport 最小支持度上界。 从这个值开始迭代减小最小支持度。
     
    12.    verbose 如果设置为真,则算法会以冗余模式运行。

    2)控制台输出结果

    Apriori
    =======
    
    Minimum support: 0.2 (5 instances)
    Minimum metric <confidence>: 0.9
    Number of cycles performed: 16
    
    Generated sets of large itemsets:
    
    Size of set of large itemsets L(1): 11
    
    Size of set of large itemsets L(2): 21
    
    Size of set of large itemsets L(3): 6
    
    Best rules found:
    
     1. tear-prod-rate=reduced 12 ==> contact-lenses=none 12    conf:(1)
     2. spectacle-prescrip=myope tear-prod-rate=reduced 6 ==> contact-lenses=none 6    conf:(1)
     3. spectacle-prescrip=hypermetrope tear-prod-rate=reduced 6 ==> contact-lenses=none 6    conf:(1)
     4. astigmatism=no tear-prod-rate=reduced 6 ==> contact-lenses=none 6    conf:(1)
     5. astigmatism=yes tear-prod-rate=reduced 6 ==> contact-lenses=none 6    conf:(1)
     6. contact-lenses=soft 5 ==> astigmatism=no 5    conf:(1)
     7. contact-lenses=soft 5 ==> tear-prod-rate=normal 5    conf:(1)
     8. tear-prod-rate=normal contact-lenses=soft 5 ==> astigmatism=no 5    conf:(1)
     9. astigmatism=no contact-lenses=soft 5 ==> tear-prod-rate=normal 5    conf:(1)
    10. contact-lenses=soft 5 ==> astigmatism=no tear-prod-rate=normal 5    conf:(1)

    转置请注明出处:http://www.cnblogs.com/rongyux/

  • 相关阅读:
    Android:android sdk源码中怎么没有httpclient的源码了
    Android:ADB server didn't ACK或者adb server is out of date. killing解决办法
    [GitHub]第八讲:GitHub Pages
    [GitHub]第七讲:GitHub issues
    [GitHub]第六讲:开源项目贡献流程
    php反射类 ReflectionClass
    排名前 8 的 PHP 调试工具,你认可吗?
    如何调试PHP程序
    eclipse 快速建立PHP调试环境
    HTML 列表元素OL、UL、LI
  • 原文地址:https://www.cnblogs.com/rongyux/p/5384184.html
Copyright © 2020-2023  润新知