• Heritrix 3.1.0 源码解析(二十一)


    上文中的抽象类Scoper关联到另外一个成员变量DecideRule scope,我不得不先中断处理器类的分析(后面再继续处理器分析),来插叙一下DecideRule scope对象,我说了,DecideRule scope成员是用来控制CrawlURI caUri对象的范围

    照例先来浏览一下DecideRule相关类图

    DecideRule类是一个抽象类,用来判断一个CrawlURI caUri对象是接受还是拒绝

    public DecideResult decisionFor(CrawlURI uri) {
            if (!getEnabled()) {
                return DecideResult.NONE;
            }
            DecideResult result = innerDecide(uri);
            if (result == DecideResult.NONE) {
                return result;
            }
    
            return result;
        }
        
        
        protected abstract DecideResult innerDecide(CrawlURI uri);
        
        
        public DecideResult onlyDecision(CrawlURI uri) {
            return null;
        }
    
        public boolean accepts(CrawlURI uri) {
            return DecideResult.ACCEPT == decisionFor(uri);
        }

    上面抽象方法由子类DecideResult innerDecide(CrawlURI uri)实现

    DecideResult为枚举类,其值有三

    /**
     * The decision of a DecideRule.
     * 
     * @author pjack
     */
    public enum DecideResult {
    
        /** Indicates the URI was accepted. */
        ACCEPT, 
        
        /** Indicates the URI was neither accepted nor rejected. */
        NONE, 
        
        /** Indicates the URI was rejected. */
        REJECT;
    
        
        public static DecideResult invert(DecideResult result) {
            switch (result) {
                case ACCEPT:
                    return REJECT;
                case REJECT:
                    return ACCEPT;
                default:
                    return result;
            }
        }
    }

    我们再来看它的重要子类DecideRuleSequence,该类拥有DecideRule聚集,DecideResult innerDecide(CrawlURI uri)方法里面迭代调用聚集元素的DecideResult decisionFor(CrawlURI uri)方法(composite模式与Iterator模式结合)

    @SuppressWarnings("unchecked")
        public List<DecideRule> getRules() {
            return (List<DecideRule>) kp.get("rules");
        }
        public void setRules(List<DecideRule> rules) {
            kp.put("rules", rules);
        }
    
        public DecideResult innerDecide(CrawlURI uri) {
            DecideRule decisiveRule = null;
            int decisiveRuleNumber = -1;
            DecideResult result = DecideResult.NONE;
            List<DecideRule> rules = getRules();
            int max = rules.size();
            
            for (int i = 0; i < max; i++) {
                DecideRule rule = rules.get(i);
                if (rule.onlyDecision(uri) != result) {
                    DecideResult r = rule.decisionFor(uri);
                    if (LOGGER.isLoggable(Level.FINEST)) {
                        LOGGER.finest("DecideRule #" + i + " " + 
                                rule.getClass().getName() + " returned " + r + " for url: " + uri);
                    }
                    if (r != DecideResult.NONE) {
                        result = r;
                        decisiveRule = rule;
                        decisiveRuleNumber = i;
                    }
                }
            }
    
            if (fileLogger != null) {
                fileLogger.info(decisiveRuleNumber + " " + decisiveRule.getClass().getSimpleName() + " " + result + " " + uri);
            }
    
            return result;
        }

    运行环境中该聚集元素我们可以通过crawler-beans.cxml配置文件看到

    <!-- SCOPE: rules for which discovered URIs to crawl; order is very 
          important because last decision returned other than 'NONE' wins. -->
     <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
      <!-- <property name="logToFile" value="false" /> -->
      <property name="rules">
       <list>
        <!-- Begin by REJECTing all... -->
        <bean class="org.archive.modules.deciderules.RejectDecideRule">
        </bean>
        <!-- ...then ACCEPT those within configured/seed-implied SURT prefixes... -->
        <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
         <!-- <property name="seedsAsSurtPrefixes" value="true" /> -->
         <!-- <property name="alsoCheckVia" value="false" /> -->
         <!-- <property name="surtsSourceFile" value="" /> -->
         <!-- <property name="surtsDumpFile" value="${launchId}/surts.dump" /> -->
         <!-- <property name="surtsSource">
               <bean class="org.archive.spring.ConfigString">
                <property name="value">
                 <value>
                  # example.com
                  # http://www.example.edu/path1/
                  # +http://(org,example,
                 </value>
                </property> 
               </bean>
              </property> -->
        </bean>
        <!-- ...but REJECT those more than a configured link-hop-count from start... -->
        <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
         <!-- <property name="maxHops" value="20" /> -->
        </bean>
        <!-- ...but ACCEPT those more than a configured link-hop-count from start... -->
        <bean class="org.archive.modules.deciderules.TransclusionDecideRule">
         <!-- <property name="maxTransHops" value="2" /> -->
         <!-- <property name="maxSpeculativeHops" value="1" /> -->
        </bean>
        <!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... -->
        <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
              <property name="decision" value="REJECT"/>
              <property name="seedsAsSurtPrefixes" value="false"/>
              <property name="surtsDumpFile" value="${launchId}/negative-surts.dump" /> 
         <!-- <property name="surtsSource">
               <bean class="org.archive.spring.ConfigFile">
                <property name="path" value="negative-surts.txt" />
               </bean>
              </property> -->
        </bean>
        <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
        <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
              <property name="decision" value="REJECT"/>
         <!-- <property name="listLogicalOr" value="true" /> -->
         <!-- <property name="regexList">
               <list>
               </list>
              </property> -->
        </bean>
        <!-- ...and REJECT those with suspicious repeating path-segments... -->
        <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
         <!-- <property name="maxRepetitions" value="2" /> -->
        </bean>
        <!-- ...and REJECT those with more than threshold number of path-segments... -->
        <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
         <!-- <property name="maxPathDepth" value="20" /> -->
        </bean>
        <!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
        <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
        </bean>
        <!-- ...but always REJECT those with unsupported URI schemes -->
        <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
        </bean>
       </list>
      </property>
     </bean>

    抽象类PredicatedDecideRule继承自DecideRule类

     @Override
        protected DecideResult innerDecide(CrawlURI uri) {
            if (evaluate(uri)) {
                return getDecision();
            }
            return DecideResult.NONE;
        }
    
        protected abstract boolean evaluate(CrawlURI object);

    boolean evaluate(CrawlURI object)方法由子类实现

    其他相关实现类我不再一一介绍了

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3037547.html

  • 相关阅读:
    IntentService源码分析
    startService过程源码分析
    洛谷P3300 城市规划
    agc033
    洛谷P3306 随机数生成器
    洛谷P3299 保护出题人
    洛谷P3298 泉
    洛谷P3296 刺客信条
    树hash
    我们都爱膜您退火!
  • 原文地址:https://www.cnblogs.com/chenying99/p/3037547.html
Copyright © 2020-2023  润新知