• Heritrix 3.1.0 源码解析(二十)


    本文接着上文分析,CandidateChain candidateChain处理器链相关联的处理器

    CandidateChain处理器链有两个处理器

    org.archive.crawler.prefetch.CandidateScoper

    org.archive.crawler.prefetch.FrontierPreparer

    要了解上面的处理器,我们先要了解另外一个抽象类Scoper,继承自抽象父类Processor,该类用来控制CrawlURI caUri对象的范围,里面有一个成员变量DecideRule scope

     protected DecideRule scope;
        public DecideRule getScope() {
            return this.scope;
        }
        @Autowired
        public void setScope(DecideRule scope) {
            this.scope = scope;
        }

    该类重要的方法如下(调用成员变量DecideRule scope的DecideResult decisionFor(CrawlURI uri)方法

    /**
         * Schedule the given {@link CrawlURI CrawlURI} with the Frontier.
         * @param caUri The CrawlURI to be scheduled.
         * @return true if CrawlURI was accepted by crawl scope, false
         * otherwise.
         */
        protected boolean isInScope(CrawlURI caUri) {
            boolean result = false;
            //System.out.println(this.getClass().getName()+":"+"scope name:"+scope.getClass().getName());
            DecideResult dr = scope.decisionFor(caUri);
            if (dr == DecideResult.ACCEPT) {
                result = true;
                if (fileLogger != null) {
                    fileLogger.info("ACCEPT " + caUri); 
                }
            } else {
                outOfScope(caUri);
            }
            return result;
        }
        
        /**
         * Called when a CrawlURI is ruled out of scope.
         * Override if you don't want logs as coming from this class.
         * @param caUri CrawlURI that is out of scope.
         */
        protected void outOfScope(CrawlURI caUri) {
            if (fileLogger != null) {
                fileLogger.info("REJECT " + caUri); 
            }
        }

    该类的子类调用上面的方法判断CrawlURI caUri对象是否溢出范围,CandidateScoper类和FrontierPreparer类都是它的子类,另外还有Preselector类等

    CandidateScoper类代码很简单,覆盖Processor类的ProcessResult innerProcessResult(CrawlURI curi)方法

    @Override
        protected ProcessResult innerProcessResult(CrawlURI curi) throws InterruptedException {
            if (!isInScope(curi)) {
                // Scope rejected
                curi.setFetchStatus(S_OUT_OF_SCOPE);
                return ProcessResult.FINISH;
            }
            return ProcessResult.PROCEED;
        }

    表达式!isInScope(curi)调用父类抽象类Scoper的方法判断当前CrawlURI curi对象是否溢出范围

    FrontierPreparer类主要是为CrawlURI curi对象设置相关值,为抓取数据做准备(没发现该类调用父类抽象类Scoper方法

     /* (non-Javadoc)
         * @see org.archive.modules.Processor#innerProcess(org.archive.modules.CrawlURI)
         */
        @Override
        protected void innerProcess(CrawlURI curi) {
            prepare(curi);
        }
        
        /**
         * Apply all configured policies to CrawlURI
         * 
         * @param curi CrawlURI
         */
        public void prepare(CrawlURI curi) {
            
            // set schedulingDirective
            curi.setSchedulingDirective(getSchedulingDirective(curi));
                
            // set canonicalized version
            curi.setCanonicalString(canonicalize(curi));
            
            // set queue key
            curi.setClassKey(getClassKey(curi));
            
            // set cost
            curi.setHolderCost(getCost(curi));
            
            // set URI precedence
            getUriPrecedencePolicy().uriScheduled(curi);
    
    
        }

    上面的方法void prepare(CrawlURI curi)为CrawlURI curi对象设置相关值,计算相应值的方法如下 

    /**
         * Calculate the coarse, original 'schedulingDirective' prioritization
         * for the given CrawlURI
         * 
         * @param curi
         * @return
         */
        protected int getSchedulingDirective(CrawlURI curi) {
            if(StringUtils.isNotEmpty(curi.getPathFromSeed())) {
                char lastHop = curi.getPathFromSeed().charAt(curi.getPathFromSeed().length()-1);
                if(lastHop == 'R') {
                    // refer
                    return getPreferenceDepthHops() >= 0 ? HIGH : MEDIUM;
                } 
            }
            if (getPreferenceDepthHops() == 0) {
                return HIGH;
                // this implies seed redirects are treated as path
                // length 1, which I belive is standard.
                // curi.getPathFromSeed() can never be null here, because
                // we're processing a link extracted from curi
            } else if (getPreferenceDepthHops() > 0 && 
                curi.getPathFromSeed().length() + 1 <= getPreferenceDepthHops()) {
                return HIGH;
            } else {
                // optionally preferencing embeds up to MEDIUM
                int prefHops = getPreferenceEmbedHops(); 
                if (prefHops > 0) {
                    int embedHops = curi.getTransHops();
                    if (embedHops > 0 && embedHops <= prefHops
                            && curi.getSchedulingDirective() == SchedulingConstants.NORMAL) {
                        // number of embed hops falls within the preferenced range, and
                        // uri is not already MEDIUM -- so promote it
                        return MEDIUM;
                    }
                }
                // Everything else stays as previously assigned
                // (probably NORMAL, at least for now)
                return curi.getSchedulingDirective();
            }
        }
        /**
         * Canonicalize passed CrawlURI. This method differs from
         * {@link #canonicalize(UURI)} in that it takes a look at
         * the CrawlURI context possibly overriding any canonicalization effect if
         * it could make us miss content. If canonicalization produces an URL that
         * was 'alreadyseen', but the entry in the 'alreadyseen' database did
         * nothing but redirect to the current URL, we won't get the current URL;
         * we'll think we've already see it. Examples would be archive.org
         * redirecting to www.archive.org or the inverse, www.netarkivet.net
         * redirecting to netarkivet.net (assuming stripWWW rule enabled).
         * <p>Note, this method under circumstance sets the forceFetch flag.
         * 
         * @param cauri CrawlURI to examine.
         * @return Canonicalized <code>cacuri</code>.
         */
        protected String canonicalize(CrawlURI cauri) {
            String canon = getCanonicalizationPolicy().canonicalize(cauri.getURI());
            if (cauri.isLocation()) {
                // If the via is not the same as where we're being redirected (i.e.
                // we're not being redirected back to the same page, AND the
                // canonicalization of the via is equal to the the current cauri, 
                // THEN forcefetch (Forcefetch so no chance of our not crawling
                // content because alreadyseen check things its seen the url before.
                // An example of an URL that redirects to itself is:
                // http://bridalelegance.com/images/buttons3/tuxedos-off.gif.
                // An example of an URL whose canonicalization equals its via's
                // canonicalization, and we want to fetch content at the
                // redirection (i.e. need to set forcefetch), is netarkivet.dk.
                if (!cauri.toString().equals(cauri.getVia().toString()) &&
                        getCanonicalizationPolicy().canonicalize(
                                cauri.getVia().toCustomString()).equals(canon)) {
                    cauri.setForceFetch(true);
                }
            }
            return canon;
        }
        
        /**
         * @param cauri CrawlURI we're to get a key for.
         * @return a String token representing a queue
         */
        public String getClassKey(CrawlURI curi) {
            assert KeyedProperties.overridesActiveFrom(curi);      
            String queueKey = getQueueAssignmentPolicy().getClassKey(curi);
            return queueKey;
        }
        
        /**
         * Return the 'cost' of a CrawlURI (how much of its associated
         * queue's budget it depletes upon attempted processing)
         * 
         * @param curi
         * @return the associated cost
         */
        protected int getCost(CrawlURI curi) {
            assert KeyedProperties.overridesActiveFrom(curi);
            
            int cost = curi.getHolderCost();
            if (cost == CrawlURI.UNCALCULATED) {
                cost = getCostAssignmentPolicy().costOf(curi);
            }
            return cost;
        }

    这些方法涉及相应的策略类,这个话题比较大,留在后面的文章再解析吧

    Preselector类用来配置正则过滤CrawlURI curi对象

    @Override
        protected ProcessResult innerProcessResult(CrawlURI puri) {
            CrawlURI curi = (CrawlURI)puri;
            
            // Check if uris should be blocked
            if (getBlockAll()) {
                curi.setFetchStatus(S_BLOCKED_BY_USER);
                return ProcessResult.FINISH;
            }
    
            // Check if allowed by regular expression
            String regex = getAllowByRegex();
            if (regex != null && !regex.equals("")) {
                if (!TextUtils.matches(regex, curi.toString())) {
                    curi.setFetchStatus(S_BLOCKED_BY_USER);
                    return ProcessResult.FINISH;
                }
            }
    
            // Check if blocked by regular expression
            regex = getBlockByRegex();
            if (regex != null && !regex.equals("")) {
                if (TextUtils.matches(regex, curi.toString())) {
                    curi.setFetchStatus(S_BLOCKED_BY_USER);
                    return ProcessResult.FINISH;
                }
            }
    
            // Possibly recheck scope
            if (getRecheckScope()) {
                if (!isInScope(curi)) {
                    // Scope rejected
                    curi.setFetchStatus(S_OUT_OF_SCOPE);
                    return ProcessResult.FINISH;
                }
            }
            
            return ProcessResult.PROCEED;
        }

    对应的配置文件crawler-beans.cxml里面的配置示例如下

     <!-- FETCH CHAIN --> 
     <!-- first, processors are declared as top-level named beans -->
     <bean id="preselector" class="org.archive.crawler.prefetch.Preselector">
          <!-- <property name="recheckScope" value="false" />-->
         <!--  <property name="blockAll" value="false" />-->
         <!--  <property name="blockByRegex" value="" />-->
         <!--  <property name="allowByRegex" value="" />-->
     </bean>

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3037360.html

  • 相关阅读:
    找水王
    统计txt文档中的单词个数
    返回一个数组中最大子数组的和
    寻找最长字符串
    第二阶段冲刺第九天
    第二阶段冲刺第八天
    第二阶段冲刺第七天
    第二阶段冲刺第六天
    构建之法阅读笔记06
    小目标
  • 原文地址:https://www.cnblogs.com/chenying99/p/3037360.html
Copyright © 2020-2023  润新知