• Heritrix 3.1.0 源码解析(十九)


    本文继续分析与heritrix3.1.0系统的处理器相关的源码

    我们照例先来浏览一下class uml图

    所有的处理器都继承自抽象父类Processor,其中重要的方法如下

    /**
         * Processes the given URI.  First checks {@link #ENABLED} and
         * {@link #DECIDE_RULES}.  If ENABLED is false, then nothing happens.
         * If the DECIDE_RULES indicate REJECT, then the 
         * {@link #innerRejectProcess(ProcessorURI)} method is invoked, and
         * the process method returns.
         * 
         * <p>Next, the {@link #shouldProcess(ProcessorURI)} method is 
         * consulted to see if this Processor knows how to handle the given
         * URI.  If it returns false, then nothing futher occurs.
         * 
         * <p>FIXME: Should innerRejectProcess be called when ENABLED is false,
         * or when shouldProcess returns false?  The previous Processor 
         * implementation didn't handle it that way.
         * 
         * <p>Otherwise, the URI is considered valid.  This processor's count
         * of handled URIs is incremented, and the 
         * {@link #innerProcess(ProcessorURI)} method is invoked to actually
         * perform the process.
         * 
         * @param uri  The URI to process
         * @throws  InterruptedException   if the thread is interrupted
         */
        public ProcessResult process(CrawlURI uri) 
        throws InterruptedException {
            if (!getEnabled()) {
                return ProcessResult.PROCEED;
            }
            
            if (getShouldProcessRule().decisionFor(uri) == DecideResult.REJECT) {
                innerRejectProcess(uri);
                return ProcessResult.PROCEED;
            }
            
            if (shouldProcess(uri)) {
                uriCount.incrementAndGet();
                return innerProcessResult(uri);
            } else {
                return ProcessResult.PROCEED;
            }
        }

    首先判断是否需要该处理器处理,shouldProcess(CrawlURI uri)为抽象方法,由子类实现(具体处理器类判断是否需要经过自身处理当前CrawlURI uri对象)

    里面进一步调用ProcessResult innerProcessResult(CrawlURI uri) 方法(有些子类覆盖了该方法)

    protected ProcessResult innerProcessResult(CrawlURI uri) 
        throws InterruptedException {
            innerProcess(uri);
            return ProcessResult.PROCEED;
        }

    继续调用void innerProcess(CrawlURI uri)方法,该方法是抽象方法,由子类实现

    /**
         * Actually performs the process.  By the time this method is invoked,
         * it is known that the given URI passes the {@link #ENABLED}, the 
         * {@link #DECIDE_RULES} and the {@link #shouldProcess(ProcessorURI)}
         * tests.  
         * 
         * @param uri    the URI to process
         * @throws InterruptedException   if the thread is interrupted
         */
        protected abstract void innerProcess(CrawlURI uri) 
        throws InterruptedException;

    处理器Processor类的子类 逻辑上又分为几大不同类别的处理器,它们在系统运行时已经属于不同的处理器链,在类的继承层次上 又有各自的层次归属

    本文以及接下来的文章我只能选择部分处理器Processor分析一下

    CandidatesProcessor处理器:CandidatesProcessor处理器里面拥有CandidateChain candidateChain成员,调用该处理器链的处理器方法

    通过该处理器的CrawlURI cURI对象最终调用BdbFrontier的schedule(CrawlURI cURI)方法添加到BDB数据库

     /**
         * Candidate chain
         */
        protected CandidateChain candidateChain;
        public CandidateChain getCandidateChain() {
            return this.candidateChain;
        }
        @Autowired
        public void setCandidateChain(CandidateChain candidateChain) {
            this.candidateChain = candidateChain;
        }
        
        /**
         * The frontier to use.
         */
        protected Frontier frontier;
        public Frontier getFrontier() {
            return this.frontier;
        }
        @Autowired
        public void setFrontier(Frontier frontier) {
            this.frontier = frontier;
        }

    实际调用的处理器方法如下

    /* (non-Javadoc)
         * @see org.archive.modules.Processor#innerProcess(org.archive.modules.CrawlURI)
         */
        @Override
        protected void innerProcess(final CrawlURI curi) throws InterruptedException {
            // Handle any prerequisites when S_DEFERRED for prereqs
            if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) {
                CrawlURI prereq = curi.getPrerequisiteUri();
                prereq.setFullVia(curi); 
                sheetOverlaysManager.applyOverlaysTo(prereq);
                try {
                    KeyedProperties.clearOverridesFrom(curi); 
                    KeyedProperties.loadOverridesFrom(prereq);
    
                    getCandidateChain().process(prereq, null);
                    
                    if(prereq.getFetchStatus()>=0) {
                        
                        frontier.schedule(prereq);
                    } else {
                        curi.setFetchStatus(S_PREREQUISITE_UNSCHEDULABLE_FAILURE);
                    }
                } finally {
                    KeyedProperties.clearOverridesFrom(prereq); 
                    KeyedProperties.loadOverridesFrom(curi);
                }
                return;
            }
    
            // Don't consider candidate links of error pages
            if (curi.getFetchStatus() < 200 || curi.getFetchStatus() >= 400) {
                curi.getOutLinks().clear();
                return;
            }
    
            for (Link wref: curi.getOutLinks()) {
                CrawlURI candidate;
                try {
                    candidate = curi.createCrawlURI(curi.getBaseURI(),wref);
                    // at least for duration of candidatechain, offer
                    // access to full CrawlURI of via
                    candidate.setFullVia(curi); 
                } catch (URIException e) {
                    loggerModule.logUriError(e, curi.getUURI(), 
                            wref.getDestination().toString());
                    continue;
                }
                sheetOverlaysManager.applyOverlaysTo(candidate);
                try {
                    KeyedProperties.clearOverridesFrom(curi); 
                    KeyedProperties.loadOverridesFrom(candidate);
                    
                    if(getSeedsRedirectNewSeeds() && curi.isSeed() 
                            && wref.getHopType() == Hop.REFER
                            && candidate.getHopCount() < SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS) {
                        candidate.setSeed(true);                     
                    }
                    getCandidateChain().process(candidate, null); 
                    if(candidate.getFetchStatus()>=0) {
                        if(checkForSeedPromotion(candidate)) {
                            /*
                             * We want to guarantee crawling of seed version of
                             * CrawlURI even if same url has already been enqueued,
                             * see https://webarchive.jira.com/browse/HER-1891
                             */
                            candidate.setForceFetch(true);                        
                            getSeeds().addSeed(candidate);
                        } else {                        
                            frontier.schedule(candidate);
                        }
                        curi.getOutCandidates().add(candidate);
                    }
                    
                } finally {
                    KeyedProperties.clearOverridesFrom(candidate); 
                    KeyedProperties.loadOverridesFrom(curi);
                }
            }
            curi.getOutLinks().clear();
        }

    我们查看一下爬行任务配置文件crawler-beans.cxml,CandidateChain candidateChain处理器链的相关处理器如下

    <!-- CANDIDATE CHAIN --> 
     <!-- first, processors are declared as top-level named beans -->
     <bean id="candidateScoper" class="org.archive.crawler.prefetch.CandidateScoper">
     </bean>
     <bean id="preparer" class="org.archive.crawler.prefetch.FrontierPreparer">
      <!-- <property name="preferenceDepthHops" value="-1" /> -->
      <!-- <property name="preferenceEmbedHops" value="1" /> -->
      <!-- <property name="canonicalizationPolicy"> 
            <ref bean="canonicalizationPolicy" />
           </property> -->
      <!-- <property name="queueAssignmentPolicy"> 
            <ref bean="queueAssignmentPolicy" />
           </property> -->
      <!-- <property name="uriPrecedencePolicy"> 
            <ref bean="uriPrecedencePolicy" />
           </property> -->
      <!-- <property name="costAssignmentPolicy"> 
            <ref bean="costAssignmentPolicy" />
           </property> -->
     </bean>
     <!-- now, processors are assembled into ordered CandidateChain bean -->
     <bean id="candidateProcessors" class="org.archive.modules.CandidateChain">
      <property name="processors">
       <list>
        <!-- apply scoping rules to each individual candidate URI... -->
        <ref bean="candidateScoper"/>
        <!-- ...then prepare those ACCEPTed to be enqueued to frontier. -->
        <ref bean="preparer"/>
       </list>
      </property>
     </bean>

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3036954.html

  • 相关阅读:
    Set the Welcome Page in SharePoint through Code
    调试工具
    Sublime Text 3编译Sass
    js储存参数的数组arguments
    js 判断客户端浏览器
    手机设计尺寸
    required
    前端工具HBuilder安装Sass插件
    apache极简配置虚拟机
    跳转输入框按键操作
  • 原文地址:https://www.cnblogs.com/chenying99/p/3036954.html
Copyright © 2020-2023  润新知