Heritrix 3.1.0 源码解析（三十一）

从BdbFrontier对象的next方法（从某个Classkey标识的BdbWorkQueue工作队列）取出来的CrawlURI uri对象第一步要进入的处理器是Preselector处理器，该处理器主要是对CrawlURI uri对象根据配置文件里面配置的正则表达式进行过滤，通过过滤的CrawlURI uri对象才能进入下一步的处理器，该处理器继承自Scoper类（Scoper类我在前面的文章已经解析过，这不再重复），该类比较简单，我只贴出相关处理的方法

@Override
    protected ProcessResult innerProcessResult(CrawlURI puri) {
        CrawlURI curi = (CrawlURI)puri;
        
        // Check if uris should be blocked
        if (getBlockAll()) {
            curi.setFetchStatus(S_BLOCKED_BY_USER);
            return ProcessResult.FINISH;
        }

        // Check if allowed by regular expression
        String regex = getAllowByRegex();
        if (regex != null && !regex.equals("")) {
            if (!TextUtils.matches(regex, curi.toString())) {
                curi.setFetchStatus(S_BLOCKED_BY_USER);
                return ProcessResult.FINISH;
            }
        }

        // Check if blocked by regular expression
        regex = getBlockByRegex();
        if (regex != null && !regex.equals("")) {
            if (TextUtils.matches(regex, curi.toString())) {
                curi.setFetchStatus(S_BLOCKED_BY_USER);
                return ProcessResult.FINISH;
            }
        }

        // Possibly recheck scope
        if (getRecheckScope()) {
            if (!isInScope(curi)) {
                // Scope rejected
                curi.setFetchStatus(S_OUT_OF_SCOPE);
                return ProcessResult.FINISH;
            }
        }
        
        return ProcessResult.PROCEED;
    }

上面方法里面的最后一步是判断是否还要重新范围筛选（调用父类Scoper类的boolean isInScope(CrawlURI caUri)方法），默认为false

这里需要弄明白的是，该处理器的正则过滤，是在CrawlURI uri对象已经添加到了BdbWorkQueue工作队列而进行正式采集前的处理，不同于CandidatesProcessor处理器是在CrawlURI uri对象进入BdbWorkQueue工作队列之前的筛选，我们配置过滤CrawlURI uri对象的过滤规则，本人推荐在CandidatesProcessor处理器相关模块设置

相应的正则表达式可以在crawler-beans.cxml配置文件中设置Preselector处理器Bean属性

 <!-- first, processors are declared as top-level named beans -->
 <bean id="preselector" class="org.archive.crawler.prefetch.MyPreselector">
      <!-- <property name="recheckScope" value="false" />-->
     <!--  <property name="blockAll" value="false" />-->
     <!--  <property name="blockByRegex" value="" />-->
     <!--  <property name="allowByRegex" value="" />-->
 </bean>

通过FrontierPreparer处理器的CrawlURI uri对象下一步进入PreconditionEnforcer处理器，该处理器可以称为先决条件处理器，主要是验证DNS,验证Robots规则，进行身份认证等，其相关处理方法如下

@Override
    protected ProcessResult innerProcessResult(CrawlURI puri) {
        CrawlURI curi = (CrawlURI)puri;
        //DNS解析验证
        if (considerDnsPreconditions(curi)) {
            return ProcessResult.FINISH;
        }

        // make sure we only process schemes we understand (i.e. not dns) 当前CrawlURI puri对象的scheme不是http并且不是https
        String scheme = curi.getUURI().getScheme().toLowerCase();
        if (! (scheme.equals("http") || scheme.equals("https"))) {
            logger.fine("PolitenessEnforcer doesn't understand uri's of type " +
                scheme + " (ignoring)");
            return ProcessResult.PROCEED;
        }
        //Robots验证
        if (considerRobotsPreconditions(curi)) {
            return ProcessResult.FINISH;
        }
//        System.out.println("!curi.isPrerequisite():"+!curi.isPrerequisite());
        //身份认证
        if (!curi.isPrerequisite() && credentialPrecondition(curi)) {
            return ProcessResult.FINISH;
        }

        // OK, it's allowed

        // For all curis that will in fact be fetched, set appropriate delays.
        // TODO: SOMEDAY: allow per-host, per-protocol, etc. factors
        // curi.setDelayFactor(getDelayFactorFor(curi));
        // curi.setMinimumDelay(getMinimumDelayFor(curi));

        return ProcessResult.PROCEED;
    }

如果存在先决条件，则设置当前CrawlURI puri对象的先决条件并退出当前处理器链（FetchChain处理器链）的流程

我们先来分析第一个先决条件：DNS解析验证，boolean considerDnsPreconditions(CrawlURI curi)方法

/**
     * @param curi CrawlURI whose dns prerequisite we're to check.
     * @return true if no further processing in this module should occur
     */
    protected boolean considerDnsPreconditions(CrawlURI curi) {
        if(curi.getUURI().getScheme().equals("dns")){
            // DNS URIs never have a DNS precondition
            //如果为DNS，本身为先决条件
            curi.setPrerequisite(true);
            return false; 
        } else if (curi.getUURI().getScheme().equals("whois")) {
            return false;
        }

        //serverCache:org.archive.modules.net.BdbServerCache
        CrawlServer cs = serverCache.getServerFor(curi.getUURI());        
        if(cs == null) {
            curi.setFetchStatus(S_UNFETCHABLE_URI);
//            curi.skipToPostProcessing();
            return true;
        }

        // If we've done a dns lookup and it didn't resolve a host
        // cancel further fetch-processing of this URI, because
        // the domain is unresolvable
        CrawlHost ch = serverCache.getHostFor(curi.getUURI());        
        if (ch == null || ch.hasBeenLookedUp() && ch.getIP() == null) {
            if (logger.isLoggable(Level.FINE)) {
                logger.fine( "no dns for " + ch +
                    " cancelling processing for CrawlURI " + curi.toString());
            }
            curi.setFetchStatus(S_DOMAIN_PREREQUISITE_FAILURE);
//            curi.skipToPostProcessing();
            return true;
        }

        // If we haven't done a dns lookup  and this isn't a dns uri
        // shoot that off and defer further processing
        //判断IP是否过期并且当前CrawlURI curi对象的scheme本身不是dns
        if (isIpExpired(curi) && !curi.getUURI().getScheme().equals("dns")) {
            logger.fine("Deferring processing of CrawlURI " + curi.toString()
                + " for dns lookup.");
            String preq = "dns:" + ch.getHostName();
            try {
                // 先决条件 DNS解析
                curi.markPrerequisite(preq);
            } catch (URIException e) {
                throw new RuntimeException(e); // shouldn't ever happen
            }
            return true;
        }
        
        // DNS preconditions OK
        return false;
    }

boolean isIpExpired(CrawlURI curi)方法判断IP是否注册（判断当前CrawlURI curi对象对应的CrawlHost host对象里面IP是否注册）

/** Return true if ip should be looked up.
     *
     * @param curi the URI to check.
     * @return true if ip should be looked up.
     */
    public boolean isIpExpired(CrawlURI curi) {
        CrawlHost host = serverCache.getHostFor(curi.getUURI());
        if (!host.hasBeenLookedUp()) {
            // IP has not been looked up yet.
            return true;
        }

        if (host.getIpTTL() == CrawlHost.IP_NEVER_EXPIRES) {
            // IP never expires (numeric IP)
            return false;
        }

        long duration = getIpValidityDurationSeconds();
        if (duration == 0) {
            // Never expire ip if duration is null (set by user or more likely,
            // set to zero in case where we tried in FetchDNS but failed).
            return false;
        }
        
        long ttl = host.getIpTTL();
        if (ttl > duration) {
            // Use the larger of the operator-set minimum duration 
            // or the DNS record TTL
            duration = ttl;
        }

        // Duration and ttl are in seconds.  Convert to millis.
        if (duration > 0) {
            duration *= 1000;
        }

        return (duration + host.getIpFetched()) < System.currentTimeMillis();
    }

如果IP没有注册，则设置当前CrawlURI curi对象的先决条件为dns:host，CrawlURI curi对象的CrawlURI markPrerequisite(String preq) 方法如下

/**
     * Do all actions associated with setting a <code>CrawlURI</code> as
     * requiring a prerequisite.
     *
     * @param lastProcessorChain Last processor chain reference.  This chain is
     * where this <code>CrawlURI</code> goes next.
     * @param preq Object to set a prerequisite.
     * @return the newly created prerequisite CrawlURI
     * @throws URIException
     */
    public CrawlURI markPrerequisite(String preq) 
    throws URIException {
        UURI src = getUURI();
        UURI dest = UURIFactory.getInstance(preq);
        LinkContext lc = LinkContext.PREREQ_MISC;
        Hop hop = Hop.PREREQ;
        Link link = new Link(src, dest, lc, hop);
        CrawlURI caUri = createCrawlURI(getBaseURI(), link);
        // TODO: consider moving some of this to candidate-handling
        int prereqPriority = getSchedulingDirective() - 1;
        if (prereqPriority < 0) {
            prereqPriority = 0;
            logger.severe("Unable to promote prerequisite " + caUri + " above " + this);
        }
        caUri.setSchedulingDirective(prereqPriority);
        caUri.setForceFetch(true);
        setPrerequisiteUri(caUri);
        incrementDeferrals();
        setFetchStatus(S_DEFERRED);
        
        return caUri;
    }

在上述方法里面首先生成先决条件CrawlURI caUri对象，设置高一级的调度级别，最后以key值为String A_PREREQUISITE_URI = "prerequisite-uri"添加到当前CrawlURI curi对象的Map<String,Object> data成员里面

/**
     * Set a prerequisite for this URI.
     * <p>
     * A prerequisite is a URI that must be crawled before this URI can be
     * crawled.
     *
     * @param link Link to set as prereq.
     */
    public void setPrerequisiteUri(CrawlURI pre) {
        getData().put(A_PREREQUISITE_URI, pre);
    }

退出当前处理器链（FetchChain处理器链）后，进入下一轮的处理器链（DispositionChain处理器链）中，在CandidatesProcessor处理器中将先决条件添加到BdbWorkQueue工作队列，相关代码如下：

@Override
    protected void innerProcess(final CrawlURI curi) throws InterruptedException {
        // Handle any prerequisites when S_DEFERRED for prereqs
        if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) {
            CrawlURI prereq = curi.getPrerequisiteUri();
            prereq.setFullVia(curi); 
            sheetOverlaysManager.applyOverlaysTo(prereq);
            try {
                KeyedProperties.clearOverridesFrom(curi); 
                KeyedProperties.loadOverridesFrom(prereq);
                getCandidateChain().process(prereq, null);
                
                if(prereq.getFetchStatus()>=0) {
                    //添加到BdbWorkQueue工作队列
                    frontier.schedule(prereq);
                } else {
                    curi.setFetchStatus(S_PREREQUISITE_UNSCHEDULABLE_FAILURE);
                }
            } finally {
                KeyedProperties.clearOverridesFrom(prereq); 
                KeyedProperties.loadOverridesFrom(curi);
            }
            return;
        }
    //后面部分的代码略
}

回到PreconditionEnforcer处理器，第二个先决条件为验证Robots规则，处理逻辑与验证DNS相似， boolean considerRobotsPreconditions(CrawlURI curi)方法如下

/**
     * Consider the robots precondition.
     *
     * @param curi CrawlURI we're checking for any required preconditions.
     * @return True, if this <code>curi</code> has a precondition or processing
     *         should be terminated for some other reason.  False if
     *         we can proceed to process this url.
     */
    protected boolean considerRobotsPreconditions(CrawlURI curi) {
        // treat /robots.txt fetches specially
        //忽略验证 return false;
        UURI uuri = curi.getUURI();
        try {
            if (uuri != null && uuri.getPath() != null &&
                    curi.getUURI().getPath().equals("/robots.txt")) {
                // allow processing to continue
                //本身为先决条件
                curi.setPrerequisite(true);
                return false;
            }
        } catch (URIException e) {
            logger.severe("Failed get of path for " + curi);
        }
        
        CrawlServer cs = serverCache.getServerFor(curi.getUURI());
        // require /robots.txt if not present
        //验证Robots是否过期
        if (cs.isRobotsExpired(getRobotsValidityDurationSeconds())) {
            // Need to get robots
            if (logger.isLoggable(Level.FINE)) {
                logger.fine( "No valid robots for " + cs  +
                    "; deferring " + curi);
            }

            // Robots expired - should be refetched even though its already
            // crawled.
            try {
                String prereq = curi.getUURI().resolve("/robots.txt").toString();
                //设置先决条件
                curi.markPrerequisite(prereq);
            }
            catch (URIException e1) {
                logger.severe("Failed resolve using " + curi);
                throw new RuntimeException(e1); // shouldn't ever happen
            }
            return true;
        }
        // test against robots.txt if available
        if (cs.isValidRobots()) {
            String ua = metadata.getUserAgent();
            RobotsPolicy robots = metadata.getRobotsPolicy();
            if(!robots.allows(ua, curi, cs.getRobotstxt())) {
                if(getCalculateRobotsOnly()) {
                    // annotate URI as excluded, but continue to process normally
                    curi.getAnnotations().add("robotExcluded");
                    return false; 
                }
                // mark as precluded; in FetchHTTP, this will
                // prevent fetching and cause a skip to the end
                // of processing (unless an intervening processor
                // overrules)
                curi.setFetchStatus(S_ROBOTS_PRECLUDED);
                curi.setError("robots.txt exclusion");
                logger.fine("robots.txt precluded " + curi);
                return true;
            }
            return false;
        }
        // No valid robots found => Attempt to get robots.txt failed
//        curi.skipToPostProcessing();
        curi.setFetchStatus(S_ROBOTS_PREREQUISITE_FAILURE);
        curi.setError("robots.txt prerequisite failed");
        if (logger.isLoggable(Level.FINE)) {
            logger.fine("robots.txt prerequisite failed " + curi);
        }
        return true;
    }

第三个先决条件为身份认证，boolean credentialPrecondition(final CrawlURI curi)方法如下

/**
    * Consider credential preconditions.
    *
    * Looks to see if any credential preconditions (e.g. html form login
    * credentials) for this <code>CrawlServer</code>. If there are, have they
    * been run already? If not, make the running of these logins a precondition
    * of accessing any other url on this <code>CrawlServer</code>.
    *
    * <p>
    * One day, do optimization and avoid running the bulk of the code below.
    * Argument for running the code everytime is that overrides and refinements
    * may change what comes back from credential store.
    *
    * @param curi CrawlURI we're checking for any required preconditions.
    * @return True, if this <code>curi</code> has a precondition that needs to
    *         be met before we can proceed. False if we can precede to process
    *         this url.
    */
    /**
     * 考虑 不同classkey 而host实际相同的情况   应该以host为依据
     * @param curi
     * @return
     */
    protected boolean credentialPrecondition(final CrawlURI curi) {

        boolean result = false;
        
        CredentialStore cs = getCredentialStore();
        if (cs == null) {
            logger.severe("No credential store for " + curi);
            return result;
        }
        //System.out.println(cs.getAll().size());
        //遍历CredentialStore cs对象存储的Collection<Credential>集合
        for (Credential c: cs.getAll()) {
            //判断当前CrawlURI curi对象是否先决条件
            if (c.isPrerequisite(curi)) {
                // This credential has a prereq. and this curi is it.  Let it
                // through.  Add its avatar to the curi as a mark.  Also, does
                // this curi need to be posted?  Note, we do this test for
                // is it a prereq BEFORE we do the check that curi is of the
                // credential domain because such as yahoo have you go to
                // another domain altogether to login.
                //为CrawlURI curi对象添加当前证书
                c.attach(curi);
                curi.setFetchType(CrawlURI.FetchType.HTTP_POST);
                break;
            }
            //当前Credential c对象的域名与当前CrawlURI curi对象的CrawlServer serv对象的serverName是一致的
            //也就是说 当前Credential c对象的域名与当前CrawlURI curi对象对应的域名是一致的
            if (!c.rootUriMatch(serverCache, curi)) {
                continue;
            }
            //当前Credential c对象存在先决条件（form验证是登录地址）            
            if (!c.hasPrerequisite(curi)) {
                continue;
            }
            //判断是否已经验证            
            //预先判断当前CrawlURI curi对象对应的域与当前Credential c认证对象对应的域是一致的
            
            //获取当前CrawlURI curi对象对应的CrawlServer server对象
            //遍历CrawlServer server对象的Set<Credential> credentials集合，检查是否有与当前Credential c认证对象一致的
            //外层循环是配置文件里面的所有Credential集合     内层循环是服务器已存储的是与当前CrawlURI curi对象对应的Credential集合
            //这里面应该保证同一域名下的CrawlURI curi对象对应的CrawlServer server对象是同一的
            //（不然同一域名下的其他的CrawlURI curi对象同样需要再次验证）
            if (!authenticated(c, curi)) {
                // Han't been authenticated.  Queue it and move on (Assumption
                // is that we can do one authentication at a time -- usually one
                // html form).
                //登录地址
                String prereq = c.getPrerequisite(curi);
                if (prereq == null || prereq.length() <= 0) {
                    CrawlServer server = serverCache.getServerFor(curi.getUURI());
                    logger.severe(server.getName() + " has "
                        + " credential(s) of type " + c + " but prereq"
                        + " is null.");
                } else {
                    try {
                        //添加先决条件
                        curi.markPrerequisite(prereq);
                    } catch (URIException e) {
                        logger.severe("unable to set credentials prerequisite "+prereq);
                        loggerModule.logUriError(e,curi.getUURI(),prereq);
                        return false; 
                    }
                    result = true;
                    if (logger.isLoggable(Level.FINE)) {
                        logger.fine("Queueing prereq " + prereq + " of type " +
                            c + " for " + curi);
                    }
                    //跳出循环
                    break;
                }
            }
        }
        return result;
    }

boolean authenticated(final Credential credential, final CrawlURI curi)方法判断是否已经身份认证（CrawlServer server对象存在当前认证对象Credential credential）

/**
     * Has passed credential already been authenticated.
     *
     * @param credential Credential to test.
     * @param curi CrawlURI.
     * @return True if already run.
     */
    protected boolean authenticated(final Credential credential, final CrawlURI curi) {
        //获取CrawlURI curi对象对应的CrawlServer server对象
        CrawlServer server = serverCache.getServerFor(curi.getUURI());
        if (!server.hasCredentials()) {
            return false;
        }
        //CrawlServer server对象里面已经持久化的Set<Credential> credentials集合
        Set<Credential> credentials = server.getCredentials();
        for (Credential cred: credentials) {
            //两者的key一致并且类型一致
            if (cred.getKey().equals(credential.getKey()) 
                    && cred.getClass().isInstance(credential)) {
                return true; 
            }
        }
        return false;
    }

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处博客园刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/30/3052319.html

相关阅读:
一个小时学会MySQL数据库
 4种解决json日期格式问题的办法
 一个小时学会Git
docker 常用操作
 Fine-Grained Image (细粒度图像) – Papers, Codes and Datasets
Pytorch在colab和kaggle中使用TensorBoard/TensorboardX可视化
 训练集，验证集，测试集比例
 深度学习模型评估指标
 注意力机制（Attention Mechanism）应用——自然语言处理（NLP）
自注意力机制（Self-attention Mechanism）——自然语言处理（NLP）
原文地址：https://www.cnblogs.com/chenying99/p/3052319.html