• Heritrix 3.1.0 源码解析(三十一)


    从BdbFrontier对象的next方法(从某个Classkey标识的BdbWorkQueue工作队列)取出来的CrawlURI uri对象第一步要进入的处理器是Preselector处理器,该处理器主要是对CrawlURI uri对象根据配置文件里面配置的正则表达式进行过滤,通过过滤的CrawlURI uri对象才能进入下一步的处理器,该处理器继承自Scoper类(Scoper类我在前面的文章已经解析过,这不再重复),该类比较简单,我只贴出相关处理的方法

    @Override
        protected ProcessResult innerProcessResult(CrawlURI puri) {
            CrawlURI curi = (CrawlURI)puri;
            
            // Check if uris should be blocked
            if (getBlockAll()) {
                curi.setFetchStatus(S_BLOCKED_BY_USER);
                return ProcessResult.FINISH;
            }
    
            // Check if allowed by regular expression
            String regex = getAllowByRegex();
            if (regex != null && !regex.equals("")) {
                if (!TextUtils.matches(regex, curi.toString())) {
                    curi.setFetchStatus(S_BLOCKED_BY_USER);
                    return ProcessResult.FINISH;
                }
            }
    
            // Check if blocked by regular expression
            regex = getBlockByRegex();
            if (regex != null && !regex.equals("")) {
                if (TextUtils.matches(regex, curi.toString())) {
                    curi.setFetchStatus(S_BLOCKED_BY_USER);
                    return ProcessResult.FINISH;
                }
            }
    
            // Possibly recheck scope
            if (getRecheckScope()) {
                if (!isInScope(curi)) {
                    // Scope rejected
                    curi.setFetchStatus(S_OUT_OF_SCOPE);
                    return ProcessResult.FINISH;
                }
            }
            
            return ProcessResult.PROCEED;
        }

    上面方法里面的最后一步是判断是否还要重新范围筛选(调用父类Scoper类的boolean isInScope(CrawlURI caUri)方法),默认为false

    这里需要弄明白的是,该处理器的正则过滤,是在CrawlURI uri对象已经添加到了BdbWorkQueue工作队列而进行正式采集前的处理,不同于CandidatesProcessor处理器是在CrawlURI uri对象进入BdbWorkQueue工作队列之前的筛选,我们配置过滤CrawlURI uri对象的过滤规则,本人推荐在CandidatesProcessor处理器相关模块设置

    相应的正则表达式可以在crawler-beans.cxml配置文件中设置Preselector处理器Bean属性

     <!-- first, processors are declared as top-level named beans -->
     <bean id="preselector" class="org.archive.crawler.prefetch.MyPreselector">
          <!-- <property name="recheckScope" value="false" />-->
         <!--  <property name="blockAll" value="false" />-->
         <!--  <property name="blockByRegex" value="" />-->
         <!--  <property name="allowByRegex" value="" />-->
     </bean>

    通过FrontierPreparer处理器的CrawlURI uri对象下一步进入PreconditionEnforcer处理器,该处理器可以称为先决条件处理器,主要是验证DNS,验证Robots规则,进行身份认证等,其相关处理方法如下

    @Override
        protected ProcessResult innerProcessResult(CrawlURI puri) {
            CrawlURI curi = (CrawlURI)puri;
            //DNS解析验证
            if (considerDnsPreconditions(curi)) {
                return ProcessResult.FINISH;
            }
    
            // make sure we only process schemes we understand (i.e. not dns) 当前CrawlURI puri对象的scheme不是http并且不是https
            String scheme = curi.getUURI().getScheme().toLowerCase();
            if (! (scheme.equals("http") || scheme.equals("https"))) {
                logger.fine("PolitenessEnforcer doesn't understand uri's of type " +
                    scheme + " (ignoring)");
                return ProcessResult.PROCEED;
            }
            //Robots验证
            if (considerRobotsPreconditions(curi)) {
                return ProcessResult.FINISH;
            }
    //        System.out.println("!curi.isPrerequisite():"+!curi.isPrerequisite());
            //身份认证
            if (!curi.isPrerequisite() && credentialPrecondition(curi)) {
                return ProcessResult.FINISH;
            }
    
            // OK, it's allowed
    
            // For all curis that will in fact be fetched, set appropriate delays.
            // TODO: SOMEDAY: allow per-host, per-protocol, etc. factors
            // curi.setDelayFactor(getDelayFactorFor(curi));
            // curi.setMinimumDelay(getMinimumDelayFor(curi));
    
            return ProcessResult.PROCEED;
        }

    如果存在先决条件,则设置当前CrawlURI puri对象的先决条件并退出当前处理器链(FetchChain处理器链)的流程

    我们先来分析第一个先决条件:DNS解析验证,boolean considerDnsPreconditions(CrawlURI curi)方法

    /**
         * @param curi CrawlURI whose dns prerequisite we're to check.
         * @return true if no further processing in this module should occur
         */
        protected boolean considerDnsPreconditions(CrawlURI curi) {
            if(curi.getUURI().getScheme().equals("dns")){
                // DNS URIs never have a DNS precondition
                //如果为DNS,本身为先决条件
                curi.setPrerequisite(true);
                return false; 
            } else if (curi.getUURI().getScheme().equals("whois")) {
                return false;
            }
    
            //serverCache:org.archive.modules.net.BdbServerCache
            CrawlServer cs = serverCache.getServerFor(curi.getUURI());        
            if(cs == null) {
                curi.setFetchStatus(S_UNFETCHABLE_URI);
    //            curi.skipToPostProcessing();
                return true;
            }
    
            // If we've done a dns lookup and it didn't resolve a host
            // cancel further fetch-processing of this URI, because
            // the domain is unresolvable
            CrawlHost ch = serverCache.getHostFor(curi.getUURI());        
            if (ch == null || ch.hasBeenLookedUp() && ch.getIP() == null) {
                if (logger.isLoggable(Level.FINE)) {
                    logger.fine( "no dns for " + ch +
                        " cancelling processing for CrawlURI " + curi.toString());
                }
                curi.setFetchStatus(S_DOMAIN_PREREQUISITE_FAILURE);
    //            curi.skipToPostProcessing();
                return true;
            }
    
            // If we haven't done a dns lookup  and this isn't a dns uri
            // shoot that off and defer further processing
            //判断IP是否过期并且当前CrawlURI curi对象的scheme本身不是dns
            if (isIpExpired(curi) && !curi.getUURI().getScheme().equals("dns")) {
                logger.fine("Deferring processing of CrawlURI " + curi.toString()
                    + " for dns lookup.");
                String preq = "dns:" + ch.getHostName();
                try {
                    // 先决条件 DNS解析
                    curi.markPrerequisite(preq);
                } catch (URIException e) {
                    throw new RuntimeException(e); // shouldn't ever happen
                }
                return true;
            }
            
            // DNS preconditions OK
            return false;
        }

    boolean isIpExpired(CrawlURI curi)方法判断IP是否注册(判断当前CrawlURI curi对象对应的CrawlHost host对象里面IP是否注册)

    /** Return true if ip should be looked up.
         *
         * @param curi the URI to check.
         * @return true if ip should be looked up.
         */
        public boolean isIpExpired(CrawlURI curi) {
            CrawlHost host = serverCache.getHostFor(curi.getUURI());
            if (!host.hasBeenLookedUp()) {
                // IP has not been looked up yet.
                return true;
            }
    
            if (host.getIpTTL() == CrawlHost.IP_NEVER_EXPIRES) {
                // IP never expires (numeric IP)
                return false;
            }
    
            long duration = getIpValidityDurationSeconds();
            if (duration == 0) {
                // Never expire ip if duration is null (set by user or more likely,
                // set to zero in case where we tried in FetchDNS but failed).
                return false;
            }
            
            long ttl = host.getIpTTL();
            if (ttl > duration) {
                // Use the larger of the operator-set minimum duration 
                // or the DNS record TTL
                duration = ttl;
            }
    
            // Duration and ttl are in seconds.  Convert to millis.
            if (duration > 0) {
                duration *= 1000;
            }
    
            return (duration + host.getIpFetched()) < System.currentTimeMillis();
        }

    如果IP没有注册,则设置当前CrawlURI curi对象的先决条件为dns:host,CrawlURI curi对象的CrawlURI markPrerequisite(String preq) 方法如下 

    /**
         * Do all actions associated with setting a <code>CrawlURI</code> as
         * requiring a prerequisite.
         *
         * @param lastProcessorChain Last processor chain reference.  This chain is
         * where this <code>CrawlURI</code> goes next.
         * @param preq Object to set a prerequisite.
         * @return the newly created prerequisite CrawlURI
         * @throws URIException
         */
        public CrawlURI markPrerequisite(String preq) 
        throws URIException {
            UURI src = getUURI();
            UURI dest = UURIFactory.getInstance(preq);
            LinkContext lc = LinkContext.PREREQ_MISC;
            Hop hop = Hop.PREREQ;
            Link link = new Link(src, dest, lc, hop);
            CrawlURI caUri = createCrawlURI(getBaseURI(), link);
            // TODO: consider moving some of this to candidate-handling
            int prereqPriority = getSchedulingDirective() - 1;
            if (prereqPriority < 0) {
                prereqPriority = 0;
                logger.severe("Unable to promote prerequisite " + caUri + " above " + this);
            }
            caUri.setSchedulingDirective(prereqPriority);
            caUri.setForceFetch(true);
            setPrerequisiteUri(caUri);
            incrementDeferrals();
            setFetchStatus(S_DEFERRED);
            
            return caUri;
        }

    在上述方法里面首先生成先决条件CrawlURI caUri对象,设置高一级的调度级别,最后以key值为String A_PREREQUISITE_URI = "prerequisite-uri"添加到当前CrawlURI curi对象的Map<String,Object> data成员里面 

    /**
         * Set a prerequisite for this URI.
         * <p>
         * A prerequisite is a URI that must be crawled before this URI can be
         * crawled.
         *
         * @param link Link to set as prereq.
         */
        public void setPrerequisiteUri(CrawlURI pre) {
            getData().put(A_PREREQUISITE_URI, pre);
        }

    退出当前处理器链(FetchChain处理器链)后,进入下一轮的处理器链(DispositionChain处理器链)中,在CandidatesProcessor处理器中将先决条件添加到BdbWorkQueue工作队列,相关代码如下:

    @Override
        protected void innerProcess(final CrawlURI curi) throws InterruptedException {
            // Handle any prerequisites when S_DEFERRED for prereqs
            if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) {
                CrawlURI prereq = curi.getPrerequisiteUri();
                prereq.setFullVia(curi); 
                sheetOverlaysManager.applyOverlaysTo(prereq);
                try {
                    KeyedProperties.clearOverridesFrom(curi); 
                    KeyedProperties.loadOverridesFrom(prereq);
                    getCandidateChain().process(prereq, null);
                    
                    if(prereq.getFetchStatus()>=0) {
                        //添加到BdbWorkQueue工作队列
                        frontier.schedule(prereq);
                    } else {
                        curi.setFetchStatus(S_PREREQUISITE_UNSCHEDULABLE_FAILURE);
                    }
                } finally {
                    KeyedProperties.clearOverridesFrom(prereq); 
                    KeyedProperties.loadOverridesFrom(curi);
                }
                return;
            }
        //后面部分的代码略
    }

    回到PreconditionEnforcer处理器,第二个先决条件为验证Robots规则,处理逻辑与验证DNS相似, boolean considerRobotsPreconditions(CrawlURI curi)方法如下

    /**
         * Consider the robots precondition.
         *
         * @param curi CrawlURI we're checking for any required preconditions.
         * @return True, if this <code>curi</code> has a precondition or processing
         *         should be terminated for some other reason.  False if
         *         we can proceed to process this url.
         */
        protected boolean considerRobotsPreconditions(CrawlURI curi) {
            // treat /robots.txt fetches specially
            //忽略验证 return false;
            UURI uuri = curi.getUURI();
            try {
                if (uuri != null && uuri.getPath() != null &&
                        curi.getUURI().getPath().equals("/robots.txt")) {
                    // allow processing to continue
                    //本身为先决条件
                    curi.setPrerequisite(true);
                    return false;
                }
            } catch (URIException e) {
                logger.severe("Failed get of path for " + curi);
            }
            
            CrawlServer cs = serverCache.getServerFor(curi.getUURI());
            // require /robots.txt if not present
            //验证Robots是否过期
            if (cs.isRobotsExpired(getRobotsValidityDurationSeconds())) {
                // Need to get robots
                if (logger.isLoggable(Level.FINE)) {
                    logger.fine( "No valid robots for " + cs  +
                        "; deferring " + curi);
                }
    
                // Robots expired - should be refetched even though its already
                // crawled.
                try {
                    String prereq = curi.getUURI().resolve("/robots.txt").toString();
                    //设置先决条件
                    curi.markPrerequisite(prereq);
                }
                catch (URIException e1) {
                    logger.severe("Failed resolve using " + curi);
                    throw new RuntimeException(e1); // shouldn't ever happen
                }
                return true;
            }
            // test against robots.txt if available
            if (cs.isValidRobots()) {
                String ua = metadata.getUserAgent();
                RobotsPolicy robots = metadata.getRobotsPolicy();
                if(!robots.allows(ua, curi, cs.getRobotstxt())) {
                    if(getCalculateRobotsOnly()) {
                        // annotate URI as excluded, but continue to process normally
                        curi.getAnnotations().add("robotExcluded");
                        return false; 
                    }
                    // mark as precluded; in FetchHTTP, this will
                    // prevent fetching and cause a skip to the end
                    // of processing (unless an intervening processor
                    // overrules)
                    curi.setFetchStatus(S_ROBOTS_PRECLUDED);
                    curi.setError("robots.txt exclusion");
                    logger.fine("robots.txt precluded " + curi);
                    return true;
                }
                return false;
            }
            // No valid robots found => Attempt to get robots.txt failed
    //        curi.skipToPostProcessing();
            curi.setFetchStatus(S_ROBOTS_PREREQUISITE_FAILURE);
            curi.setError("robots.txt prerequisite failed");
            if (logger.isLoggable(Level.FINE)) {
                logger.fine("robots.txt prerequisite failed " + curi);
            }
            return true;
        }

    第三个先决条件为身份认证,boolean credentialPrecondition(final CrawlURI curi)方法如下 

    /**
        * Consider credential preconditions.
        *
        * Looks to see if any credential preconditions (e.g. html form login
        * credentials) for this <code>CrawlServer</code>. If there are, have they
        * been run already? If not, make the running of these logins a precondition
        * of accessing any other url on this <code>CrawlServer</code>.
        *
        * <p>
        * One day, do optimization and avoid running the bulk of the code below.
        * Argument for running the code everytime is that overrides and refinements
        * may change what comes back from credential store.
        *
        * @param curi CrawlURI we're checking for any required preconditions.
        * @return True, if this <code>curi</code> has a precondition that needs to
        *         be met before we can proceed. False if we can precede to process
        *         this url.
        */
        /**
         * 考虑 不同classkey 而host实际相同的情况   应该以host为依据
         * @param curi
         * @return
         */
        protected boolean credentialPrecondition(final CrawlURI curi) {
    
            boolean result = false;
            
            CredentialStore cs = getCredentialStore();
            if (cs == null) {
                logger.severe("No credential store for " + curi);
                return result;
            }
            //System.out.println(cs.getAll().size());
            //遍历CredentialStore cs对象存储的Collection<Credential>集合
            for (Credential c: cs.getAll()) {
                //判断当前CrawlURI curi对象是否先决条件
                if (c.isPrerequisite(curi)) {
                    // This credential has a prereq. and this curi is it.  Let it
                    // through.  Add its avatar to the curi as a mark.  Also, does
                    // this curi need to be posted?  Note, we do this test for
                    // is it a prereq BEFORE we do the check that curi is of the
                    // credential domain because such as yahoo have you go to
                    // another domain altogether to login.
                    //为CrawlURI curi对象添加当前证书
                    c.attach(curi);
                    curi.setFetchType(CrawlURI.FetchType.HTTP_POST);
                    break;
                }
                //当前Credential c对象的域名与当前CrawlURI curi对象的CrawlServer serv对象的serverName是一致的
                //也就是说 当前Credential c对象的域名与当前CrawlURI curi对象对应的域名是一致的
                if (!c.rootUriMatch(serverCache, curi)) {
                    continue;
                }
                //当前Credential c对象存在先决条件(form验证是登录地址)            
                if (!c.hasPrerequisite(curi)) {
                    continue;
                }
                //判断是否已经验证            
                //预先判断当前CrawlURI curi对象对应的域与当前Credential c认证对象对应的域是一致的
                
                //获取当前CrawlURI curi对象对应的CrawlServer server对象
                //遍历CrawlServer server对象的Set<Credential> credentials集合,检查是否有与当前Credential c认证对象一致的
                //外层循环是配置文件里面的所有Credential集合     内层循环是服务器已存储的是与当前CrawlURI curi对象对应的Credential集合
                //这里面应该保证同一域名下的CrawlURI curi对象对应的CrawlServer server对象是同一的
                //(不然同一域名下的其他的CrawlURI curi对象同样需要再次验证)
                if (!authenticated(c, curi)) {
                    // Han't been authenticated.  Queue it and move on (Assumption
                    // is that we can do one authentication at a time -- usually one
                    // html form).
                    //登录地址
                    String prereq = c.getPrerequisite(curi);
                    if (prereq == null || prereq.length() <= 0) {
                        CrawlServer server = serverCache.getServerFor(curi.getUURI());
                        logger.severe(server.getName() + " has "
                            + " credential(s) of type " + c + " but prereq"
                            + " is null.");
                    } else {
                        try {
                            //添加先决条件
                            curi.markPrerequisite(prereq);
                        } catch (URIException e) {
                            logger.severe("unable to set credentials prerequisite "+prereq);
                            loggerModule.logUriError(e,curi.getUURI(),prereq);
                            return false; 
                        }
                        result = true;
                        if (logger.isLoggable(Level.FINE)) {
                            logger.fine("Queueing prereq " + prereq + " of type " +
                                c + " for " + curi);
                        }
                        //跳出循环
                        break;
                    }
                }
            }
            return result;
        }

    boolean authenticated(final Credential credential, final CrawlURI curi)方法判断是否已经身份认证(CrawlServer server对象存在当前认证对象Credential credential)

    /**
         * Has passed credential already been authenticated.
         *
         * @param credential Credential to test.
         * @param curi CrawlURI.
         * @return True if already run.
         */
        protected boolean authenticated(final Credential credential, final CrawlURI curi) {
            //获取CrawlURI curi对象对应的CrawlServer server对象
            CrawlServer server = serverCache.getServerFor(curi.getUURI());
            if (!server.hasCredentials()) {
                return false;
            }
            //CrawlServer server对象里面已经持久化的Set<Credential> credentials集合
            Set<Credential> credentials = server.getCredentials();
            for (Credential cred: credentials) {
                //两者的key一致并且类型一致
                if (cred.getKey().equals(credential.getKey()) 
                        && cred.getClass().isInstance(credential)) {
                    return true; 
                }
            }
            return false;
        }

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/30/3052319.html

  • 相关阅读:
    一个小时学会MySQL数据库
    4种解决json日期格式问题的办法
    一个小时学会Git
    docker 常用操作
    Fine-Grained Image (细粒度图像) – Papers, Codes and Datasets
    Pytorch在colab和kaggle中使用TensorBoard/TensorboardX可视化
    训练集,验证集,测试集比例
    深度学习模型评估指标
    注意力机制(Attention Mechanism)应用——自然语言处理(NLP)
    自注意力机制(Self-attention Mechanism)——自然语言处理(NLP)
  • 原文地址:https://www.cnblogs.com/chenying99/p/3052319.html
Copyright © 2020-2023  润新知