从BdbFrontier对象的next方法(从某个Classkey标识的BdbWorkQueue工作队列)取出来的CrawlURI uri对象第一步要进入的处理器是Preselector处理器,该处理器主要是对CrawlURI uri对象根据配置文件里面配置的正则表达式进行过滤,通过过滤的CrawlURI uri对象才能进入下一步的处理器,该处理器继承自Scoper类(Scoper类我在前面的文章已经解析过,这不再重复),该类比较简单,我只贴出相关处理的方法
@Override protected ProcessResult innerProcessResult(CrawlURI puri) { CrawlURI curi = (CrawlURI)puri; // Check if uris should be blocked if (getBlockAll()) { curi.setFetchStatus(S_BLOCKED_BY_USER); return ProcessResult.FINISH; } // Check if allowed by regular expression String regex = getAllowByRegex(); if (regex != null && !regex.equals("")) { if (!TextUtils.matches(regex, curi.toString())) { curi.setFetchStatus(S_BLOCKED_BY_USER); return ProcessResult.FINISH; } } // Check if blocked by regular expression regex = getBlockByRegex(); if (regex != null && !regex.equals("")) { if (TextUtils.matches(regex, curi.toString())) { curi.setFetchStatus(S_BLOCKED_BY_USER); return ProcessResult.FINISH; } } // Possibly recheck scope if (getRecheckScope()) { if (!isInScope(curi)) { // Scope rejected curi.setFetchStatus(S_OUT_OF_SCOPE); return ProcessResult.FINISH; } } return ProcessResult.PROCEED; }
上面方法里面的最后一步是判断是否还要重新范围筛选(调用父类Scoper类的boolean isInScope(CrawlURI caUri)方法),默认为false
这里需要弄明白的是,该处理器的正则过滤,是在CrawlURI uri对象已经添加到了BdbWorkQueue工作队列而进行正式采集前的处理,不同于CandidatesProcessor处理器是在CrawlURI uri对象进入BdbWorkQueue工作队列之前的筛选,我们配置过滤CrawlURI uri对象的过滤规则,本人推荐在CandidatesProcessor处理器相关模块设置
相应的正则表达式可以在crawler-beans.cxml配置文件中设置Preselector处理器Bean属性
<!-- first, processors are declared as top-level named beans --> <bean id="preselector" class="org.archive.crawler.prefetch.MyPreselector"> <!-- <property name="recheckScope" value="false" />--> <!-- <property name="blockAll" value="false" />--> <!-- <property name="blockByRegex" value="" />--> <!-- <property name="allowByRegex" value="" />--> </bean>
通过FrontierPreparer处理器的CrawlURI uri对象下一步进入PreconditionEnforcer处理器,该处理器可以称为先决条件处理器,主要是验证DNS,验证Robots规则,进行身份认证等,其相关处理方法如下
@Override protected ProcessResult innerProcessResult(CrawlURI puri) { CrawlURI curi = (CrawlURI)puri; //DNS解析验证 if (considerDnsPreconditions(curi)) { return ProcessResult.FINISH; } // make sure we only process schemes we understand (i.e. not dns) 当前CrawlURI puri对象的scheme不是http并且不是https String scheme = curi.getUURI().getScheme().toLowerCase(); if (! (scheme.equals("http") || scheme.equals("https"))) { logger.fine("PolitenessEnforcer doesn't understand uri's of type " + scheme + " (ignoring)"); return ProcessResult.PROCEED; } //Robots验证 if (considerRobotsPreconditions(curi)) { return ProcessResult.FINISH; } // System.out.println("!curi.isPrerequisite():"+!curi.isPrerequisite()); //身份认证 if (!curi.isPrerequisite() && credentialPrecondition(curi)) { return ProcessResult.FINISH; } // OK, it's allowed // For all curis that will in fact be fetched, set appropriate delays. // TODO: SOMEDAY: allow per-host, per-protocol, etc. factors // curi.setDelayFactor(getDelayFactorFor(curi)); // curi.setMinimumDelay(getMinimumDelayFor(curi)); return ProcessResult.PROCEED; }
如果存在先决条件,则设置当前CrawlURI puri对象的先决条件并退出当前处理器链(FetchChain处理器链)的流程
我们先来分析第一个先决条件:DNS解析验证,boolean considerDnsPreconditions(CrawlURI curi)方法
/** * @param curi CrawlURI whose dns prerequisite we're to check. * @return true if no further processing in this module should occur */ protected boolean considerDnsPreconditions(CrawlURI curi) { if(curi.getUURI().getScheme().equals("dns")){ // DNS URIs never have a DNS precondition //如果为DNS,本身为先决条件 curi.setPrerequisite(true); return false; } else if (curi.getUURI().getScheme().equals("whois")) { return false; } //serverCache:org.archive.modules.net.BdbServerCache CrawlServer cs = serverCache.getServerFor(curi.getUURI()); if(cs == null) { curi.setFetchStatus(S_UNFETCHABLE_URI); // curi.skipToPostProcessing(); return true; } // If we've done a dns lookup and it didn't resolve a host // cancel further fetch-processing of this URI, because // the domain is unresolvable CrawlHost ch = serverCache.getHostFor(curi.getUURI()); if (ch == null || ch.hasBeenLookedUp() && ch.getIP() == null) { if (logger.isLoggable(Level.FINE)) { logger.fine( "no dns for " + ch + " cancelling processing for CrawlURI " + curi.toString()); } curi.setFetchStatus(S_DOMAIN_PREREQUISITE_FAILURE); // curi.skipToPostProcessing(); return true; } // If we haven't done a dns lookup and this isn't a dns uri // shoot that off and defer further processing //判断IP是否过期并且当前CrawlURI curi对象的scheme本身不是dns if (isIpExpired(curi) && !curi.getUURI().getScheme().equals("dns")) { logger.fine("Deferring processing of CrawlURI " + curi.toString() + " for dns lookup."); String preq = "dns:" + ch.getHostName(); try { // 先决条件 DNS解析 curi.markPrerequisite(preq); } catch (URIException e) { throw new RuntimeException(e); // shouldn't ever happen } return true; } // DNS preconditions OK return false; }
boolean isIpExpired(CrawlURI curi)方法判断IP是否注册(判断当前CrawlURI curi对象对应的CrawlHost host对象里面IP是否注册)
/** Return true if ip should be looked up. * * @param curi the URI to check. * @return true if ip should be looked up. */ public boolean isIpExpired(CrawlURI curi) { CrawlHost host = serverCache.getHostFor(curi.getUURI()); if (!host.hasBeenLookedUp()) { // IP has not been looked up yet. return true; } if (host.getIpTTL() == CrawlHost.IP_NEVER_EXPIRES) { // IP never expires (numeric IP) return false; } long duration = getIpValidityDurationSeconds(); if (duration == 0) { // Never expire ip if duration is null (set by user or more likely, // set to zero in case where we tried in FetchDNS but failed). return false; } long ttl = host.getIpTTL(); if (ttl > duration) { // Use the larger of the operator-set minimum duration // or the DNS record TTL duration = ttl; } // Duration and ttl are in seconds. Convert to millis. if (duration > 0) { duration *= 1000; } return (duration + host.getIpFetched()) < System.currentTimeMillis(); }
如果IP没有注册,则设置当前CrawlURI curi对象的先决条件为dns:host,CrawlURI curi对象的CrawlURI markPrerequisite(String preq) 方法如下
/** * Do all actions associated with setting a <code>CrawlURI</code> as * requiring a prerequisite. * * @param lastProcessorChain Last processor chain reference. This chain is * where this <code>CrawlURI</code> goes next. * @param preq Object to set a prerequisite. * @return the newly created prerequisite CrawlURI * @throws URIException */ public CrawlURI markPrerequisite(String preq) throws URIException { UURI src = getUURI(); UURI dest = UURIFactory.getInstance(preq); LinkContext lc = LinkContext.PREREQ_MISC; Hop hop = Hop.PREREQ; Link link = new Link(src, dest, lc, hop); CrawlURI caUri = createCrawlURI(getBaseURI(), link); // TODO: consider moving some of this to candidate-handling int prereqPriority = getSchedulingDirective() - 1; if (prereqPriority < 0) { prereqPriority = 0; logger.severe("Unable to promote prerequisite " + caUri + " above " + this); } caUri.setSchedulingDirective(prereqPriority); caUri.setForceFetch(true); setPrerequisiteUri(caUri); incrementDeferrals(); setFetchStatus(S_DEFERRED); return caUri; }
在上述方法里面首先生成先决条件CrawlURI caUri对象,设置高一级的调度级别,最后以key值为String A_PREREQUISITE_URI = "prerequisite-uri"添加到当前CrawlURI curi对象的Map<String,Object> data成员里面
/** * Set a prerequisite for this URI. * <p> * A prerequisite is a URI that must be crawled before this URI can be * crawled. * * @param link Link to set as prereq. */ public void setPrerequisiteUri(CrawlURI pre) { getData().put(A_PREREQUISITE_URI, pre); }
退出当前处理器链(FetchChain处理器链)后,进入下一轮的处理器链(DispositionChain处理器链)中,在CandidatesProcessor处理器中将先决条件添加到BdbWorkQueue工作队列,相关代码如下:
@Override protected void innerProcess(final CrawlURI curi) throws InterruptedException { // Handle any prerequisites when S_DEFERRED for prereqs if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) { CrawlURI prereq = curi.getPrerequisiteUri(); prereq.setFullVia(curi); sheetOverlaysManager.applyOverlaysTo(prereq); try { KeyedProperties.clearOverridesFrom(curi); KeyedProperties.loadOverridesFrom(prereq); getCandidateChain().process(prereq, null); if(prereq.getFetchStatus()>=0) { //添加到BdbWorkQueue工作队列 frontier.schedule(prereq); } else { curi.setFetchStatus(S_PREREQUISITE_UNSCHEDULABLE_FAILURE); } } finally { KeyedProperties.clearOverridesFrom(prereq); KeyedProperties.loadOverridesFrom(curi); } return; } //后面部分的代码略 }
回到PreconditionEnforcer处理器,第二个先决条件为验证Robots规则,处理逻辑与验证DNS相似, boolean considerRobotsPreconditions(CrawlURI curi)方法如下
/** * Consider the robots precondition. * * @param curi CrawlURI we're checking for any required preconditions. * @return True, if this <code>curi</code> has a precondition or processing * should be terminated for some other reason. False if * we can proceed to process this url. */ protected boolean considerRobotsPreconditions(CrawlURI curi) { // treat /robots.txt fetches specially //忽略验证 return false; UURI uuri = curi.getUURI(); try { if (uuri != null && uuri.getPath() != null && curi.getUURI().getPath().equals("/robots.txt")) { // allow processing to continue //本身为先决条件 curi.setPrerequisite(true); return false; } } catch (URIException e) { logger.severe("Failed get of path for " + curi); } CrawlServer cs = serverCache.getServerFor(curi.getUURI()); // require /robots.txt if not present //验证Robots是否过期 if (cs.isRobotsExpired(getRobotsValidityDurationSeconds())) { // Need to get robots if (logger.isLoggable(Level.FINE)) { logger.fine( "No valid robots for " + cs + "; deferring " + curi); } // Robots expired - should be refetched even though its already // crawled. try { String prereq = curi.getUURI().resolve("/robots.txt").toString(); //设置先决条件 curi.markPrerequisite(prereq); } catch (URIException e1) { logger.severe("Failed resolve using " + curi); throw new RuntimeException(e1); // shouldn't ever happen } return true; } // test against robots.txt if available if (cs.isValidRobots()) { String ua = metadata.getUserAgent(); RobotsPolicy robots = metadata.getRobotsPolicy(); if(!robots.allows(ua, curi, cs.getRobotstxt())) { if(getCalculateRobotsOnly()) { // annotate URI as excluded, but continue to process normally curi.getAnnotations().add("robotExcluded"); return false; } // mark as precluded; in FetchHTTP, this will // prevent fetching and cause a skip to the end // of processing (unless an intervening processor // overrules) curi.setFetchStatus(S_ROBOTS_PRECLUDED); curi.setError("robots.txt exclusion"); logger.fine("robots.txt precluded " + curi); return true; } return false; } // No valid robots found => Attempt to get robots.txt failed // curi.skipToPostProcessing(); curi.setFetchStatus(S_ROBOTS_PREREQUISITE_FAILURE); curi.setError("robots.txt prerequisite failed"); if (logger.isLoggable(Level.FINE)) { logger.fine("robots.txt prerequisite failed " + curi); } return true; }
第三个先决条件为身份认证,boolean credentialPrecondition(final CrawlURI curi)方法如下
/** * Consider credential preconditions. * * Looks to see if any credential preconditions (e.g. html form login * credentials) for this <code>CrawlServer</code>. If there are, have they * been run already? If not, make the running of these logins a precondition * of accessing any other url on this <code>CrawlServer</code>. * * <p> * One day, do optimization and avoid running the bulk of the code below. * Argument for running the code everytime is that overrides and refinements * may change what comes back from credential store. * * @param curi CrawlURI we're checking for any required preconditions. * @return True, if this <code>curi</code> has a precondition that needs to * be met before we can proceed. False if we can precede to process * this url. */ /** * 考虑 不同classkey 而host实际相同的情况 应该以host为依据 * @param curi * @return */ protected boolean credentialPrecondition(final CrawlURI curi) { boolean result = false; CredentialStore cs = getCredentialStore(); if (cs == null) { logger.severe("No credential store for " + curi); return result; } //System.out.println(cs.getAll().size()); //遍历CredentialStore cs对象存储的Collection<Credential>集合 for (Credential c: cs.getAll()) { //判断当前CrawlURI curi对象是否先决条件 if (c.isPrerequisite(curi)) { // This credential has a prereq. and this curi is it. Let it // through. Add its avatar to the curi as a mark. Also, does // this curi need to be posted? Note, we do this test for // is it a prereq BEFORE we do the check that curi is of the // credential domain because such as yahoo have you go to // another domain altogether to login. //为CrawlURI curi对象添加当前证书 c.attach(curi); curi.setFetchType(CrawlURI.FetchType.HTTP_POST); break; } //当前Credential c对象的域名与当前CrawlURI curi对象的CrawlServer serv对象的serverName是一致的 //也就是说 当前Credential c对象的域名与当前CrawlURI curi对象对应的域名是一致的 if (!c.rootUriMatch(serverCache, curi)) { continue; } //当前Credential c对象存在先决条件(form验证是登录地址) if (!c.hasPrerequisite(curi)) { continue; } //判断是否已经验证 //预先判断当前CrawlURI curi对象对应的域与当前Credential c认证对象对应的域是一致的 //获取当前CrawlURI curi对象对应的CrawlServer server对象 //遍历CrawlServer server对象的Set<Credential> credentials集合,检查是否有与当前Credential c认证对象一致的 //外层循环是配置文件里面的所有Credential集合 内层循环是服务器已存储的是与当前CrawlURI curi对象对应的Credential集合 //这里面应该保证同一域名下的CrawlURI curi对象对应的CrawlServer server对象是同一的 //(不然同一域名下的其他的CrawlURI curi对象同样需要再次验证) if (!authenticated(c, curi)) { // Han't been authenticated. Queue it and move on (Assumption // is that we can do one authentication at a time -- usually one // html form). //登录地址 String prereq = c.getPrerequisite(curi); if (prereq == null || prereq.length() <= 0) { CrawlServer server = serverCache.getServerFor(curi.getUURI()); logger.severe(server.getName() + " has " + " credential(s) of type " + c + " but prereq" + " is null."); } else { try { //添加先决条件 curi.markPrerequisite(prereq); } catch (URIException e) { logger.severe("unable to set credentials prerequisite "+prereq); loggerModule.logUriError(e,curi.getUURI(),prereq); return false; } result = true; if (logger.isLoggable(Level.FINE)) { logger.fine("Queueing prereq " + prereq + " of type " + c + " for " + curi); } //跳出循环 break; } } } return result; }
boolean authenticated(final Credential credential, final CrawlURI curi)方法判断是否已经身份认证(CrawlServer server对象存在当前认证对象Credential credential)
/** * Has passed credential already been authenticated. * * @param credential Credential to test. * @param curi CrawlURI. * @return True if already run. */ protected boolean authenticated(final Credential credential, final CrawlURI curi) { //获取CrawlURI curi对象对应的CrawlServer server对象 CrawlServer server = serverCache.getServerFor(curi.getUURI()); if (!server.hasCredentials()) { return false; } //CrawlServer server对象里面已经持久化的Set<Credential> credentials集合 Set<Credential> credentials = server.getCredentials(); for (Credential cred: credentials) { //两者的key一致并且类型一致 if (cred.getKey().equals(credential.getKey()) && cred.getClass().isInstance(credential)) { return true; } } return false; }
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/30/3052319.html