上文分析了Heritrix3.1.0系统是怎么添加CrawlURI curi对象的,那么在系统初始化的时候,是怎么载入CrawlURI curi种子的呢?
我们回顾前面的文章,在我们执行采集任务的launch指令的时候,实际会调用CrawlController对象的void requestCrawlStart()方法
/** * Operator requested crawl begin */ public void requestCrawlStart() { hasStarted = true; sendCrawlStateChangeEvent(State.PREPARING, CrawlStatus.PREPARING); if(recoveryCheckpoint==null) { // only announce (trigger scheduling of) seeds // when doing a cold (non-recovery) start getSeeds().announceSeeds(); } setupToePool(); // A proper exit will change this value. this.sExit = CrawlStatus.FINISHED_ABNORMAL; if (getPauseAtStart()) { // frontier is already paused unless started, so just // 'complete'/ack pause completePause(); } else { getFrontier().run(); } }
继续调用getSeeds().announceSeeds()方法,这里的getSeeds()真实对象是TextSeedModule(spring自动注入的),然后调用它的void announceSeeds()方法
/** * Announce all seeds from configured source to SeedListeners * (including nonseed lines mixed in). * @see org.archive.modules.seeds.SeedModule#announceSeeds() */ public void announceSeeds() { if(getBlockAwaitingSeedLines()>-1) { final CountDownLatch latch = new CountDownLatch(getBlockAwaitingSeedLines()); new Thread(){ @Override public void run() { announceSeeds(latch); while(latch.getCount()>0) { latch.countDown(); } } }.start(); try { latch.await(); } catch (InterruptedException e) { // do nothing } } else { announceSeeds(null); } }
上面方法中if后面的CountDownLatch latch是线程计数,else后面是null,继续调用void announceSeeds(CountDownLatch latchOrNull)方法
protected void announceSeeds(CountDownLatch latchOrNull) { BufferedReader reader = new BufferedReader(textSource.obtainReader()); try { announceSeedsFromReader(reader,latchOrNull); } finally { IOUtils.closeQuietly(reader); } }
首先获取ReadSource textSource(org.archive.spring.ConfigString)的Reader(StringReader),然后调用void announceSeedsFromReader(BufferedReader reader, CountDownLatch latchOrNull)方法
/** * Announce all seeds (and nonseed possible-directive lines) from * the given Reader * @param reader source of seed/directive lines * @param latchOrNull if non-null, sent countDown after each line, allowing * another thread to proceed after a configurable number of lines processed */ protected void announceSeedsFromReader(BufferedReader reader, CountDownLatch latchOrNull) { String s; Iterator<String> iter = new RegexLineIterator( new LineReadingIterator(reader), RegexLineIterator.COMMENT_LINE, RegexLineIterator.NONWHITESPACE_ENTRY_TRAILING_COMMENT, RegexLineIterator.ENTRY); int count = 0; while (iter.hasNext()) { s = (String) iter.next(); if(Character.isLetterOrDigit(s.charAt(0))) { // consider a likely URI seedLine(s); count++; if(count%20000==0) { System.runFinalization(); } } else { // report just in case it's a useful directive nonseedLine(s); } if(latchOrNull!=null) { latchOrNull.countDown(); } } publishConcludedSeedBatch(); }
迭代url字符串并调用void seedLine(String uri)方法
/** * Handle a read line that is probably a seed. * * @param uri String seed-containing line */ protected void seedLine(String uri) { if (!uri.matches("[a-zA-Z][\\w+\\-]+:.*")) { // Rfc2396 s3.1 scheme, // minus '.' // Does not begin with scheme, so try http:// uri = "http://" + uri; } try { UURI uuri = UURIFactory.getInstance(uri); CrawlURI curi = new CrawlURI(uuri); curi.setSeed(true); curi.setSchedulingDirective(SchedulingConstants.MEDIUM); if (getSourceTagSeeds()) { curi.setSourceTag(curi.toString()); } publishAddedSeed(curi); } catch (URIException e) { // try as nonseed line as fallback nonseedLine(uri); } }
最后调用父类SeedModule的void publishAddedSeed(CrawlURI curi)方法(observer模式)
protected void publishAddedSeed(CrawlURI curi) { for (SeedListener l: seedListeners) { l.addedSeed(curi); } }
BdbFrontier类间接实现了SeedListener接口(AbstractFrontier抽象类void addedSeed(CrawlURI puri)方法)
/** * When notified of a seed via the SeedListener interface, * schedule it. * * @see org.archive.modules.seeds.SeedListener#addedSeed(org.archive.modules.CrawlURI) */ public void addedSeed(CrawlURI puri) { schedule(puri); }
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/20/3031924.html