接下来分析BdbFrontier类的void finished(CrawlURI curi) 方法,完成CrawlURI对象的扫尾工作
在BdbFrontier类的父类的父类AbstractFrontier里面
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.AbstractFrontier
/** * Note that the previously emitted CrawlURI has completed * its processing (for now). * * The CrawlURI may be scheduled to retry, if appropriate, * and other related URIs may become eligible for release * via the next next() call, as a result of finished(). * * (non-Javadoc) * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI) */ public void finished(CrawlURI curi) { try { KeyedProperties.loadOverridesFrom(curi); processFinish(curi); } finally { KeyedProperties.clearOverridesFrom(curi); } }
继续调用BdbFrontier类的void processFinish(CrawlURI curi)方法,在BdbFrontier类的父类WorkQueueFrontier里面
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.WorkQueueFrontier
/** * Note that the previously emitted CrawlURI has completed * its processing (for now). * * The CrawlURI may be scheduled to retry, if appropriate, * and other related URIs may become eligible for release * via the next next() call, as a result of finished(). * * TODO: make as many decisions about what happens to the CrawlURI * (success, failure, retry) and queue (retire, snooze, ready) as * possible elsewhere, such as in DispositionProcessor. Then, break * this into simple branches or focused methods for each case. * * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI) */ protected void processFinish(CrawlURI curi) { // assert Thread.currentThread() == managerThread; long now = System.currentTimeMillis(); curi.incrementFetchAttempts(); logNonfatalErrors(curi); WorkQueue wq = (WorkQueue) curi.getHolder(); // always refresh budgeting values from current curi // (whose overlay settings should be active here) wq.setSessionBudget(getBalanceReplenishAmount()); wq.setTotalBudget(getQueueTotalBudget()); assert (wq.peek(this) == curi) : "unexpected peek " + wq; int holderCost = curi.getHolderCost(); if (needsReenqueuing(curi)) { // codes/errors which don't consume the URI, leaving it atop queue if(curi.getFetchStatus()!=S_DEFERRED) { wq.expend(holderCost); // all retries but DEFERRED cost } long delay_ms = retryDelayFor(curi) * 1000; curi.processingCleanup(); // lose state that shouldn't burden retry wq.unpeek(curi); wq.update(this, curi); // rewrite any changes handleQueue(wq,curi.includesRetireDirective(),now,delay_ms); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY)); doJournalReenqueued(curi); wq.makeDirty(); return; // no further dequeueing, logging, rescheduling to occur } // Curi will definitely be disposed of without retry, so remove from queue wq.dequeue(this,curi); decrementQueuedCount(1); largestQueues.update(wq.getClassKey(), wq.getCount()); log(curi); if (curi.isSuccess()) { // codes deemed 'success' incrementSucceededFetchCount(); totalProcessedBytes.addAndGet(curi.getRecordedSize()); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED)); doJournalFinishedSuccess(curi); } else if (isDisregarded(curi)) { // codes meaning 'undo' (even though URI was enqueued, // we now want to disregard it from normal success/failure tallies) // (eg robots-excluded, operator-changed-scope, etc) incrementDisregardedUriCount(); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED)); holderCost = 0; // no charge for disregarded URIs // TODO: consider reinstating forget-URI capability, so URI could be // re-enqueued if discovered again doJournalDisregarded(curi); } else { // codes meaning 'failure' incrementFailedFetchCount(); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED)); // if exception, also send to crawlErrors if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) { Object[] array = { curi }; loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI() .toString(), array); } // charge queue any extra error penalty wq.noteError(getErrorPenaltyAmount()); doJournalFinishedFailure(curi); } wq.expend(holderCost); // successes & failures charge cost to queue long delay_ms = curi.getPolitenessDelay(); handleQueue(wq,curi.includesRetireDirective(),now,delay_ms); wq.makeDirty(); if(curi.getRescheduleTime()>0) { // marked up for forced-revisit at a set time curi.processingCleanup(); curi.resetForRescheduling(); futureUris.put(curi.getRescheduleTime(),curi); futureUriCount.incrementAndGet(); } else { curi.stripToMinimal(); curi.processingCleanup(); } }
上述方面首先获取CrawlURI curi的holder属性(该CrawlURI curi对象对应classkey值得BdbWorkQueue对象,这里涉及到Heritrix3.1.0工作队列的调度,后文再分析),
然后调用BdbWorkQueue对象的synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected)方法
org.archive.crawler.frontier.BdbWorkQueue
org.archive.crawler.frontier.WorkQueue
/** * Remove the peekItem from the queue and adjusts the count. * * @param frontier Work queues manager. */ protected synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected) { try { deleteItem(frontier, peekItem); } catch (IOException e) { //FIXME better exception handling e.printStackTrace(); throw new RuntimeException(e); } unpeek(expected); count--; lastDequeueTime = System.currentTimeMillis(); }
org.archive.crawler.frontier.BdbWorkQueue
protected void deleteItem(final WorkQueueFrontier frontier, final CrawlURI peekItem) throws IOException { try { final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier) .getWorkQueues(); queues.delete(peekItem); } catch (DatabaseException e) { throw new IOException(e); } }
最后调用BdbMultipleWorkQueues对象的void delete(CrawlURI item) 方法,前面文章已经涉及过,这里不再重复这个方法了
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025419.html