Heritrix 3.1.0 源码解析（十三）

接下来分析BdbFrontier类的void finished(CrawlURI curi) 方法，完成CrawlURI对象的扫尾工作

在BdbFrontier类的父类的父类AbstractFrontier里面

org.archive.crawler.frontier.BdbFrontier

org.archive.crawler.frontier.AbstractFrontier

/**
     * Note that the previously emitted CrawlURI has completed
     * its processing (for now).
     *
     * The CrawlURI may be scheduled to retry, if appropriate,
     * and other related URIs may become eligible for release
     * via the next next() call, as a result of finished().
     *
     *  (non-Javadoc)
     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)
     */
    public void finished(CrawlURI curi) {
        try {
            KeyedProperties.loadOverridesFrom(curi);
            processFinish(curi);
        } finally {
            KeyedProperties.clearOverridesFrom(curi); 
        }
    }

继续调用BdbFrontier类的void processFinish(CrawlURI curi)方法，在BdbFrontier类的父类WorkQueueFrontier里面

org.archive.crawler.frontier.BdbFrontier

org.archive.crawler.frontier.WorkQueueFrontier

/**
     * Note that the previously emitted CrawlURI has completed
     * its processing (for now).
     *
     * The CrawlURI may be scheduled to retry, if appropriate,
     * and other related URIs may become eligible for release
     * via the next next() call, as a result of finished().
     *
     * TODO: make as many decisions about what happens to the CrawlURI
     * (success, failure, retry) and queue (retire, snooze, ready) as 
     * possible elsewhere, such as in DispositionProcessor. Then, break
     * this into simple branches or focused methods for each case. 
     *  
     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)
     */
    protected void processFinish(CrawlURI curi) {
//        assert Thread.currentThread() == managerThread;
        
        long now = System.currentTimeMillis();

        curi.incrementFetchAttempts();
        logNonfatalErrors(curi);
        
        WorkQueue wq = (WorkQueue) curi.getHolder();
        // always refresh budgeting values from current curi
        // (whose overlay settings should be active here)
        wq.setSessionBudget(getBalanceReplenishAmount());
        wq.setTotalBudget(getQueueTotalBudget());
        
        assert (wq.peek(this) == curi) : "unexpected peek " + wq;

        int holderCost = curi.getHolderCost();

        if (needsReenqueuing(curi)) {
            // codes/errors which don't consume the URI, leaving it atop queue
            if(curi.getFetchStatus()!=S_DEFERRED) {
                wq.expend(holderCost); // all retries but DEFERRED cost
            }
            long delay_ms = retryDelayFor(curi) * 1000;
            curi.processingCleanup(); // lose state that shouldn't burden retry
            wq.unpeek(curi);
            wq.update(this, curi); // rewrite any changes
            handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);
            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY));
            doJournalReenqueued(curi);
            wq.makeDirty();
            return; // no further dequeueing, logging, rescheduling to occur
        }

        // Curi will definitely be disposed of without retry, so remove from queue
        wq.dequeue(this,curi);
        decrementQueuedCount(1);
        largestQueues.update(wq.getClassKey(), wq.getCount());
        log(curi);

        
        if (curi.isSuccess()) {
            // codes deemed 'success' 
            incrementSucceededFetchCount();
            totalProcessedBytes.addAndGet(curi.getRecordedSize());
            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED));
            doJournalFinishedSuccess(curi);
           
        } else if (isDisregarded(curi)) {
            // codes meaning 'undo' (even though URI was enqueued, 
            // we now want to disregard it from normal success/failure tallies)
            // (eg robots-excluded, operator-changed-scope, etc)
            incrementDisregardedUriCount();
            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED));
            holderCost = 0; // no charge for disregarded URIs
            // TODO: consider reinstating forget-URI capability, so URI could be
            // re-enqueued if discovered again
            doJournalDisregarded(curi);
            
        } else {
            // codes meaning 'failure'
            incrementFailedFetchCount();
            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED));
            // if exception, also send to crawlErrors
            if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) {
                Object[] array = { curi };
                loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI()
                        .toString(), array);
            }        
            // charge queue any extra error penalty
            wq.noteError(getErrorPenaltyAmount());
            doJournalFinishedFailure(curi);
            
        }

        wq.expend(holderCost); // successes & failures charge cost to queue
        
        long delay_ms = curi.getPolitenessDelay();
        handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);
        wq.makeDirty();
        
        if(curi.getRescheduleTime()>0) {
            // marked up for forced-revisit at a set time
            curi.processingCleanup();
            curi.resetForRescheduling(); 
            futureUris.put(curi.getRescheduleTime(),curi);
            futureUriCount.incrementAndGet(); 
        } else {
            curi.stripToMinimal();
            curi.processingCleanup();
        }
    }

上述方面首先获取CrawlURI curi的holder属性（该CrawlURI curi对象对应classkey值得BdbWorkQueue对象，这里涉及到Heritrix3.1.0工作队列的调度，后文再分析），

然后调用BdbWorkQueue对象的synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected)方法

org.archive.crawler.frontier.BdbWorkQueue

org.archive.crawler.frontier.WorkQueue

/**
     * Remove the peekItem from the queue and adjusts the count.
     * 
     * @param frontier  Work queues manager.
     */
    protected synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected) {
        try {
            deleteItem(frontier, peekItem);
        } catch (IOException e) {
            //FIXME better exception handling
            e.printStackTrace();
            throw new RuntimeException(e);
        }
        unpeek(expected);
        count--;
        lastDequeueTime = System.currentTimeMillis();
    }

org.archive.crawler.frontier.BdbWorkQueue

protected void deleteItem(final WorkQueueFrontier frontier,
            final CrawlURI peekItem) throws IOException {
        try {
            final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier)
                .getWorkQueues();
             queues.delete(peekItem);
        } catch (DatabaseException e) {
            throw new IOException(e);
        }
    }

最后调用BdbMultipleWorkQueues对象的void delete(CrawlURI item) 方法，前面文章已经涉及过，这里不再重复这个方法了

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处博客园刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025419.html

相关阅读:
基于接口(工厂模式)三层架构的引用和访问流程
 数据库访问类小结
 是毁灭还是重生——从浏览器大战看未来软件发展
 Func,Action的介绍及其用法
 IIS发布程序后，出现“服务器应用程序不可用”的错误
 泛型委托 Action<T>和Func<T,TResult>
IIS发布程序后，出现“服务器应用程序不可用”的错误
 利用winrar自动备份重要资料
 HDU 2795 Billboard
HDU 1140 War on Weather
原文地址：https://www.cnblogs.com/chenying99/p/3025419.html