• Heritrix 3.1.0 源码解析(十三)


    接下来分析BdbFrontier类的void finished(CrawlURI curi) 方法,完成CrawlURI对象的扫尾工作

    BdbFrontier类的父类的父类AbstractFrontier里面

    org.archive.crawler.frontier.BdbFrontier

          org.archive.crawler.frontier.AbstractFrontier

    /**
         * Note that the previously emitted CrawlURI has completed
         * its processing (for now).
         *
         * The CrawlURI may be scheduled to retry, if appropriate,
         * and other related URIs may become eligible for release
         * via the next next() call, as a result of finished().
         *
         *  (non-Javadoc)
         * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)
         */
        public void finished(CrawlURI curi) {
            try {
                KeyedProperties.loadOverridesFrom(curi);
                processFinish(curi);
            } finally {
                KeyedProperties.clearOverridesFrom(curi); 
            }
        }

    继续调用BdbFrontier类的void processFinish(CrawlURI curi)方法,在BdbFrontier类的父类WorkQueueFrontier里面

    org.archive.crawler.frontier.BdbFrontier

                    org.archive.crawler.frontier.WorkQueueFrontier

    /**
         * Note that the previously emitted CrawlURI has completed
         * its processing (for now).
         *
         * The CrawlURI may be scheduled to retry, if appropriate,
         * and other related URIs may become eligible for release
         * via the next next() call, as a result of finished().
         *
         * TODO: make as many decisions about what happens to the CrawlURI
         * (success, failure, retry) and queue (retire, snooze, ready) as 
         * possible elsewhere, such as in DispositionProcessor. Then, break
         * this into simple branches or focused methods for each case. 
         *  
         * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)
         */
        protected void processFinish(CrawlURI curi) {
    //        assert Thread.currentThread() == managerThread;
            
            long now = System.currentTimeMillis();
    
            curi.incrementFetchAttempts();
            logNonfatalErrors(curi);
            
            WorkQueue wq = (WorkQueue) curi.getHolder();
            // always refresh budgeting values from current curi
            // (whose overlay settings should be active here)
            wq.setSessionBudget(getBalanceReplenishAmount());
            wq.setTotalBudget(getQueueTotalBudget());
            
            assert (wq.peek(this) == curi) : "unexpected peek " + wq;
    
            int holderCost = curi.getHolderCost();
    
            if (needsReenqueuing(curi)) {
                // codes/errors which don't consume the URI, leaving it atop queue
                if(curi.getFetchStatus()!=S_DEFERRED) {
                    wq.expend(holderCost); // all retries but DEFERRED cost
                }
                long delay_ms = retryDelayFor(curi) * 1000;
                curi.processingCleanup(); // lose state that shouldn't burden retry
                wq.unpeek(curi);
                wq.update(this, curi); // rewrite any changes
                handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);
                appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY));
                doJournalReenqueued(curi);
                wq.makeDirty();
                return; // no further dequeueing, logging, rescheduling to occur
            }
    
            // Curi will definitely be disposed of without retry, so remove from queue
            wq.dequeue(this,curi);
            decrementQueuedCount(1);
            largestQueues.update(wq.getClassKey(), wq.getCount());
            log(curi);
    
            
            if (curi.isSuccess()) {
                // codes deemed 'success' 
                incrementSucceededFetchCount();
                totalProcessedBytes.addAndGet(curi.getRecordedSize());
                appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED));
                doJournalFinishedSuccess(curi);
               
            } else if (isDisregarded(curi)) {
                // codes meaning 'undo' (even though URI was enqueued, 
                // we now want to disregard it from normal success/failure tallies)
                // (eg robots-excluded, operator-changed-scope, etc)
                incrementDisregardedUriCount();
                appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED));
                holderCost = 0; // no charge for disregarded URIs
                // TODO: consider reinstating forget-URI capability, so URI could be
                // re-enqueued if discovered again
                doJournalDisregarded(curi);
                
            } else {
                // codes meaning 'failure'
                incrementFailedFetchCount();
                appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED));
                // if exception, also send to crawlErrors
                if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) {
                    Object[] array = { curi };
                    loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI()
                            .toString(), array);
                }        
                // charge queue any extra error penalty
                wq.noteError(getErrorPenaltyAmount());
                doJournalFinishedFailure(curi);
                
            }
    
            wq.expend(holderCost); // successes & failures charge cost to queue
            
            long delay_ms = curi.getPolitenessDelay();
            handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);
            wq.makeDirty();
            
            if(curi.getRescheduleTime()>0) {
                // marked up for forced-revisit at a set time
                curi.processingCleanup();
                curi.resetForRescheduling(); 
                futureUris.put(curi.getRescheduleTime(),curi);
                futureUriCount.incrementAndGet(); 
            } else {
                curi.stripToMinimal();
                curi.processingCleanup();
            }
        }

    上述方面首先获取CrawlURI curi的holder属性(该CrawlURI curi对象对应classkey值得BdbWorkQueue对象,这里涉及到Heritrix3.1.0工作队列的调度,后文再分析),

    然后调用BdbWorkQueue对象的synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected)方法

    org.archive.crawler.frontier.BdbWorkQueue

          org.archive.crawler.frontier.WorkQueue

    /**
         * Remove the peekItem from the queue and adjusts the count.
         * 
         * @param frontier  Work queues manager.
         */
        protected synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected) {
            try {
                deleteItem(frontier, peekItem);
            } catch (IOException e) {
                //FIXME better exception handling
                e.printStackTrace();
                throw new RuntimeException(e);
            }
            unpeek(expected);
            count--;
            lastDequeueTime = System.currentTimeMillis();
        }

    org.archive.crawler.frontier.BdbWorkQueue

    protected void deleteItem(final WorkQueueFrontier frontier,
                final CrawlURI peekItem) throws IOException {
            try {
                final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier)
                    .getWorkQueues();
                 queues.delete(peekItem);
            } catch (DatabaseException e) {
                throw new IOException(e);
            }
        }

    最后调用BdbMultipleWorkQueues对象的void delete(CrawlURI item) 方法,前面文章已经涉及过,这里不再重复这个方法了

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025419.html 

  • 相关阅读:
    基于接口(工厂模式)三层架构的引用和访问流程
    数据库访问类小结
    是毁灭还是重生——从浏览器大战看未来软件发展
    Func,Action的介绍及其用法
    IIS发布程序后,出现“服务器应用程序不可用”的错误
    泛型委托 Action<T>和Func<T,TResult>
    IIS发布程序后,出现“服务器应用程序不可用”的错误
    利用winrar自动备份重要资料
    HDU 2795 Billboard
    HDU 1140 War on Weather
  • 原文地址:https://www.cnblogs.com/chenying99/p/3025419.html
Copyright © 2020-2023  润新知