• Heritrix 3.1.0 源码解析(十)


    本文要分析的是Heritrix3.1.0的Frontier组件,先熟悉一下相关的UML类图

    通过浏览该图,我们可以清楚的看出Frontier组件的相关接口和类的继承和调用关系,不必我再文字描述了 

    在分析BdbFrontier类的相关方法之前,有必要熟悉一下在系统环境中该对象的初始状态,以及状态的改变

    这部分内容需要回顾 Heritrix 3.1.0 源码解析(六)一文,本人就不在这里重复了 

    BdbFrontier类重要的有几个方法,分别是添加CrawlURI,获取CrawlURI,完成CrawlURI

    void schedule(CrawlURI curi)

    CrawlURI next()

    void finished(CrawlURI curi) 

    下面首先来分析void schedule(CrawlURI curi)方法,在它的父类WorkQueueFrontier里面 

    org.archive.crawler.frontier.BdbFrontier

        org.archive.crawler.frontier.WorkQueueFrontier

    /**
         * Arrange for the given CrawlURI to be visited, if it is not
         * already enqueued/completed. 
         * 
         * Differs from superclass in that it operates in calling thread, rather 
         * than deferring operations via in-queue to managerThread. TODO: settle
         * on either defer or in-thread approach after testing. 
         *
         * @see org.archive.crawler.framework.Frontier#schedule(org.archive.modules.CrawlURI)
         */
        @Override
        public void schedule(CrawlURI curi) {
            sheetOverlaysManager.applyOverlaysTo(curi);
            try {
                KeyedProperties.loadOverridesFrom(curi);
                if(curi.getClassKey()==null) {
                    // remedial processing
                    preparer.prepare(curi);
                }
                processScheduleIfUnique(curi);
            } finally {
                KeyedProperties.clearOverridesFrom(curi); 
            }
        }

    org.archive.crawler.frontier.BdbFrontier

        org.archive.crawler.frontier.WorkQueueFrontier

     /**
         * Arrange for the given CrawlURI to be visited, if it is not
         * already scheduled/completed.
         *
         * @see org.archive.crawler.framework.Frontier#schedule(org.archive.modules.CrawlURI)
         */
        protected void processScheduleIfUnique(CrawlURI curi) {
    //        assert Thread.currentThread() == managerThread;
            assert KeyedProperties.overridesActiveFrom(curi); 
            
            // Canonicalization may set forceFetch flag.  See
            // #canonicalization(CrawlURI) javadoc for circumstance.
            String canon = curi.getCanonicalString();
            if (curi.forceFetch()) {
                uriUniqFilter.addForce(canon, curi);
            } else {
                uriUniqFilter.add(canon, curi);
            }
        }

    void processScheduleIfUnique(CrawlURI curi)方法调用了BdbUriUniqFilter类成员的void add(String key, CrawlURI value)方法

    在分析BdbUriUniqFilter类成员的void add(String key, CrawlURI value)方法前,首先有必要查看一下该对象的状态

    org.archive.crawler.util.BdbUriUniqFilter类在它的初始化方法里面初始化下面成员变量,前者是BDB数据库,后者是BDB数据库的数据对象(在BdbUriUniqFilter类方法里面我还没找发现该变量的作用)

    protected transient Database alreadySeen = null;
    protected transient DatabaseEntry value = null;

    public void start() {
            if(isRunning()) {
                return; 
            }
            boolean isRecovery = (recoveryCheckpoint != null);
            try {
                BdbModule.BdbConfig config = getDatabaseConfig();
                config.setAllowCreate(!isRecovery);
                initialize(bdb.openDatabase(DB_NAME, config, isRecovery));
            } catch (DatabaseException e) {
                throw new IllegalStateException(e);
            }
            if(isRecovery) {
                JSONObject json = recoveryCheckpoint.loadJson(beanName);
                try {
                    count.set(json.getLong("count"));
                } catch (JSONException e) {
                    throw new RuntimeException(e);
                }           
            }
            isRunning = true; 
        }

     进一步调用 void initialize(Database db)

    /**
         * Method shared by constructors.
         * @param env Environment to use.
         * @throws DatabaseException
         */
        protected void initialize(Database db) throws DatabaseException {
            open(db);
        }

    初始化数据库Database alreadySeen,void open(final Database db)

    protected void open(final Database db)
        throws DatabaseException {
            this.alreadySeen = db;
            this.value = new DatabaseEntry("".getBytes());
        }

    另外在BdbFrontier对象的初始化方法里面我们发现执行了BdbUriUniqFilter类成员的void setDestination(CrawlUriReceiver receiver)方法,这里是设置BdbFrontier对象本身,在BdbUriUniqFilter对象的相关方法里面调用了BdbFrontier对象的void receive(CrawlURI item)方法(BdbFrontier类实现了CrawlUriReceiver接口),这里类似于回调接口

    org.archive.crawler.frontier.BdbFrontier

         org.archive.crawler.frontier.WorkQueueFrontier

    public void start() {
            if(isRunning()) {
                return; 
            }
            uriUniqFilter.setDestination(this);
            super.start();
            try {
                initInternalQueues();
            } catch (Exception e) {
                throw new IllegalStateException(e);
            }
    
        }

    BdbUriUniqFilter对象的状态分析完毕,现在我们再来分析上面uriUniqFilter.add(canon, curi)方法,在BdbUriUniqFilter的父类里面

    org.archive.crawler.util.BdbUriUniqFilter

         org.archive.crawler.util.SetBasedUriUniqFilter

    public void add(String key, CrawlURI value) {
            profileLog(key);
            if (setAdd(key)) {
                this.receiver.receive(value);
                if (setCount() % 50000 == 0) {
                    LOGGER.log(Level.FINE, "count: " + setCount() + " totalDups: "
                            + duplicateCount + " recentDups: "
                            + (duplicateCount - duplicatesAtLastSample));
                    duplicatesAtLastSample = duplicateCount;
                }
            } else {
                duplicateCount++;
            }
        }

    进一步调用BdbUriUniqFilter本身的boolean setAdd(CharSequence uri)方法

    org.archive.crawler.util.BdbUriUniqFilter 

    protected boolean setAdd(CharSequence uri) {
            DatabaseEntry key = new DatabaseEntry();
            LongBinding.longToEntry(createKey(uri), key);
            long started = 0;
            
            OperationStatus status = null;
            try {
                if (logger.isLoggable(Level.INFO)) {
                    started = System.currentTimeMillis();
                }
                status = alreadySeen.putNoOverwrite(null, key, ZERO_LENGTH_ENTRY);
                if (logger.isLoggable(Level.INFO)) {
                    aggregatedLookupTime +=
                        (System.currentTimeMillis() - started);
                }
            } catch (DatabaseException e) {
                logger.severe(e.getMessage());
            }
            if (status == OperationStatus.SUCCESS) {
                count.incrementAndGet();
                if (logger.isLoggable(Level.FINE)) {
                    final int logAt = 10000;
                    if (count.get() > 0 && ((count.get() % logAt) == 0)) {
                        logger.fine("Average lookup " +
                            (aggregatedLookupTime / logAt) + "ms.");
                        aggregatedLookupTime = 0;
                    }
                }
            }
            if(status == OperationStatus.KEYEXIST) {
                return false; // not added
            } else {
                return true;
            }
        }

     这里有意思的是,我们可以看到Heritrix是怎么实现URL排重的

    **
         * Create fingerprint.
         * Pubic access so test code can access createKey.
         * @param uri URI to fingerprint.
         * @return Fingerprint of passed <code>url</code>.
         */
        public static long createKey(CharSequence uri) {
            String url = uri.toString();
            int index = url.indexOf(COLON_SLASH_SLASH);
            if (index > 0) {
                index = url.indexOf('/', index + COLON_SLASH_SLASH.length());
            }
            CharSequence hostPlusScheme = (index == -1)? url: url.subSequence(0, index);
            long tmp = FPGenerator.std24.fp(hostPlusScheme);
            return tmp | (FPGenerator.std40.fp(url) >>> 24);
        }

    这里需要注意的是Database alreadySeen数据库并没用真正存储URL链接地址,而是URL链接的Fingerprint值

    BdbUriUniqFilter本身的boolean setAdd(CharSequence uri)方法的真正意义在于判断URL链接是否添加过,除此之外,大概没有其他含义

    接下来我们看到void add(String key, CrawlURI value)方法后面调用了this.receiver.receive(value)方法

    我们前面分析BdbUriUniqFilter初始化状态的时候已经提到了,BdbUriUniqFilter对象是通过回调BdbFrontier对象的void receive(CrawlURI item)方法(BdbFrontier类实现了CrawlUriReceiver接口)

    BdbFrontier类的父类的父类AbstractFrontier我们发现该方法void receive(CrawlURI curi)

     org.archive.crawler.frontier.BdbFrontier

           org.archive.crawler.frontier.AbstractFrontier

    /**
         * Accept the given CrawlURI for scheduling, as it has
         * passed the alreadyIncluded filter. 
         * 
         * Choose a per-classKey queue and enqueue it. If this
         * item has made an unready queue ready, place that 
         * queue on the readyClassQueues queue. 
         * @param caUri CrawlURI.
         */
        public void receive(CrawlURI curi) {
            sheetOverlaysManager.applyOverlaysTo(curi);
            // prefer doing asap if already in manager thread
            try {
                KeyedProperties.loadOverridesFrom(curi);
                processScheduleAlways(curi);
            } finally {
                KeyedProperties.clearOverridesFrom(curi); 
            }
        }

    继续调用BdbFrontier类的void processScheduleAlways(CrawlURI curi)方法,在它的父类WorkQueueFrontier里面

     org.archive.crawler.frontier.BdbFrontier

            org.archive.crawler.frontier.WorkQueueFrontier 

    /**
         * Accept the given CrawlURI for scheduling, as it has
         * passed the alreadyIncluded filter. 
         * 
         * Choose a per-classKey queue and enqueue it. If this
         * item has made an unready queue ready, place that 
         * queue on the readyClassQueues queue. 
         * @param caUri CrawlURI.
         */
        protected void processScheduleAlways(CrawlURI curi) {
    //        assert Thread.currentThread() == managerThread;
            assert KeyedProperties.overridesActiveFrom(curi); 
            
            prepForFrontier(curi);
            sendToQueue(curi);
        }

    void prepForFrontier(CrawlURI curi)方法是设置curi的序号

    void sendToQueue(CrawlURI curi)方法添加CrawlURI curi到工作队列里面(该方法是由BdbFrontier类的void schedule(CrawlURI curi)发起的调用的方法里面的实质性方法),在BdbFrontier类的父类WorkQueueFrontier里面

     org.archive.crawler.frontier.BdbFrontier

              org.archive.crawler.frontier.WorkQueueFrontier

    /**
         * Send a CrawlURI to the appropriate subqueue.
         * 
         * @param curi
         */
        protected void sendToQueue(CrawlURI curi) {
    //        assert Thread.currentThread() == managerThread;
            
            WorkQueue wq = getQueueFor(curi.getClassKey());
            synchronized(wq) {
                int originalPrecedence = wq.getPrecedence();
                wq.enqueue(this, curi);
                // always take budgeting values from current curi
                // (whose overlay settings should be active here)
                wq.setSessionBudget(getBalanceReplenishAmount());
                wq.setTotalBudget(getQueueTotalBudget());
                
                if(!wq.isRetired()) {
                    incrementQueuedUriCount();
                    int currentPrecedence = wq.getPrecedence();
                    if(!wq.isManaged() || currentPrecedence < originalPrecedence) {
                        // queue newly filled or bumped up in precedence; ensure enqueuing
                        // at precedence level (perhaps duplicate; if so that's handled elsewhere)
                        deactivateQueue(wq);
                    }
                }
            }
            // Update recovery log.
            doJournalAdded(curi);
            wq.makeDirty();
            largestQueues.update(wq.getClassKey(), wq.getCount());
        }

    void sendToQueue(CrawlURI curi)方法首先是根据CrawlURI curi的ClassKey得到工作队列,这里是BdbWorkQueue(后面还设置了BdbWorkQueue对象的相关属性,涉及到Heritrix3.1.0工作队列的调度,后文再分析),

    然后调用它的long enqueue(final WorkQueueFrontier frontier,CrawlURI curi)方法,在BdbWorkQueue类的父类WorkQueue里面

    org.archive.crawler.frontier.BdbWorkQueue

             org.archive.crawler.frontier.WorkQueue

    /**
         * Add the given CrawlURI, noting its addition in running count. (It
         * should not already be present.)
         * 
         * @param frontier Work queues manager.
         * @param curi CrawlURI to insert.
         */
        protected synchronized long enqueue(final WorkQueueFrontier frontier,
            CrawlURI curi) {
            try {
                insert(frontier, curi, false);
            } catch (IOException e) {
                //FIXME better exception handling
                e.printStackTrace();
                throw new RuntimeException(e);
            }
            count++;
            enqueueCount++;
            return count;
        }

    继续调用BdbWorkQueue对象的void insert(final WorkQueueFrontier frontier, CrawlURI curi,boolean overwriteIfPresent)方法,也在在BdbWorkQueue类的父类WorkQueue里面

    org.archive.crawler.frontier.BdbWorkQueue

          org.archive.crawler.frontier.WorkQueue

    /**
         * Insert the given curi, whether it is already present or not. 
         * @param frontier WorkQueueFrontier.
         * @param curi CrawlURI to insert.
         * @throws IOException
         */
        private void insert(final WorkQueueFrontier frontier, CrawlURI curi,
                boolean overwriteIfPresent)
            throws IOException {
            insertItem(frontier, curi, overwriteIfPresent);
            lastQueued = curi.toString();
        }

    进一步调用BdbWorkQueue对象的void insertItem(final WorkQueueFrontier frontier, final CrawlURI curi, boolean overwriteIfPresent)方法

    org.archive.crawler.frontier.BdbWorkQueue

    protected void insertItem(final WorkQueueFrontier frontier,
                final CrawlURI curi, boolean overwriteIfPresent) throws IOException {
            try {
                final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier)
                    .getWorkQueues();
                queues.put(curi, overwriteIfPresent);
                if (LOGGER.isLoggable(Level.FINE)) {
                    LOGGER.fine("Inserted into " + getPrefixClassKey(this.origin) +
                        " (count " + Long.toString(getCount())+ "): " +
                            curi.toString());
                }
            } catch (DatabaseException e) {
                throw new IOException(e);
            }
        }

    我们从这里可以看到,最终是调用BdbFrontier的成员变量BdbMultipleWorkQueues pendingUris的void put(CrawlURI curi, boolean overwriteIfPresent) 方法,将CrawlURI curi放入待采集队列

    现在总结一下:

    BdbFrontier对象的void schedule(CrawlURI curi)方法首先通过调用BdbUriUniqFilter对象void add(String key, CrawlURI value)方法,

    里面先判断当前CrawlURI curi是否已经添加过,然后回调BdbFrontier对象的void receive(CrawlURI curi)方法,

    而BdbFrontier对象的void receive(CrawlURI curi)方法:根据当前CrawlURI curi的ClassKey获取相应的BdbWorkQueue对象,然后调用BdbWorkQueue对象的long enqueue(final WorkQueueFrontier frontier,CrawlURI curi)方法,最后调用将CrawlURI curi添加到BdbFrontier的成员变量BdbMultipleWorkQueues pendingUris的void put(CrawlURI curi, boolean overwriteIfPresent) 方法,归入pendingUris队列

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/16/3023659.html

  • 相关阅读:
    VML 和 SVG 的区别
    ie神器htc
    js函数实现递归自调用的方法
    http状态码
    高级算法——贪心算法(背包问题)
    高级算法——贪心算法(找零问题)
    关于arguments.callee
    检索算法——二分查找
    检索算法——顺序查找(最大值、最小值、自组织数据)
    高级排序算法——快速排序(一种分而治之的算法)
  • 原文地址:https://www.cnblogs.com/chenying99/p/3023659.html
Copyright © 2020-2023  润新知