本文要分析的是Heritrix3.1.0的Frontier组件,先熟悉一下相关的UML类图
通过浏览该图,我们可以清楚的看出Frontier组件的相关接口和类的继承和调用关系,不必我再文字描述了
在分析BdbFrontier类的相关方法之前,有必要熟悉一下在系统环境中该对象的初始状态,以及状态的改变
这部分内容需要回顾 Heritrix 3.1.0 源码解析(六)一文,本人就不在这里重复了
BdbFrontier类重要的有几个方法,分别是添加CrawlURI,获取CrawlURI,完成CrawlURI
void schedule(CrawlURI curi)
CrawlURI next()
void finished(CrawlURI curi)
下面首先来分析void schedule(CrawlURI curi)方法,在它的父类WorkQueueFrontier里面
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.WorkQueueFrontier
/** * Arrange for the given CrawlURI to be visited, if it is not * already enqueued/completed. * * Differs from superclass in that it operates in calling thread, rather * than deferring operations via in-queue to managerThread. TODO: settle * on either defer or in-thread approach after testing. * * @see org.archive.crawler.framework.Frontier#schedule(org.archive.modules.CrawlURI) */ @Override public void schedule(CrawlURI curi) { sheetOverlaysManager.applyOverlaysTo(curi); try { KeyedProperties.loadOverridesFrom(curi); if(curi.getClassKey()==null) { // remedial processing preparer.prepare(curi); } processScheduleIfUnique(curi); } finally { KeyedProperties.clearOverridesFrom(curi); } }
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.WorkQueueFrontier
/** * Arrange for the given CrawlURI to be visited, if it is not * already scheduled/completed. * * @see org.archive.crawler.framework.Frontier#schedule(org.archive.modules.CrawlURI) */ protected void processScheduleIfUnique(CrawlURI curi) { // assert Thread.currentThread() == managerThread; assert KeyedProperties.overridesActiveFrom(curi); // Canonicalization may set forceFetch flag. See // #canonicalization(CrawlURI) javadoc for circumstance. String canon = curi.getCanonicalString(); if (curi.forceFetch()) { uriUniqFilter.addForce(canon, curi); } else { uriUniqFilter.add(canon, curi); } }
void processScheduleIfUnique(CrawlURI curi)方法调用了BdbUriUniqFilter类成员的void add(String key, CrawlURI value)方法
在分析BdbUriUniqFilter类成员的void add(String key, CrawlURI value)方法前,首先有必要查看一下该对象的状态
org.archive.crawler.util.BdbUriUniqFilter类在它的初始化方法里面初始化下面成员变量,前者是BDB数据库,后者是BDB数据库的数据对象(在BdbUriUniqFilter类方法里面我还没找发现该变量的作用)
protected transient Database alreadySeen = null;
protected transient DatabaseEntry value = null;
public void start() { if(isRunning()) { return; } boolean isRecovery = (recoveryCheckpoint != null); try { BdbModule.BdbConfig config = getDatabaseConfig(); config.setAllowCreate(!isRecovery); initialize(bdb.openDatabase(DB_NAME, config, isRecovery)); } catch (DatabaseException e) { throw new IllegalStateException(e); } if(isRecovery) { JSONObject json = recoveryCheckpoint.loadJson(beanName); try { count.set(json.getLong("count")); } catch (JSONException e) { throw new RuntimeException(e); } } isRunning = true; }
进一步调用 void initialize(Database db)
/** * Method shared by constructors. * @param env Environment to use. * @throws DatabaseException */ protected void initialize(Database db) throws DatabaseException { open(db); }
初始化数据库Database alreadySeen,void open(final Database db)
protected void open(final Database db) throws DatabaseException { this.alreadySeen = db; this.value = new DatabaseEntry("".getBytes()); }
另外在BdbFrontier对象的初始化方法里面我们发现执行了BdbUriUniqFilter类成员的void setDestination(CrawlUriReceiver receiver)方法,这里是设置BdbFrontier对象本身,在BdbUriUniqFilter对象的相关方法里面调用了BdbFrontier对象的void receive(CrawlURI item)方法(BdbFrontier类实现了CrawlUriReceiver接口),这里类似于回调接口
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.WorkQueueFrontier
public void start() { if(isRunning()) { return; } uriUniqFilter.setDestination(this); super.start(); try { initInternalQueues(); } catch (Exception e) { throw new IllegalStateException(e); } }
BdbUriUniqFilter对象的状态分析完毕,现在我们再来分析上面uriUniqFilter.add(canon, curi)方法,在BdbUriUniqFilter的父类里面
org.archive.crawler.util.BdbUriUniqFilter
org.archive.crawler.util.SetBasedUriUniqFilter
public void add(String key, CrawlURI value) { profileLog(key); if (setAdd(key)) { this.receiver.receive(value); if (setCount() % 50000 == 0) { LOGGER.log(Level.FINE, "count: " + setCount() + " totalDups: " + duplicateCount + " recentDups: " + (duplicateCount - duplicatesAtLastSample)); duplicatesAtLastSample = duplicateCount; } } else { duplicateCount++; } }
进一步调用BdbUriUniqFilter本身的boolean setAdd(CharSequence uri)方法
org.archive.crawler.util.BdbUriUniqFilter
protected boolean setAdd(CharSequence uri) { DatabaseEntry key = new DatabaseEntry(); LongBinding.longToEntry(createKey(uri), key); long started = 0; OperationStatus status = null; try { if (logger.isLoggable(Level.INFO)) { started = System.currentTimeMillis(); } status = alreadySeen.putNoOverwrite(null, key, ZERO_LENGTH_ENTRY); if (logger.isLoggable(Level.INFO)) { aggregatedLookupTime += (System.currentTimeMillis() - started); } } catch (DatabaseException e) { logger.severe(e.getMessage()); } if (status == OperationStatus.SUCCESS) { count.incrementAndGet(); if (logger.isLoggable(Level.FINE)) { final int logAt = 10000; if (count.get() > 0 && ((count.get() % logAt) == 0)) { logger.fine("Average lookup " + (aggregatedLookupTime / logAt) + "ms."); aggregatedLookupTime = 0; } } } if(status == OperationStatus.KEYEXIST) { return false; // not added } else { return true; } }
这里有意思的是,我们可以看到Heritrix是怎么实现URL排重的
** * Create fingerprint. * Pubic access so test code can access createKey. * @param uri URI to fingerprint. * @return Fingerprint of passed <code>url</code>. */ public static long createKey(CharSequence uri) { String url = uri.toString(); int index = url.indexOf(COLON_SLASH_SLASH); if (index > 0) { index = url.indexOf('/', index + COLON_SLASH_SLASH.length()); } CharSequence hostPlusScheme = (index == -1)? url: url.subSequence(0, index); long tmp = FPGenerator.std24.fp(hostPlusScheme); return tmp | (FPGenerator.std40.fp(url) >>> 24); }
这里需要注意的是Database alreadySeen数据库并没用真正存储URL链接地址,而是URL链接的Fingerprint值
BdbUriUniqFilter本身的boolean setAdd(CharSequence uri)方法的真正意义在于判断URL链接是否添加过,除此之外,大概没有其他含义
接下来我们看到void add(String key, CrawlURI value)方法后面调用了this.receiver.receive(value)方法
我们前面分析BdbUriUniqFilter初始化状态的时候已经提到了,BdbUriUniqFilter对象是通过回调BdbFrontier对象的void receive(CrawlURI item)方法(BdbFrontier类实现了CrawlUriReceiver接口)
在BdbFrontier类的父类的父类AbstractFrontier我们发现该方法void receive(CrawlURI curi)
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.AbstractFrontier
/** * Accept the given CrawlURI for scheduling, as it has * passed the alreadyIncluded filter. * * Choose a per-classKey queue and enqueue it. If this * item has made an unready queue ready, place that * queue on the readyClassQueues queue. * @param caUri CrawlURI. */ public void receive(CrawlURI curi) { sheetOverlaysManager.applyOverlaysTo(curi); // prefer doing asap if already in manager thread try { KeyedProperties.loadOverridesFrom(curi); processScheduleAlways(curi); } finally { KeyedProperties.clearOverridesFrom(curi); } }
继续调用BdbFrontier类的void processScheduleAlways(CrawlURI curi)方法,在它的父类WorkQueueFrontier里面
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.WorkQueueFrontier
/** * Accept the given CrawlURI for scheduling, as it has * passed the alreadyIncluded filter. * * Choose a per-classKey queue and enqueue it. If this * item has made an unready queue ready, place that * queue on the readyClassQueues queue. * @param caUri CrawlURI. */ protected void processScheduleAlways(CrawlURI curi) { // assert Thread.currentThread() == managerThread; assert KeyedProperties.overridesActiveFrom(curi); prepForFrontier(curi); sendToQueue(curi); }
void prepForFrontier(CrawlURI curi)方法是设置curi的序号
void sendToQueue(CrawlURI curi)方法添加CrawlURI curi到工作队列里面(该方法是由BdbFrontier类的void schedule(CrawlURI curi)发起的调用的方法里面的实质性方法),在BdbFrontier类的父类WorkQueueFrontier里面
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.WorkQueueFrontier
/** * Send a CrawlURI to the appropriate subqueue. * * @param curi */ protected void sendToQueue(CrawlURI curi) { // assert Thread.currentThread() == managerThread; WorkQueue wq = getQueueFor(curi.getClassKey()); synchronized(wq) { int originalPrecedence = wq.getPrecedence(); wq.enqueue(this, curi); // always take budgeting values from current curi // (whose overlay settings should be active here) wq.setSessionBudget(getBalanceReplenishAmount()); wq.setTotalBudget(getQueueTotalBudget()); if(!wq.isRetired()) { incrementQueuedUriCount(); int currentPrecedence = wq.getPrecedence(); if(!wq.isManaged() || currentPrecedence < originalPrecedence) { // queue newly filled or bumped up in precedence; ensure enqueuing // at precedence level (perhaps duplicate; if so that's handled elsewhere) deactivateQueue(wq); } } } // Update recovery log. doJournalAdded(curi); wq.makeDirty(); largestQueues.update(wq.getClassKey(), wq.getCount()); }
void sendToQueue(CrawlURI curi)方法首先是根据CrawlURI curi的ClassKey得到工作队列,这里是BdbWorkQueue(后面还设置了BdbWorkQueue对象的相关属性,涉及到Heritrix3.1.0工作队列的调度,后文再分析),
然后调用它的long enqueue(final WorkQueueFrontier frontier,CrawlURI curi)方法,在BdbWorkQueue类的父类WorkQueue里面
org.archive.crawler.frontier.BdbWorkQueue
org.archive.crawler.frontier.WorkQueue
/** * Add the given CrawlURI, noting its addition in running count. (It * should not already be present.) * * @param frontier Work queues manager. * @param curi CrawlURI to insert. */ protected synchronized long enqueue(final WorkQueueFrontier frontier, CrawlURI curi) { try { insert(frontier, curi, false); } catch (IOException e) { //FIXME better exception handling e.printStackTrace(); throw new RuntimeException(e); } count++; enqueueCount++; return count; }
继续调用BdbWorkQueue对象的void insert(final WorkQueueFrontier frontier, CrawlURI curi,boolean overwriteIfPresent)方法,也在在BdbWorkQueue类的父类WorkQueue里面
org.archive.crawler.frontier.BdbWorkQueue
org.archive.crawler.frontier.WorkQueue
/** * Insert the given curi, whether it is already present or not. * @param frontier WorkQueueFrontier. * @param curi CrawlURI to insert. * @throws IOException */ private void insert(final WorkQueueFrontier frontier, CrawlURI curi, boolean overwriteIfPresent) throws IOException { insertItem(frontier, curi, overwriteIfPresent); lastQueued = curi.toString(); }
进一步调用BdbWorkQueue对象的void insertItem(final WorkQueueFrontier frontier, final CrawlURI curi, boolean overwriteIfPresent)方法
org.archive.crawler.frontier.BdbWorkQueue
protected void insertItem(final WorkQueueFrontier frontier, final CrawlURI curi, boolean overwriteIfPresent) throws IOException { try { final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier) .getWorkQueues(); queues.put(curi, overwriteIfPresent); if (LOGGER.isLoggable(Level.FINE)) { LOGGER.fine("Inserted into " + getPrefixClassKey(this.origin) + " (count " + Long.toString(getCount())+ "): " + curi.toString()); } } catch (DatabaseException e) { throw new IOException(e); } }
我们从这里可以看到,最终是调用BdbFrontier的成员变量BdbMultipleWorkQueues pendingUris的void put(CrawlURI curi, boolean overwriteIfPresent) 方法,将CrawlURI curi放入待采集队列
现在总结一下:
BdbFrontier对象的void schedule(CrawlURI curi)方法首先通过调用BdbUriUniqFilter对象void add(String key, CrawlURI value)方法,
里面先判断当前CrawlURI curi是否已经添加过,然后回调BdbFrontier对象的void receive(CrawlURI curi)方法,
而BdbFrontier对象的void receive(CrawlURI curi)方法:根据当前CrawlURI curi的ClassKey获取相应的BdbWorkQueue对象,然后调用BdbWorkQueue对象的long enqueue(final WorkQueueFrontier frontier,CrawlURI curi)方法,最后调用将CrawlURI curi添加到BdbFrontier的成员变量BdbMultipleWorkQueues pendingUris的void put(CrawlURI curi, boolean overwriteIfPresent) 方法,归入pendingUris队列
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/16/3023659.html