接下来分析BdbFrontier类的CrawlURI next()方法,该方法是获取下一个待采集的CrawlURI对象
该方法是在BdbFrontier类的父类的父类AbstractFrontier里面
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.AbstractFrontier
/* (non-Javadoc) * @see org.archive.crawler.framework.Frontier#next() */ public CrawlURI next() throws InterruptedException { CrawlURI crawlable = null; while(crawlable==null) { outboundLock.readLock().lockInterruptibly(); // try filling outbound until we get something to work on crawlable = findEligibleURI(); outboundLock.readLock().unlock(); } return crawlable; }
继续调用BdbFrontier类的CrawlURI findEligibleURI()方法,在它的父类WorkQueueFrontier里面
/** * Return the next CrawlURI eligible to be processed (and presumably * visited/fetched) by a a worker thread. * * Relies on the readyClassQueues having been loaded with * any work queues that are eligible to provide a URI. * * @return next CrawlURI eligible to be processed, or null if none available * * @see org.archive.crawler.framework.Frontier#next() */ protected CrawlURI findEligibleURI() { // wake any snoozed queues wakeQueues(); // consider rescheduled URIS checkFutures(); // find a non-empty ready queue, if any // TODO: refactor to untangle these loops, early-exits, etc! WorkQueue readyQ = null; findauri: while(true) { findaqueue: do { String key = readyClassQueues.poll(); if(key==null) { // no ready queues; try to activate one if(!getInactiveQueuesByPrecedence().isEmpty() && highestPrecedenceWaiting < getPrecedenceFloor()) { activateInactiveQueue(); continue findaqueue; } else { // nothing ready or readyable break findaqueue; } } readyQ = getQueueFor(key); if(readyQ==null) { // readyQ key wasn't in all queues: unexpected logger.severe("Key "+ key + " in readyClassQueues but not allQueues"); break findaqueue; } if(readyQ.getCount()==0) { // readyQ is empty and ready: it's exhausted readyQ.noteExhausted(); readyQ.makeDirty(); readyQ = null; continue; } if(!inProcessQueues.add(readyQ)) { // double activation; discard this and move on // (this guard allows other enqueuings to ready or // the various inactive-by-precedence queues to // sometimes redundantly enqueue a queue key) readyQ = null; continue; } // queue has gone 'in process' readyQ.considerActive(); readyQ.setWakeTime(0); // clear obsolete wake time, if any readyQ.setSessionBudget(getBalanceReplenishAmount()); readyQ.setTotalBudget(getQueueTotalBudget()); if (readyQ.isOverSessionBudget()) { deactivateQueue(readyQ); readyQ.makeDirty(); readyQ = null; continue; } if (readyQ.isOverTotalBudget()) { retireQueue(readyQ); readyQ.makeDirty(); readyQ = null; continue; } } while (readyQ == null); if (readyQ == null) { // no queues left in ready or readiable break findauri; } returnauri: while(true) { // loop left by explicit return or break on empty CrawlURI curi = null; curi = readyQ.peek(this); if(curi == null) { // should not reach logger.severe("No CrawlURI from ready non-empty queue " + readyQ.classKey + "\n" + readyQ.shortReportLegend() + "\n" + readyQ.shortReportLine() + "\n"); break returnauri; } // from queues, override names persist but not map source curi.setOverlayMapsSource(sheetOverlaysManager); // TODO: consider optimizations avoiding this recalc of // overrides when not necessary sheetOverlaysManager.applyOverlaysTo(curi); // check if curi belongs in different queue String currentQueueKey; try { KeyedProperties.loadOverridesFrom(curi); currentQueueKey = getClassKey(curi); } finally { KeyedProperties.clearOverridesFrom(curi); } if (currentQueueKey.equals(curi.getClassKey())) { // curi was in right queue, emit noteAboutToEmit(curi, readyQ); return curi; } // URI's assigned queue has changed since it // was queued (eg because its IP has become // known). Requeue to new queue. // TODO: consider synchronization on readyQ readyQ.dequeue(this,curi); doJournalRelocated(curi); curi.setClassKey(currentQueueKey); decrementQueuedCount(1); curi.setHolderKey(null); sendToQueue(curi); if(readyQ.getCount()==0) { // readyQ is empty and ready: it's exhausted // release held status, allowing any subsequent // enqueues to again put queue in ready // FIXME: tiny window here where queue could // receive new URI, be readied, fail not-in-process? inProcessQueues.remove(readyQ); readyQ.noteExhausted(); readyQ.makeDirty(); readyQ = null; continue findauri; } } } if(inProcessQueues.size()==0) { // Nothing was ready or in progress or imminent to wake; ensure // any piled-up pending-scheduled URIs are considered uriUniqFilter.requestFlush(); } // if truly nothing ready, wait a moment before returning null // so that loop in surrounding next() has a chance of getting something // next time if(getTotalEligibleInactiveQueues()==0) { try { Thread.sleep(1000); } catch (InterruptedException e) { // } } // nothing eligible return null; }
首先是根据ClassKey获取WorkQueue类型对象,这里是BdbWorkQueue对象(这个ClassKey值的获取涉及到Heritrix3.1.0工作队列的调度,后文再分析),
然后是调用BdbWorkQueue对象的CrawlURI peek(final WorkQueueFrontier frontier)方法,在它的父类WorkQueue里面
/** * Return the topmost queue item -- and remember it, * such that even later higher-priority inserts don't * change it. * * TODO: evaluate if this is really necessary * @param frontier Work queues manager * * @return topmost queue item, or null */ public synchronized CrawlURI peek(final WorkQueueFrontier frontier) { if(peekItem == null && count > 0) { try { peekItem = peekItem(frontier); } catch (IOException e) { //FIXME better exception handling logger.log(Level.SEVERE,"peek failure",e); e.printStackTrace(); // throw new RuntimeException(e); } if(peekItem != null) { lastPeeked = peekItem.toString(); } } return peekItem; }
进一步调用CrawlURI peekItem(final WorkQueueFrontier frontier)方法
org.archive.crawler.frontier.BdbWorkQueue
protected CrawlURI peekItem(final WorkQueueFrontier frontier) throws IOException { final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier) .getWorkQueues(); DatabaseEntry key = new DatabaseEntry(origin); CrawlURI curi = null; int tries = 1; while(true) { try { curi = queues.get(key); } catch (DatabaseException e) { LOGGER.log(Level.SEVERE,"peekItem failure; retrying",e); } // ensure CrawlURI, if any, came from acceptable range: if(!ArchiveUtils.startsWith(key.getData(),origin)) { LOGGER.severe( "inconsistency: "+classKey+"("+ getPrefixClassKey(origin)+") with " + getCount() + " items gave " + curi +"("+getPrefixClassKey(key.getData())); // clear curi to allow retry curi = null; // reset key to original origin for retry key.setData(origin); } if (curi!=null) { // success break; } if (tries>3) { LOGGER.severe("no item where expected in queue "+classKey); break; } tries++; LOGGER.severe("Trying get #" + Integer.toString(tries) + " in queue " + classKey + " with " + getCount() + " items using key " + getPrefixClassKey(key.getData())); } return curi; }
上面我们可以看到,之后调用的是BdbMultipleWorkQueues对象的方法,传入DatabaseEntry key = new DatabaseEntry(origin)参数
这里的origin是byte[]类型的,是根据BdbWorkQueue工作队列的classKey值算出来的,在BdbWorkQueue的构造函数里面
/** * Create a virtual queue inside the given BdbMultipleWorkQueues * * @param classKey */ public BdbWorkQueue(String classKey, BdbFrontier frontier) { super(classKey); this.origin = BdbMultipleWorkQueues.calculateOriginKey(classKey); if (LOGGER.isLoggable(Level.FINE)) { LOGGER.fine(getPrefixClassKey(this.origin) + " " + classKey); } // add the queue-front 'cap' entry; see... // http://sourceforge.net/tracker/index.php?func=detail&aid=1262665&group_id=73833&atid=539102 frontier.getWorkQueues().addCap(origin); }
至于怎么算出来的,BdbMultipleWorkQueues类的byte[] calculateOriginKey(String classKey)静态方法
/** * Calculate the 'origin' key for a virtual queue of items * with the given classKey. This origin key will be a * prefix of the keys for all items in the queue. * * @param classKey String key to derive origin byte key from * @return a byte array key */ static byte[] calculateOriginKey(String classKey) { byte[] classKeyBytes = null; int len = 0; try { classKeyBytes = classKey.getBytes("UTF-8"); len = classKeyBytes.length; } catch (UnsupportedEncodingException e) { // should be impossible; all JVMs must support UTF-8 e.printStackTrace(); } byte[] keyData = new byte[len+1]; System.arraycopy(classKeyBytes,0,keyData,0,len); keyData[len]=0; return keyData; }
最后BdbMultipleWorkQueues对象的CrawlURI get(DatabaseEntry headKey)方法,我们在前面的文章已经看过了,这里不再贴出
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025414.html