Heritrix3.1.0系统里面Frontier组件管理链接队列,采用的是BDB数据库,利用BDB数据库来存储CrawlURI对象,首先我们来看Heritrix3.1.0是怎么实现BDB模块的
我们知道,创建BDB数据库首先要构建数据库环境,Heritrix3.1.0的BDB模块里面EnhancedEnvironment类实现了对BDB数据库环境的封装(继承自je的Environment),如果你不熟悉BDB数据库,可以先google一下吧
EnhancedEnvironment类的源码如下:
/** * Version of BDB_JE Environment with additional convenience features, such as * a shared, cached StoredClassCatalog. (Additional convenience caching of * Databases and StoredCollections may be added later.) * * @author gojomo */ public class EnhancedEnvironment extends Environment { StoredClassCatalog classCatalog; Database classCatalogDB; /** * Constructor * * @param envHome directory in which to open environment * @param envConfig config options * @throws DatabaseException */ public EnhancedEnvironment(File envHome, EnvironmentConfig envConfig) throws DatabaseException { super(envHome, envConfig); } /** * Return a StoredClassCatalog backed by a Database in this environment, * either pre-existing or created (and cached) if necessary. * * @return the cached class catalog */ public StoredClassCatalog getClassCatalog() { if(classCatalog == null) { DatabaseConfig dbConfig = new DatabaseConfig(); dbConfig.setAllowCreate(true); dbConfig.setReadOnly(this.getConfig().getReadOnly()); try { classCatalogDB = openDatabase(null, "classCatalog", dbConfig); classCatalog = new StoredClassCatalog(classCatalogDB); } catch (DatabaseException e) { // TODO Auto-generated catch block throw new RuntimeException(e); } } return classCatalog; } @Override public synchronized void close() throws DatabaseException { if(classCatalogDB!=null) { classCatalogDB.close(); } super.close(); } /** * Create a temporary test environment in the given directory. * @param dir target directory * @return EnhancedEnvironment */ public static EnhancedEnvironment getTestEnvironment(File dir) { EnvironmentConfig envConfig = new EnvironmentConfig(); envConfig.setAllowCreate(true); envConfig.setTransactional(false); EnhancedEnvironment env; try { env = new EnhancedEnvironment(dir, envConfig); } catch (DatabaseException e) { throw new RuntimeException(e); } return env; } }
从该类源码可以看到,除了实现je的Environment功能外,还增加了StoredClassCatalog getClassCatalog()方法,是BDB存储自定义对象需要用到的,里面同时创建了classCatalogDB库用来构建StoredClassCatalog对象
那么 我们要创建以及操作BDB数据库是哪里实现的呢,接下来就是要分析的BdbModule类了(BdbModule类实现了一系列的接口,这部分暂时不具体解释)
BdbModule类的源码有点长,我这里就不贴出来了,只在分析时贴出相关代码
private static class DatabasePlusConfig implements Serializable { private static final long serialVersionUID = 1L; public transient Database database; public BdbConfig config; } /** * Configuration object for databases. Needed because * {@link DatabaseConfig} is not serializable. Also it prevents invalid * configurations. (All databases opened through this module must be * deferred-write, because otherwise they can't sync(), and you can't * run a checkpoint without doing sync() first.) * * @author pjack * */ public static class BdbConfig implements Serializable { private static final long serialVersionUID = 1L; boolean allowCreate; boolean sortedDuplicates; boolean transactional; boolean deferredWrite = true; public BdbConfig() { } public boolean isAllowCreate() { return allowCreate; } public void setAllowCreate(boolean allowCreate) { this.allowCreate = allowCreate; } public boolean getSortedDuplicates() { return sortedDuplicates; } public void setSortedDuplicates(boolean sortedDuplicates) { this.sortedDuplicates = sortedDuplicates; } public DatabaseConfig toDatabaseConfig() { DatabaseConfig result = new DatabaseConfig(); result.setDeferredWrite(deferredWrite); result.setTransactional(transactional); result.setAllowCreate(allowCreate); result.setSortedDuplicates(sortedDuplicates); return result; } public boolean isTransactional() { return transactional; } public void setTransactional(boolean transactional) { this.transactional = transactional; } public void setDeferredWrite(boolean b) { this.deferredWrite = true; } }
上面部分是静态类DatabasePlusConfig和BdbConfig,前者是私有的,只能在BdbModule类创建,后者是公有的,可以在外部创建
显然,静态类DatabasePlusConfig除了Database database成员变量外,还有静态类BdbConfig的成员变量BdbConfig config
静态类BdbConfig是对BDB数据库配置的封装,我们从它的属性可以看到,通过设置里面的属性后,从它的DatabaseConfig toDatabaseConfig()方法返回BDB数据库配置对象
public DatabaseConfig toDatabaseConfig() { DatabaseConfig result = new DatabaseConfig(); result.setDeferredWrite(deferredWrite); result.setTransactional(transactional); result.setAllowCreate(allowCreate); result.setSortedDuplicates(sortedDuplicates); return result; }
BdbModule源码下面部分为BDB数据库环境属性设置,在后面的BDB数据库环境实例化方法里面用到了这些参数
protected ConfigPath dir = new ConfigPath("bdbmodule subdirectory","state"); public ConfigPath getDir() { return dir; } public void setDir(ConfigPath dir) { this.dir = dir; } int cachePercent = -1; public int getCachePercent() { return cachePercent; } public void setCachePercent(int cachePercent) { this.cachePercent = cachePercent; } boolean useSharedCache = true; public boolean getUseSharedCache() { return useSharedCache; } public void setUseSharedCache(boolean useSharedCache) { this.useSharedCache = useSharedCache; } /** * Expected number of concurrent threads; used to tune nLockTables * according to JE FAQ * http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html#33 */ int expectedConcurrency = 64; public int getExpectedConcurrency() { return expectedConcurrency; } public void setExpectedConcurrency(int expectedConcurrency) { this.expectedConcurrency = expectedConcurrency; } /** * Whether to use hard-links to log files to collect/retain * the BDB log files needed for a checkpoint. Default is true. * May not work on Windows (especially on pre-NTFS filesystems). * If false, the BDB 'je.cleaner.expunge' value will be set to * 'false', as well, meaning BDB will *not* delete obsolete JDB * files, but only rename the '.DEL'. They will have to be * manually deleted to free disk space, but .DEL files referenced * in any checkpoint's 'jdbfiles.manifest' should be retained to * keep the checkpoint valid. */ boolean useHardLinkCheckpoints = true; public boolean getUseHardLinkCheckpoints() { return useHardLinkCheckpoints; } public void setUseHardLinkCheckpoints(boolean useHardLinkCheckpoints) { this.useHardLinkCheckpoints = useHardLinkCheckpoints; } private transient EnhancedEnvironment bdbEnvironment; private transient StoredClassCatalog classCatalog;
下面需要注意的是两个成员变量比较重要
@SuppressWarnings("rawtypes") private Map<String,ObjectIdentityCache> oiCaches = new ConcurrentHashMap<String,ObjectIdentityCache>(); private Map<String,DatabasePlusConfig> databases = new ConcurrentHashMap<String,DatabasePlusConfig>();
两者都是map类型的变量成员,可以理解为map容器,前者保存的是缓存管理的对象(BdbFrontier模块里面用来管理工作队列缓存),后者是DatabasePlusConfig对象,对外提供BDB数据库实例
我们看它的初始化方法start(该方法是spring框架里面的Lifecycle接口方法,BdbModule实现了该接口)
public synchronized void start() { if (isRunning()) { return; } isRunning = true; try { boolean isRecovery = false; if(recoveryCheckpoint!=null) { isRecovery = true; doRecover(); } setup(getDir().getFile(), !isRecovery); } catch (DatabaseException e) { throw new IllegalStateException(e); } catch (IOException e) { throw new IllegalStateException(e); } }
doRecover()方法用于从断点恢复,setup(getDir().getFile(), !isRecovery);用于实初始化数据库环境的封装对象EnhancedEnvironment和StoredClassCatalog对象
protected void setup(File f, boolean create) throws DatabaseException, IOException { EnvironmentConfig config = new EnvironmentConfig(); config.setAllowCreate(create); config.setLockTimeout(75, TimeUnit.MINUTES); // set to max if(getCachePercent()>0) { config.setCachePercent(getCachePercent()); } config.setSharedCache(getUseSharedCache()); // we take the advice literally from... // http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html#33 long nLockTables = getExpectedConcurrency()-1; while(!BigInteger.valueOf(nLockTables).isProbablePrime(Integer.MAX_VALUE)) { nLockTables--; } config.setConfigParam("je.lock.nLockTables", Long.toString(nLockTables)); // triple this value to 6K because stats show many faults config.setConfigParam("je.log.faultReadSize", "6144"); if(!getUseHardLinkCheckpoints()) { // to support checkpoints by textual manifest only, // prevent BDB's cleaner from deleting log files config.setConfigParam("je.cleaner.expunge", "false"); } // else leave whatever other setting was already in place org.archive.util.FileUtils.ensureWriteableDirectory(f); this.bdbEnvironment = new EnhancedEnvironment(f, config); this.classCatalog = this.bdbEnvironment.getClassCatalog(); if(!create) { // freeze last log file -- so that originating checkpoint isn't fouled DbBackup dbBackup = new DbBackup(bdbEnvironment); dbBackup.startBackup(); dbBackup.endBackup(); } }
打开数据库的方法是openDatabase(String name, BdbConfig config, boolean usePriorData)
/** * Open a Database inside this BdbModule's environment, and * remember it for automatic close-at-module-stop. * * @param name * @param config * @param usePriorData * @return * @throws DatabaseException */ public Database openDatabase(String name, BdbConfig config, boolean usePriorData) throws DatabaseException { if (bdbEnvironment == null) { // proper initialization hasn't occurred throw new IllegalStateException("BdbModule not started"); } if (databases.containsKey(name)) { DatabasePlusConfig dpc = databases.get(name); if(dpc.config == config) { // object-identical configs: OK to share DB return dpc.database; } // unshared config object: might be name collision; error throw new IllegalStateException("Database already exists: " +name); } DatabasePlusConfig dpc = new DatabasePlusConfig(); if (!usePriorData) { try { bdbEnvironment.truncateDatabase(null, name, false); } catch (DatabaseNotFoundException e) { // Ignored } } dpc.database = bdbEnvironment.openDatabase(null, name, config.toDatabaseConfig()); dpc.config = config; databases.put(name, dpc); return dpc.database; }
在调用该方法时先判断Map<String,DatabasePlusConfig> databases成员变量里面有没有保存,然后再创建
下面的方法是返回StoredQueue队列,StoredQueue队列里面保存的类型为参数里面的Class<K> clazz,数据库配置是StoredQueue.databaseConfig()(StoredQueue本身的)
public <K extends Serializable> StoredQueue<K> getStoredQueue(String dbname, Class<K> clazz, boolean usePriorData) { try { Database queueDb; queueDb = openDatabase(dbname, StoredQueue.databaseConfig(), usePriorData); return new StoredQueue<K>(queueDb, clazz, getClassCatalog()); } catch (DatabaseException e) { throw new RuntimeException(e); } }
在实例化StoredQueue队列时,传入的StoredClassCatalog对象用于创建EntryBinding<E>类型的对象(比如Heritrix里面有KryoBinding<K>类型的)(用于可序列化化类到BDB数据类型的转换,K为可序列化类型对象 <K extends Serializable>)
这里有必要看来一段插曲,进去看看StoredQueue类的源码,StoredQueue继承自AbstractQueue<E>,实现了用BDB数据库存储队列成员的队列操作
/** * Queue backed by a JE Collections StoredSortedMap. * * @author gojomo * * @param <E> */ public class StoredQueue<E extends Serializable> extends AbstractQueue<E> { @SuppressWarnings("unused") private static final Logger logger = Logger.getLogger(StoredQueue.class.getName()); transient StoredSortedMap<Long,E> queueMap; // Long -> E transient Database queueDb; // Database AtomicLong tailIndex; // next spot for insert transient volatile E peekItem = null; /** * Create a StoredQueue backed by the given Database. * * The Class of values to be queued may be provided; there is only a * benefit when a primitive type is specified. A StoredClassCatalog * must be provided if a primitive type is not supplied. * * @param db * @param clsOrNull * @param classCatalog */ public StoredQueue(Database db, Class<E> clsOrNull, StoredClassCatalog classCatalog) { hookupDatabase(db, clsOrNull, classCatalog); tailIndex = new AtomicLong(queueMap.isEmpty() ? 0L : queueMap.lastKey()+1); } /** * @param db * @param clsOrNull * @param classCatalog */ public void hookupDatabase(Database db, Class<E> clsOrNull, StoredClassCatalog classCatalog) { EntryBinding<E> valueBinding = TupleBinding.getPrimitiveBinding(clsOrNull); if(valueBinding == null) { valueBinding = new SerialBinding<E>(classCatalog, clsOrNull); } queueDb = db; queueMap = new StoredSortedMap<Long,E>( db, TupleBinding.getPrimitiveBinding(Long.class), valueBinding, true); } @Override public Iterator<E> iterator() { return queueMap.values().iterator(); } @Override public int size() { try { return Math.max(0, (int)(tailIndex.get() - queueMap.firstKey())); } catch (IllegalStateException ise) { return 0; } catch (NoSuchElementException nse) { return 0; } catch (NullPointerException npe) { return 0; } } @Override public boolean isEmpty() { if(peekItem!=null) { return false; } try { return queueMap.isEmpty(); } catch (IllegalStateException de) { return true; } } public boolean offer(E o) { long targetIndex = tailIndex.getAndIncrement(); queueMap.put(targetIndex, o); return true; } public synchronized E peek() { if(peekItem == null) { if(queueMap.isEmpty()) { return null; } peekItem = queueMap.remove(queueMap.firstKey()); } return peekItem; } public synchronized E poll() { E head = peek(); peekItem = null; return head; } /** * A suitable DatabaseConfig for the Database backing a StoredQueue. * (However, it is not necessary to use these config options.) * * @return DatabaseConfig suitable for queue */ public static BdbModule.BdbConfig databaseConfig() { BdbModule.BdbConfig dbConfig = new BdbModule.BdbConfig(); dbConfig.setTransactional(false); dbConfig.setAllowCreate(true); return dbConfig; } public void close() { try { queueDb.sync(); queueDb.close(); } catch (DatabaseException e) { throw new RuntimeException(e); } } }
je封装了StoredSortedMap<Long,E>类型的类用于操作管理BDB数据库里面的数据,至此,我们可以将StoredQueue对象理解为数据存储在BDB数据库(里面经过StoredSortedMap的封装)的队列(queue)
后面的部分为缓存管理(管理实现了IdentityCacheable接口的对象的缓存,如BdbWorkQueue类间接实现了该接口,从而实现了工作队列对象的缓存的管理;其实ObjectIdentityBdbManualCache对象本身的缓存也是通过BDB数据库存储的)
/** * Get an ObjectIdentityBdbCache, backed by a BDB Database of the * given name, with the given value class type. If 'recycle' is true, * reuse values already in the database; otherwise start with an * empty cache. * * @param <V> * @param dbName * @param recycle * @param valueClass * @return * @throws DatabaseException */ public <V extends IdentityCacheable> ObjectIdentityBdbManualCache<V> getOIBCCache(String dbName, boolean recycle, Class<? extends V> valueClass) throws DatabaseException { if (!recycle) { try { bdbEnvironment.truncateDatabase(null, dbName, false); } catch (DatabaseNotFoundException e) { // ignored } } ObjectIdentityBdbManualCache<V> oic = new ObjectIdentityBdbManualCache<V>(); oic.initialize(bdbEnvironment, dbName, valueClass, classCatalog); oiCaches.put(dbName, oic); return oic; } public <V extends IdentityCacheable> ObjectIdentityCache<V> getObjectCache(String dbName, boolean recycle, Class<V> valueClass) throws DatabaseException { return getObjectCache(dbName, recycle, valueClass, valueClass); } /** * Get an ObjectIdentityCache, backed by a BDB Database of the given * name, with objects of the given valueClass type. If 'recycle' is * true, reuse values already in the database; otherwise start with * an empty cache. * * @param <V> * @param dbName * @param recycle * @param valueClass * @return * @throws DatabaseException */ public <V extends IdentityCacheable> ObjectIdentityCache<V> getObjectCache(String dbName, boolean recycle, Class<V> declaredClass, Class<? extends V> valueClass) throws DatabaseException { @SuppressWarnings("unchecked") ObjectIdentityCache<V> oic = oiCaches.get(dbName); if(oic!=null) { return oic; } oic = getOIBCCache(dbName, recycle, valueClass); return oic; }
再后面部分为设置断点及从断点恢复
public void doCheckpoint(Checkpoint checkpointInProgress) throws IOException { // First sync objectCaches for (@SuppressWarnings("rawtypes") ObjectIdentityCache oic : oiCaches.values()) { oic.sync(); } try { // sync all databases for (DatabasePlusConfig dbc: databases.values()) { dbc.database.sync(); } // Do a force checkpoint. Thats what a sync does (i.e. doSync). CheckpointConfig chkptConfig = new CheckpointConfig(); chkptConfig.setForce(true); // Mark Hayes of sleepycat says: // "The default for this property is false, which gives the current // behavior (allow deltas). If this property is true, deltas are // prohibited -- full versions of internal nodes are always logged // during the checkpoint. When a full version of an internal node // is logged during a checkpoint, recovery does not need to process // it at all. It is only fetched if needed by the application, // during normal DB operations after recovery. When a delta of an // internal node is logged during a checkpoint, recovery must // process it by fetching the full version of the node from earlier // in the log, and then applying the delta to it. This can be // pretty slow, since it is potentially a large amount of // random I/O." // chkptConfig.setMinimizeRecoveryTime(true); bdbEnvironment.checkpoint(chkptConfig); LOGGER.fine("Finished bdb checkpoint."); DbBackup dbBackup = new DbBackup(bdbEnvironment); try { dbBackup.startBackup(); File envCpDir = new File(dir.getFile(),checkpointInProgress.getName()); org.archive.util.FileUtils.ensureWriteableDirectory(envCpDir); File logfilesList = new File(envCpDir,"jdbfiles.manifest"); String[] filedata = dbBackup.getLogFilesInBackupSet(); for (int i=0; i<filedata.length;i++) { File f = new File(dir.getFile(),filedata[i]); filedata[i] += ","+f.length(); if(getUseHardLinkCheckpoints()) { File hardLink = new File(envCpDir,filedata[i]); if (!FilesystemLinkMaker.makeHardLink(f.getAbsolutePath(), hardLink.getAbsolutePath())) { LOGGER.log(Level.SEVERE, "unable to create required checkpoint link "+hardLink); } } } FileUtils.writeLines(logfilesList,Arrays.asList(filedata)); LOGGER.fine("Finished processing bdb log files."); } finally { dbBackup.endBackup(); } } catch (DatabaseException e) { throw new IOException(e); } } @SuppressWarnings("unchecked") protected void doRecover() throws IOException { File cpDir = new File(dir.getFile(),recoveryCheckpoint.getName()); File logfilesList = new File(cpDir,"jdbfiles.manifest"); List<String> filesAndLengths = FileUtils.readLines(logfilesList); HashMap<String,Long> retainLogfiles = new HashMap<String,Long>(); for(String line : filesAndLengths) { String[] fileAndLength = line.split(","); long expectedLength = Long.valueOf(fileAndLength[1]); retainLogfiles.put(fileAndLength[0],expectedLength); // check for files in checkpoint directory; relink to environment as necessary File cpFile = new File(cpDir, line); File destFile = new File(dir.getFile(), fileAndLength[0]); if(cpFile.exists()) { if(cpFile.length()!=expectedLength) { LOGGER.warning(cpFile.getName()+" expected "+expectedLength+" actual "+cpFile.length()); // TODO: is truncation necessary? } if(destFile.exists()) { if(!destFile.delete()) { LOGGER.log(Level.SEVERE, "unable to delete obstructing file "+destFile); } } int status = CLibrary.INSTANCE.link(cpFile.getAbsolutePath(), destFile.getAbsolutePath()); if (status!=0) { LOGGER.log(Level.SEVERE, "unable to create required restore link "+destFile); } } } IOFileFilter filter = FileFilterUtils.orFileFilter( FileFilterUtils.suffixFileFilter(".jdb"), FileFilterUtils.suffixFileFilter(".del")); filter = FileFilterUtils.makeFileOnly(filter); // reverify environment directory is as it was at checkpoint time, // deleting any extra files for(File f : dir.getFile().listFiles((FileFilter)filter)) { if(retainLogfiles.containsKey(f.getName())) { // named file still exists under original name long expectedLength = retainLogfiles.get(f.getName()); if(f.length()!=expectedLength) { LOGGER.warning(f.getName()+" expected "+expectedLength+" actual "+f.length()); // TODO: truncate? this unexpected length mismatch // probably only happens if there was already a recovery // where the affected file was the last of the set, in // which case BDB appends a small amount of (harmless?) data // to the previously-undersized file } retainLogfiles.remove(f.getName()); continue; } // file as now-named not in restore set; check if un-".DEL" renaming needed String undelName = f.getName().replace(".del", ".jdb"); if(retainLogfiles.containsKey(undelName)) { // file if renamed matches desired file name long expectedLength = retainLogfiles.get(undelName); if(f.length()!=expectedLength) { LOGGER.warning(f.getName()+" expected "+expectedLength+" actual "+f.length()); // TODO: truncate to expected size? } if(!f.renameTo(new File(f.getParentFile(),undelName))) { throw new IOException("Unable to rename " + f + " to " + undelName); } retainLogfiles.remove(undelName); } // file not needed; delete/move-aside if(!f.delete()) { LOGGER.warning("unable to delete "+f); org.archive.util.FileUtils.moveAsideIfExists(f); } // TODO: log/warn of ruined later checkpoints? } if(retainLogfiles.size()>0) { // some needed files weren't present LOGGER.severe("Checkpoint corrupt, needed log files missing: "+retainLogfiles); } }
最后还有getStoredMap(String dbName, Class<K> keyClass, Class<V> valueClass, boolean allowDuplicates, boolean usePriorData)方法。用于创建临时的DisposableStoredSortedMap<K,V>对象(继承自je的StoredSortedMap,可以理解为存储在BDB数据库的(经过StoredSortedMap封装)临时的map容器, Class<K> keyClass, Class<V> valueClass参数为key和value的类型)
/** * Creates a database-backed TempStoredSortedMap for transient * reporting requirements. Calling the returned map's destroy() * method when done discards the associated Database. * * @param <K> * @param <V> * @param dbName Database name to use; if null a name will be synthesized * @param keyClass Class of keys; should be a Java primitive type * @param valueClass Class of values; may be any serializable type * @param allowDuplicates whether duplicate keys allowed * @return */ public <K,V> DisposableStoredSortedMap<K, V> getStoredMap(String dbName, Class<K> keyClass, Class<V> valueClass, boolean allowDuplicates, boolean usePriorData) { BdbConfig config = new BdbConfig(); config.setSortedDuplicates(allowDuplicates); config.setAllowCreate(!usePriorData); Database mapDb; if(dbName==null) { dbName = "tempMap-"+System.identityHashCode(this)+"-"+sn; sn++; } final String openName = dbName; try { mapDb = openDatabase(openName,config,usePriorData); } catch (DatabaseException e) { throw new RuntimeException(e); } EntryBinding<V> valueBinding = TupleBinding.getPrimitiveBinding(valueClass); if(valueBinding == null) { valueBinding = new SerialBinding<V>(classCatalog, valueClass); } DisposableStoredSortedMap<K,V> storedMap = new DisposableStoredSortedMap<K, V>( mapDb, TupleBinding.getPrimitiveBinding(keyClass), valueBinding, true) { @Override public void dispose() { super.dispose(); DatabasePlusConfig dpc = BdbModule.this.databases.remove(openName); if (dpc == null) { BdbModule.LOGGER.log(Level.WARNING,"No such database: " + openName); } } }; return storedMap; }
经过本文分析,我们还有很多疑问,待后文再继续吧
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/14/3019757.html