基于Berkeley DB实现的持久化队列

基于Berkeley DB实现的持久化队列
          本博客属原创文章,欢迎转载！转载请务必注明出处:http://guoyunsky.iteye.com/blog/1169912

本博客已迁移本人独立博客: http://www.yun5u.com/

      队列很常见,但大部分的队列是将数据放入到内存.如果数据过多,就有内存溢出危险,而且长久占据着内存,也会影响性能.比如爬虫,将要抓取的URL放到内存,而URL过多,内存肯定要爆.在读 Heritrix源码中,发现Heritrix是基于Bdb实现了一个持久化队列,于是我就将这块代码独立出来,平时使用也蛮爽的,现在拿出来共享.同时数据已经持久化,相比放在内存的一次性,可以循环累加使用.

      大家也知道BDB的高性能和嵌入式.但这个持久化队列我觉得比较适合单机.如果涉及到分布式,就不大适合了.毕竟分布式要通信,负载均衡,冗余等.可以用其他的数据库等替代.

      这里大概先说下实现原理,BDB是Key-Value型数据库,而队列是FIFO.所以这个持久化队列以位置作为BDB的Key,数据作为BDB的 Value.然后用两个变量,分别记录队列两头的位置,也就是头部和尾部.当有数据插入的时候,就以尾部的位置为这个数据的Key.而当要取出数据时,以头部位置作为Key,获取这个Key的数据.原理大概如此,这个类也继承AbstractQueue,这里贴上代码.以下代码需引用bdb- je,common-io,junit.请在附件中下载
1. 自定义的BDB环境类,可以缓存StoredClassCatalog并共享
Java代码
1. package com.guoyun.util;
3. import java.io.File;
5. import com.sleepycat.bind.serial.StoredClassCatalog;
6. import com.sleepycat.je.Database;
7. import com.sleepycat.je.DatabaseConfig;
8. import com.sleepycat.je.DatabaseException;
9. import com.sleepycat.je.Environment;
10. import com.sleepycat.je.EnvironmentConfig;
11. /**
12. * BDB数据库环境,可以缓存StoredClassCatalog并共享
13. *
14. * @contributor guoyun
15. */
16. public class BdbEnvironment extends Environment {
17. StoredClassCatalog classCatalog;
18. Database classCatalogDB;
20. /**
21. * Constructor
22. *
23. * @param envHome 数据库环境目录
24. * @param envConfig config options 数据库换纪念馆配置
25. * @throws DatabaseException
26. */
27. public BdbEnvironment(File envHome, EnvironmentConfig envConfig) throws DatabaseException {
28. super(envHome, envConfig);
29. }
31. /**
32. * 返回StoredClassCatalog
33. * @return the cached class catalog
34. */
35. public StoredClassCatalog getClassCatalog() {
36. if(classCatalog == null) {
37. DatabaseConfig dbConfig = new DatabaseConfig();
38. dbConfig.setAllowCreate(true);
39. try {
40. classCatalogDB = openDatabase(null, "classCatalog", dbConfig);
41. classCatalog = new StoredClassCatalog(classCatalogDB);
42. } catch (DatabaseException e) {
43. // TODO Auto-generated catch block
44. throw new RuntimeException(e);
45. }
46. }
47. return classCatalog;
48. }
50. @Override
51. public synchronized void close() throws DatabaseException {
52. if(classCatalogDB!=null) {
53. classCatalogDB.close();
54. }
55. super.close();
56. }
58. }
2. 基于BDB实现的持久化队列
Java代码
1. package com.guoyun.util;
3. import java.io.File;
4. import java.io.IOException;
5. import java.io.Serializable;
6. import java.util.AbstractQueue;
7. import java.util.Iterator;
8. import java.util.concurrent.atomic.AtomicLong;
10. import org.apache.commons.io.FileUtils;
12. import com.sleepycat.bind.EntryBinding;
13. import com.sleepycat.bind.serial.SerialBinding;
14. import com.sleepycat.bind.serial.StoredClassCatalog;
15. import com.sleepycat.bind.tuple.TupleBinding;
16. import com.sleepycat.collections.StoredMap;
17. import com.sleepycat.collections.StoredSortedMap;
18. import com.sleepycat.je.Database;
19. import com.sleepycat.je.DatabaseConfig;
20. import com.sleepycat.je.DatabaseException;
21. import com.sleepycat.je.DatabaseExistsException;
22. import com.sleepycat.je.DatabaseNotFoundException;
23. import com.sleepycat.je.EnvironmentConfig;
24. /**
25. * 持久化队列,基于BDB实现,也继承Queue,以及可以序列化.但不等同于Queue的时,不再使用后需要关闭
26. * 相比一般的内存Queue,插入和获取值需要多消耗一定的时间
27. * 这里为什么是继承AbstractQueue而不是实现Queue接口,是因为只要实现offer,peek,poll几个方法即可,
28. * 其他如remove,addAll,AbstractQueue会基于这几个方法去实现
29. *
30. * @contributor guoyun
31. * @param <E>
32. */
33. public class BdbPersistentQueue<E extends Serializable> extends AbstractQueue<E> implements
34. Serializable {
35. private static final long serialVersionUID = 3427799316155220967L;
36. private transient BdbEnvironment dbEnv; // 数据库环境,无需序列化
37. private transient Database queueDb; // 数据库,用于保存值,使得支持队列持久化,无需序列化
38. private transient StoredMap<Long,E> queueMap; // 持久化Map,Key为指针位置,Value为值,无需序列化
39. private transient String dbDir; // 数据库所在目录
40. private transient String dbName; // 数据库名字
41. private AtomicLong headIndex; // 头部指针
42. private AtomicLong tailIndex; // 尾部指针
43. private transient E peekItem=null; // 当前获取的值
45. /**
46. * 构造函数,传入BDB数据库
47. *
48. * @param db
49. * @param valueClass
50. * @param classCatalog
51. */
52. public BdbPersistentQueue(Database db,Class<E> valueClass,StoredClassCatalog classCatalog){
53. this.queueDb=db;
54. this.dbName=db.getDatabaseName();
55. headIndex=new AtomicLong(0);
56. tailIndex=new AtomicLong(0);
57. bindDatabase(queueDb,valueClass,classCatalog);
58. }
59. /**
60. * 构造函数,传入BDB数据库位置和名字,自己创建数据库
61. *
62. * @param dbDir
63. * @param dbName
64. * @param valueClass
65. */
66. public BdbPersistentQueue(String dbDir,String dbName,Class<E> valueClass){
67. headIndex=new AtomicLong(0);
68. tailIndex=new AtomicLong(0);
69. this.dbDir=dbDir;
70. this.dbName=dbName;
71. createAndBindDatabase(dbDir,dbName,valueClass);
72. }
73. /**
74. * 绑定数据库
75. *
76. * @param db
77. * @param valueClass
78. * @param classCatalog
79. */
80. public void bindDatabase(Database db, Class<E> valueClass, StoredClassCatalog classCatalog){
81. EntryBinding<E> valueBinding = TupleBinding.getPrimitiveBinding(valueClass);
82. if(valueBinding == null) {
83. valueBinding = new SerialBinding<E>(classCatalog, valueClass); // 序列化绑定
84. }
85. queueDb = db;
86. queueMap = new StoredSortedMap<Long,E>(
87. db, // db
88. TupleBinding.getPrimitiveBinding(Long.class), //Key
89. valueBinding, // Value
90. true); // allow write
91. }
92. /**
93. * 创建以及绑定数据库
94. *
95. * @param dbDir
96. * @param dbName
97. * @param valueClass
98. * @throws DatabaseNotFoundException
99. * @throws DatabaseExistsException
100. * @throws DatabaseException
101. * @throws IllegalArgumentException
102. */
103. private void createAndBindDatabase(String dbDir, String dbName,Class<E> valueClass) throws DatabaseNotFoundException,
104. DatabaseExistsException,DatabaseException,IllegalArgumentException{
105. File envFile = null;
106. EnvironmentConfig envConfig = null;
107. DatabaseConfig dbConfig = null;
108. Database db=null;
110. try {
111. // 数据库位置
112. envFile = new File(dbDir);
114. // 数据库环境配置
115. envConfig = new EnvironmentConfig();
116. envConfig.setAllowCreate(true);
117. envConfig.setTransactional(false);
119. // 数据库配置
120. dbConfig = new DatabaseConfig();
121. dbConfig.setAllowCreate(true);
122. dbConfig.setTransactional(false);
123. dbConfig.setDeferredWrite(true);
125. // 创建环境
126. dbEnv = new BdbEnvironment(envFile, envConfig);
127. // 打开数据库
128. db = dbEnv.openDatabase(null, dbName, dbConfig);
129. // 绑定数据库
130. bindDatabase(db,valueClass,dbEnv.getClassCatalog());
132. } catch (DatabaseNotFoundException e) {
133. throw e;
134. } catch (DatabaseExistsException e) {
135. throw e;
136. } catch (DatabaseException e) {
137. throw e;
138. } catch (IllegalArgumentException e) {
139. throw e;
140. }
143. }
145. /**
146. * 值遍历器
147. */
148. @Override
149. public Iterator<E> iterator() {
150. return queueMap.values().iterator();
151. }
152. /**
153. * 大小
154. */
155. @Override
156. public int size() {
157. synchronized(tailIndex){
158. synchronized(headIndex){
159. return (int)(tailIndex.get()-headIndex.get());
160. }
161. }
162. }
164. /**
165. * 插入值
166. */
167. @Override
168. public boolean offer(E e) {
169. synchronized(tailIndex){
170. queueMap.put(tailIndex.getAndIncrement(), e); // 从尾部插入
171. }
172. return true;
173. }
175. /**
176. * 获取值,从头部获取
177. */
178. @Override
179. public E peek() {
180. synchronized(headIndex){
181. if(peekItem!=null){
182. return peekItem;
183. }
184. E headItem=null;
185. while(headItem==null&&headIndex.get()<tailIndex.get()){ // 没有超出范围
186. headItem=queueMap.get(headIndex.get());
187. if(headItem!=null){
188. peekItem=headItem;
189. continue;
190. }
191. headIndex.incrementAndGet(); // 头部指针后移
192. }
193. return headItem;
194. }
195. }
197. /**
198. * 移出元素,移出头部元素
199. */
200. @Override
201. public E poll() {
202. synchronized(headIndex){
203. E headItem=peek();
204. if(headItem!=null){
205. queueMap.remove(headIndex.getAndIncrement());
206. peekItem=null;
207. return headItem;
208. }
209. }
210. return null;
211. }
213. /**
214. * 关闭,也就是关闭所是用的BDB数据库但不关闭数据库环境
215. */
216. public void close(){
217. try {
218. if(queueDb!=null){
219. queueDb.sync();
220. queueDb.close();
221. }
222. } catch (DatabaseException e) {
223. // TODO Auto-generated catch block
224. e.printStackTrace();
225. } catch (UnsupportedOperationException e) {
226. // TODO Auto-generated catch block
227. e.printStackTrace();
228. }
229. }
231. /**
232. * 清理,会清空数据库,并且删掉数据库所在目录,慎用.如果想保留数据,请调用close()
233. */
234. @Override
235. public void clear() {
236. try {
237. close();
238. if(dbEnv!=null&&queueDb!=null){
239. dbEnv.removeDatabase(null, dbName==null?queueDb.getDatabaseName():dbName);
240. dbEnv.close();
241. }
242. } catch (DatabaseNotFoundException e) {
243. // TODO Auto-generated catch block
244. e.printStackTrace();
245. } catch (DatabaseException e) {
246. // TODO Auto-generated catch block
247. e.printStackTrace();
248. } finally{
249. try {
250. if(this.dbDir!=null){
251. FileUtils.deleteDirectory(new File(this.dbDir));
252. }
254. } catch (IOException e) {
255. // TODO Auto-generated catch block
256. e.printStackTrace();
257. }
258. }
259. }
261. }
3. 测试类,测试数据准确性和性能
Java代码
1. package com.guoyun.util;
3. import java.io.File;
4. import java.util.Queue;
5. import java.util.concurrent.LinkedBlockingQueue;
7. import junit.framework.TestCase;
9. public class BdbPersistentQueueTest extends TestCase{
10. Queue<String> memoryQueue;
11. Queue<String> persistentQueue;
13. @Override
14. protected void setUp() throws Exception {
15. super.setUp();
16. memoryQueue=new LinkedBlockingQueue<String>();
17. String dbDir="E:/java/test/bdbDir";
18. File file=new File(dbDir);
19. if(!file.exists()||!file.isDirectory()){
20. file.mkdirs();
21. }
22. persistentQueue=new BdbPersistentQueue(dbDir,"pq",String.class);
23. }
25. @Override
26. protected void tearDown() throws Exception {
27. super.tearDown();
28. memoryQueue.clear();
29. memoryQueue=null;
30. persistentQueue.clear();
31. persistentQueue=null;
32. }
34. /**
35. * 排放值
36. * @param queue
37. * @return 排放的数据个数
38. */
39. public int drain(Queue<String> queue){
40. int count=0;
41. while(true){
42. try {
43. queue.remove();
44. count++;
45. } catch (Exception e) {
46. return count;
47. }
48. }
50. }
51. /**
52. *
53. * @param queue
54. * @param size
55. */
56. public void fill(Queue<String> queue,int size){
57. for(int i=0;i<size;i++){
58. queue.add(i+"");
59. }
60. }
62. public void checkTime(int size){
63. System.out.println("1.内存Queue插入和排空数据所耗时间");
64. long time=0;
65. long start=System.nanoTime();
66. fill(memoryQueue,size);
67. time=System.nanoTime()-start;
68. System.out.println(" 填充 "+size+" 条数据耗时: "+(double)time/1000000+" 毫秒,单条耗时: "+((double)time/size)+" 纳秒");
69. start=System.nanoTime();
70. drain(memoryQueue);
71. time=System.nanoTime()-start;
72. System.out.println(" 排空 "+size+" 条数据耗时: "+(double)time/1000000+" 毫秒,单条耗时: "+((double)time/size)+" 纳秒");
74. System.out.println("2.持久化Queue插入和排空数据所耗时间");
75. start=System.nanoTime();
76. fill(persistentQueue,size);
77. time=System.nanoTime()-start;
78. System.out.println(" 填充 "+size+" 条数据耗时: "+(double)time/1000000+" 毫秒,单条耗时: "+((double)time/size/1000000)+" 豪秒");
79. start=System.nanoTime();
80. drain(persistentQueue);
81. time=System.nanoTime()-start;
82. System.out.println(" 排空 "+size+" 条数据耗时: "+(double)time/1000000+" 毫秒,单条耗时: "+((double)time/size/1000)+" 豪秒");
84. }
86. /**
87. * 十万数量级测试
88. */
89. public void testTime_tenThousand(){
90. System.out.println("========测试1000000(十万)条数据=================");
91. checkTime(100000);
92. }
95. /**
96. * 百万数量级测试
97. */
98. public void testTime_mil(){
99. System.out.println("========测试1000000(百万)条数据=================");
100. checkTime(1000000);
101. }
104. /**
105. * 千万数量级测试,注意要防止内存溢出
106. */
107. public void testTime_tenMil(){
108. System.out.println("========测试10000000(千万)条数据=================");
109. checkTime(10000000);
110. }
112. /**
113. * 测试队列数据准确性
114. * @param queue
115. * @param queueName
116. * @param size
117. */
118. public void checkDataExact(Queue<String> queue,String queueName,int size){
119. if(queue.size()!=size){
120. System.err.println("Error size of "+queueName);
121. }
122. String value=null;
123. for(int i=0;i<size;i++){
124. value=queue.remove();
125. if(!((i+"").equals(value))){
126. System.err.println("Error "+queueName+":"+i+"->"+value);
127. }
128. }
129. }
131. /**
132. * 测试队列中数据的准确性,包括长度
133. */
134. public void testExact(){
135. int size=100;
136. fill(memoryQueue,size);
137. fill(persistentQueue,size);
139. checkDataExact(memoryQueue,"MemoryQueue",100);
140. checkDataExact(persistentQueue,"PersistentQueue",100);
142. }
144. }
4.测试性能

========测试1000000(十万)条数据=================
1.内存Queue插入和排空数据所耗时间
填充 100000 条数据耗时: 53.550787 毫秒,单条耗时: 535.50787 纳秒
排空 100000 条数据耗时: 27.09901 毫秒,单条耗时: 270.9901 纳秒
2.持久化Queue插入和排空数据所耗时间
填充 100000 条数据耗时: 1399.644305 毫秒,单条耗时: 0.01399644305 豪秒
排空 100000 条数据耗时: 2104.765179 毫秒,单条耗时: 21.04765179 豪秒

持久化写入是内存写入的26倍,读取是77倍

========测试1000000(百万)条数据=================
1.内存Queue插入和排空数据所耗时间
填充 1000000 条数据耗时: 699.105888 毫秒,单条耗时: 699.105888 纳秒
排空 1000000 条数据耗时: 158.792281 毫秒,单条耗时: 158.792281 纳秒
2.持久化Queue插入和排空数据所耗时间
填充 1000000 条数据耗时: 11978.132218 毫秒,单条耗时: 0.011978132218 豪秒
排空 1000000 条数据耗时: 22355.617205 毫秒,单条耗时: 22.355617204999998 豪秒

持久化写入是内存写入的17倍,读取是141倍

========测试10000000(千万)条数据=================
1.内存Queue插入和排空数据所耗时间
填充 10000000 条数据耗时: 9678.377046 毫秒,单条耗时: 967.8377046 纳秒
排空 10000000 条数据耗时: 1473.416825 毫秒,单条耗时: 147.3416825 纳秒
2.持久化Queue插入和排空数据所耗时间
填充 10000000 条数据耗时: 151177.036391 毫秒,单条耗时: 0.0151177036391 豪秒
排空 10000000 条数据耗时: 361642.655135 毫秒,单条耗时: 36.164265513500006 豪秒

持久化写入是内存写入的15倍,读取是245倍

可以看出写入和遍历一条都是在毫秒级别,还有千万级的数据,BDB的性能着实牛逼.而且随着数据的增多,写的时间在缩短,读的时间在增长.
相关阅读:
Hadoop HDFS的常用命令
 Spark简介
 Hadoop datanode无法启动的错误
 kafka在虚拟机环境的优化
 kafka的安装和使用
 Strom的安装及使用
 机器学习
 Tomcat启动时为什么要配置CATALINA_HOME环境变量？？
sqoop的安装和使用
 Python Lambda 的简单用法
原文地址：https://www.cnblogs.com/zheh/p/3934344.html