首先,你需要对 MongoDB 的 database,collection,document 有一个大致了解。
其中 MongoDB 中的 collection,document 分别对应 SQL 中的 table,row 的概念。了解更多
需要用到的依赖:
<!-- mongodb jdbc driver -->
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.4.3</version>
</dependency>
接着就是如何使用 API 来读取 Document => 了解更多
有了这些基础知识之后,我就来说说我踩的坑。
分页查询越来越慢!
首先,考虑到 10W 肯定不能一次性取出来并存储到List中,否则内存会爆炸,所以准备采取分页的方式,skip 和 limit 正好可以帮助我实现分页,代码如下:
private List<Document> page(MongoCollection<Document> collection, int count, int pageSize) {
List<Document> result = new ArrayList<>();
long beginTime = System.currentTimeMillis();
FindIterable<Document> documents = collection.find().skip(count).limit(pageSize);
try (MongoCursor<Document> cursor = documents.iterator()) {
while (cursor.hasNext()) {
result.add(cursor.next());
}
}
long duration = System.currentTimeMillis()-beginTime;
log.info("It takes {} ms to page from {} - {}", duration, count, count + result.size() - 1);
return result;
}
然后,调用的代码如下:
long total = collection.count();
log.info("The collection {} contains {} documents.", collectionName, total);
int count = 0; // 单个集合已经处理的数量
boolean hasMore = true;
while (hasMore) {
List<Document> documents = page(collection, count, 500);
// ... 处理查出来的 Document 列表,比如插入新库。
hasMore = documents.size() == 500;
count += documents.size();
}
但是,这方法却有问题,点击展开日志
The collection test_big_data contains 100002 documents.
It takes 594 ms to page from 0 - 499
It takes 554 ms to page from 500 - 999
It takes 549 ms to page from 1000 - 1499
It takes 565 ms to page from 1500 - 1999
It takes 561 ms to page from 2000 - 2499
It takes 583 ms to page from 2500 - 2999
It takes 583 ms to page from 3000 - 3499
It takes 596 ms to page from 3500 - 3999
It takes 595 ms to page from 4000 - 4499
It takes 615 ms to page from 4500 - 4999
It takes 614 ms to page from 5000 - 5499
It takes 632 ms to page from 5500 - 5999
It takes 653 ms to page from 6000 - 6499
It takes 653 ms to page from 6500 - 6999
It takes 645 ms to page from 7000 - 7499
It takes 669 ms to page from 7500 - 7999
It takes 685 ms to page from 8000 - 8499
It takes 671 ms to page from 8500 - 8999
It takes 695 ms to page from 9000 - 9499
It takes 706 ms to page from 9500 - 9999
It takes 692 ms to page from 10000 - 10499
It takes 719 ms to page from 10500 - 10999
It takes 709 ms to page from 11000 - 11499
It takes 722 ms to page from 11500 - 11999
It takes 739 ms to page from 12000 - 12499
It takes 749 ms to page from 12500 - 12999
It takes 768 ms to page from 13000 - 13499
It takes 755 ms to page from 13500 - 13999
It takes 770 ms to page from 14000 - 14499
It takes 795 ms to page from 14500 - 14999
It takes 797 ms to page from 15000 - 15499
It takes 836 ms to page from 15500 - 15999
It takes 809 ms to page from 16000 - 16499
It takes 831 ms to page from 16500 - 16999
It takes 843 ms to page from 17000 - 17499
It takes 875 ms to page from 17500 - 17999
It takes 910 ms to page from 18000 - 18499
It takes 872 ms to page from 18500 - 18999
It takes 937 ms to page from 19000 - 19499
It takes 898 ms to page from 19500 - 19999
It takes 913 ms to page from 20000 - 20499
It takes 926 ms to page from 20500 - 20999
It takes 966 ms to page from 21000 - 21499
It takes 970 ms to page from 21500 - 21999
It takes 957 ms to page from 22000 - 22499
It takes 989 ms to page from 22500 - 22999
It takes 1009 ms to page from 23000 - 23499
It takes 1011 ms to page from 23500 - 23999
It takes 1031 ms to page from 24000 - 24499
It takes 1038 ms to page from 24500 - 24999
It takes 1066 ms to page from 25000 - 25499
It takes 1068 ms to page from 25500 - 25999
It takes 1085 ms to page from 26000 - 26499
It takes 1123 ms to page from 26500 - 26999
It takes 1111 ms to page from 27000 - 27499
It takes 1109 ms to page from 27500 - 27999
It takes 1159 ms to page from 28000 - 28499
It takes 1134 ms to page from 28500 - 28999
It takes 1144 ms to page from 29000 - 29499
It takes 1152 ms to page from 29500 - 29999
It takes 1165 ms to page from 30000 - 30499
It takes 1179 ms to page from 30500 - 30999
It takes 1216 ms to page from 31000 - 31499
It takes 1247 ms to page from 31500 - 31999
It takes 1230 ms to page from 32000 - 32499
It takes 1250 ms to page from 32500 - 32999
It takes 1283 ms to page from 33000 - 33499
It takes 1264 ms to page from 33500 - 33999
It takes 1301 ms to page from 34000 - 34499
It takes 1251 ms to page from 34500 - 34999
It takes 1297 ms to page from 35000 - 35499
It takes 1316 ms to page from 35500 - 35999
It takes 1327 ms to page from 36000 - 36499
It takes 1348 ms to page from 36500 - 36999
It takes 1359 ms to page from 37000 - 37499
It takes 1343 ms to page from 37500 - 37999
It takes 1363 ms to page from 38000 - 38499
It takes 1402 ms to page from 38500 - 38999
It takes 1351 ms to page from 39000 - 39499
It takes 1410 ms to page from 39500 - 39999
It takes 1407 ms to page from 40000 - 40499
It takes 1400 ms to page from 40500 - 40999
It takes 1426 ms to page from 41000 - 41499
It takes 1405 ms to page from 41500 - 41999
It takes 1443 ms to page from 42000 - 42499
It takes 1474 ms to page from 42500 - 42999
It takes 1459 ms to page from 43000 - 43499
It takes 1446 ms to page from 43500 - 43999
It takes 1519 ms to page from 44000 - 44499
It takes 1537 ms to page from 44500 - 44999
It takes 1579 ms to page from 45000 - 45499
It takes 1506 ms to page from 45500 - 45999
It takes 1563 ms to page from 46000 - 46499
It takes 1572 ms to page from 46500 - 46999
It takes 1602 ms to page from 47000 - 47499
It takes 1623 ms to page from 47500 - 47999
It takes 1639 ms to page from 48000 - 48499
It takes 1633 ms to page from 48500 - 48999
It takes 1613 ms to page from 49000 - 49499
It takes 1661 ms to page from 49500 - 49999
It takes 1641 ms to page from 50000 - 50499
It takes 1677 ms to page from 50500 - 50999
It takes 1635 ms to page from 51000 - 51499
It takes 1729 ms to page from 51500 - 51999
It takes 1741 ms to page from 52000 - 52499
It takes 1700 ms to page from 52500 - 52999
It takes 1747 ms to page from 53000 - 53499
It takes 1703 ms to page from 53500 - 53999
It takes 1736 ms to page from 54000 - 54499
It takes 1725 ms to page from 54500 - 54999
It takes 1766 ms to page from 55000 - 55499
It takes 1849 ms to page from 55500 - 55999
It takes 1837 ms to page from 56000 - 56499
It takes 1836 ms to page from 56500 - 56999
It takes 1817 ms to page from 57000 - 57499
It takes 1845 ms to page from 57500 - 57999
It takes 1870 ms to page from 58000 - 58499
It takes 1857 ms to page from 58500 - 58999
It takes 1920 ms to page from 59000 - 59499
It takes 1884 ms to page from 59500 - 59999
It takes 1874 ms to page from 60000 - 60499
It takes 1876 ms to page from 60500 - 60999
It takes 1895 ms to page from 61000 - 61499
It takes 1958 ms to page from 61500 - 61999
It takes 1917 ms to page from 62000 - 62499
It takes 1914 ms to page from 62500 - 62999
It takes 1890 ms to page from 63000 - 63499
It takes 1943 ms to page from 63500 - 63999
It takes 1956 ms to page from 64000 - 64499
It takes 2021 ms to page from 64500 - 64999
It takes 1984 ms to page from 65000 - 65499
It takes 1972 ms to page from 65500 - 65999
It takes 1992 ms to page from 66000 - 66499
It takes 1959 ms to page from 66500 - 66999
It takes 1997 ms to page from 67000 - 67499
It takes 2084 ms to page from 67500 - 67999
It takes 2148 ms to page from 68000 - 68499
It takes 2159 ms to page from 68500 - 68999
It takes 2185 ms to page from 69000 - 69499
It takes 2171 ms to page from 69500 - 69999
It takes 2053 ms to page from 70000 - 70499
It takes 2109 ms to page from 70500 - 70999
It takes 2380 ms to page from 71000 - 71499
It takes 2126 ms to page from 71500 - 71999
It takes 2183 ms to page from 72000 - 72499
It takes 2186 ms to page from 72500 - 72999
It takes 2215 ms to page from 73000 - 73499
It takes 2160 ms to page from 73500 - 73999
It takes 2259 ms to page from 74000 - 74499
It takes 2178 ms to page from 74500 - 74999
It takes 2231 ms to page from 75000 - 75499
It takes 2273 ms to page from 75500 - 75999
It takes 2259 ms to page from 76000 - 76499
It takes 2323 ms to page from 76500 - 76999
It takes 2293 ms to page from 77000 - 77499
It takes 2302 ms to page from 77500 - 77999
It takes 2274 ms to page from 78000 - 78499
It takes 2379 ms to page from 78500 - 78999
It takes 2358 ms to page from 79000 - 79499
It takes 2384 ms to page from 79500 - 79999
It takes 2290 ms to page from 80000 - 80499
It takes 2324 ms to page from 80500 - 80999
It takes 2416 ms to page from 81000 - 81499
It takes 2650 ms to page from 81500 - 81999
It takes 2545 ms to page from 82000 - 82499
It takes 2468 ms to page from 82500 - 82999
It takes 2388 ms to page from 83000 - 83499
It takes 2468 ms to page from 83500 - 83999
It takes 2565 ms to page from 84000 - 84499
It takes 2492 ms to page from 84500 - 84999
It takes 2554 ms to page from 85000 - 85499
It takes 2520 ms to page from 85500 - 85999
It takes 2523 ms to page from 86000 - 86499
It takes 2585 ms to page from 86500 - 86999
It takes 2540 ms to page from 87000 - 87499
It takes 2555 ms to page from 87500 - 87999
It takes 2592 ms to page from 88000 - 88499
It takes 2585 ms to page from 88500 - 88999
It takes 2647 ms to page from 89000 - 89499
It takes 2536 ms to page from 89500 - 89999
It takes 2519 ms to page from 90000 - 90499
It takes 2582 ms to page from 90500 - 90999
It takes 2519 ms to page from 91000 - 91499
It takes 2567 ms to page from 91500 - 91999
It takes 2582 ms to page from 92000 - 92499
It takes 2568 ms to page from 92500 - 92999
It takes 2734 ms to page from 93000 - 93499
It takes 2736 ms to page from 93500 - 93999
It takes 2648 ms to page from 94000 - 94499
It takes 2850 ms to page from 94500 - 94999
It takes 2664 ms to page from 95000 - 95499
It takes 2714 ms to page from 95500 - 95999
It takes 2653 ms to page from 96000 - 96499
It takes 2696 ms to page from 96500 - 96999
It takes 2768 ms to page from 97000 - 97499
It takes 2755 ms to page from 97500 - 97999
It takes 2776 ms to page from 98000 - 98499
It takes 2767 ms to page from 98500 - 98999
It takes 2888 ms to page from 99000 - 99499
It takes 2814 ms to page from 99500 - 99999
It takes 2366 ms to page from 100000 - 100001
通过观察打印日志发现,分页查数据的速度越来越慢,不符合我预期的每段数据查询时间相同。
只创建一个游标
因此,我换了另一种查询方式进行尝试:
long total = collection.count();
log.info("The collection {} contains {} documents.", collectionName, total);
/*
* 检索所有文档
* 1. 获取迭代器FindIterable<Document>
* 2. 获取游标MongoCursor<Document>
* 3. 通过游标遍历检索出的文档集合
*/
int count = 0;
FindIterable<Document> documents = collection.find();
try (MongoCursor<Document> cursor = documents.iterator()) {
List<Document> list = new ArrayList<>();
long begin = System.currentTimeMillis();
while (cursor.hasNext()) {
list.add(cursor.next());
count++;
if (count % 100 == 0 || total - count == 0) {
long duration = System.currentTimeMillis() - begin;
log.info("It takes {} ms to current {}", duration, count);
// 消费数据!!!
list.clear();
begin = System.currentTimeMillis();
}
}
}
点击展开(部分)打印日志
The collection test_big_data contains 100002 documents.
It takes 0 ms to current 100
It takes 1545 ms to current 200
It takes 0 ms to current 300
It takes 0 ms to current 400
It takes 0 ms to current 500
It takes 0 ms to current 600
It takes 0 ms to current 700
It takes 0 ms to current 800
It takes 0 ms to current 900
It takes 0 ms to current 1000
It takes 0 ms to current 1100
It takes 0 ms to current 1200
It takes 0 ms to current 1300
It takes 0 ms to current 1400
It takes 1497 ms to current 1500
It takes 0 ms to current 1600
It takes 0 ms to current 1700
It takes 0 ms to current 1800
It takes 0 ms to current 1900
It takes 0 ms to current 2000
It takes 0 ms to current 2100
It takes 0 ms to current 2200
It takes 0 ms to current 2300
It takes 0 ms to current 2400
It takes 0 ms to current 2500
It takes 0 ms to current 2600
It takes 0 ms to current 2700
It takes 0 ms to current 2800
It takes 1486 ms to current 2900
It takes 0 ms to current 3000
It takes 0 ms to current 3100
It takes 0 ms to current 3200
It takes 0 ms to current 3300
It takes 0 ms to current 3400
It takes 0 ms to current 3500
It takes 0 ms to current 3600
It takes 0 ms to current 3700
It takes 0 ms to current 3800
It takes 0 ms to current 3900
It takes 0 ms to current 4000
It takes 0 ms to current 4100
It takes 0 ms to current 4200
It takes 1488 ms to current 4300
It takes 0 ms to current 4400
It takes 0 ms to current 4500
It takes 0 ms to current 4600
It takes 0 ms to current 4700
It takes 0 ms to current 4800
It takes 0 ms to current 4900
It takes 0 ms to current 5000
It takes 0 ms to current 5100
It takes 0 ms to current 5200
It takes 0 ms to current 5300
It takes 0 ms to current 5400
It takes 0 ms to current 5500
It takes 0 ms to current 5600
It takes 1503 ms to current 5700
It takes 0 ms to current 5800
It takes 0 ms to current 5900
It takes 0 ms to current 6000
It takes 0 ms to current 6100
It takes 0 ms to current 6200
It takes 0 ms to current 6300
It takes 0 ms to current 6400
It takes 0 ms to current 6500
It takes 0 ms to current 6600
It takes 0 ms to current 6700
It takes 0 ms to current 6800
It takes 0 ms to current 6900
It takes 0 ms to current 7000
It takes 1475 ms to current 7100
当然,间隔数量主要和你单个 Document 的大小以及缓冲区的总大小有关,你我的实验结果将因人而异。
综上所述
本文结论:在使用 mongo-java-driver 时,如果需要扫描全表的情况下,创建多个cursor分页查询的效率不及只用一个cursor查全表效率高。