如果我们使用文件来存储数据,那么如何考虑它们的性能呢?假设我们只会按照行来读取数据,该如何分布文件,才能达到一个比较好的性能呢?
需求: 1. 需要支持通过行号定位数据;2. 需要支持任意多行连续数据的读取;(类似于sql中的 limit offset, size;)
功能实际很简单,只要把数据按照行存储之后,就可以支持以上操作了。但是我们要讨论的是,如何才能支持高效地读取?假设我们有一个字段可以保存一些索引信息的话,当如何利用?
事实上,从本文标题即可看出,可选两个方案:
1. 使用多个分片文件进行存储文件,通过索引字段查找到分片信息直接读取分片文件数据;
2. 使用一个文件来存储数据数据,通过索引字段查询行的字节偏移,然后通过seek()等命令跳转到对应的数据位置读取指定行数据;
那么,到底哪个方案更好呢?感觉第1个方案会好一点,但是具体好多少呢?另外,如果使用第一个方案会有个缺点,即小文件太多,不容易管理,如果第二个方案的性能差异不大的话,也许是个不错的选择。
要验证这两个方案的优劣,自然就要用测试的方法给出答案了。压力测试可以很方便给到我们数据,我们可以使用外部的工具如 jmeter, 将压测结果放在外部进行对比;也可以自己写单元测试循环读取n次以得出耗时然后对比;但比较优秀的是使用jmh工具直接使用类似单元测试的方式给出具体的性能对比结果,这也是我们本文的实现方式以及目的。
1. jmh 工具的引入
JMH 是一个由 OpenJDK/Oracle 里面那群开发了 Java 编译器的大牛们所开发的 Micro Benchmark Framework 。主要用method级别的性能测试,精度可以精确到微秒级。
要使用jmh, 只需引入两个依赖包就可以了。
<!-- benchmark --> <dependency> <groupId>org.openjdk.jmh</groupId> <artifactId>jmh-core</artifactId> <version>1.19</version> <scope>test</scope> </dependency> <dependency> <groupId>org.openjdk.jmh</groupId> <artifactId>jmh-generator-annprocess</artifactId> <version>1.19</version> <scope>test</scope> </dependency>
2. 使用jmh 编写测试用例
实际上,测试用例与工具是没有关系的,主要是自己要分析清楚需要做哪些方面的测试,然后一个个case给出。而jmh 只是在我们完成了基础用例的基础上,做一些性能收集类的工作。根据本次测试的目的,我们主要写几个读取文件的方式用例即可,具体如下:
package com.my.test.benchmark; import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.runner.Runner; import org.openjdk.jmh.runner.RunnerException; import org.openjdk.jmh.runner.options.Options; import org.openjdk.jmh.runner.options.OptionsBuilder; import java.io.*; import java.util.ArrayList; import java.util.List; import java.util.concurrent.TimeUnit; /** * 功能描述: 测试读取整个文件与多个小文件的性能差异 */ @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.SECONDS) @State(Scope.Thread) public class SplitFileWithOneFileReadBenchmarkTest { // 测试入口,run public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include(SplitFileWithOneFileReadBenchmarkTest.class.getSimpleName()) // 预热: 避免因应用初次加载的损耗带来的偏差 .warmupIterations(1) // 测试迭代次数设置 .measurementIterations(5) // 运行测试的jvm进程数设置 .forks(1) .build(); new Runner(opt).run(); } /** * 数据读取场景,读取1w行数据 */ private int maxReadLines = 10000; /** * 读取文件字节偏移(非行号,实际场景中可以使用行号遇到到字节位置) */ private int seekFilePos = 100_0000; private int seekFilePosSmallest = 0; private int seekFilePosLargest = 1500_0000; @Setup public void setup() { // do something for init... } @TearDown public void tearDown() { // do something when test end } /** * 测试从小文件(分片)中读取数据的性能 * * @return 读取的结果 * @throws IOException 读取异常抛出 */ @Benchmark public List testSplitReadUseBufferReader() throws IOException { File file = new File("C:\Users\Administrator\Desktop\1.txt"); FileReader reader = new FileReader(file); BufferedReader bufferedReader = new BufferedReader(reader); return readLimitLinesFromBuffer(bufferedReader); } /** * 共享读数据方法 * * @param bufferedReader 文件reader * @return 指定行数的数据 * @throws IOException 读取异常抛出 */ private List<String> readLimitLinesFromBuffer(BufferedReader bufferedReader) throws IOException { String lineData; int lineNum = 0; List<String> result = new ArrayList<>(); while ((lineData = bufferedReader.readLine()) != null) { result.add(lineData); if(++lineNum > maxReadLines) { break; } } return result; } /** * 共享读数据方法 * * @param randomAccessFile 文件reader * @return 指定行数的数据 * @throws IOException 读取异常抛出 */ private List<String> readLimitLinesFromRandomAccessFile( RandomAccessFile randomAccessFile) throws IOException { String lineData; int lineNum = 0; List<String> result = new ArrayList<>(); while ((lineData = randomAccessFile.readLine()) != null) { result.add(lineData); if(++lineNum > maxReadLines) { break; } } return result; } /** * 测试从单个文件中读取指定行数的性能(seekPos已知) * * @return 读取结果 * @throws IOException 读取异常抛出 */ @Benchmark public List testOneFileReadUseBufferReader() throws IOException { return readFileLimitDataWithSeek( "C:\Users\Administrator\Desktop\1-all.txt", seekFilePos); } /** * 读取大文件开头数据场景性能测试 */ @Benchmark public List<String> testOneFileReadFromBeginUseBufferReader() throws IOException { return readFileLimitDataWithSeek( "C:\Users\Administrator\Desktop\1-all.txt", seekFilePosSmallest); } /** * 读取大文件结尾数据场景性能测试 */ @Benchmark public List<String> testOneFileReadFromEndUseBufferReader() throws IOException { return readFileLimitDataWithSeek( "C:\Users\Administrator\Desktop\1-all.txt", seekFilePosLargest); } /** * 从某文件读取数据,指定 seekPos 位置 * * @param filePath 文件路径 * @param seekPos 要跳转到的文件位置 * @return 读取结果 * @throws IOException 读取异常抛出 */ private List<String> readFileLimitDataWithSeek(String filePath, int seekPos) throws IOException { File file = new File(filePath); FileReader reader = new FileReader(file); reader.skip(seekPos); BufferedReader bufferedReader = new BufferedReader(reader); return readLimitLinesFromBuffer(bufferedReader); } /** * 使用 RandAccessFile 读取文件性能测试 */ @Benchmark public List testOneFileReadUseRandomAccessFile() throws IOException { String filePath = "C:\Users\Administrator\Desktop\1-all.txt"; File file = new File(filePath); RandomAccessFile reader = new RandomAccessFile(file, "r"); reader.seek(seekFilePos); return readLimitLinesFromRandomAccessFile(reader); } @Benchmark public List testSplitReadUseRandomAccessFile() throws IOException { String filePath = "C:\Users\Administrator\Desktop\1.txt"; File file = new File(filePath); RandomAccessFile reader = new RandomAccessFile(file, "r"); return readLimitLinesFromRandomAccessFile(reader); } }
具体jmh的参数就不多说,重点说明如 main() 函数中的注释,另外,只需关注的是 @Benchmark 注解为需要进行性能测试的方法上应用即可。
(jmh官方文档请参考: http://openjdk.java.net/projects/code-tools/jmh/ 官方benchmark样例: http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/)
根据以上用例(外部文件依赖需要我们自行准备好),我们就可以得出具体的结果了。如下:
# JMH version: 1.19 # VM version: JDK 1.8.0_131, VM 25.131-b11 # VM invoker: C:Program FilesJavajdk1.8.0_131jreinjava.exe # VM options: -javaagent:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1libidea_rt.jar=61902:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1in -Dfile.encoding=UTF-8 # Warmup: 1 iterations, 1 s each # Measurement: 5 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Throughput, ops/time # Benchmark: com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testOneFileReadFromBeginUseBufferReader # Run progress: 0.00% complete, ETA 00:00:36 # Fork: 1 of 1 # Warmup Iteration 1: 469.938 ops/s Iteration 1: 948.380 ops/s Iteration 2: 988.579 ops/s Iteration 3: 1085.332 ops/s Iteration 4: 1120.864 ops/s Iteration 5: 1105.987 ops/s Result "com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testOneFileReadFromBeginUseBufferReader": 1049.828 ±(99.9%) 295.170 ops/s [Average] (min, avg, max) = (948.380, 1049.828, 1120.864), stdev = 76.655 CI (99.9%): [754.659, 1344.998] (assumes normal distribution) # JMH version: 1.19 # VM version: JDK 1.8.0_131, VM 25.131-b11 # VM invoker: C:Program FilesJavajdk1.8.0_131jreinjava.exe # VM options: -javaagent:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1libidea_rt.jar=61902:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1in -Dfile.encoding=UTF-8 # Warmup: 1 iterations, 1 s each # Measurement: 5 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Throughput, ops/time # Benchmark: com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testOneFileReadFromEndUseBufferReader # Run progress: 16.67% complete, ETA 00:00:38 # Fork: 1 of 1 # Warmup Iteration 1: 29.599 ops/s Iteration 1: 34.859 ops/s Iteration 2: 35.343 ops/s Iteration 3: 35.197 ops/s Iteration 4: 32.628 ops/s Iteration 5: 35.747 ops/s Result "com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testOneFileReadFromEndUseBufferReader": 34.755 ±(99.9%) 4.739 ops/s [Average] (min, avg, max) = (32.628, 34.755, 35.747), stdev = 1.231 CI (99.9%): [30.016, 39.494] (assumes normal distribution) # JMH version: 1.19 # VM version: JDK 1.8.0_131, VM 25.131-b11 # VM invoker: C:Program FilesJavajdk1.8.0_131jreinjava.exe # VM options: -javaagent:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1libidea_rt.jar=61902:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1in -Dfile.encoding=UTF-8 # Warmup: 1 iterations, 1 s each # Measurement: 5 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Throughput, ops/time # Benchmark: com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testOneFileReadUseBufferReader # Run progress: 33.33% complete, ETA 00:00:30 # Fork: 1 of 1 # Warmup Iteration 1: 317.167 ops/s Iteration 1: 356.527 ops/s Iteration 2: 359.695 ops/s Iteration 3: 376.455 ops/s Iteration 4: 387.701 ops/s Iteration 5: 386.874 ops/s Result "com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testOneFileReadUseBufferReader": 373.451 ±(99.9%) 56.720 ops/s [Average] (min, avg, max) = (356.527, 373.451, 387.701), stdev = 14.730 CI (99.9%): [316.730, 430.171] (assumes normal distribution) # JMH version: 1.19 # VM version: JDK 1.8.0_131, VM 25.131-b11 # VM invoker: C:Program FilesJavajdk1.8.0_131jreinjava.exe # VM options: -javaagent:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1libidea_rt.jar=61902:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1in -Dfile.encoding=UTF-8 # Warmup: 1 iterations, 1 s each # Measurement: 5 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Throughput, ops/time # Benchmark: com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testOneFileReadUseRandomAccessFile # Run progress: 50.00% complete, ETA 00:00:22 # Fork: 1 of 1 # Warmup Iteration 1: 1.505 ops/s Iteration 1: 1.545 ops/s Iteration 2: 1.543 ops/s Iteration 3: 1.537 ops/s Iteration 4: 1.549 ops/s Iteration 5: 1.547 ops/s Result "com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testOneFileReadUseRandomAccessFile": 1.544 ±(99.9%) 0.018 ops/s [Average] (min, avg, max) = (1.537, 1.544, 1.549), stdev = 0.005 CI (99.9%): [1.526, 1.562] (assumes normal distribution) # JMH version: 1.19 # VM version: JDK 1.8.0_131, VM 25.131-b11 # VM invoker: C:Program FilesJavajdk1.8.0_131jreinjava.exe # VM options: -javaagent:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1libidea_rt.jar=61902:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1in -Dfile.encoding=UTF-8 # Warmup: 1 iterations, 1 s each # Measurement: 5 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Throughput, ops/time # Benchmark: com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testSplitReadUseBufferReader # Run progress: 66.67% complete, ETA 00:00:15 # Fork: 1 of 1 # Warmup Iteration 1: 833.442 ops/s Iteration 1: 1015.799 ops/s Iteration 2: 1049.247 ops/s Iteration 3: 1091.090 ops/s Iteration 4: 1207.990 ops/s Iteration 5: 1185.992 ops/s Result "com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testSplitReadUseBufferReader": 1110.023 ±(99.9%) 323.884 ops/s [Average] (min, avg, max) = (1015.799, 1110.023, 1207.990), stdev = 84.112 CI (99.9%): [786.140, 1433.907] (assumes normal distribution) # JMH version: 1.19 # VM version: JDK 1.8.0_131, VM 25.131-b11 # VM invoker: C:Program FilesJavajdk1.8.0_131jreinjava.exe # VM options: -javaagent:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1libidea_rt.jar=61902:D:Program FilesJetBrainsIntelliJ IDEA 2017.3.1in -Dfile.encoding=UTF-8 # Warmup: 1 iterations, 1 s each # Measurement: 5 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Throughput, ops/time # Benchmark: com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testSplitReadUseRandomAccessFile # Run progress: 83.33% complete, ETA 00:00:07 # Fork: 1 of 1 # Warmup Iteration 1: 3.891 ops/s Iteration 1: 4.004 ops/s Iteration 2: 3.984 ops/s Iteration 3: 4.057 ops/s Iteration 4: 4.052 ops/s Iteration 5: 4.078 ops/s Result "com.my.test.benchmark.SplitFileWithOneFileReadBenchmarkTest.testSplitReadUseRandomAccessFile": 4.035 ±(99.9%) 0.151 ops/s [Average] (min, avg, max) = (3.984, 4.035, 4.078), stdev = 0.039 CI (99.9%): [3.884, 4.186] (assumes normal distribution) # Run complete. Total time: 00:00:46 Benchmark Mode Cnt Score Error Units SplitFileWithOneFileReadBenchmarkTest.testOneFileReadFromBeginUseBufferReader thrpt 5 1049.828 ± 295.170 ops/s SplitFileWithOneFileReadBenchmarkTest.testOneFileReadFromEndUseBufferReader thrpt 5 34.755 ± 4.739 ops/s SplitFileWithOneFileReadBenchmarkTest.testOneFileReadUseBufferReader thrpt 5 373.451 ± 56.720 ops/s SplitFileWithOneFileReadBenchmarkTest.testOneFileReadUseRandomAccessFile thrpt 5 1.544 ± 0.018 ops/s SplitFileWithOneFileReadBenchmarkTest.testSplitReadUseBufferReader thrpt 5 1110.023 ± 323.884 ops/s SplitFileWithOneFileReadBenchmarkTest.testSplitReadUseRandomAccessFile thrpt 5 4.035 ± 0.151 ops/s
实际上我们只需要关注最后给出的结果报表即可,大体意思是:
1. 使用 BufferedReader 性能一定是比 RandAccessFile 好的(两个数量级的差异),抛弃RandAccessFile的读取方式;
2. 在读取的数据行数非常靠前时,单个大文件与小分片文件的读取性能是差不多的;
3. 随着行号取值越来越大,单个大文件的读取性能越来越差,比分片小文件差一个数量级以上;
3. 结论
当我们存储的数据比较大时,建议使用分片文件进行存储,借助索引字段进行分片文件查找,性能将会达到最优。
jmh 测试只能用 run 运行,不能debug。