最近工作碰到一个问题,如何将大量数据(100MB+)导入到远程的mysql server上。
尝试1:
Statement执行executeBatch的方法。每次导入1000条记录。时间为12s/1000条。比较慢。
对于1M次的插入这意味着需要4个多小时,期间还会因为网络状况,数据库负载等因素而把载入延迟提升到85s/1000条甚至更高。
效果较差。
尝试2:
使用PreparedStatement,该方法需要预先给定insert操作的“格式”。
实测用这种方式插入的效率为每秒钟数十行。
注意,将rewriteBatchedStatements设为true之后,在不到一分钟时间里面就将78万条数据全部导入数据库了。这是一个行之有效的方法。
代码:
1 import java.io.BufferedReader; 2 import java.io.FileReader; 3 import java.sql.Connection; 4 import java.sql.DriverManager; 5 import java.sql.PreparedStatement; 6 7 /** 8 * 9 */ 10 public class PreparedStatementTestMain { 11 private static PreparedStatement ps; 12 public static void main(String[] args) { 13 try{ 14 Class.forName("com.mysql.jdbc.Driver"); 15 Connection conn = DriverManager.getConnection("jdbc:mysql://remote-host/test?user=xxx&password=xxx"); 16 String sql = "insert into test values(?,?,?,?,?,?,?,?,?,?,?)"; 17 ps = conn.prepareStatement(sql); 18 19 BufferedReader in = new BufferedReader(new FileReader("xxxx")); 20 String line; 21 int count =0; 22 while((line = in.readLine())!=null){ 23 count+=1; 24 String[] values = line.split(" ",-1); 25 //ps.setInt(1,count); 26 for(int i =1;i<values.length;i++) { 27 // if(i==6){ 28 // ps.setInt(i+1,Integer.parseInt(values[i])); 29 // }else{ 30 // if(values[i]==null){ 31 // ps.setString(i," "); 32 // }else { 33 ps.setString(i, values[i]); 34 // } 35 // } 36 } 37 ps.addBatch(); 38 System.out.println("Line "+count); 39 } 40 ps.executeBatch(); 41 ps.close(); 42 }catch(Exception e){ 43 e.printStackTrace(); 44 } 45 } 46 }
尝试3:
使用mysqlimport工具。经过实测,速度接近于尝试2中加上rewriteBatchedStatements之后的速度。不过前提是数据必须要保存为文件。
另外一个思路:
多线程插入。
测试:
import java.sql.*; import java.util.Properties; import java.util.Random; import java.util.concurrent.*; public class TestMultiThreadInsert { private static final String dbClassName = "com.mysql.jdbc.Driver"; private static final String CONNECTION = "jdbc:mysql://host/"; private static final String USER = "x"; private static final String PASSWORD = "xxx"; private static final int THREAD_NUM=10; private static void executeSQL(Connection conn, String sql) throws SQLException { Statement stmt = conn.createStatement(); stmt.execute(sql); } private static void ResetEnvironment() throws SQLException { Properties p = new Properties(); p.put("user", USER); p.put("password", PASSWORD); Connection conn = DriverManager.getConnection(CONNECTION, p); { for (String query: new String[] { "USE test", "CREATE TABLE IF NOT EXISTS MTI (ID INT AUTO_INCREMENT PRIMARY KEY,MASSAGE VARCHAR(9) NOT NULL)", "TRUNCATE TABLE MTI" }) { executeSQL(conn, query); } } } private static void worker() { Properties properties = new Properties(); properties.put("user", USER); properties.put("password", PASSWORD); try{ Connection conn = DriverManager.getConnection(CONNECTION, properties); executeSQL(conn, "USE test"); while (!Thread.interrupted()) { executeSQL(conn, String.format("INSERT INTO MTI VALUES (NULL,'hello')")); System.out.println("Inserting "+value+" finished."); } } catch (SQLException e) { e.printStackTrace(); } } public static void main(String[] args) throws ClassNotFoundException, SQLException, InterruptedException { Class.forName(dbClassName); ResetEnvironment(); ExecutorService executor = Executors.newFixedThreadPool(THREAD_NUM); for (int i = 0; i < THREAD_NUM; i++) { executor.submit(new Runnable() { public void run() { worker(); } }); } Thread.sleep(20000); executor.shutdownNow(); if (!executor.awaitTermination(5, TimeUnit.SECONDS)) { System.err.println("Pool did not terminate"); } } }
20个线程分别单条单条地插入,20秒钟插入2923条;
10个线程分别单条单条地插入,20秒钟插入1699条;
1个线程单条单条地插入,20秒钟330条。
预测:将多线程与PreparedStatement结合预计可以提高插入速度。
但是使用多线程插入会不可避免的要考虑到一个问题:写锁。
虽然上面的程序确实证明了多线程插入的可行性,但是背后的逻辑是什么样的呢?有必要进行一下解读。
上面的代码中的多线程对应的是多个连接(可参考:https://dev.mysql.com/doc/refman/5.5/en/connection-threads.html),通过多线程主要是提高了命令提交速度,而不是多个执行线程。至于如何执行,还需要考察InnoDB(目前所用的数据库引擎)对数据插入的处理机制。
为了解决这个问题,通过搜索,查到了这些可能存在联系的关键词:
1.io_threads(https://dev.mysql.com/doc/refman/5.5/en/innodb-performance-multiple_io_threads.html),
2.锁,
3.insert buffer(https://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-performance-change_buffering.html)。
关于insert buffer,理解下面这句话是关键:
innodb使用insert buffer"欺骗"数据库:对于为非唯一索引,辅助索引的修改操作并非实时更新索引的叶子页,而是把若干对同一页面的更新缓存起来做合并为一次性更新操作,转化随机IO 为顺序IO,这样可以避免随机IO带来性能损耗,提高数据库的写性能。
要理解上面那句话,要先知道innoDB使用了什么样的数据结构来存储数据。
B+Tree!
关于B+Tree 网上一堆说明,这里不作赘述。
关于io_threads,是利用多个线程来处理对数据页的读写。
有一个问题依然没有说明白:锁!
(官方)locking
The system of protecting a transaction from seeing or changing data that is being queried or changed by other transactions. The locking strategy must balance reliability and consistency of database operations (the principles of the ACID philosophy) against the performance needed for goodconcurrency. Fine-tuning the locking strategy often involves choosing an isolation level and ensuring all your database operations are safe and reliable for that isolation level.
innodb为了提高读的性能,自定义了read write lock,也就是读写锁。其设计原则是:
1、同一时刻允许多个线程同时读取内存中的变量
2、同一时刻只允许一个线程更改内存中的变量
3、同一时刻当有线程在读取变量时不允许任何线程写存在
4、同一时刻当有线程在更改变量时不允许任何线程读,也不允许出自己以外的线程写(线程内可以递归占有锁)。
5、当有rw_lock处于线程读模式下是有线程写等待,这时候如果再有其他线程读请求锁的时,这个读请求将处于等待前面写完成。
既然有了锁,那么如何利用多线程的写操作来提高效率呢?
思考角度:提高互斥锁的切换效率!
怎么做到?
参考http://www.2cto.com/database/201411/352586.html
https://dev.mysql.com/doc/refman/5.5/en/innodb-performance-latching.html
On many platforms, Atomic operations can often be used to synchronize the actions of multiple threads more efficiently than Pthreads. Each operation to acquire or release a lock can be done in fewer CPU instructions, wasting less time when threads contend for access to shared data structures. This in turn means greater scalability on multi-core platforms.
On platforms where the GCC, Windows, or Solaris functions for atomic memory access are not available, InnoDB
uses the traditional Pthreads method of implementing mutexes and read/write locks.
mutex
Informal abbreviation for "mutex variable". (Mutex itself is short for "mutual exclusion".) The low-level object that InnoDB uses to represent and enforce exclusive-access locks to internal in-memory data structures. Once the lock is acquired, any other process, thread, and so on is prevented from acquiring the same lock. Contrast with rw-locks, which InnoDB uses to represent and enforce shared-access locks to internal in-memory data structures. Mutexes and rw-locks are known collectively as latches.
rw-lock
The low-level object that InnoDB uses to represent and enforce shared-access locks to internal in-memory data structures following certain rules. Contrast with mutexes, which InnoDB uses to represent and enforce exclusive access to internal in-memory data structures. Mutexes and rw-locks are known collectively as latches.
rw-lock
types include s-locks
(shared locks), x-locks
(exclusive locks), and sx-locks
(shared-exclusive locks).
-
An
s-lock
provides read access to a common resource. -
An
x-lock
provides write access to a common resource while not permitting inconsistent reads by other threads. -
An
sx-lock
provides write access to a common resource while permitting inconsistent reads by other threads.sx-locks
were introduced in MySQL 5.7 to optimize concurrency and improve scalability for read-write workloads.
The following matrix summarizes rw-lock type compatibility.
S | SX | X | |
---|---|---|---|
S |
Compatible | Compatible | Conflict |
SX |
Compatible | Conflict | Conflict |
X |
Conflict | Conflict | Conflict |
补充:
rewriteBatchedStatements到底为什么对速度优化这个多?
一种说法:这样做的目的是为了让mysql能够将多个mysql insert语句打包成一个packet和mysql服务器通信。这样可以极大降低网络开销。
另一种说法:
Rewriting Batches
“rewriteBatchedStatements=true”
Affects (Prepared)Statement.add/executeBatch()
Core concept - remove latency
Special treatment for prepared INSERT statements
——Mark Matthews - Sun Microsystems
PreparedStatement VS Statement
数据库系统会对sql语句进行预编译处理(如果JDBC驱动支持的话),预处理语句将被预先编译好,这条预编译的sql查询语句能在将来的使用中重用。