• 数据仓库:Mysql大量数据快速导出


    背景

    写这篇文章主要是介绍一下我做数据仓库ETL同步的过程中遇到的一些有意思的内容和提升程序运行效率的过程。

    关系型数据库:

      项目初期:游戏的运营数据比较轻量,相关的运营数据是通过Java后台程序聚合查询关系型数据库MySQL完全可以应付,系统通过定时任务每日统计相关数据,等待运营人员查询即可。

      项目中后期:随着开服数量增多,玩家数量越来越多,数据库的数据量越来越大,运营后台查询效率越来越低。对于普通的关系型来说,如MySQL,当单表存储记录数超过500万条后,数据库查询性能将变得极为缓慢,而往往我们都不会只做单表查询,还有多表join。这里假如有100个游戏服,每个服有20张表,而每个表有500W数据,那么:

      总数据量 = 100 * 20 * 500W = 10亿  按当时的库表结构,换算成磁盘空间,约为100G左右

    我的天呐,现在没有单机的内存能同一时间载入100G的数据

    https://www.zhihu.com/question/19719997

      所以,考虑到这一点,Hive被提出来解决难题!

     

    数据仓库

    Hive适合做海量数据的数据仓库工具, 因为数据仓库中的数据有这两个特点:最全的历史数据(海量)、相对稳定的;所谓相对稳定,指的是数据仓库不同于业务系统数据库,数据经常会被更新,数据一旦进入数据仓库,很少会被更新和删除,只会被大量查询。而Hive,也是具备这两个特点

    二、项目架构设计

     在这里先说下初期项目架构的探索,因为数据流向,其实最终就是从MYSQL--------->Hive中,我使用的是Jdbc方式。为什么不使用下列工具呢?

    • Sqoop, 因为该游戏每个服有将近80张表,然后又有很多服,以后还会更多,而每个服的库表数据结构其实是完全一样的,只是IP地址不一样,使用Sqoop的话,将会需要维护越来越多的脚本,再者Sqoop没法处理原始数据中一些带有Hive表定义的行列分隔符
    • DataX 阿里开源的数据同步中间件,没做过详细研究

    1、全局缓存队列

    使用生产者消费者模型,中间使用内存,数据落地成txt

    首先生产者通过Jdbc获取源数据内容,放入固定大小的缓存队列,同时消费者不断的从缓存读取数据,根据不同的数据类型分别读取出来,并逐条写入相应的txt文件。

    速度每秒约8000条。

    这样做表面上看起来非常美好,流水式的处理,来一条处理一下,可是发现消费的速度远远赶不上生产的速度,生产出来的数据会堆积在缓存队列里面,假如队列不固定长度的话,这时候还会大量消耗内存,所以为了提升写入的速度,决定采用下一种方案

    2、每一张表一个缓存队列及writer接口

    每张表各自起一个生产者消费者模型,消费者启动时初始化相应的writer接口,架构设计如下:

    table1的生产者通过Jdbc获取源数据内容,放入table自带的固定大小的缓存队列,同时table1相应的消费者不断的从缓存读取数据,根据不同的数据类型分别读取出来,并逐条写入相应的txt文件。

    速度每秒约2W条。

     这样生产者线程可以并发的进行,通过控制生产者线程的数量,可以大大提高处理的效率, 项目关键代码如下:

    1)线程池

    /***
     * 
     * 
     * @描述 任务线程池
     */
    public class DumpExecuteService {
    
        private static ExecutorService dumpServerWorkerService; // 游戏服任务
        private static ExecutorService dumpTableWorkerService; // 表数据任务
        private static ExecutorService dumpReaderWorkerService; // 读取数据任务
        private static ExecutorService dumpWriterWorkerService; // 写数据结果任务
    
        /***
         * 初始化任务线程池
         * @param concurrencyDBCount 并发数量
         */
        public synchronized static void startup(int concurrencyDBCount) {
    
            if (dumpServerWorkerService != null)
                return;
    
            if (concurrencyDBCount > 2)
                concurrencyDBCount = 2; // 最多支持两个数据库任务并发执行
    
            if (concurrencyDBCount < 1)
                concurrencyDBCount = 1;
    
            dumpServerWorkerService = Executors.newFixedThreadPool(concurrencyDBCount, new NamedThreadFactory(
                    "DumpExecuteService.dumpServerWorkerService" + System.currentTimeMillis()));
            dumpTableWorkerService = Executors.newFixedThreadPool(2, new NamedThreadFactory("DumpExecuteService.dumpTableWorkerService"
                    + System.currentTimeMillis()));
            dumpWriterWorkerService = Executors.newFixedThreadPool(8, new NamedThreadFactory("DumpExecuteService.dumpWriterWorkerService"
                    + System.currentTimeMillis()));
            dumpReaderWorkerService = Executors.newFixedThreadPool(2, new NamedThreadFactory("DumpExecuteService.dumpReaderWorkerService"
                    + System.currentTimeMillis()));
        }
    
        public static Future<Integer> submitDumpServerWorker(DumpServerWorkerLogic worker) {
            return dumpServerWorkerService.submit(worker);
        }
    
        public static Future<Integer> submitDumpWriteWorker(DumpWriteWorkerLogic worker) {
            return dumpWriterWorkerService.submit(worker);
        }
    
        public static Future<Integer> submitDumpReadWorker(DumpReadWorkerLogic worker) {
            return dumpReaderWorkerService.submit(worker);
        }
    
        public static Future<Integer> submitDumpTableWorker(DumpTableWorkerLogic worker) {
            return dumpTableWorkerService.submit(worker);
        }
    
        /***
         * 关闭线程池
         */
        public synchronized static void shutdown() {
    
            //执行线程池关闭...
        }
    }

    说明:该类定义4个线程池,分别用于执行不同的任务

    2)游戏服任务线程池

    /**
     * 1) 获取 游戏服log库数据库连接 
    2) 依次处理单张表
    */ public class DumpServerWorkerLogic extends AbstractLogic implements Callable<Integer> { private static Logger logger = LoggerFactory.getLogger(DumpServerWorkerLogic.class); private final ServerPO server;// 数据库 private final String startDate;// 开始时间 private SourceType sourceType;// 数据来源类型 private Map<String, Integer> resultDBMap;// 表记录计数 private GameType gameType; public DumpServerWorkerLogic(ServerPO server, String startDate, SourceType sourceType, Map<String, Integer> resultDBMap, GameType gameType) { CheckUtil.checkNotNull("DumpServerWorkerLogic.server", server); CheckUtil.checkNotNull("DumpServerWorkerLogic.startDate", startDate); CheckUtil.checkNotNull("DumpServerWorkerLogic.sourceType", sourceType); CheckUtil.checkNotNull("DumpServerWorkerLogic.resultDBMap", resultDBMap); CheckUtil.checkNotNull("DumpServerWorkerLogic.gameType", gameType); this.server = server; this.startDate = startDate; this.sourceType = sourceType; this.resultDBMap = resultDBMap; this.gameType = gameType; } @Override public Integer call() { // 获取连接, 并取得该库的所有表 Connection conn = null; try { conn = JdbcUtils.getDbConnection(server); } catch (Exception e) { throw new GameRuntimeException(e.getMessage(), e); } List<String> tableNames = null; DumpDbInfoBO dumpDbInfoBO = DumpConfig.getDumpDbInfoBO(); int totalRecordCount = 0; try { switch (this.sourceType) { case GAME_LOG: tableNames = JdbcUtils.getAllTableNames(conn); break; case INFOCENTER: tableNames = dumpDbInfoBO.getIncludeInfoTables(); tableNames.add("pay_action"); break; case EVENT_LOG: tableNames = new ArrayList<String>(); Date date = DateTimeUtil.string2Date(startDate, "yyyy-MM-dd"); String sdate = DateTimeUtil.date2String(date, "yyyyMMdd"); String smonth = DateTimeUtil.date2String(date, "yyyyMM"); tableNames.add("log_device_startup" + "_" + smonth); tableNames.add("log_device" + "_" + sdate); break; } // 遍历table for (String tableName : tableNames) { // 过滤 if (dumpDbInfoBO.getExcludeTables().contains(tableName)) continue; DumpTableWorkerLogic tableTask = new DumpTableWorkerLogic(conn, server, tableName, startDate, resultDBMap, gameType, sourceType); Future<Integer> tableFuture = DumpExecuteService.submitDumpTableWorker(tableTask); int count = tableFuture.get(); totalRecordCount += count; logger.info(String.format("DumpServerWorkerLogic %s-%s.%s be done", startDate, server.getLogDbName(), tableName)); } return totalRecordCount; } catch (Exception e) { throw new GameRuntimeException(e, "DumpTableWorkerLogic fail. server={%s}, errorMsg={%s} ",server.getId(), e.getMessage()); } finally { JdbcUtils.closeConnection(conn); } } }

     3)表处理任务,一个表一个

    /***
     * 
     * 
     * @描述 创建一个表查询结果写任务 (一个表一个)
     */
    public class DumpTableWorkerLogic implements Callable<Integer> {
        private static Logger logger = LoggerFactory.getLogger(DumpTableWorkerLogic.class);
    
        private final String tableName;
        private final Connection conn;
    
        private ServerPO server;
    
        private String startDate;
    
        private Map<String, Integer> resultDBMap;// 表记录计数
    
        private GameType gameType;
    
        private SourceType sourceType;// 数据来源类型
    
        public DumpTableWorkerLogic(Connection conn, ServerPO server, String tableName, String startDate, Map<String, Integer> resultDBMap,
                GameType gameType, SourceType sourceType) {
            CheckUtil.checkNotNull("DumpTableWorkerLogic.conn", conn);
            CheckUtil.checkNotNull("DumpTableWorkerLogic.tableName", tableName);
            CheckUtil.checkNotNull("DumpTableWorkerLogic.server", server);
            CheckUtil.checkNotNull("DumpTableWorkerLogic.startDate", startDate);
            CheckUtil.checkNotNull("DumpTableWorkerLogic.resultDBMap", resultDBMap);
            CheckUtil.checkNotNull("DumpTableWorkerLogic.gameType", gameType);
            CheckUtil.checkNotNull("DumpServerWorkerLogic.sourceType", sourceType);
    
            this.conn = conn;
            this.tableName = tableName;
            this.server = server;
            this.startDate = startDate;
            this.resultDBMap = resultDBMap;
            this.gameType = gameType;
            this.sourceType = sourceType;
    
            logger.info("DumpTableWorkerLogic[{}] Reg", tableName);
        }
    
        @Override
        public Integer call() {
            logger.info("DumpTableWorkerLogic[{}] Start", tableName);
    
            // 写检查结果任务
            DumpWriteWorkerLogic writerWorker = new DumpWriteWorkerLogic(server, tableName, startDate, resultDBMap, gameType,
                    sourceType);
            Future<Integer> writeFuture = DumpExecuteService.submitDumpWriteWorker(writerWorker);
            logger.info("DumpTableWorkerLogic[{}] writer={}", tableName);
    
            // 数据查询任务
            DumpReadWorkerLogic readerWorker = new DumpReadWorkerLogic(conn, tableName, writerWorker, startDate);
            DumpExecuteService.submitDumpReadWorker(readerWorker);
            logger.info("DumpTableWorkerLogic[{}] reader={}", tableName);
    
            try {
                int writeCount = writeFuture.get();
                logger.info("DumpTableWorkerLogic[{}] ---" + startDate + "---" + server.getId() + "---" + tableName + "---导出数据条数---"
                        + writeCount);
                return writeCount;
            }  catch (Exception e) {
                throw new GameRuntimeException(e, "DumpTableWorkerLogic fail. tableName={%s}, errorMsg={%s} ",tableName, e.getMessage());
            }
        }
    
    }

    4)单表读取任务线程

    /***
     * mysql读取数据任务
     * 
     */
    public class DumpReadWorkerLogic implements Callable<Integer> {
    
        private static Logger logger = LoggerFactory.getLogger(DumpReadWorkerLogic.class);
    
        private String tableName;
    
        private final Connection conn;
    
        private DumpWriteWorkerLogic writerWorker; // 写结果数据任务
    
        private String startDate;// 开始导出日期
    
        private static final int LIMIT = 50000;// 限制sql一次读出条数
    
        public DumpReadWorkerLogic(Connection conn, String tableName, DumpWriteWorkerLogic writerWorker, String startDate) {
            CheckUtil.checkNotNull("MysqlDataReadWorker.conn", conn);
            CheckUtil.checkNotNull("MysqlDataReadWorker.tableName", tableName);
            CheckUtil.checkNotNull("MysqlDataReadWorker.startDate", startDate);
    
            this.conn = conn;
            this.tableName = tableName;
            this.writerWorker = writerWorker;
            this.startDate = startDate;
    
            logger.info("DumpReadWorkerLogic Reg. tableName={}", this.tableName);
        }
    
        @Override
        public Integer call() {
            try {
                List<Map<String, Object>> result = JdbcUtils.queryForList(conn, "show full fields from " + tableName);
    
                int index = 0;
                String querySql = "";
    
                int totalCount = 0;
                while (true) {
                    int offset = index * LIMIT;
                    querySql = DumpLogic.getTableQuerySql(result, tableName, true, startDate) + " limit " + offset + "," + LIMIT;
                    int row = DumpLogic.query(conn, querySql, writerWorker);
                    totalCount += row;
                    logger.info("tableName=" + tableName + ", offset=" + offset + ", index=" + index + ", row=" + row + ", limit=" + LIMIT);
                    if (row < LIMIT)
                        break;
                    index++;
                }
                writerWorker.prepareClose();
                logger.info(startDate + "---" + tableName + "---Read.End");
                return totalCount;
            }
            catch (Exception e) {
                throw new GameRuntimeException(e, "MysqlDataReadWorker fail. tableName={%s}, errorMsg={%s} ",tableName, e.getMessage());
            }
        }
    
    }

    5)单表写入任务线程

    /***
     * 
     * 
     * @描述 mysql数据导出任务
     */
    public class DumpWriteWorkerLogic implements Callable<Integer> {
    
        private static final Logger logger = LoggerFactory.getLogger(DumpWriteWorkerLogic.class);
        private String tableName;// 表名
    
        private AtomicBoolean alive; // 线程是否活着
    
        private BufferedWriter writer;
    
        private ArrayBlockingQueue<String> queue; // 消息队列
    
        private ServerPO server;// 服务器
    
        private String startDate;// 开始时间
    
        private Map<String, Integer> resultDBMap;// 当天某服某表数量记录
    
        private GameType gameType;
    
        private SourceType sourceType;// 数据来源类型
    
        public DumpWriteWorkerLogic(ServerPO server, String tableName, String startDate, Map<String, Integer> resultDBMap, GameType gameType,
                SourceType sourceType) {
            CheckUtil.checkNotNull("DumpWriteWorkerLogic.tableName", tableName);
            CheckUtil.checkNotNull("DumpWriteWorkerLogic.server", server);
            CheckUtil.checkNotNull("DumpWriteWorkerLogic.startDate", startDate);
            CheckUtil.checkNotNull("DumpWriteWorkerLogic.resultDBMap", resultDBMap);
            CheckUtil.checkNotNull("DumpWriteWorkerLogic.gameType", gameType);
            CheckUtil.checkNotNull("DumpWriteWorkerLogic.sourceType", sourceType);
    
            this.tableName = tableName;
            this.server = server;
            this.startDate = startDate;
            this.queue = new ArrayBlockingQueue<>(65536);
            this.alive = new AtomicBoolean(true);
            this.gameType = gameType;
            this.sourceType = sourceType;
            this.writer = createWriter();
            this.resultDBMap = resultDBMap;
    
            logger.info("DumpWriteWorkerLogic Reg. tableName={}", this.tableName);
        }
    
        /***
         * 创建writer, 若文件不存在,会新建文件
         * 
         * @param serverId
         * @return
         */
        private BufferedWriter createWriter() {
            try {
                File toFile = FileUtils.getFilenameOfDumpTable(sourceType, tableName, startDate, gameType, ".txt");
                if (!toFile.exists()) {
                    FileUtils.createFile(sourceType, tableName, startDate, gameType);
                }
                return new BufferedWriter(new OutputStreamWriter(new FileOutputStream(toFile, true), Charsets.UTF_8), 5 * 1024 * 1024);
            } catch (Exception e) {
                throw new GameRuntimeException(e, "DumpWriteWorkerLogic createWriter fail. server={%s}, errorMsg={%s} ",server.getId(), e.getMessage());
            }
        }
    
        /***
         * 写入文件
         * 
         * @param line
         *            一条记录
         */
        private void writeToFile(String line) {
            try {
                this.writer.write(line + "
    ");
            } catch (Exception e) {
                throw new GameRuntimeException(e, "DumpWriteWorkerLogic writeToFile fail. errorMsg={%s} ", e.getMessage());
            }
        }
    
        /**
         * 记录数据到消息队列; 如果消息队列满了, 会阻塞直到可以put为止
         * 
         * @param result
         */
        public void putToWriterQueue(String line) {
    
            CheckUtil.checkNotNull("DumpWriteWorkerLogic putToWriterQueue", line);
    
            try {
                queue.put(line);
            } catch (InterruptedException e) {
                throw new GameRuntimeException(e, "DumpWriteWorkerLogic putToWriterQueue fail. errorMsg={%s} ", e.getMessage());
            }
        }
    
        /**
         * 准备关闭 (通知一下"需要处理的用户数据都处理完毕了"; task 写完数据, 就可以完毕了)
         */
        public void prepareClose() {
            alive.set(false);
        }
    
        @Override
        public Integer call() {
            logger.info("DumpWriteWorkerLogic Start. tableName={}", this.tableName);
            try {
                int totalCount = 0;
                while (alive.get() || !queue.isEmpty()) {
                    List<String> dataList = new ArrayList<String>();
                    queue.drainTo(dataList);
                    int count = processDataList(dataList);
                    totalCount += count;
                }
                logger.info("DumpWriteWorkerLogic ---" + startDate + "---" + tableName + "---Writer.End");
                return totalCount;
            } catch (Exception exp) {
                throw new GameRuntimeException(exp, "DumpWriteWorkerLogic call() fail. errorMsg={%s} ", exp.getMessage());
            } finally {
                FileUtil.close(this.writer);
            }
        }
    
        /***
         * 处理数据:写入本地文件及map
         * 
         * @param dataList
         *            数据集合
         * @return
         */
        private int processDataList(List<String> dataList) {
            int totalCount = 0;
    
            // 所有记录
            String key = server.getId() + "#" + tableName + "#" + sourceType.getIndex();
            if (dataList != null && dataList.size() > 0) {
    
                for (String line : dataList) {
    
                    // 按行写入文件
                    writeToFile(line);
    
                    // 记录到result_data_record_count
                    if (resultDBMap.get(key) != null) {
                        resultDBMap.put(key, resultDBMap.get(key) + 1);
                    }
                    else {
                        resultDBMap.put(key, 1);
                    }
    
                    totalCount++;
                }
            }
    
            return totalCount;
        }
    
    }

    内存优化

    1、使用Jdbc方式获取数据,如果这个数据表比较大,那么获取数据的速度特别慢;

    2、这个进程还会占用非常大的内存,并且GC不掉。分析原因,Jdbc获取数据的时候,会一次将所有数据放入到内存,如果同步的数据表非常大,那么甚至会将内存撑爆。

    那么优化的方法是让Jdbc不是一次全部将数据拿到内存,而是分页获取,每次最大limit数设置为50000,请参考read线程。

    经过这种架构优化后,5000W数据大约花费40min可完成导出

    说明:

    因为本文只是记录项目的设计过程,详细的代码后面会开源。

  • 相关阅读:
    可用性战术
    操作系统学习笔记_02
    SpringBoot程序启动报错:FileNotFoundException:class path resource [***] cannot be opened because it does not exist
    【SQL干货】求去重后的count数
    mybatis连接Oracle执行begin..end批量操作返回行数问题
    【SQL干货】一条语句搞定订单的排序
    【SQL干货】删除重复数据记录
    【SQL干货】一条sql按季度统计交易数据
    java语法糖类型推导/类型推断(type inference)
    ThreadLocal小解
  • 原文地址:https://www.cnblogs.com/ITtangtang/p/7612237.html
Copyright © 2020-2023  润新知