• mysql 存储过程批量删除重复数据


    表结构:

    LOAD DATA INFILE '/usr/local/phone_imsi_12' replace INTO TABLE tbl_imsi2number_new FIELDS TERMINATED BY ' ' ENCLOSED BY '' (number,imsi);

    先用SQL语句来进行去重操作:

    delete from tbl_imsi2number_new where imsi in (select imsi from (select imsi from tbl_imsi2number_new group by imsi having count(imsi) > 1) a) and number not in (select * from (select min(number) from tbl_imsi2number_new group by imsi having count(imsi) > 1 ) b);

    因为数据量太大(共计3亿多数据),所以:

    将开发那边拿过来的数据load data infile到大表里,不进行任何去重操作,没有任何约束。然后将这些数据分成几十个小文件,再load进几十个小表中,用这几十个小表去对比大表去重。得到去重后的小表。去重以后的小表,根据字段进行hash算出后两位数字(mod(conv(right(md5(imsi),2),16,10),100))进行批量插入。

    存储过程如下(去重):

    DELIMITER //

    /*tblname 动态控制表名*/
    CREATE PROCEDURE create_imsi(IN tblname varchar(200))
    begin
    declare age int default 1;
    declare done int(1) default 0;
    declare v_imsi varchar(200);

    /*定义游标*/
    declare cur_l cursor for select imsi from sqlstr;

    /*定义异常*/
    DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' set done=1;
    drop view if exists sqlstr;

    /*定义视图*/
    set @tbl = CONCAT("create view sqlstr as select a.imsi from tbl_new a,(select imsi from phone_",tblname," group by imsi having count(imsi) > 1) b where a.imsi = b.imsi group by imsi");

    /*执行视图语句*/

    PREPARE stmt FROM @tbl;
    EXECUTE stmt;
    DEALLOCATE PREPARE stmt;
    OPEN cur_l;
    FETCH cur_l INTO v_imsi;
    while (done <> 1)
    do

    /*对比大表数据,删除小表中的重复数据*/
    set @del = CONCAT("delete from phone_",tblname," where imsi=",v_imsi);
    PREPARE stmt1 FROM @del;
    EXECUTE stmt1;
    DEALLOCATE PREPARE stmt1;
    FETCH cur_l INTO v_imsi;
    end while;
    close cur_l;
    end//

    DELIMITER ;

    2、根据hash算法插入新表:

    DELIMITER //
    CREATE PROCEDURE insert_imsi(IN tblname varchar(20))
    begin
    declare age int default 1;
    declare done int(1) default 0;
    declare done1 int(1) default 0;
    declare v_imsi varchar(200);
    declare v_e varchar(2000);
    declare v_number varchar(3000);
    declare v_ctype varchar(2000);
    declare cur_l cursor for select split from sqlstr;
    DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' set done=1;
    DECLARE CONTINUE HANDLER FOR 1146 set done1=3;
    DECLARE CONTINUE HANDLER FOR SQLSTATE '23000' set done1=1;
    DECLARE CONTINUE HANDLER FOR SQLSTATE '42000' set done1=2;
    DECLARE CONTINUE HANDLER FOR SQLSTATE 'HY000' set done1=3;
    drop view if exists sqlstx;
    drop view if exists sqlstr;
    set @sqlstx = CONCAT("create view sqlstr as SELECT imsi,number,ctype,mod(conv(right(md5(imsi),2),16,10),100) split from imsi_phone_",tblname);
    PREPARE stmt1 FROM @sqlstx;
    EXECUTE stmt1;
    DEALLOCATE PREPARE stmt1;
    OPEN cur_l;
    WHILE done <> 1
    DO
      FETCH cur_l INTO v_e;
      set @ins = concat("insert into imsi_",v_e,"(imsi,number,ctype) select imsi,number,ctype from sqlstr where split = '",v_e,"'");
      PREPARE stmt3 FROM @ins;
      EXECUTE stmt3;
    END WHILE;
    close cur_l;
    end//

    DELIMITER ;

     报错:1、ERROR 1243 (HY000) at line 1: Unknown prepared statement handler (stmt3) given to EXECUTE

       2、ERROR 1054 (42S22) at line 1: Unknown column '000cdc41b2a02518' in 'where clause'

    由于set @dat = concat("insert into imsi_",v_e,"(imsi,number,ctype) select imsi,number,ctype from imsi_phone_",tblname," where imsi=‘’",v_imsi,“‘’”);没有在(=)那里加单引号,因为字段里有字母。

    参数优化:

    由于建表使用innodb引擎,所以此优化是针对innodb引擎的:

    1、innodb_flush_log_at_trx_commit参数设置为1,减少刷新。
    2、set sql_log_bin=0  暂时不产生二进制日志
    3、sync_binlog  设置为0,减少刷新
    4、innodb_buffer_pool_size    尽可能设置最大
    5、set foreign_key_checks=0  去除外键检查
    6、减少不必要的索引,有重复数据的话,主键是必须要的
    7、innodb_change_buffer_max_size    上限为50,这里我设置为40,因为load是插入数据,所以设置插入缓冲
    8、binlog_cache_size  如果必须要开启二进制日志,设置此参数尽可能大,因为sync_binlog设置为0,所以缓冲应该大
    9、innodb_flush_method    刷新模式,设置为O_DIRECT
    10、innodb_io_capacity    刷新脏页,根据你的硬盘设置,SAS设置800--900
    11、innodb_log_buffer_size与innodb_sort_buffer_size  尽可能设置最大
    12、unique_checks  设置为不检查:set unique_checks=0;
    13、alter table tablename disable keys;设置表忽略索引,如果有。

    14、设置自动提交为0,减少日志刷新:SET autocommit=0;

    15、设置innodb_autoinc_lock_mode=2

    16、设置主键,聚簇索引,按照主键顺序插入会更快

  • 相关阅读:
    C#关于HttpClient的应用(二):融云IM集成
    C#关于HttpClient的应用(一):获取IP所在的地理位置信息
    PHP逐字符读取数据
    PHP逐行读取数据
    PHP函数的创建
    PHP数组的创建
    PHP基础学习代码案例
    查看端口号占用情况
    apache错误 Unable to open process" with PID 4!
    NUnit TestFixtureSetup 和 TestFixtureTearDown
  • 原文地址:https://www.cnblogs.com/magmell/p/8941338.html
Copyright © 2020-2023  润新知