• 【Oracle/Sql】清除重复而带出的性能问题


    笔者使用的环境:

    # 类别 版本
    1 操作系统 Win10
    2 数据库 Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
    3 硬件环境 T440p
    4 内存 8G

    有这样一张表:

    CREATE TABLE tb_sc
    (
        id NUMBER not null primary key,
        studentid int not null,
        courseid int not null,
        score int not null
    )

    可以用以下SQL给它充值:

    Insert into tb_sc
    select rownum,dbms_random.value(0,100000),dbms_random.value(1,5),dbms_random.value(0,150) from dual
    connect by level<=200000
    order by dbms_random.random

    填充20万数据,很快就完成了。

    因为上面学生科目是随机数产生的,因此会出现同一studentid,同一科目id,而有不同考分的多条记录,这在现实的一次高考中是不会发生的,因此需要清除掉重复的记录。

    我采用的方案是留下学生id和科目id相同而分数最高的一条,可以用以下sql来得到记录:

    select studentid,courseid,max(score) from tb_sc group by studentid,courseid

    看以下DB里这样的记录约有十六万条,也就是说有四万多条记录要清除掉:

    SQL> select count(*) from
      2  ( select studentid,courseid,max(score) from tb_sc group by studentid,courseid );
    
      COUNT(*)
    ----------
        162315

    使用如下语句就能找到这十六万条记录:

    select tb_sc.* from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2
    where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score 

    然后使用以下语句能将这十六万条输入导入到一张新表:

    create table tb_sc_nodup2 as select tb_sc.* from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2
    where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score 

    下面是执行记录:

    SQL> create table tb_sc_nodup2 as select tb_sc.* from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2
      2  where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score ;
    
    表已创建。
    
    已用时间:  00: 00: 00.42

    可见,创建新表还是挺快的。之后truancate掉旧表,再insert into 旧表 select * from 新表也花不了多少时间。

    也可以使用以下sql取删除掉重复项 :

    delete from tb_sc where id not in (select tb_sc.id from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2
    where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score)

    看看这下要花多长时间:

    SQL> delete from tb_sc where id not in (select tb_sc.id from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2
      2  where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score);
    
    已删除37448行。
    
    已用时间:  00: 00: 00.81

    也还可以。和导入新表删旧表再倒回来估计耗时相去不远。

    但如果采用如下sql去删除:

    delete from tb_sc where (studentid,courseid,score) not in (select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid)

    那花的时间可就长了,长到能令人怀疑人生。因为它相当于跑了个双重循环(20万*16万=320亿),再用六个量进行三三比对,这六个量还没有一个是主键,自然就慢得吓人了。

    虽然说小表一般不会产生性能问题,但sq书写不合理也一样会导致性能问题的。

    这个语句到底多慢呢,如果是20万数据量我等不起,让我们通过减少数据量再进行测试:

    SQL> truncate table tb_sc;
    
    表被截断。
    
    SQL> Insert into tb_sc
      2  select rownum,dbms_random.value(0,10000),dbms_random.value(1,5),dbms_random.value(0,150) from dual
      3  connect by level<=10000
      4  order by dbms_random.random;
    
    已创建10000行。
    
    SQL> commit;
    
    提交完成。
    
    SQL> select count(*) from
      2  ( select studentid,courseid,max(score) from tb_sc group by studentid,courseid );
    
      COUNT(*)
    ----------
          8969

    commit之后,tb_sc表里有一万条数据,大约有一千多条是不符合要求的。

    再打开执行计划看看:

    SQL> set autotrace trace exp;
    SQL> delete from tb_sc where (studentid,courseid,score) not in (select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid);
    
    已删除1029行。
    
    
    执行计划
    ----------------------------------------------------------
    Plan hash value: 3448751082
    
    -------------------------------------------------------------------------------
    | Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
    -------------------------------------------------------------------------------
    |   0 | DELETE STATEMENT      |       |   449 | 23348 | 77150  (15)| 00:15:26 |
    |   1 |  DELETE               | TB_SC |       |       |            |          |
    |*  2 |   FILTER              |       |       |       |            |          |
    |   3 |    TABLE ACCESS FULL  | TB_SC |  8971 |   455K|     9   (0)| 00:00:01 |
    |*  4 |    FILTER             |       |       |       |            |          |
    |   5 |     HASH GROUP BY     |       |  8971 |   341K|    11  (19)| 00:00:01 |
    |   6 |      TABLE ACCESS FULL| TB_SC |  8971 |   341K|     9   (0)| 00:00:01 |
    -------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - filter( NOT EXISTS (SELECT 0 FROM "TB_SC" "TB_SC" GROUP BY
                  "STUDENTID","COURSEID" HAVING LNNVL("STUDENTID"<>:B1) AND
                  LNNVL("COURSEID"<>:B2) AND LNNVL(MAX("SCORE")<>:B3)))
       4 - filter(LNNVL("STUDENTID"<>:B1) AND LNNVL("COURSEID"<>:B2) AND
                  LNNVL(MAX("SCORE")<>:B3))
    
    Note
    -----
       - dynamic sampling used for this statement (level=2)

    和猜测差不多,里面有两次全表扫描,两次比较,但是cost在delete statement处骤升让人惊异。万条记录还能执行并把执行计划跑出来,二十万就遥遥无期了。

    看来以后写in查询走非索引列是不合适的,这也是在调试别的课题时得到的收获吧。

    汇聚点滴,终成大洋!

    --2020年1月24日--

     附:与not in等效之not exist语句

    delete from tb_sc where not exists (
    select 'x' from tb_sc a,
                     (select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid) b
    where a.studentid=b.studentid and a.courseid=b.courseid and a.score=b.score and tb_sc.id=a.id)

    --2020年1月24日 18点37分--

  • 相关阅读:
    原码/反码/补码详解
    树剖LCA讲解
    code vs 2166 Bessie的体重问题
    code vs 3376 符号三角形
    洛谷 P3078 [USACO13MAR]扑克牌型Poker Hands
    vjudge B
    POJ 1013 Counterfeit Dollar
    POJ 1008 Maya Calendar
    POJ 1006 Biorhythms
    POJ 3252 Round Numbers
  • 原文地址:https://www.cnblogs.com/heyang78/p/12231995.html
Copyright © 2020-2023  润新知