笔者使用的环境:
# | 类别 | 版本 |
1 | 操作系统 | Win10 |
2 | 数据库 | Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production |
3 | 硬件环境 | T440p |
4 | 内存 | 8G |
有这样一张表:
CREATE TABLE tb_sc ( id NUMBER not null primary key, studentid int not null, courseid int not null, score int not null )
可以用以下SQL给它充值:
Insert into tb_sc select rownum,dbms_random.value(0,100000),dbms_random.value(1,5),dbms_random.value(0,150) from dual connect by level<=200000 order by dbms_random.random
填充20万数据,很快就完成了。
因为上面学生科目是随机数产生的,因此会出现同一studentid,同一科目id,而有不同考分的多条记录,这在现实的一次高考中是不会发生的,因此需要清除掉重复的记录。
我采用的方案是留下学生id和科目id相同而分数最高的一条,可以用以下sql来得到记录:
select studentid,courseid,max(score) from tb_sc group by studentid,courseid
看以下DB里这样的记录约有十六万条,也就是说有四万多条记录要清除掉:
SQL> select count(*) from 2 ( select studentid,courseid,max(score) from tb_sc group by studentid,courseid ); COUNT(*) ---------- 162315
使用如下语句就能找到这十六万条记录:
select tb_sc.* from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2 where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score
然后使用以下语句能将这十六万条输入导入到一张新表:
create table tb_sc_nodup2 as select tb_sc.* from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2 where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score
下面是执行记录:
SQL> create table tb_sc_nodup2 as select tb_sc.* from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2 2 where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score ; 表已创建。 已用时间: 00: 00: 00.42
可见,创建新表还是挺快的。之后truancate掉旧表,再insert into 旧表 select * from 新表也花不了多少时间。
也可以使用以下sql取删除掉重复项 :
delete from tb_sc where id not in (select tb_sc.id from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2 where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score)
看看这下要花多长时间:
SQL> delete from tb_sc where id not in (select tb_sc.id from tb_sc,( select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid ) tb_sc2 2 where tb_sc.studentid=tb_sc2.studentid and tb_sc.courseid=tb_sc2.courseid and tb_sc.score=tb_sc2.score); 已删除37448行。 已用时间: 00: 00: 00.81
也还可以。和导入新表删旧表再倒回来估计耗时相去不远。
但如果采用如下sql去删除:
delete from tb_sc where (studentid,courseid,score) not in (select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid)
那花的时间可就长了,长到能令人怀疑人生。因为它相当于跑了个双重循环(20万*16万=320亿),再用六个量进行三三比对,这六个量还没有一个是主键,自然就慢得吓人了。
虽然说小表一般不会产生性能问题,但sq书写不合理也一样会导致性能问题的。
这个语句到底多慢呢,如果是20万数据量我等不起,让我们通过减少数据量再进行测试:
SQL> truncate table tb_sc; 表被截断。 SQL> Insert into tb_sc 2 select rownum,dbms_random.value(0,10000),dbms_random.value(1,5),dbms_random.value(0,150) from dual 3 connect by level<=10000 4 order by dbms_random.random; 已创建10000行。 SQL> commit; 提交完成。 SQL> select count(*) from 2 ( select studentid,courseid,max(score) from tb_sc group by studentid,courseid ); COUNT(*) ---------- 8969
commit之后,tb_sc表里有一万条数据,大约有一千多条是不符合要求的。
再打开执行计划看看:
SQL> set autotrace trace exp; SQL> delete from tb_sc where (studentid,courseid,score) not in (select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid); 已删除1029行。 执行计划 ---------------------------------------------------------- Plan hash value: 3448751082 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | DELETE STATEMENT | | 449 | 23348 | 77150 (15)| 00:15:26 | | 1 | DELETE | TB_SC | | | | | |* 2 | FILTER | | | | | | | 3 | TABLE ACCESS FULL | TB_SC | 8971 | 455K| 9 (0)| 00:00:01 | |* 4 | FILTER | | | | | | | 5 | HASH GROUP BY | | 8971 | 341K| 11 (19)| 00:00:01 | | 6 | TABLE ACCESS FULL| TB_SC | 8971 | 341K| 9 (0)| 00:00:01 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter( NOT EXISTS (SELECT 0 FROM "TB_SC" "TB_SC" GROUP BY "STUDENTID","COURSEID" HAVING LNNVL("STUDENTID"<>:B1) AND LNNVL("COURSEID"<>:B2) AND LNNVL(MAX("SCORE")<>:B3))) 4 - filter(LNNVL("STUDENTID"<>:B1) AND LNNVL("COURSEID"<>:B2) AND LNNVL(MAX("SCORE")<>:B3)) Note ----- - dynamic sampling used for this statement (level=2)
和猜测差不多,里面有两次全表扫描,两次比较,但是cost在delete statement处骤升让人惊异。万条记录还能执行并把执行计划跑出来,二十万就遥遥无期了。
看来以后写in查询走非索引列是不合适的,这也是在调试别的课题时得到的收获吧。
汇聚点滴,终成大洋!
--2020年1月24日--
附:与not in等效之not exist语句
delete from tb_sc where not exists ( select 'x' from tb_sc a, (select studentid,courseid,max(score) as score from tb_sc group by studentid,courseid) b where a.studentid=b.studentid and a.courseid=b.courseid and a.score=b.score and tb_sc.id=a.id)
--2020年1月24日 18点37分--