除了平均数外,众数和中位数也是衡量集合的标尺之一,中位数是将集合升序排列后恰好位于正中间的元素,如果集合总数为偶数,则取中间两个元素的平均值作为中位数,下文将就用SQL去求中位数展开讨论。
首先建表:
create table tb_coder( id number(4,0) not null primary key, name nvarchar2(20) not null, salary integer not null)
然后充值:
insert into tb_coder select rownum,dbms_random.string('*',dbms_random.value(2,20)),dbms_random.value(3000,20000) from dual connect by level<11 order by dbms_random.random;
然后我们看看表中数据:
SQL> select * from tb_coder order by salary; ID NAME SALARY ---------- ---------------------------------------- ---------- 9 DTVMMQSOAJA 4947 10 CSIZK 5424 2 TYLWH 5676 7 EPUIILOYENSD 10316 3 MSKHEGSWNIFZWB 15153 4 OFZKCXNWZWSJR 15326 5 HOJLJWIOYNDBB 16101 8 XJQTKPDXAIOOWJ 16843 6 RWRGMEIRGZN 18599 1 OZKFBBFYWGEERAUGSNH 19710 已选择10行。
现在可以观察得知,总个数为10,是偶数,中位数就该是处于中间两个数15153和15326的平均值:15329.5
经过思考,我发现解决问题的关键是得知中间在哪,为此我添加了两列以帮助思考:
SQL> select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary; ID SALARY SEQ REVSEQ ---------- ---------- ---------- ---------- 9 4947 1 10 10 5424 2 9 2 5676 3 8 7 10316 4 7 3 15153 5 6 4 15326 6 5 5 16101 7 4 8 16843 8 3 6 18599 9 2 1 19710 10 1 已选择10行。
进过观察可以知道,用seq减去revseq再求绝对值,值大的一定靠两边,值小的一定靠中间。于是再查:
SQL> select a.*,abs(a.seq-a.revseq) as diff from 2 (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a; ID SALARY SEQ REVSEQ DIFF ---------- ---------- ---------- ---------- ---------- 9 4947 1 10 9 10 5424 2 9 7 2 5676 3 8 5 7 10316 4 7 3 3 15153 5 6 1 4 15326 6 5 1 5 16101 7 4 3 8 16843 8 3 5 6 18599 9 2 7 1 19710 10 1 9 已选择10行。
从上面已经明显看出,diff值最小的就是我们要找的15153和15326两条记录。
如果我取diff值最小的记录,求平均值不就是中位数了吗,于是有了下面的SQL:
SQL> select avg(b.salary) from 2 (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a ) b 3 where b.diff=(select min(c.diff) from 4 (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a ) c); AVG(B.SALARY) ------------- 15239.5
结果等于预期,这时偶数个情况,那么奇数个呢?
让我们先删除最后一行:
SQL> delete from tb_coder where id=10; 已删除 1 行。 SQL> commit; 提交完成。
再看结果集:
SQL> select * from tb_coder order by salary; ID NAME SALARY ---------- ---------------------------------------- ---------- 9 DTVMMQSOAJA 4947 2 TYLWH 5676 7 EPUIILOYENSD 10316 3 MSKHEGSWNIFZWB 15153 4 OFZKCXNWZWSJR 15326 5 HOJLJWIOYNDBB 16101 8 XJQTKPDXAIOOWJ 16843 6 RWRGMEIRGZN 18599 1 OZKFBBFYWGEERAUGSNH 19710 已选择9行。
只剩下九条,那么处于中间的15326应该是中位数。
看看是不是:
SQL> select avg(b.salary) from 2 (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a ) b 3 where b.diff=(select min(c.diff) from 4 (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a ) c); AVG(B.SALARY) ------------- 15326
结果等于预期!
附:另一种中位数求法,这种方法比较简洁,但是是别人的方法《SQL进阶教程》P61-P62,不是我自己想出来的。
select avg(distinct salary) from ( select t1.salary from tb_coder t1,tb_coder t2 group by t1.salary having sum(case when t1.salary>=t2.salary then 1 else 0 end)>=count(*)/2 and sum(case when t1.salary<=t2.salary then 1 else 0 end)>=count(*)/2 )
另外众数也是衡量集合属性的标尺之一,下面列出了两种求众数的方法供大家参考(一千元分一个档):
--求众数(使用谓词) select a.sal,count(*) as cnt from (select round(salary/1000)*1000 as sal from tb_coder) a group by a.sal having count(*) >= all(select count(*) from (select round(salary/1000)*1000 as sal from tb_coder) b group by b.sal) --求众数(使用rownum) select b.sal,b.cnt from ( select a.sal,count(*) as cnt from (select round(salary/1000)*1000 as sal from tb_coder) a group by a.sal order by cnt desc ) b where rownum=1
--2020年4月5日--
以上用到的全部SQL:
create table tb_coder( id number(4,0) not null primary key, name nvarchar2(20) not null, salary integer not null) insert into tb_coder select rownum,dbms_random.string('*',dbms_random.value(2,20)),dbms_random.value(3000,20000) from dual connect by level<11 order by dbms_random.random; select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a select b.* from (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a ) b order by b.diff select b.* from (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a ) b where b.diff=(select min(c.diff) from (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a ) c) select avg(b.salary) from (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a ) b where b.diff=(select min(c.diff) from (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_coder order by salary) a ) c)