所谓的数据稠化,就是补全缺失的数据。因为在数据库表中,存储的数据经常是稀疏的(sparse data),也就是不完整的。比如记录一个员工每个月的销售额,用这么一个销售表来记录:SalesRecord(Name(姓名),Date(日期),Sales(销售额)),假设某个月这个员工请假没上班,对应的没有销售额,一般也不会将这个员工的销售额存储为0,而是直接不存储,这样在销售表中就会产生缺失的行,导致的结果就是这些销售数据在时间上是不连续的,或者说就是缺失的。为了后续的一些统计,需要对数据进行补全也就是所谓的数据稠化。
下面做一个简单的例子,
图 1
需要知道每个人每科的成绩,这里就是需要补全数据,比如(Lucy的Chinese成绩??),因此,最终我们需要的到
的结果是这样的(红色背景是补全的内容):
图 2
步骤如下:
先创建一张成绩表Scores
1 --创建Scores表 2 create table Scores( 3 stuName varchar2(10), 4 subject varchar2(10), 5 score number );
然后插入数据,得到图1.
接下来,我们是实现一维(学科)数据稠密,也就是对学科,每个人都有每个学科,首先想到的是要找出所有学科(暂且这么做,以后经常是对另外一张学科表join),
废话少说,找出所有学科:
select distinct subject from Scores;
我们可以根据原表利用partition by()语法来进行下一步操作
1 --一维稠密数据
2 SELECT scores.stuname,
3 m.subject,
4 NVL(scores.score,0)
5 FROM scores
6 PARTITION BY (scores.stuname) --这里是重点
7 right join
8 (SELECT DISTINCT subject FROM scores) m
9 ON scores.subject=m.subject;
最终得到结果:
显然,上面的代码看起来很杂乱,我们可以来个with as 语法,使代码看起来清晰:
with
v1 as (select distinct subject from scores),
SELECT scores.stuname,v1.subject,NVL(scores.score,0) FROM scores
PARTITION BY (scores.stuname)
right join v1
ON scores.subject=v1.subject;
实现了一维的数据稠密,那么给表再添加个字段—年份,需要知道每个人,每年的每科成绩又怎么稠密呢?同样的道理
先实现一维的稠化,再在稠化后的基础上再稠化一次,以此类推就可以了嘛!
with
v1 as (select * from scores), --这是原表
v2 as (select distinct subject from scores), --所有科目
v3 as (select distinct dateyear from scores), --所有年份
v4 as (select v1.stuname,v2.subject,v1.score,v1.dateyear from v1
partition by (v1.stuname)
right join v2
on v1.subject = v2.subject) --v4就是对学科稠密化后的表,如图2所示
select v4.stuname , v4.subject , NVL(v4.score,0),v3.dateyear from v4
partition by (stuname,subject) --注意这里
right join v3 --最后在对v4进行年份的稠密,
on v4.dateyear=v3.dateyear;
这样就大功告成啦,每个人,每年的每课成绩均可有了。接下来,需要对着表进行行列转换如下图所示,这里我们以后再说!
下面实现二维数据稠化,我们同样有如下表:
YEARMONTH | STUDENT | SUBJECT | SCORE |
201601 | Jim | Chinese | 78 |
201601 | Jim | Math | 34 |
201603 | Jim | English | 89 |
201605 | Jim | Physics | 88 |
201608 | Jim | Math | 67 |
201601 | Joe | Math | 87 |
201602 | Joe | Chinese | 87 |
201604 | Joe | Chinese | 55 |
201609 | Joe | Math | 45 |
201609 | Joe | Physics | 90 |
YEARMONTH | STUDENT | SUBJECT | SCORE |
201601 | Jim | Chinese | 78 |
201601 | Jim | Math | 34 |
201601 | Jim | English | 0 |
201601 | Jim | Physics | 0 |
201602 | Jim | Chinese | 0 |
201602 | Jim | Math | 0 |
201602 | Jim | English | 0 |
201602 | Jim | Physic | 0 |
201603 | Jim | Chinese | 0 |
201603 | Jim | Math | 0 |
201603 | Jim | English | 89 |
201603 | Jim | Physics | 0 |
如右表所示,部分补全数据为红色背景的数据,其分数默认为0,这样我们就能看到
每个人(student维度)在所有时间(yearmonth维度)的每科(subject维度)的分数(score度量)
我们的做法应该是怎么样呢?
我们先做前期的准备,创建一张成绩表,并插入相应的数据
--创建学生成绩表 CREATE TABLE stu_score ( yearmonth number, student VARCHAR2(20), subject varchar2(20), score number ) --往表中插入数据 INSERT INTO stu_score VALUES(201601,'Jim','Chinese',78); INSERT INTO stu_score VALUES(201601,'Jim','Math',34); INSERT INTO stu_score VALUES(201603,'Jim','English',89); INSERT INTO stu_score VALUES(201605,'Jim','Physics',88); INSERT INTO stu_score VALUES(201608,'Jim','Math',67); INSERT INTO stu_score VALUES(201601,'Joe','Math',87); INSERT INTO stu_score VALUES(201602,'Joe','Chinese',87); INSERT INTO stu_score VALUES(201604,'Joe','Chinese',55); INSERT INTO stu_score VALUES(201609,'Joe','Math',45); INSERT INTO stu_score VALUES(201609,'Joe','Physics',90);
同样,创建一张时间维度表
--创建时间维度表 CREATE TABLE DIM_DATE ( yearmonth number ); INSERT INTO DIM_DATE VALUES(201601); INSERT INTO DIM_DATE VALUES(201602); INSERT INTO DIM_DATE VALUES(201603); INSERT INTO DIM_DATE VALUES(201604); INSERT INTO DIM_DATE VALUES(201605); INSERT INTO DIM_DATE VALUES(201606); INSERT INTO DIM_DATE VALUES(201607); INSERT INTO DIM_DATE VALUES(201608); INSERT INTO DIM_DATE VALUES(201609); INSERT INTO DIM_DATE VALUES(201610); INSERT INTO DIM_DATE VALUES(201611); INSERT INTO DIM_DATE VALUES(201612);
然后,我们需要做的就是去稠化这些数据,保证在每个维度都有数据
WITH sub AS ( SELECT DISTINCT subject FROM stu_score ), t1 as( SELECT t.yearmonth,t.student,sub.subject,t.score FROM stu_score t PARTITION BY (t.student) RIGHT JOIN sub ON t.subject=sub.subject) SELECT dim_date.yearmonth,t1.student,t1.subject,nvl(t1.score,0) FROM t1 PARTITION BY (student,subject) right JOIN DIM_DATE ON dim_date.yearmonth = t1.yearmonth;
或者,不用创建临时表,直接合并
WITH sub AS ( --学科维度表,将所有学科选出 SELECT DISTINCT subject FROM stu_score ), SELECT dim_date.yearmonth,t1.student,t1.subject,nvl(t1.score,0) FROM (SELECT t.yearmonth,t.student,sub.subject,t.score FROM stu_score t PARTITION BY (t.student) RIGHT JOIN sub ON t.subject=sub.subject)t1 --对学科稠化,每个人在每个学科都有数据 PARTITION BY (student,subject) right JOIN DIM_DATE ON dim_date.yearmonth = t1.yearmonth; --对日期稠化,保证每个日期都有数据