题目
1.用拉链表实现核心交易分析中DIM层商家维表,并实现该拉链表的回滚(自己构造数据,编写SQL,并要有相应的文字说明);
2. 在会员分析中计算 沉默会员数 和 流失会员数
沉默会员的定义:只在安装当天启动过App,而且安装时间是在7天前
流失会员的定义:最近30天未登录的会员
3. 在核心交易分析中完成如下指标的计算
统计2020年每个季度的销售订单笔数、订单总额
统计2020年每个月的销售订单笔数、订单总额
统计2020年每周(周一到周日)的销售订单笔数、订单总额
统计2020年国家法定节假日、休息日、工作日的订单笔数、订单总额
详解(1)
一、创建表加载数据(准备工作)
1.创建商家信息表--拉链表(dim层)
特点:
- 相较于商家维表,多了两个字段startdate,enddate
- 拉链表不是分区表
DROP TABLE IF EXISTS `dim.dim_trade_shops`; CREATE EXTERNAL TABLE `dim.dim_trade_shops`( `shopid` int, `userid` int, `areaid` int, `shopname` string, `shoplevel` tinyint, `status` tinyint, `createtime` string, `modifytime` string,
`startdate` string,
`enddate` string
)COMMENT '商家店铺表';
2.准备三天的数据
/root/data/shops/shop-2020-11-20.dat
100050,1,100225,WSxxx营超市,1,1,2020-06-28,2020-11-20 13:22:22
100052,2,100236,新鲜xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100053,3,100011,华为xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100054,4,100159,小米xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100055,5,100211,苹果xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
/root/data/shops/shop-2020-11-21.dat
100057,7,100311,三只xxx鼠零食,1,1,2020-06-28,2020-11-21 13:22:22
100058,8,100329,良子xxx铺美食,1,1,2020-06-28,2020-11-21 13:22:22
100054,4,100159,小米xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22
100055,5,100211,苹果xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22
/root/data/shops/shop-2020-11-22.dat
100059,9,100225,乐居xxx日用品,1,1,2020-06-28,2020-11-22 13:22:22
100060,10,100211,同仁xxx大健康,1,1,2020-06-28,2020-11-22 13:22:22
100052,2,100236,新鲜xxx旗舰店,1,2,2020-06-28,2020-11-22 13:22:22
3.创建商家信息维表(ods层)
DROP TABLE IF EXISTS `ods.ods_trade_shops`; CREATE EXTERNAL TABLE `ods.ods_trade_shops`( `shopid` int, `userid` int, `areaid` int, `shopname` string, `shoplevel` tinyint, `status` tinyint, `createtime` string, `modifytime` string ) COMMENT '商家店铺表' PARTITIONED BY (`dt` string) row format delimited fields terminated by ',' location '/user/data/trade.db/shops';
二、拉链表的实现
- ods_trade_shops(分区表) => 每日变更的数据(修改的+新增的) / 历史数据(第一天)
- dim_trade_shops(拉链表)=> 多了两个字段 startdate 、 enddate
1、ods_trade_shops 初始化(2020-11-20)。获取历史数据
load data local inpath '/root/data/shops/shop-2020-11-20.dat' into table ods.ods_trade_shops partition(dt = '2020-11-20');
100050,1,100225,WSxxx营超市,1,1,2020-06-28,2020-11-20 13:22:22
100052,2,100236,新鲜xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100053,3,100011,华为xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100054,4,100159,小米xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100055,5,100211,苹果xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
2.初始化拉链表(2020-11-20)。ods_trade_shops => dim_trade_shops
insert overwrite table dim.dim_trade_shops select shopid,userid,areaid,shopname,shoplevel,status,createtime,modifytime, dt as startdate, '9999-12-31' as enddate from ods.ods_trade_shops where dt = '2020-11-20';
3.导入2020-11-21数据,构建拉链表
load data local inpath '/root/data/shops/shop-2020-11-21.dat' into table ods.ods_trade_shops partition(dt = '2020-11-21');
100057,7,100311,三只xxx鼠零食,1,1,2020-06-28,2020-11-21 13:22:22
100058,8,100329,良子xxx铺美食,1,1,2020-06-28,2020-11-21 13:22:22
100054,4,100159,小米xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22
100055,5,100211,苹果xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22
4.构建拉链表(新增数据 + 历史数据)
新增数据处理逻辑与家在历史数据相似,startdate=dt enddate = '9999-12-31'
第一步:新增数据的处理办法
select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, dt as startdate,
'9999-12-31' AS enddate
from ods.ods_trade_shops
where dt = '2020-11-21';
第二步:历史数据的处理办法
select b.shopid, b.userid ,b.areaid , b.shopname, b.shoplevel,b.status, b.createtime, b.modifytime,
b.startdate, CASE WHEN a.shopid is not null and b.enddate ='9999-12-31' THEN date_add('2020-11-21',-1) ELSE b.enddate
end as enddate from (select *
from ods.ods_trade_shops
where dt='2020-11-21') a
right join dim.dim_trade_shops b
on a.shopid = b.shopid;
综上,拉链表的最终处理(新增+历史数据)
insert overwrite table dim.dim_trade_shops select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, dt as startdate, '9999-12-31' AS enddate from ods.ods_trade_shops where dt = '2020-11-21'
union all
select b.shopid, b.userid ,b.areaid , b.shopname, b.shoplevel,b.status, b.createtime, b.modifytime, b.startdate, CASE WHEN a.shopid is not null and b.enddate ='9999-12-31' THEN date_add('2020-11-21',-1) ELSE b.enddate end as enddate from (select * from ods.ods_trade_shops where dt='2020-11-21') a right join dim.dim_trade_shops b on a.shopid = b.shopid;
处理拉链表的脚本:shopszipper.sh
#!/bin/bash source /etc/profile if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" insert overwrite table dim.dim_trade_shops select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, dt as startdate, '9999-12-31' AS enddate from ods.ods_trade_shops where dt = '$do_date' union all select b.shopid, b.userid ,b.areaid , b.shopname, b.shoplevel,b.status, b.createtime, b.modifytime, b.startdate, CASE WHEN a.shopid is not null and b.enddate ='9999-12-31' THEN date_add('$do_date',-1) ELSE b.enddate end as enddate from (select * from ods.ods_trade_shops where dt='$do_date') a right join dim.dim_trade_shops b on a.shopid = b.shopid; " hive -e "$sql"
5.结果验证
ods维表初始化(导入2020-11-20的数据)
ods维表导入2020-11-21日的数据
拉链表更新后(维表导入2020-11-21数据,并执行拉链表的操作后的结果)
三、拉链表的回滚
回滚逻辑:将拉链表恢复到rollback_date那一天的数据
- end_date < rollback_date,即结束日期 < 回滚日期。表示该行数据在rollback_date 之前产生,这些数据需要原样保留
- start_date <= rollback_date <= end_date,即开始日期 <= 回滚日期 <= 结束日期。这些数据是回滚日期之后产生的,但是需要修改。将end_date 改为 9999-12-31
- 其他数据不用管
例如,回滚到2020-11-21的数据
1.处理end_date < rollback_date 的数据:保留
select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, startdate, enddate
1 as tag from dim.dim_trade_shops where enddate < '2020-11-21';
2.处理start_date < rollback_date < end_date 的数据:设置enddate=9999-12-31
select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, startdate, enddate 2 as tag from dim.dim_trade_shops where startdate <= '2020-11-21' and '2020-11-21' <= enddate;
3.将数据写入拉链表
drop table tmp.test; create table tmp.test as select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, startdate, enddate, 1 as tag from dim.dim_trade_shops where enddate < '2020-11-21'union all select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, startdate, enddate, 2 as tag from dim.dim_trade_shops where startdate <= '2020-11-21' and '2020-11-21' <= enddate;
4.结果验证
回滚的表:
原有的表:
详解(2)
2. 在会员分析中计算沉默会员数和流失会员数
沉默会员的定义:只在安装当天启动过App,而且安装时间是在7天前
select count(*) clientNum from dws.dws_member_start_day where dt <= date_add(current_date,-7) and device_id in( select device_id from ( select device_id, count(*) cnt from dws.dws_member_start_day group by device_id having cnt = 1 )tmp );
流失会员的定义:最近30天未登录的会员
select count(distinct device_id) lossNum from dws.dws_member_start_day where device_id not in ( select distinct device_id from dws.dws_member_start_day where dt >=date_add(current_date,30) tem );
详解(3)
3. 在核心交易分析中完成如下指标的计算
分析:以createTime来确定当前订单属于哪个季度,哪个月,哪个周,维护一张节假日,休息日,工作日的配置表,判断属于哪个特殊日期,计算orderId的数量,计算totalMoney的总和
create table dwd.dwd_trade_orders( `orderId` int, `orderNo` string, `userId` bigint, `status` tinyint, `productMoney` decimal, `totalMoney` decimal, `payMethod` tinyint, `isPay` tinyint, `areaId` int, `tradeSrc` tinyint, `tradeType` int, `isRefund` tinyint, `dataFlag` tinyint, `createTime` string, `payTime` string, `modifiedTime` string, `start_date` string, `end_date` string ) COMMENT '订单事实拉链表' partitioned by (dt string) STORED AS PARQUET;
create table special_day( `special_day_name` string, `start_date` string, `end_date` string ) COMMENT '订单事实拉链表'
统计2020年每个季度的销售订单笔数、订单总额
select (weekofyear(createTime)/3.1)+1,count(orderId),sum(totalMoney) from dwd_trade_orders group by (weekofyear(createTime)/3.1)+1 where year(createTime)='2020'
统计2020年每个月的销售订单笔数、订单总额
select month(createTime),count(orderId),sum(totalMoney) from dwd_trade_orders group by month(createTime) where year(createTime)='2020'
统计2020年每周(周一到周日)的销售订单笔数、订单总额
select weekofyear(createTime),count(orderId),sum(totalMoney) from dwd_trade_orders group by weekofyear(createTime) where year(createTime)='2020'
统计2020年国家法定节假日、休息日、工作日的订单笔数、订单总额
select special_day_name,count(orderId),sum(totalMoney) from( select a.orderId,a.totalMoney,b.special_day_name from dwd_trade_orders a right join special_day b on b.start_date<=a.createTime<=end_date ) t1 group by special_day_name where year(createTime)='2020';