• 数据仓库(十)——ADS 层


    建表说明

    ADS层不涉及建模,建表根据具体需求而定。

    第一章 访客主题

    1.1 访客统计

    该需求为访客综合统计,其中包含若干指标,以下为对每个指标的解释说明。

    指标

    说明

    对应字段

    访客数

    统计访问人数

    uv_count

    页面停留时长

    统计所有页面访问记录总时长,以秒为单位

    duration_sec

    平均页面停留时长

    统计每个会话平均停留时长,以秒为单位

    avg_duration_sec

    页面浏览总数

    统计所有页面访问记录总数

    page_count

    平均页面浏览数

    统计每个会话平均浏览页面数

    avg_page_count

    会话总数

    统计会话总数

    sv_count

    跳出数

    统计只浏览一个页面的会话个数

    bounce_count

    跳出率

    只有一个页面的会话的比例

    bounce_rate

    1.建表语句

    DROP TABLE IF EXISTS ads_visit_stats;
    CREATE EXTERNAL TABLE ads_visit_stats (
      `dt` STRING COMMENT '统计日期',
      `is_new` STRING COMMENT '新老标识,1:新,0:老',
      `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
      `channel` STRING COMMENT '渠道',
      `uv_count` BIGINT COMMENT '日活(访问人数)',
      `duration_sec` BIGINT COMMENT '页面停留总时长',
      `avg_duration_sec` BIGINT COMMENT '一次会话,页面停留平均时长,单位为描述',
      `page_count` BIGINT COMMENT '页面总浏览数',
      `avg_page_count` BIGINT COMMENT '一次会话,页面平均浏览数',
      `sv_count` BIGINT COMMENT '会话次数',
      `bounce_count` BIGINT COMMENT '跳出数',
      `bounce_rate` DECIMAL(16,2) COMMENT '跳出率'
    ) COMMENT '访客统计'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_visit_stats/';

    2.数据装载
    思路分析:该需求的关键点为会话的划分,总体实现思路可分为以下几步:
    第一步:对所有页面访问记录进行会话的划分。
    第二步:统计每个会话的浏览时长和浏览页面数。
    第三步:统计上述各指标。

    insert overwrite table ads_visit_stats
    select * from ads_visit_stats
    union
    select
        '2020-06-14' dt,
        is_new,
        recent_days,
        channel,
        count(distinct(mid_id)) uv_count,
        cast(sum(duration)/1000 as bigint) duration_sec,
        cast(avg(duration)/1000 as bigint) avg_duration_sec,
        sum(page_count) page_count,
        cast(avg(page_count) as bigint) avg_page_count,
        count(*) sv_count,
        sum(if(page_count=1,1,0)) bounce_count,
        cast(sum(if(page_count=1,1,0))/count(*)*100 as decimal(16,2)) bounce_rate
    from
    (
        select
            session_id,
            mid_id,
            is_new,
            recent_days,
            channel,
            count(*) page_count,
            sum(during_time) duration
        from
        (
            select
                mid_id,
                channel,
                recent_days,
                is_new,
                last_page_id,
                page_id,
                during_time,
                concat(mid_id,'-',last_value(if(last_page_id is null,ts,null),true) over (partition by recent_days,mid_id order by ts)) session_id
            from
            (
                select
                    mid_id,
                    channel,
                    last_page_id,
                    page_id,
                    during_time,
                    ts,
                    recent_days,
                    if(visit_date_first>=date_add('2020-06-14',-recent_days+1),'1','0') is_new
                from
                (
                    select
                        t1.mid_id,
                        t1.channel,
                        t1.last_page_id,
                        t1.page_id,
                        t1.during_time,
                        t1.dt,
                        t1.ts,
                        t2.visit_date_first
                    from
                    (
                        select
                            mid_id,
                            channel,
                            last_page_id,
                            page_id,
                            during_time,
                            dt,
                            ts
                        from dwd_page_log
                        where dt>=date_add('2020-06-14',-30)
                    )t1
                    left join
                    (
                        select
                            mid_id,
                            visit_date_first
                        from dwt_visitor_topic
                        where dt='2020-06-14'
                    )t2
                    on t1.mid_id=t2.mid_id
                )t3 lateral view explode(Array(1,7,30)) tmp as recent_days
                where dt>=date_add('2020-06-14',-recent_days+1)
            )t4
        )t5
        group by session_id,mid_id,is_new,recent_days,channel
    )t6
    group by is_new,recent_days,channel;

    1.2 路径分析

    用户路径分析,顾名思义,就是指用户在APP或网站中的访问路径。为了衡量网站优化的效果或营销推广的效果,以及了解用户行为偏好,时常要对访问路径进行分析。

    用户访问路径的可视化通常使用桑基图。如下图所示,该图可真实还原用户的访问路径,包括页面跳转和页面访问次序。

    桑基图需要我们提供每种页面跳转的次数,每个跳转由source/target表示,source指跳转起始页面,target表示跳转终到页面。

    1.建表语句

    DROP TABLE IF EXISTS ads_page_path;
    CREATE EXTERNAL TABLE ads_page_path
    (
        `dt` STRING COMMENT '统计日期',
        `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
        `source` STRING COMMENT '跳转起始页面ID',
        `target` STRING COMMENT '跳转终到页面ID',
        `path_count` BIGINT COMMENT '跳转次数'
    )  COMMENT '页面浏览路径'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_page_path/';

    2.数据装载
    思路分析:该需求要统计的就是每种跳转的次数,故理论上对source/target进行分组count()即可。统计时需注意以下两点:
    第一点:桑基图的source不允许为空,但target可为空。
    第二点:桑基图所展示的流程不允许存在环。

    insert overwrite table ads_page_path
    select * from ads_page_path
    union
    select
        '2020-06-14',
        recent_days,
        source,
        target,
        count(*)
    from
    (
        select
            recent_days,
            concat('step-',step,':',source) source,
            concat('step-',step+1,':',target) target
        from
        (
            select
                recent_days,
                page_id source,
                lead(page_id,1,null) over (partition by recent_days,session_id order by ts) target,
                row_number() over (partition by recent_days,session_id order by ts) step
            from
            (
                select
                    recent_days,
                    last_page_id,
                    page_id,
                    ts,
                    concat(mid_id,'-',last_value(if(last_page_id is null,ts,null),true) over (partition by mid_id,recent_days order by ts)) session_id
                from dwd_page_log lateral view explode(Array(1,7,30)) tmp as recent_days
                where dt>=date_add('2020-06-14',-30)
                and dt>=date_add('2020-06-14',-recent_days+1)
            )t2
        )t3
    )t4
    group by recent_days,source,target;

    第二章 用户主题

    2.1 用户统计

    该需求为用户综合统计,其中包含若干指标,以下为对每个指标的解释说明。

    指标

    说明

    对应字段

    新增用户数

    统计新增注册用户人数

    new_user_count

    新增下单用户数

    统计新增下单用户人数

    new_order_user_count

    下单总金额

    统计所有订单总额

    order_final_amount

    下单用户数

    统计下单用户总数

    order_user_count

    未下单用户数

    统计活跃但未下单用户数

    no_order_user_count


    1.建表语句

    DROP TABLE IF EXISTS ads_user_total;
    CREATE EXTERNAL TABLE `ads_user_total` (
      `dt` STRING COMMENT '统计日期',
      `recent_days` BIGINT COMMENT '最近天数,0:累积值,1:最近1天,7:最近7天,30:最近30天',
      `new_user_count` BIGINT COMMENT '新注册用户数',
      `new_order_user_count` BIGINT COMMENT '新增下单用户数',
      `order_final_amount` DECIMAL(16,2) COMMENT '下单总金额',
      `order_user_count` BIGINT COMMENT '下单用户数',
      `no_order_user_count` BIGINT COMMENT '未下单用户数(具体指活跃用户中未下单用户)'
    ) COMMENT '用户统计'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_user_total/';

    2.数据装载

    insert overwrite table ads_user_total
    select * from ads_user_total
    union
    select
        '2020-06-14',
        recent_days,
        sum(if(login_date_first>=recent_days_ago,1,0)) new_user_count,
        sum(if(order_date_first>=recent_days_ago,1,0)) new_order_user_count,
        sum(order_final_amount) order_final_amount,
        sum(if(order_final_amount>0,1,0)) order_user_count,
        sum(if(login_date_last>=recent_days_ago and order_final_amount=0,1,0)) no_order_user_count
    from
    (
        select
            recent_days,
            user_id,
            login_date_first,
            login_date_last,
            order_date_first,
            case when recent_days=0 then order_final_amount
                 when recent_days=1 then order_last_1d_final_amount
                 when recent_days=7 then order_last_7d_final_amount
                 when recent_days=30 then order_last_30d_final_amount
            end order_final_amount,
            if(recent_days=0,'1970-01-01',date_add('2020-06-14',-recent_days+1)) recent_days_ago
        from dwt_user_topic lateral view explode(Array(0,1,7,30)) tmp as recent_days
        where dt='2020-06-14'
    )t1
    group by recent_days;

    2.2 用户变动统计

    该需求包括两个指标,分别为流失用户数和回流用户数,以下为对两个指标的解释说明。

    指标

    说明

    对应字段

    流失用户数

    之前活跃过的用户,最近一段时间未活跃,就称为流失用户。此处要求统计7日前(只包含7日前当天)活跃,但最近7日未活跃的用户总数。

    user_churn_count

    回流用户数

    之前的活跃用户,一段时间未活跃(流失),今日又活跃了,就称为回流用户。此处要求统计回流用户总数。

    new_order_user_count

    1.建表语句

    DROP TABLE IF EXISTS ads_user_change;
    CREATE EXTERNAL TABLE `ads_user_change` (
      `dt` STRING COMMENT '统计日期',
      `user_churn_count` BIGINT COMMENT '流失用户数',
      `user_back_count` BIGINT COMMENT '回流用户数'
    ) COMMENT '用户变动统计'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_user_change/';

    2.数据装载
    思路分析:
    流失用户:末次活跃时间为7日前的用户即为流失用户。
    回流用户:末次活跃时间为今日,上次活跃时间在8日前的用户即为回流用户。

    insert overwrite table ads_user_change
    select * from ads_user_change
    union
    select
        churn.dt,
        user_churn_count,
        user_back_count
    from
    (
        select
            '2020-06-14' dt,
            count(*) user_churn_count
        from dwt_user_topic
        where dt='2020-06-14'
        and login_date_last=date_add('2020-06-14',-7)
    )churn
    join
    (
        select
            '2020-06-14' dt,
            count(*) user_back_count
        from
        (
            select
                user_id,
                login_date_last
            from dwt_user_topic
            where dt='2020-06-14'
            and login_date_last='2020-06-14'
        )t1
        join
        (
            select
                user_id,
                login_date_last login_date_previous
            from dwt_user_topic
            where dt=date_add('2020-06-14',-1)
        )t2
        on t1.user_id=t2.user_id
        where datediff(login_date_last,login_date_previous)>=8
    )back
    on churn.dt=back.dt;

    2.3 用户行为漏斗分析

    漏斗分析是一个数据分析模型,它能够科学反映一个业务过程从起点到终点各阶段用户转化情况。由于其能将各阶段环节都展示出来,故哪个阶段存在问题,就能一目了然。

    该需求要求统计一个完整的购物流程各个阶段的人数。
    1.建表语句

    DROP TABLE IF EXISTS ads_user_action;
    CREATE EXTERNAL TABLE `ads_user_action` (
      `dt` STRING COMMENT '统计日期',
      `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
      `home_count` BIGINT COMMENT '浏览首页人数',
      `good_detail_count` BIGINT COMMENT '浏览商品详情页人数',
      `cart_count` BIGINT COMMENT '加入购物车人数',
      `order_count` BIGINT COMMENT '下单人数',
      `payment_count` BIGINT COMMENT '支付人数'
    ) COMMENT '漏斗分析'
    ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_user_action/';

    2.数据装载

    with
    tmp_page as
    (
        select
            '2020-06-14' dt,
            recent_days,
            sum(if(array_contains(pages,'home'),1,0)) home_count,
            sum(if(array_contains(pages,'good_detail'),1,0)) good_detail_count
        from
        (
            select
                recent_days,
                mid_id,
                collect_set(page_id) pages
            from
            (
                select
                    dt,
                    mid_id,
                    page.page_id
                from dws_visitor_action_daycount lateral view explode(page_stats) tmp as page
                where dt>=date_add('2020-06-14',-29)
                and page.page_id in('home','good_detail')
            )t1 lateral view explode(Array(1,7,30)) tmp as recent_days
            where dt>=date_add('2020-06-14',-recent_days+1)
            group by recent_days,mid_id
        )t2
        group by recent_days
    ),
    tmp_cop as
    (
        select
            '2020-06-14' dt,
            recent_days,
            sum(if(cart_count>0,1,0)) cart_count,
            sum(if(order_count>0,1,0)) order_count,
            sum(if(payment_count>0,1,0)) payment_count
        from
        (
            select
                recent_days,
                user_id,
                case
                    when recent_days=1 then cart_last_1d_count
                    when recent_days=7 then cart_last_7d_count
                    when recent_days=30 then cart_last_30d_count
                end cart_count,
                case
                    when recent_days=1 then order_last_1d_count
                    when recent_days=7 then order_last_7d_count
                    when recent_days=30 then order_last_30d_count
                end order_count,
                case
                    when recent_days=1 then payment_last_1d_count
                    when recent_days=7 then payment_last_7d_count
                    when recent_days=30 then payment_last_30d_count
                end payment_count
            from dwt_user_topic lateral view explode(Array(1,7,30)) tmp as recent_days
            where dt='2020-06-14'
        )t1
        group by recent_days
    )
    insert overwrite table ads_user_action
    select * from ads_user_action
    union
    select
        tmp_page.dt,
        tmp_page.recent_days,
        home_count,
        good_detail_count,
        cart_count,
        order_count,
        payment_count
    from tmp_page
    join tmp_cop
    on tmp_page.recent_days=tmp_cop.recent_days;

    2.4 用户留存率

    留存分析一般包含新增留存和活跃留存分析。

    新增留存分析是分析某天的新增用户中,有多少人有后续的活跃行为。活跃留存分析是分析某天的活跃用户中,有多少人有后续的活跃行为。

    留存分析是衡量产品对用户价值高低的重要指标。

    此处要求统计新增留存率,新增留存率具体是指留存用户数与新增用户数的比值,例如2020-06-14新增100个用户,1日之后(2020-06-15)这100人中有80个人活跃了,那2020-06-14的1日留存数则为80,2020-06-14的1日留存率则为80%。

    要求统计每天的1至7日留存率,如下图所示。

    1.建表语句

    DROP TABLE IF EXISTS ads_user_retention;
    CREATE EXTERNAL TABLE ads_user_retention (
      `dt` STRING COMMENT '统计日期',
      `create_date` STRING COMMENT '用户新增日期',
      `retention_day` BIGINT COMMENT '截至当前日期留存天数',
      `retention_count` BIGINT COMMENT '留存用户数量',
      `new_user_count` BIGINT COMMENT '新增用户数量',
      `retention_rate` DECIMAL(16,2) COMMENT '留存率'
    ) COMMENT '用户留存率'
    ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_user_retention/';

    2.数据装载

    insert overwrite table ads_user_retention
    select * from ads_user_retention
    union
    select
        '2020-06-14',
        login_date_first create_date,
        datediff('2020-06-14',login_date_first) retention_day,
        sum(if(login_date_last='2020-06-14',1,0)) retention_count,
        count(*) new_user_count,
        cast(sum(if(login_date_last='2020-06-14',1,0))/count(*)*100 as decimal(16,2)) retention_rate
    from dwt_user_topic
    where dt='2020-06-14'
    and login_date_first>=date_add('2020-06-14',-7)
    and login_date_first<'2020-06-14'
    group by login_date_first;

    第三章 商品主题

    3.1 商品统计

    该指标为商品综合统计,包含每个spu被下单总次数和被下单总金额。

    1.建表语句

    DROP TABLE IF EXISTS ads_order_spu_stats;
    CREATE EXTERNAL TABLE `ads_order_spu_stats` (
        `dt` STRING COMMENT '统计日期',
        `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
        `spu_id` STRING COMMENT '商品ID',
        `spu_name` STRING COMMENT '商品名称',
        `tm_id` STRING COMMENT '品牌ID',
        `tm_name` STRING COMMENT '品牌名称',
        `category3_id` STRING COMMENT '三级品类ID',
        `category3_name` STRING COMMENT '三级品类名称',
        `category2_id` STRING COMMENT '二级品类ID',
        `category2_name` STRING COMMENT '二级品类名称',
        `category1_id` STRING COMMENT '一级品类ID',
        `category1_name` STRING COMMENT '一级品类名称',
        `order_count` BIGINT COMMENT '订单数',
        `order_amount` DECIMAL(16,2) COMMENT '订单金额'
    ) COMMENT '商品销售统计'
    ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_order_spu_stats/';

    2.数据装载

    insert overwrite table ads_order_spu_stats
    select * from ads_order_spu_stats
    union
    select
        '2020-06-14' dt,
        recent_days,
        spu_id,
        spu_name,
        tm_id,
        tm_name,
        category3_id,
        category3_name,
        category2_id,
        category2_name,
        category1_id,
        category1_name,
        sum(order_count),
        sum(order_amount)
    from
    (
        select
            recent_days,
            sku_id,
            case
                when recent_days=1 then order_last_1d_count
                when recent_days=7 then order_last_7d_count
                when recent_days=30 then order_last_30d_count
            end order_count,
            case
                when recent_days=1 then order_last_1d_final_amount
                when recent_days=7 then order_last_7d_final_amount
                when recent_days=30 then order_last_30d_final_amount
            end order_amount
        from dwt_sku_topic lateral view explode(Array(1,7,30)) tmp as recent_days
        where dt='2020-06-14'
    )t1
    left join
    (
        select
            id,
            spu_id,
            spu_name,
            tm_id,
            tm_name,
            category3_id,
            category3_name,
            category2_id,
            category2_name,
            category1_id,
            category1_name
        from dim_sku_info
        where dt='2020-06-14'
    )t2
    on t1.sku_id=t2.id
    group by recent_days,spu_id,spu_name,tm_id,tm_name,category3_id,category3_name,category2_id,category2_name,category1_id,category1_name;

    3.2 品牌复购率

    品牌复购率是指一段时间内重复购买某品牌的人数与购买过该品牌的人数的比值。重复购买即购买次数大于等于2,购买过即购买次数大于1。

    此处要求统计最近1,7,30天的各品牌复购率。

    1.建表语句

    DROP TABLE IF EXISTS ads_repeat_purchase;
    CREATE EXTERNAL TABLE `ads_repeat_purchase` (
      `dt` STRING COMMENT '统计日期',
      `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
      `tm_id` STRING COMMENT '品牌ID',
      `tm_name` STRING COMMENT '品牌名称',
      `order_repeat_rate` DECIMAL(16,2) COMMENT '复购率'
    ) COMMENT '品牌复购率'
    ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_repeat_purchase/';

    2.数据装载
    思路分析:该需求可分两步实现:
    第一步:统计每个用户购买每个品牌的次数。
    第二步:分别统计购买次数大于1的人数和大于2的人数。

    insert overwrite table ads_repeat_purchase
    select * from ads_repeat_purchase
    union
    select
        '2020-06-14' dt,
        recent_days,
        tm_id,
        tm_name,
        cast(sum(if(order_count>=2,1,0))/sum(if(order_count>=1,1,0))*100 as decimal(16,2))
    from
    (
        select
            recent_days,
            user_id,
            tm_id,
            tm_name,
            sum(order_count) order_count
        from
        (
            select
                recent_days,
                user_id,
                sku_id,
                count(*) order_count
            from dwd_order_detail lateral view explode(Array(1,7,30)) tmp as recent_days
            where dt>=date_add('2020-06-14',-29)
            and dt>=date_add('2020-06-14',-recent_days+1)
            group by recent_days, user_id,sku_id
        )t1
        left join
        (
            select
                id,
                tm_id,
                tm_name
            from dim_sku_info
            where dt='2020-06-14'
        )t2
        on t1.sku_id=t2.id
        group by recent_days,user_id,tm_id,tm_name
    )t3
    group by recent_days,tm_id,tm_name;

    第四章 订单主题

    4.1 订单统计

    该需求包含订单总数,订单总金额和下单总人数。
    1.建表语句

    DROP TABLE IF EXISTS ads_order_total;
    CREATE EXTERNAL TABLE `ads_order_total` (
      `dt` STRING COMMENT '统计日期',
      `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
      `order_count` BIGINT COMMENT '订单数',
      `order_amount` DECIMAL(16,2) COMMENT '订单金额',
      `order_user_count` BIGINT COMMENT '下单人数'
    ) COMMENT '订单统计'
    ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_order_total/';

    2.数据装载

    insert overwrite table ads_order_total
    select * from ads_order_total
    union
    select
        '2020-06-14',
        recent_days,
        sum(order_count),
        sum(order_final_amount) order_final_amount,
        sum(if(order_final_amount>0,1,0)) order_user_count
    from
    (
        select
            recent_days,
            user_id,
            case when recent_days=0 then order_count
                 when recent_days=1 then order_last_1d_count
                 when recent_days=7 then order_last_7d_count
                 when recent_days=30 then order_last_30d_count
            end order_count,
            case when recent_days=0 then order_final_amount
                 when recent_days=1 then order_last_1d_final_amount
                 when recent_days=7 then order_last_7d_final_amount
                 when recent_days=30 then order_last_30d_final_amount
            end order_final_amount
        from dwt_user_topic lateral view explode(Array(1,7,30)) tmp as recent_days
        where dt='2020-06-14'
    )t1
    group by recent_days;

    4.2 各地区订单统计

    该需求包含各省份订单总数和订单总金额。
    1.建表语句

    DROP TABLE IF EXISTS ads_order_by_province;
    CREATE EXTERNAL TABLE `ads_order_by_province` (
      `dt` STRING COMMENT '统计日期',
      `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
      `province_id` STRING COMMENT '省份ID',
      `province_name` STRING COMMENT '省份名称',
      `area_code` STRING COMMENT '地区编码',
      `iso_code` STRING COMMENT '国际标准地区编码',
      `iso_code_3166_2` STRING COMMENT '国际标准地区编码',
      `order_count` BIGINT COMMENT '订单数',
      `order_amount` DECIMAL(16,2) COMMENT '订单金额'
    ) COMMENT '各地区订单统计'
    ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_order_by_province/';

    2.数据装载

    insert overwrite table ads_order_by_province
    select * from ads_order_by_province
    union
    select
        dt,
        recent_days,
        province_id,
        province_name,
        area_code,
        iso_code,
        iso_3166_2,
        order_count,
        order_amount
    from
    (
        select
            '2020-06-14' dt,
            recent_days,
            province_id,
            sum(order_count) order_count,
            sum(order_amount) order_amount
        from
        (
            select
                recent_days,
                province_id,
                case
                    when recent_days=1 then order_last_1d_count
                    when recent_days=7 then order_last_7d_count
                    when recent_days=30 then order_last_30d_count
                end order_count,
                case
                    when recent_days=1 then order_last_1d_final_amount
                    when recent_days=7 then order_last_7d_final_amount
                    when recent_days=30 then order_last_30d_final_amount
                end order_amount
            from dwt_area_topic lateral view explode(Array(1,7,30)) tmp as recent_days
            where dt='2020-06-14'
        )t1
        group by recent_days,province_id
    )t2
    join dim_base_province t3
    on t2.province_id=t3.id;

    第五章 优惠券主题

    5.1 优惠券统计

    该需求要求统计最近30日发布的所有优惠券的领用情况和补贴率,补贴率是指,优惠金额与使用优惠券的订单的原价金额的比值。
    1.建表语句

    DROP TABLE IF EXISTS ads_coupon_stats;
    CREATE EXTERNAL TABLE ads_coupon_stats (
      `dt` STRING COMMENT '统计日期',
      `coupon_id` STRING COMMENT '优惠券ID',
      `coupon_name` STRING COMMENT '优惠券名称',
      `start_date` STRING COMMENT '发布日期',
      `rule_name` STRING COMMENT '优惠规则,例如满100元减10元',
      `get_count`  BIGINT COMMENT '领取次数',
      `order_count` BIGINT COMMENT '使用(下单)次数',
      `expire_count`  BIGINT COMMENT '过期次数',
      `order_original_amount` DECIMAL(16,2) COMMENT '使用优惠券订单原始金额',
      `order_final_amount` DECIMAL(16,2) COMMENT '使用优惠券订单最终金额',
      `reduce_amount` DECIMAL(16,2) COMMENT '优惠金额',
      `reduce_rate` DECIMAL(16,2) COMMENT '补贴率'
    ) COMMENT '商品销售统计'
    ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_coupon_stats/';

    2.数据装载

    insert overwrite table ads_coupon_stats
    select * from ads_coupon_stats
    union
    select
        '2020-06-14' dt,
        t1.id,
        coupon_name,
        start_date,
        rule_name,
        get_count,
        order_count,
        expire_count,
        order_original_amount,
        order_final_amount,
        reduce_amount,
        reduce_rate
    from
    (
        select
            id,
            coupon_name,
            date_format(start_time,'yyyy-MM-dd') start_date,
            case
                when coupon_type='3201' then concat('',condition_amount,'元减',benefit_amount,'')
                when coupon_type='3202' then concat('',condition_num,'件打', (1-benefit_discount)*10,'')
                when coupon_type='3203' then concat('',benefit_amount,'')
            end rule_name
        from dim_coupon_info
        where dt='2020-06-14'
        and date_format(start_time,'yyyy-MM-dd')>=date_add('2020-06-14',-29)
    )t1
    left join
    (
        select
            coupon_id,
            get_count,
            order_count,
            expire_count,
            order_original_amount,
            order_final_amount,
            order_reduce_amount reduce_amount,
            cast(order_reduce_amount/order_original_amount as decimal(16,2)) reduce_rate
        from dwt_coupon_topic
        where dt='2020-06-14'
    )t2
    on t1.id=t2.coupon_id;

    第六章 活动主题

    6.1 活动统计

    该需求要求统计最近30日发布的所有活动的参与情况和补贴率,补贴率是指,优惠金额与参与活动的订单原价金额的比值。
    1.建表语句

    DROP TABLE IF EXISTS ads_activity_stats;
    CREATE EXTERNAL TABLE `ads_activity_stats` (
      `dt` STRING COMMENT '统计日期',
      `activity_id` STRING COMMENT '活动ID',
      `activity_name` STRING COMMENT '活动名称',
      `start_date` STRING COMMENT '活动开始日期',
      `order_count` BIGINT COMMENT '参与活动订单数',
      `order_original_amount` DECIMAL(16,2) COMMENT '参与活动订单原始金额',
      `order_final_amount` DECIMAL(16,2) COMMENT '参与活动订单最终金额',
      `reduce_amount` DECIMAL(16,2) COMMENT '优惠金额',
      `reduce_rate` DECIMAL(16,2) COMMENT '补贴率'
    ) COMMENT '商品销售统计'
    ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/gmall/ads/ads_activity_stats/';

    2.数据装载

    insert overwrite table ads_activity_stats
    select * from ads_activity_stats
    union
    select
        '2020-06-14' dt,
        t4.activity_id,
        activity_name,
        start_date,
        order_count,
        order_original_amount,
        order_final_amount,
        reduce_amount,
        reduce_rate
    from
    (
        select
            activity_id,
            activity_name,
            date_format(start_time,'yyyy-MM-dd') start_date
        from dim_activity_rule_info
        where dt='2020-06-14'
        and date_format(start_time,'yyyy-MM-dd')>=date_add('2020-06-14',-29)
        group by activity_id,activity_name,start_time
    )t4
    left join
    (
        select
            activity_id,
            sum(order_count) order_count,
            sum(order_original_amount) order_original_amount,
            sum(order_final_amount) order_final_amount,
            sum(order_reduce_amount) reduce_amount,
            cast(sum(order_reduce_amount)/sum(order_original_amount)*100 as decimal(16,2)) reduce_rate
        from dwt_activity_topic
        where dt='2020-06-14'
        group by activity_id
    )t5
    on t4.activity_id=t5.activity_id;

    第七章 ADS层业务数据导入脚本

    1)编写脚本
    (1)在/home/atguigu/bin目录下创建脚本dwt_to_ads.sh

    [atguigu@hadoop102 bin]$ vim dwt_to_ads.sh

    在脚本中填写如下内容

    #!/bin/bash
    
    APP=gmall
    
    # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
    if [ -n "$2" ] ;then
        do_date=$2
    else
        do_date=`date -d "-1 day" +%F`
    fi
    
    ads_activity_stats="
    insert overwrite table ${APP}.ads_activity_stats
    select * from ${APP}.ads_activity_stats
    union
    select
        '$do_date' dt,
        t4.activity_id,
        activity_name,
        start_date,
        order_count,
        order_original_amount,
        order_final_amount,
        reduce_amount,
        reduce_rate
    from
    (
        select
            activity_id,
            activity_name,
            date_format(start_time,'yyyy-MM-dd') start_date
        from ${APP}.dim_activity_rule_info
        where dt='$do_date'
        and date_format(start_time,'yyyy-MM-dd')>=date_add('$do_date',-29)
        group by activity_id,activity_name,start_time
    )t4
    left join
    (
        select
            activity_id,
            sum(order_count) order_count,
            sum(order_original_amount) order_original_amount,
            sum(order_final_amount) order_final_amount,
            sum(order_reduce_amount) reduce_amount,
            cast(sum(order_reduce_amount)/sum(order_original_amount)*100 as decimal(16,2)) reduce_rate
        from ${APP}.dwt_activity_topic
        where dt='$do_date'
        group by activity_id
    )t5
    on t4.activity_id=t5.activity_id;
    "
    ads_coupon_stats="
    insert overwrite table ${APP}.ads_coupon_stats
    select * from ${APP}.ads_coupon_stats
    union
    select
        '$do_date' dt,
        t1.id,
        coupon_name,
        start_date,
        rule_name,
        get_count,
        order_count,
        expire_count,
        order_original_amount,
        order_final_amount,
        reduce_amount,
        reduce_rate
    from
    (
        select
            id,
            coupon_name,
            date_format(start_time,'yyyy-MM-dd') start_date,
            case
                when coupon_type='3201' then concat('',condition_amount,'元减',benefit_amount,'')
                when coupon_type='3202' then concat('',condition_num,'件打', (1-benefit_discount)*10,'')
                when coupon_type='3203' then concat('',benefit_amount,'')
            end rule_name
        from ${APP}.dim_coupon_info
        where dt='$do_date'
        and date_format(start_time,'yyyy-MM-dd')>=date_add('$do_date',-29)
    )t1
    left join
    (
        select
            coupon_id,
            get_count,
            order_count,
            expire_count,
            order_original_amount,
            order_final_amount,
            order_reduce_amount reduce_amount,
            cast(order_reduce_amount/order_original_amount as decimal(16,2)) reduce_rate
        from ${APP}.dwt_coupon_topic
        where dt='$do_date'
    )t2
    on t1.id=t2.coupon_id;
    "
    
    ads_order_by_province="
    insert overwrite table ${APP}.ads_order_by_province
    select * from ${APP}.ads_order_by_province
    union
    select
        dt,
        recent_days,
        province_id,
        province_name,
        area_code,
        iso_code,
        iso_3166_2,
        order_count,
        order_amount
    from
    (
        select
            '$do_date' dt,
            recent_days,
            province_id,
            sum(order_count) order_count,
            sum(order_amount) order_amount
        from
        (
            select
                recent_days,
                province_id,
                case
                    when recent_days=1 then order_last_1d_count
                    when recent_days=7 then order_last_7d_count
                    when recent_days=30 then order_last_30d_count
                end order_count,
                case
                    when recent_days=1 then order_last_1d_final_amount
                    when recent_days=7 then order_last_7d_final_amount
                    when recent_days=30 then order_last_30d_final_amount
                end order_amount
            from ${APP}.dwt_area_topic lateral view explode(Array(1,7,30)) tmp as recent_days
            where dt='$do_date'
        )t1
        group by recent_days,province_id
    )t2
    join ${APP}.dim_base_province t3
    on t2.province_id=t3.id;
    "
    
    ads_order_spu_stats="
    insert overwrite table ${APP}.ads_order_spu_stats
    select * from ${APP}.ads_order_spu_stats
    union
    select
        '$do_date' dt,
        recent_days,
        spu_id,
        spu_name,
        tm_id,
        tm_name,
        category3_id,
        category3_name,
        category2_id,
        category2_name,
        category1_id,
        category1_name,
        sum(order_count),
        sum(order_amount)
    from
    (
        select
            recent_days,
            sku_id,
            case
                when recent_days=1 then order_last_1d_count
                when recent_days=7 then order_last_7d_count
                when recent_days=30 then order_last_30d_count
            end order_count,
            case
                when recent_days=1 then order_last_1d_final_amount
                when recent_days=7 then order_last_7d_final_amount
                when recent_days=30 then order_last_30d_final_amount
            end order_amount
        from ${APP}.dwt_sku_topic lateral view explode(Array(1,7,30)) tmp as recent_days
        where dt='$do_date'
    )t1
    left join
    (
        select
            id,
            spu_id,
            spu_name,
            tm_id,
            tm_name,
            category3_id,
            category3_name,
            category2_id,
            category2_name,
            category1_id,
            category1_name
        from ${APP}.dim_sku_info
        where dt='$do_date'
    )t2
    on t1.sku_id=t2.id
    group by recent_days,spu_id,spu_name,tm_id,tm_name,category3_id,category3_name,category2_id,category2_name,category1_id,category1_name;
    "
    
    ads_order_total="
    insert overwrite table ${APP}.ads_order_total
    select * from ${APP}.ads_order_total
    union
    select
        '$do_date',
        recent_days,
        sum(order_count),
        sum(order_final_amount) order_final_amount,
        sum(if(order_final_amount>0,1,0)) order_user_count
    from
    (
        select
            recent_days,
            user_id,
            case when recent_days=0 then order_count
                 when recent_days=1 then order_last_1d_count
                 when recent_days=7 then order_last_7d_count
                 when recent_days=30 then order_last_30d_count
            end order_count,
            case when recent_days=0 then order_final_amount
                 when recent_days=1 then order_last_1d_final_amount
                 when recent_days=7 then order_last_7d_final_amount
                 when recent_days=30 then order_last_30d_final_amount
            end order_final_amount
        from ${APP}.dwt_user_topic lateral view explode(Array(1,7,30)) tmp as recent_days
        where dt='$do_date'
    )t1
    group by recent_days;
    "
    
    ads_page_path="
    insert overwrite table ${APP}.ads_page_path
    select * from ${APP}.ads_page_path
    union
    select
        '$do_date',
        recent_days,
        source,
        target,
        count(*)
    from
    (
        select
            recent_days,
            concat('step-',step,':',source) source,
            concat('step-',step+1,':',target) target
        from
        (
            select
                recent_days,
                page_id source,
                lead(page_id,1,null) over (partition by recent_days,session_id order by ts) target,
                row_number() over (partition by recent_days,session_id order by ts) step
            from
            (
                select
                    recent_days,
                    last_page_id,
                    page_id,
                    ts,
                    concat(mid_id,'-',last_value(if(last_page_id is null,ts,null),true) over (partition by mid_id,recent_days order by ts)) session_id
                from ${APP}.dwd_page_log lateral view explode(Array(1,7,30)) tmp as recent_days
                where dt>=date_add('$do_date',-30)
                and dt>=date_add('$do_date',-recent_days+1)
            )t2
        )t3
    )t4
    group by recent_days,source,target;
    "
    
    ads_repeat_purchase="
    insert overwrite table ${APP}.ads_repeat_purchase
    select * from ${APP}.ads_repeat_purchase
    union
    select
        '$do_date' dt,
        recent_days,
        tm_id,
        tm_name,
        cast(sum(if(order_count>=2,1,0))/sum(if(order_count>=1,1,0))*100 as decimal(16,2))
    from
    (
        select
            recent_days,
            user_id,
            tm_id,
            tm_name,
            sum(order_count) order_count
        from
        (
            select
                recent_days,
                user_id,
                sku_id,
                count(*) order_count
            from ${APP}.dwd_order_detail lateral view explode(Array(1,7,30)) tmp as recent_days
            where dt>=date_add('$do_date',-29)
            and dt>=date_add('$do_date',-recent_days+1)
            group by recent_days, user_id,sku_id
        )t1
        left join
        (
            select
                id,
                tm_id,
                tm_name
            from ${APP}.dim_sku_info
            where dt='$do_date'
        )t2
        on t1.sku_id=t2.id
        group by recent_days,user_id,tm_id,tm_name
    )t3
    group by recent_days,tm_id,tm_name;
    "
    
    ads_user_action="
    with
    tmp_page as
    (
        select
            '$do_date' dt,
            recent_days,
            sum(if(array_contains(pages,'home'),1,0)) home_count,
            sum(if(array_contains(pages,'good_detail'),1,0)) good_detail_count
        from
        (
            select
                recent_days,
                mid_id,
                collect_set(page_id) pages
            from
            (
                select
                    dt,
                    mid_id,
                    page.page_id
                from ${APP}.dws_visitor_action_daycount lateral view explode(page_stats) tmp as page
                where dt>=date_add('$do_date',-29)
                and page.page_id in('home','good_detail')
            )t1 lateral view explode(Array(1,7,30)) tmp as recent_days
            where dt>=date_add('$do_date',-recent_days+1)
            group by recent_days,mid_id
        )t2
        group by recent_days
    ),
    tmp_cop as
    (
        select
            '$do_date' dt,
            recent_days,
            sum(if(cart_count>0,1,0)) cart_count,
            sum(if(order_count>0,1,0)) order_count,
            sum(if(payment_count>0,1,0)) payment_count
        from
        (
            select
                recent_days,
                user_id,
                case
                    when recent_days=1 then cart_last_1d_count
                    when recent_days=7 then cart_last_7d_count
                    when recent_days=30 then cart_last_30d_count
                end cart_count,
                case
                    when recent_days=1 then order_last_1d_count
                    when recent_days=7 then order_last_7d_count
                    when recent_days=30 then order_last_30d_count
                end order_count,
                case
                    when recent_days=1 then payment_last_1d_count
                    when recent_days=7 then payment_last_7d_count
                    when recent_days=30 then payment_last_30d_count
                end payment_count
            from ${APP}.dwt_user_topic lateral view explode(Array(1,7,30)) tmp as recent_days
            where dt='$do_date'
        )t1
        group by recent_days
    )
    insert overwrite table ${APP}.ads_user_action
    select * from ${APP}.ads_user_action
    union
    select
        tmp_page.dt,
        tmp_page.recent_days,
        home_count,
        good_detail_count,
        cart_count,
        order_count,
        payment_count
    from tmp_page
    join tmp_cop
    on tmp_page.recent_days=tmp_cop.recent_days;
    "
    
    ads_user_change="
    insert overwrite table ${APP}.ads_user_change
    select * from ${APP}.ads_user_change
    union
    select
        churn.dt,
        user_churn_count,
        user_back_count
    from
    (
        select
            '$do_date' dt,
            count(*) user_churn_count
        from ${APP}.dwt_user_topic
        where dt='$do_date'
        and login_date_last=date_add('$do_date',-7)
    )churn
    join
    (
        select
            '$do_date' dt,
            count(*) user_back_count
        from
        (
            select
                user_id,
                login_date_last
            from ${APP}.dwt_user_topic
            where dt='$do_date'
            and login_date_last='$do_date'
        )t1
        join
        (
            select
                user_id,
                login_date_last login_date_previous
            from ${APP}.dwt_user_topic
            where dt=date_add('$do_date',-1)
        )t2
        on t1.user_id=t2.user_id
        where datediff(login_date_last,login_date_previous)>=8
    )back
    on churn.dt=back.dt;
    "
    
    ads_user_retention="
    insert overwrite table ${APP}.ads_user_retention
    select * from ${APP}.ads_user_retention
    union
    select
        '$do_date',
        login_date_first create_date,
        datediff('$do_date',login_date_first) retention_day,
        sum(if(login_date_last='$do_date',1,0)) retention_count,
        count(*) new_user_count,
        cast(sum(if(login_date_last='$do_date',1,0))/count(*)*100 as decimal(16,2)) retention_rate
    from ${APP}.dwt_user_topic
    where dt='$do_date'
    and login_date_first>=date_add('$do_date',-7)
    and login_date_first<'$do_date'
    group by login_date_first;
    "
    
    ads_user_total="
    insert overwrite table ${APP}.ads_user_total
    select * from ${APP}.ads_user_total
    union
    select
        '$do_date',
        recent_days,
        sum(if(login_date_first>=recent_days_ago,1,0)) new_user_count,
        sum(if(order_date_first>=recent_days_ago,1,0)) new_order_user_count,
        sum(order_final_amount) order_final_amount,
        sum(if(order_final_amount>0,1,0)) order_user_count,
        sum(if(login_date_last>=recent_days_ago and order_final_amount=0,1,0)) no_order_user_count
    from
    (
        select
            recent_days,
            user_id,
            login_date_first,
            login_date_last,
            order_date_first,
            case when recent_days=0 then order_final_amount
                 when recent_days=1 then order_last_1d_final_amount
                 when recent_days=7 then order_last_7d_final_amount
                 when recent_days=30 then order_last_30d_final_amount
            end order_final_amount,
            if(recent_days=0,'1970-01-01',date_add('$do_date',-recent_days+1)) recent_days_ago
        from ${APP}.dwt_user_topic lateral view explode(Array(0,1,7,30)) tmp as recent_days
        where dt='$do_date'
    )t1
    group by recent_days;
    "
    
    ads_visit_stats="
    insert overwrite table ${APP}.ads_visit_stats
    select * from ${APP}.ads_visit_stats
    union
    select
        '$do_date' dt,
        is_new,
        recent_days,
        channel,
        count(distinct(mid_id)) uv_count,
        cast(sum(duration)/1000 as bigint) duration_sec,
        cast(avg(duration)/1000 as bigint) avg_duration_sec,
        sum(page_count) page_count,
        cast(avg(page_count) as bigint) avg_page_count,
        count(*) sv_count,
        sum(if(page_count=1,1,0)) bounce_count,
        cast(sum(if(page_count=1,1,0))/count(*)*100 as decimal(16,2)) bounce_rate
    from
    (
        select
            session_id,
            mid_id,
            is_new,
            recent_days,
            channel,
            count(*) page_count,
            sum(during_time) duration
        from
        (
            select
                mid_id,
                channel,
                recent_days,
                is_new,
                last_page_id,
                page_id,
                during_time,
                concat(mid_id,'-',last_value(if(last_page_id is null,ts,null),true) over (partition by recent_days,mid_id order by ts)) session_id
            from
            (
                select
                    mid_id,
                    channel,
                    last_page_id,
                    page_id,
                    during_time,
                    ts,
                    recent_days,
                    if(visit_date_first>=date_add('$do_date',-recent_days+1),'1','0') is_new
                from
                (
                    select
                        t1.mid_id,
                        t1.channel,
                        t1.last_page_id,
                        t1.page_id,
                        t1.during_time,
                        t1.dt,
                        t1.ts,
                        t2.visit_date_first
                    from
                    (
                        select
                            mid_id,
                            channel,
                            last_page_id,
                            page_id,
                            during_time,
                            dt,
                            ts
                        from ${APP}.dwd_page_log
                        where dt>=date_add('$do_date',-30)
                    )t1
                    left join
                    (
                        select
                            mid_id,
                            visit_date_first
                        from ${APP}.dwt_visitor_topic
                        where dt='$do_date'
                    )t2
                    on t1.mid_id=t2.mid_id
                )t3 lateral view explode(Array(1,7,30)) tmp as recent_days
                where dt>=date_add('$do_date',-recent_days+1)
            )t4
        )t5
        group by session_id,mid_id,is_new,recent_days,channel
    )t6
    group by is_new,recent_days,channel;
    "
    
    case $1 in
        "ads_activity_stats" )
            hive -e "$ads_activity_stats"
        ;;
        "ads_coupon_stats" )
            hive -e "$ads_coupon_stats"
        ;;
        "ads_order_by_province" )
            hive -e "$ads_order_by_province"
        ;;
        "ads_order_spu_stats" )
            hive -e "$ads_order_spu_stats"
        ;;
        "ads_order_total" )
            hive -e "$ads_order_total"
        ;;
        "ads_page_path" )
            hive -e "$ads_page_path"
        ;;
        "ads_repeat_purchase" )
            hive -e "$ads_repeat_purchase"
        ;;
        "ads_user_action" )
            hive -e "$ads_user_action"
        ;;
        "ads_user_change" )
            hive -e "$ads_user_change"
        ;;
        "ads_user_retention" )
            hive -e "$ads_user_retention"
        ;;
        "ads_user_total" )
            hive -e "$ads_user_total"
        ;;
        "ads_visit_stats" )
            hive -e "$ads_visit_stats"
        ;;
        "all" )
            hive -e "$ads_activity_stats$ads_coupon_stats$ads_order_by_province$ads_order_spu_stats$ads_order_total$ads_page_path$ads_repeat_purchase$ads_user_action$ads_user_change$ads_user_retention$ads_user_total$ads_visit_stats"
        ;;
    esac

    (2)增加脚本执行权限

    [atguigu@hadoop102 bin]$ chmod 777 dwt_to_ads.sh

    2)脚本使用
    (1)执行脚本

    [atguigu@hadoop102 bin]$ dwt_to_ads.sh all 2020-06-14                                                           

    (2)查看数据是否导入

  • 相关阅读:
    super与this的区别?
    springboot(4)-thymeleaf
    springboot(3)-自定义josn
    springboot(2)-Http协议接口开发
    springboot(1)-HelloWorld
    SpringSecurityOauth2.0
    Docker 应用部署
    RabbitMQ02
    RabbitMQ01
    011通用寄存器
  • 原文地址:https://www.cnblogs.com/wkfvawl/p/15907996.html
Copyright © 2020-2023  润新知