azkaban调度
版权声明:原创作品,谢绝转载!否则将追究法律责任。个人学习工作中的一些笔记和Demo,后期会继续补充和完善. 作者:Steven(丁振春)
学习六重:重目标,重思路,重方法,重实践,重习惯,重总结。
1、概述
azkaban是一套调度系统,常用大数据作业调度。azkaban包括web和executor两套程序,web主要完成展示和交互,executor上完成调度和作业提交执行。
2、安装
https://github.com/azkaban/azkaban/commit/16b9c637cb1ba98932da7e1f69b2f93e7882b723
3、启动
3.1 启动web程序
$>/soft/azkaban/web/bin/azkaban-web-start.sh
3.2 执行executor程序
$>/soft/azkaban-exec/bin/azkaban-executor-start.sh
4、登录webui
http://s101:8081
5、登录界面
6、创建工程
6.1 点击创建项目按钮
6.2 输入项目信息
6.3 查看创建项目
7、创建job文件和依赖关系
7.1 编写job文件
job文件以job作为扩展名,job中涉及的sh脚本需要保证在executor节点上存在,实际执行时都是由executor在本机上执行特定的shell脚本。
注意:编写shell脚本时,windows上有些不可见字符在linux的解释器识别为非法字符。因此,如果使用windows编写脚本,最好使用notepad编写。linux常出现的"^M"字样,无法解释。
7.1.1 加载hdfs数据到hive的原生日志表
-
创建sql脚本文件,加载hdfs数据到hive原生表,在脚本中使用hive配置变量。注意分区:
[load_data_to_hive_raw_logs.sql]
use big12_umeng ; load data inpath 'hdfs://mycluster/user/centos/big12_umeng/raw-logs/${hiveconf:ym}/${hiveconf:day}/${hiveconf:hm}' into table raw_logs partition(ym=${hiveconf:ym},day=${hiveconf:day}) ;
-
编写shell脚本,调用上面的sql脚本
[load_data_to_hive_raw_logs.sh]
#!/bin/bash cd /home/centos/big12_umeng if [[ $# = 0 ]] ; then time=`date -d "-1 days" "+%Y%m-%d"` ; else time=$1$2-$3 fi #external time variable echo -n $time > _time ym=`echo $time | awk -F '-' '{print $1}'` day=`echo $time | awk -F '-' '{print $2}'` hive -hiveconf ym=${ym} -hiveconf day=${day} -f load_data_to_hive_raw_logs.sql
-
编写azkaban作业文件
[1_load_data_to_hive_raw_logs.job]
type=command command=sh /home/centos/big12_umeng/load_data_to_hive_raw_logs.sh
7.1.2 叉分hive的原生日志表,生成5个日志子表
-
编写数据清洗的sql脚本文件
[data_clean_startuplog.sql]
-- appstartuplog use big12_umeng ; insert into appstartuplogs partition(ym , day) select t.appChannel , t.appId , t.appPlatform , t.appVersion , t.brand , t.carrier , t.country , t.createdAtMs , t.deviceId , t.deviceStyle , t.ipAddress , t.network , t.osType , t.province , t.screenSize , t.tenantId , formatbyday(t.createdatms , 0 , 'yyyyMM') , formatbyday(t.createdatms , 0 , 'dd') from ( select forkstartuplogs(servertimestr ,clienttimems ,clientip ,json) from raw_logs where concat(ym,day) = '${ymd}' )t;
-
编写shell脚本,调用以上的sql脚本
[data_clean.sh]
#!/bin/bash cd /home/centos/big12_umeng spark-submit --master yarn --jars /home/centos/big12_umeng/ument_hive.jar --class com.oldboy.umeng.spark.stat.DataClean /home/centos/big12_umeng/umeng_spark.jar $1 `cat _time | awk -F '-' '{print $1$2}'`
事件日志、错误日志类同。
-
编写azkaban的job文件
[2_1_data_clean_startuplog.job]
type=command command=sh /home/centos/big12_umeng/data_clean.sh /home/centos/big12_umeng/data_clean_startup.sql dependencies=1_load_data_to_hive_raw_logs
-
依次编写剩余的四个job文件即可
[2_2_data_clean_eventlog.job]
type=command command=sh /home/centos/big12_umeng/data_clean.sh /home/centos/big12_umeng/data_clean_eventlog.sql dependencies=1_load_data_to_hive_raw_logs
[2_3_data_clean_errorlog.job]
type=command command=sh /home/centos/big12_umeng/data_clean.sh /home/centos/big12_umeng/data_clean_errorlog.sql
[2_4_data_clean_pagelog.job]
type=command command=sh /home/centos/big12_umeng/data_clean.sh /home/centos/big12_umeng/data_clean_pagelog.sql
[2_5_data_clean_usagelog.job]
type=command command=sh /home/centos/big12_umeng/data_clean.sh /home/centos/big12_umeng/data_clean_usagelog.sql
7.1.3 执行统计分析任务
7.1.3.1 活跃用户统计
-
昨日日活统计
-
编写sql脚本
[stat_act_day.sql]
use big12_umeng ; create table if not exists stat_act_day( day string , appid string, appplatform string, brand string , devicestyle string, ostype string , appversion string , cnt int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by ' '; insert into table stat_act_day select '${ymd}', ifnull(tt.appid ,'NULLL') , ifnull(tt.appplatform,'NULLL') , ifnull(tt.brand ,'NULLL') , ifnull(tt.devicestyle,'NULLL') , ifnull(tt.ostype ,'NULLL') , ifnull(tt.appversion ,'NULLL') , count(tt.deviceid) FROM ( select t.appid , t.appplatform, t.brand , t.devicestyle, t.ostype , t.appversion , t.deviceid FROM ( select appid , appplatform, brand , devicestyle, ostype , appversion , deviceid from appstartuplogs WHERE concat(ym,day) = '${ymd}' group BY appid , appplatform, brand , devicestyle, ostype , appversion, deviceid with cube )t where t.appid is not null and t.deviceid is not null )tt group BY tt.appid , tt.appplatform, tt.brand , tt.devicestyle, tt.ostype , tt.appversion order by tt.appid , tt.appplatform, tt.brand , tt.devicestyle, tt.ostype , tt.appversion
-
编写shell脚本,调用sql脚本
[stat_act_day.sh]
#!/bin/bash cd /home/centos/big12_umeng spark-submit --master yarn --jars /home/centos/big12_umeng/umeng_hive.jar umeng_spark.jar com.oldboy.umeng.spark.stat.StatActDay `cat _time | awk -F '-' '{print $1$2}'`
-
编写job文件
[3_1_stat_act_day.job]
type=command command=sh /home/centos/big12_umeng/stat_act_day.sh /home/centos/big12_umeng/stat_act_day.sql dependencies=2_1_data_clean_startuplog
-
-
昨日周活统计
略。
-
昨日月活统计
略。
7.1.3.2 新增用户统计
-
昨日新增
-
编写sql脚本
[stat_new_day.sql]
use big12_umeng ; create table if not exists stat_new_day( day string , appid string, appplatform string, brand string , devicestyle string, ostype string , appversion string , cnt int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by ' '; insert into table stat_new_day SELECT '${ymd}' , t.appid , t.appversion , t.appplatform, t.brand , t.devicestyle, t.ostype , count(t.deviceid) cnt FROM ( select appid , appplatform, brand , devicestyle, ostype , appversion , deviceid , min(createdatms) firsttime from appstartuplogs group BY appid , appplatform, brand , devicestyle, ostype , appversion, deviceid with cube )t WHERE t.appid is not NULL and t.deviceid is not null and formatbyday(t.firsttime , 0 , 'yyyyMMdd') = '${ymd}' group by t.appid , t.appversion , t.appplatform, t.brand , t.devicestyle, t.ostype order BY t.appid , t.appversion , t.appplatform, t.brand , t.devicestyle, t.ostype
-
编写shell脚本
[stat_new_day.sh]
#!/bin/bash cd /home/centos/big12_umeng spark-submit --master yarn --jars /home/centos/big12_umeng/umeng_hive.jar umeng_spark.jar com.oldboy.umeng.spark.stat.StatNewDay `cat _time | awk -F '-' '{print $1$2}'`
-
编写job文件
[3_2_stat_new_day.job]
type=command command=sh /home/centos/big12_umeng/stat_new_day.sh /home/centos/big12_umeng/stat_new_day.sql dependencies=2_1_data_clean_startuplog
-
-
昨日周新
-
编写sql脚本
[stat_new_week.sql]
use big12_umeng ; create table if not exists stat_new_week( day string , appid string, appplatform string, brand string , devicestyle string, ostype string , appversion string , cnt int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by ' '; insert into table stat_new_week SELECT appid , appversion , appplatform, brand , devicestyle, ostype , sum(cnt) cnt FROM stat_new_day WHERE formatbyweek(day ,'yyyyMMdd' , 0 , 'yyyyMMdd') = formatbyweek('${ymd}' ,'yyyyMMdd' , 0 , 'yyyyMMdd') group by appid , appversion , appplatform, brand , devicestyle, ostype
-
编写shell脚本,调用sql脚本
[stat_new_week.sh]
#!/bin/bash cd /home/centos/big12_umeng spark-submit --master yarn --jars /home/centos/big12_umeng/umeng_hive.jar umeng_spark.jar com.oldboy.umeng.spark.stat.StatNewWeek $1 `cat _time | awk -F '-' '{print $1$2}'`
-
编写job文件
[3_3_stat_new_week.job]
type=command command=sh /home/centos/big12_umeng/stat_new_week.sh /home/centos/big12_umeng/stat_new_week.sql dependencies=3_2_stat_new_day
-
-
昨日月新增
略。
7.1.4 导出统计结果到mysql
-
编写sqoop脚本
[export_hive_stat_act_day_to_mysql.sh]
sqoop-export --driver com.mysql.jdbc.Driver --connect jdbc:mysql://192.168.231.1:3306/big12 --username root --password root --columns day,appid,appplatform,brand,devicestyle,ostype,appversion,cnt --table stat_act_day --export-dir /user/hive/warehouse/big12_umeng.db/stat_act_day -m 3
-
编写job文件
[4_1_export_hive_stat_act_day_to_mysql.job]
type=command command=/home/centos/big12_umeng/export_hive_stat_act_day_to_mysql.sh dependencies=3_1_stat_act_day
7.1.5 结束job
-
编写job文件
[end.job]
type=noop dependencies=4_1_export_hive_stat_act_day_to_mysql,
8.2 打包job文件到一个zip中
$>cd /home/centos/umeng/job
#将当前所有文件打成1.zip文件
$>zip 1.zip *
8.3 1.zip文件内容如下
9、上传zip文件
9.1 点击上传按钮
9.2 选中zip文件
9.3 查看上传结果
9.4 展开job,查看依赖关系
10、job的执行与调度
job在上传之后,可以单独执行某个job,带依赖关系也可以不带依赖关系。执行通常用来测试,测试通过后就可以进行调度作业。