1.hive基础:order by,sort by,distribute by,cluster by
order by 全局排序 所有的map结果提交至一个reduce里,如果在严格模式下,order by 需要指定 limit 数据条数,不然数据量巨大的情况下
会造成崩溃无输出结果。
涉及属性:set hive.mapred.mode=nonstrict/strict
设置为严格模式后。order by 后面需要接 limit
sort by 多个reduce,reduce内部有序,不保证全局有序。
distribute by distribute by 一般要和 sort by 一起使用,即将某一块数据归给(distribute by)某一个reducer 处理,然后在指定的 reducer 中进行 sort by 排序。
cluster by =distribute by + sort by (分桶字段和排序字段为同一个字段)
create table company_info (
personid string,
company string,
money float
)row format delimited fields terminated by "";
load data local inpath "/home/dip/company.txt"
[dip@dip005 lzm]$ cat company.txt
p1 公司1 150
p2 公司2 230
p1 公司3 160
p3 公司4 30
p4 公司4 20
p4 公司6 90
p5 公司5 500
练习
#全局排序
select * from company_info order by money desc ;
#reduce内部有序,如果启动多个reduce,reduce之间没有有序
select * from company_info sort by money desc;
#默认第一个字段分桶,这里我们用第二个字段分桶distribute by 然后用第三个字段做sort by
2、hive基础:行转列、列转行 UDAF UDTF
2.1;列转行
concat()函数用于将多个字符串连接成一个字符串 SELECT CONCAT(id, ‘,’, name) AS con FROM info
conact_ws()函数是caoncat的特殊形式,第一个参数是分隔符号。 SELECT CONCAT("_",id, name) AS con FROM info
group_concat()函数返回一个字符串结果
hive中collect_list()函数和collect_set()函数将某列转为一个数组返回
[dip@dip005 lzm]$ cat person_info.tsv
孙悟空 白羊座 A
大海 射手座 A
宋宋 白羊座 B
猪八戒 白羊座 A
凤姐 射手座 A
create table person_info(
name string,
constellation string,
blood_type string)
row format delimited fields terminated by "";
load data local inpath "/home/dip/lzm/person_info.tsv" into table person_info
#请将具有相同星座和血型的人返回
select t1.base,collect_set(name) from
(select name,concat_ws(",",constellation,blood_type) base
from person_info)t1
group by
t1.base;
将collect_set()返回的列表进行处理:
select base,concat_ws("|",collect_set(name)) from
(select name,concat_ws(",",constellation,blood_type) base
from person_info)t1
group by
base;
2.2行转列:
“fields terminated by”:字段与字段之间的分隔符。
“collection items terminated by”:一个字段中各个子元素 item 的分隔符。
[dip@dip005 lzm]$ head movie_info.tsv
《疑犯追踪》 悬疑,动作,科幻,剧情
《lie to me》 悬疑,警匪,动作,心理,剧情
《战狼2》 战争,动作,灾难
create table movie_info(
movie string,
category array<string>
)row format delimited fields terminated by ""
collection items terminated by ",";
load data local inpath "/home/dip/lzm/movie_info.tsv" into table movie_info;
lateral view 与 explode函数的使用
youtube数据实战:
源数据解释:youtube_video_ori
视频表
video_id 视频id
uploader 视频上传者
age 视频年龄
category 视频类别
length 视频长度
views 观看次数
rate 视频评分
ratings 流量
comments 评论数
related_ids 相关视频id (不多于20个)
用户表
uploader 视频上传者
videos 上传视频数
friends 朋友的数量
练习
一、各种 TopN 指标:
01统计视频观看数 top10
select * from youtube_video_ori order by views desc limit 10;
02统计视频类别热度 top10 (视频数最多的十个类别)
考点:把category进行 一行转多行,然后聚合统计次数
方法一:
select t1.category_name,count(*) as cnt
from
(select video_id,category_name from youtube_video_ori lateral view explode(category) youtube_video_ori as category_name)t1
group by category_name
order by cnt desc
limit 10
;
方法二:
select t1.category_name ,count(*) as cnt from
(select explode(category) as category_name from youtube_video_ori)t1
group by category_name
order by cnt desc
limit 10;
03统计视频观看数 Top20 所属类别包含这 Top20 视频的个数
①计算观看视频数top20
select * from youtube_video_ori order by views desc limit 20;
②类别一行转多行
select video_id,category_name from
(select * from youtube_video_ori order by views desc limit 20)t1
lateral view explode(category) t1 as category_name;
③统计每个类别包含视频个数视频
select category_name,count(*) from
(select video_id,category_name from
(select * from youtube_video_ori order by views desc limit 20)t1
lateral view explode(category) t1 as category_name)t2
group by category_name;
04统计视频观看数 Top50 所关联视频的所属类别Rank
①top前50
select * from youtube_video_ori order by views desc limit 50;
②关联视频id一行转多行:
select video_id,related_ids_name from
(select * from youtube_video_ori order by views desc limit 50)t1
lateral view explode(related_ids)t1 as related_ids_name;
③video_id做关联,与表youtube_video_ori关联,统计关联视频所属id
select t2.related_ids_name,youtube_video_ori.category from
(select video_id,related_ids_name from
(select * from youtube_video_ori order by views desc limit 50)t1
lateral view explode(related_ids)t1 as related_ids_name)t2
join youtube_video_ori
on t2.video_id =youtube_video_ori.video_id;
④类别一行转多行 ,为key,聚类统计数量
select category_name,count(*) as cnt from
(select t2.related_ids_name,youtube_video_ori.category from
(select video_id,related_ids_name from
(select * from youtube_video_ori order by views desc limit 50)t1
lateral view explode(related_ids)t1 as related_ids_name)t2
join youtube_video_ori
on t2.video_id =youtube_video_ori.video_id)t3
lateral view explode(category) t3 as category_name
group by category_name
order by cnt;
05统计每个类别中视频观看数的 Top10
查询有多少类别
hive -e "use hue_test1;select distinct(category_name) from youtube_video_ori lateral view explode(category)youtube_video_ori as category_name"| grep -v WARN>a.txt
cat a.txt|while read line
do
hive -e "use hue_test1;select video_id,category_name,views from youtube_video_ori lateral view explode(category)youtube_video_ori as category_name where category_name='$line' order by views desc limit 10" |grep -v WARN>>b.txt
done
06统计上传视频最多的用户 Top10 以及他们上传的视频
select youtube_video_ori.uploader,youtube_video_ori.video_id from
(
select uploader,count(*)as up_cnt from youtube_video_ori
group by uploader
order by up_cnt desc
limit 10)t1
join youtube_video_ori on t1.uploader = youtube_video_ori.uploader;