• Hive学习小记-(17)inline(array(struct))与explode


    inline

    前情提要:inline无法作用于map,array(map)

    关于inline:在横表纵表转换一节已经试过,map无法使用inline;

    在这里将map转成array,发现还是无法用inline,看来inline只适用array(struct)格式;

    # map转array,还是不能用lateral view inline;inline只适用于array(struct)格式
    sc.sql(''' select id ,array(str_to_map(concat_ws(',',collect_set(concat_ws(':',prod_nm,cast(bal as string)))))) from test_youhua.zongbiao group by id ''') # 查询结果已经转成了ARRAY 1 [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}] 2 [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}]
    # 还是不能用inline
    sc.sql(''' select map_tmp_tbl.id,c1 from  (
        select id
                ,array(str_to_map(concat_ws(',',collect_set(concat_ws(':',prod_nm,cast(bal as string)))))) as array_map_col
            from test_youhua.zongbiao 
            group by id
    ) as map_tmp_tbl  lateral view inline(map_tmp_tbl.array_map_col) t1 as c1  ''').show()
    
    # 报错,inline使用的格式为array(struct),这里格式array(map)不匹配
    AnalysisException: "cannot resolve 'inline(map_tmp_tbl.`array_map_col`)' due to data type mismatch: input to function inline should be array of struct type, not ArrayType(MapType(StringType,StringType,true),false);

    看了这篇怎么感觉可以应用于array(map)???

    https://blog.csdn.net/JnYoung/article/details/106169019

    不一样的,这个示例named_struct_1字段事先就存成了struct类型。

    那接下来老老实实建一个array(struct)格式字段来处理吧

    (1)数据准备-建表insert-select:直接将map转array后不能用inline的数据存成array(struct):不能,会报错字段类型不匹配。

    这里有点奇怪,hive是schema on read,insert的时候会检查字段格式是否一致吗??

    比如parquet不支持date格式,insert进去也只是显示空字段,而不是一开始就insert报错

    # 建表
    create table if not exists test_youhua.test_array_struct_inline(
    custom_id int comment "客户id",
    all_bal array<struct<baoxian:float, cunkuan:float, jijin:float>> comment '资产配置'
    ) 
    comment "array_struct_客户资产配置表"
    ;
    # 插入数据
    insert overwrite test_youhua.test_array_struct_inline
    select id
    ,array(str_to_map(concat_ws(',',collect_set(concat_ws(':',prod_nm,cast(bal as string))))))
    from test_youhua.zongbiao 
    group by id
    
    # 报错:字段类型不一致
    FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into target table 
    because column number/types are different 'test_array_struct_inline': 
    Cannot convert column 1 from array<map<string,string>> to array<struct<baoxian:float,cunkuan:float,jijin:float>>.

    (2)数据准备-直接load文件到test_youhua.test_array_struct_inline

    # 文件准备 test_array_struct_inline, xftp到Linux
    1    [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}]
    2    [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}]
    
    # load 到HDFS
    hdfs dfs -put /opt/module/hive/my_input/test_array_struct_inline  hdfs:///user/hive/warehouse/test_youhua.db/test_array_struct_inline
    
    # 查询hive数据,数据确实已经load上去了,但是读不出来
    sc.sql(""" select * from test_youhua.test_array_struct_inline""").show()
    +---------+-------+
    |custom_id|all_bal|
    +---------+-------+
    |     null|   null|
    |     null|   null|
    +---------+-------+
    
    # 猜测是分隔符的原因,重新指定一下分隔符
    sc.sql(""" drop table test_youhua.test_array_struct_inline""")
    sc.sql("""create table if not exists test_youhua.test_array_struct_inline(
    custom_id int comment "客户id",
    all_bal array<struct<baoxian:float, cunkuan:float, jijin:float>> comment '资产配置'
    ) 
    comment "array_struct_客户资产配置表"
    row format delimited fields terminated by ','
    collection items terminated by '_'
    """)
    
    !hdfs dfs -put /opt/module/hive/my_input/test_array_struct_inline  hdfs:///user/hive/warehouse/test_youhua.db/test_array_struct_inline
    
    sc.sql(""" select * from test_youhua.test_array_struct_inline""").show()
    
    #无论怎样改都不行,读不出来,可能是array嵌套struct,影响了分隔符指定的缘故(其实是因为json格式需要导入serde包)

     (3)数据准备-用json包指定row format读文件

    其实之前数据无法正常read是因为json的分隔符的原因,需要导入jsonserde包

    参考:Hive学习小记-(16)hive加载解析json文件 稍微修改了一下文件:

    {"custom_id":"1","all_bal":[{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}]}
    {"custom_id":"2","all_bal":[{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}]}
    hive> add jar /opt/module/hive/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar
    hive> create table if not exists test_youhua.test_array_struct_inline(
    > custom_id string comment "客户id",
    > all_bal array<struct<baoxian:string, cunkuan:string, jijin:string>> comment '资产配置'
    > ) 
    > comment "array_struct_客户资产配置表"
    > row format serde 'org.apache.hive.hcatalog.data.JsonSerDe';
    OK
    Time taken: 0.09 seconds
    hive> select * from test_youhua.test_array_struct_inline;
    OK
    Time taken: 0.089 seconds
    # 数据导入并且读取成功
    hive> select * from test_youhua.test_array_struct_inline; OK 1 [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}] 2 [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}] Time taken: 0.107 seconds, Fetched: 2 row(s)
    #注意这里字段类型全部改为string,否则select会报错:
    Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
    at [Source: java.io.ByteArrayInputStream@ab327c; line: 1, column: 41]

    (4)用inline可以打开array(struct),对比explode只是打开array

    参考:https://blog.csdn.net/weixin_42003671/article/details/88132666

    # 原始数据
    hive> select * from test_youhua.test_array_struct_inline;
    OK
    1    [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}]
    2    [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}]
    Time taken: 0.103 seconds, Fetched: 2 row(s)
    
    #这个时候如果想读取jijin、baoxian、cunkuan的余额,需要:
    #这是在array的元素只有一条struct数据的情况,如果有多条struct元素,通过array[i]的形式来寻找某个key的value会比较困难,这时候就要借助inline和explode来将多个struct的某个key对应的value转到一列
    hive> select all_bal[0].jijin,all_bal[0].baoxian,all_bal[0].cunkuan from test_youhua.test_array_struct_inline;
    OK
    1.1    1.2    1.3
    2.67    2.34    2.1
    Time taken: 0.587 seconds, Fetched: 2 row(s)
    
    #用 inline 将多个struct的某个key对应的value转到一列
    hive> select tmp.custom_id,c1,c2,c3 from test_youhua.test_array_struct_inline  as tmp lateral view inline(tmp.all_bal) t1 as c1,c2,c3;
    OK
    1    1.2    1.3    1.1
    2    2.34    2.1    2.67
    Time taken: 0.093 seconds, Fetched: 2 row(s)
    
    #对比用explode来转,explode只能打开一层,即去掉了array的[]
    hive> select tmp.custom_id,c1 from test_youhua.test_array_struct_inline as tmp lateral view explode(tmp.all_bal) t1 as c1;
    OK
    1    {"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}
    2    {"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}
    
    #explode还需再加上struct.key来进一步取key对应的value值;这么看inline能够比explode打开更深一层,inline可以直接取到value,explode还要再通过struct.key形式来取value
    hive> select tmp.custom_id,c1.jijin from test_youhua.test_array_struct_inline as tmp lateral view explode(tmp.all_bal) t1 as c1; 
    OK
    1    1.1
    2    2.67
    Time taken: 0.122 seconds, Fetched: 2 row(s)
    
    #参考链接里这种写法不行,需要数据明确key、value
    select tmp.custom_id,c1.value from test_youhua.test_array_struct_inline as tmp lateral view explode(tmp.all_bal) t1 as c1 where c1.key="jijin";
    # 报错:
    RuntimeException cannot find field key(lowercase form: key) in [baoxian, cunkuan, jijin]

    tips

    (1)org.apache.hive.hcatalog.data.JsonSerDe 对复杂类型支持不足

    参考:https://www.cnblogs.com/aprilrain/p/6916359.html

    (2)insert数据到array(struct)-用named_struct

    参考这个:https://blog.csdn.net/random0815/article/details/85252250

    以及报错解决:https://blog.csdn.net/qq_36203774/article/details/102599260

    insert into test_youhua.test_array_struct_inline 
    select "4",array(named_struct('baoxian','1.46','cunkuan','1.46','jijin','1.46'));
    # 报错
    ParseException line 1:124 Failed to recognize predicate '<EOF>'. Failed rule: 'regularBody' in statement
    
    #报错解决,建临时表tmp,加from改写,这里要注意struct里面字段顺序
    with tmp as 
    (select "3",array(named_struct('baoxian','1.45','cunkuan','1.45','jijin','1.45')))
    insert into test_youhua.test_array_struct_inline
    select * from tmp;
    
    select * from test_youhua.test_array_struct_inline
    #数据插入成功
    3    [{"baoxian":"1.45","cunkuan":"1.45","jijin":"1.45"}]
    1    [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}]
    2    [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}]
    
    # 用struct来select可以,但是无法insert到指定的array(struct)中
    with tmp as 
    (select "6",array(struct('1.45','1.45','1.45')))
    insert into test_youhua.test_array_struct_inline
    select * from tmp;
    # struct报错,列名无法对应
    Cannot insert into target table because column number/types are different 'test_array_struct_inline': Cannot convert column 1 from array<struct<col1:string,col2:string,col3:string>> to array<struct<baoxian:string,cunkuan:string,jijin:string>>.

    (3)map、array、struct读数据的方式

    select array[1],map['xiao song'],struct.city from test
  • 相关阅读:
    2017/7/26 SCJP英语学习
    JSF(JavaServer Faces)简介
    Java回话技术
    2.2 对象深拷贝、浅复制、序列化
    编码与解码
    pycharm 教程(一)安装和首次使用
    Java Eclipse进行断点调试
    详细介绍如何在Eclipse中使用SVN
    SVN客户端安装与使用
    炸鸡
  • 原文地址:https://www.cnblogs.com/foolangirl/p/14326886.html
Copyright © 2020-2023  润新知