• Hive on spark和Hive on mr在处理orc格式表格时数据不一致问题探究


    一、问题描述

      在使用CDH6.3.2的集群处理数据时,当创建的表格为orc格式,且表格中存在null字段时,where中的<>条件没有生效;

    建表语句为:

    CREATE TABLE DWD_PC_INT_ZM_StockPoolComponent(
    ID bigint ,
    JSID bigint ,
    InsertTime TIMESTAMP ,
    UpdateTime TIMESTAMP ,
    CalDate TIMESTAMP ,
    InnerCode int ,
    SecuCode string ,--(30) ,
    SecuAbbr string ,--(100) ,
    SecuMarket int ,
    RuleCode bigint ,
    RuleDesc string ,--(1000),
    UDATE string,
    DELDATE string
    )stored as orcfile;

    插入数据为:

    INSERT INTO dwd.dwd_pc_int_zm_stockpoolcomponent (id,jsid,inserttime,updatetime,caldate,innercode,secucode,secuabbr,secumarket,rulecode,ruledesc,udate,deldate) VALUES
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',152,'000405','ST 鑫 光',90,100000000100002,'退市股票:退市日期:20040319','20211021','99991231'),
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',91,'000047','ST 中 侨',90,100000000100002,'退市股票:退市日期:20030530','20211021','99991231'),
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',76,'000033','新 都 退',90,100000000100002,'退市股票:退市日期:20170707','20211021','99991231'),
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',61,'000024','招商地产',90,100000000100002,'退市股票:退市日期:20151230','20211021','99991231'),
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',51,'000018','神城A退',90,100000000100002,'退市股票:退市日期:20200107','20211021','99991231'),
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',44,'000015','PT中浩A',90,100000000100002,'退市股票:退市日期:20011022','20211021','99991231'),
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',41,'000013','*ST石化A',90,100000000100002,'退市股票:退市日期:20040920','20211021','99991231'),
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',23,'000007','*ST 全新',90,100000000100002,'ST、*ST','20211021','99991231'),
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',17,'000005','ST 星 源',90,100000000100002,'ST、*ST','20211021','99991231'),
    (NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',11,'000003','PT金田A',90,100000000100002,'退市股票:退市日期:20020614','20211021','99991231');

    建表语句为:

    CREATE TABLE tmp.TMP_DWD_PC_INT_ZM_StockPoolComponent_02(
    CalDate TIMESTAMP ,
    InnerCode int ,
    SecuCode string ,--(30) ,
    SecuAbbr string ,--(100) ,
    SecuMarket int ,
    RuleCode bigint ,
    RuleDesc string --(1000)
    )stored as orcfile;

    插入数据为:

    INSERT INTO tmp_dwd_pc_int_zm_stockpoolcomponent_02 (caldate,innercode,secucode,secuabbr,secumarket,rulecode,ruledesc) VALUES
    ('2021-10-17 00:00:00.0',23603,'000136','民生策略A',NULL,101000000100001,NULL),
    ('2021-10-17 00:00:00.0',309519,'000090','民生信用债A',NULL,101000000100001,NULL),
    ('2021-10-17 00:00:00.0',309520,'000089','民生信用债C',NULL,101000000100001,NULL),
    ('2021-10-17 00:00:00.0',22892,'000068','民生转债C',NULL,101000000100001,NULL),
    ('2021-10-17 00:00:00.0',22891,'000067','民生转债A',NULL,101000000100001,NULL),
    ('2021-10-17 00:00:00.0',44,'000015','PT中浩A',90,100000000100002,'退市股票:退市日期:20011022'),
    ('2021-10-17 00:00:00.0',41,'000013','*ST石化A',90,100000000100002,'退市股票:退市日期:20040920'),
    ('2021-10-17 00:00:00.0',23,'000007','*ST 全新',90,100000000100002,'ST、*ST'),
    ('2021-10-17 00:00:00.0',17,'000005','ST 星 源',90,100000000100002,'ST、*ST'),
    ('2021-10-17 00:00:00.0',11,'000003','PT金田A',90,100000000100002,'退市股票:退市日期:20020614');

    执行查询语句为:

    SELECT
    A.CalDate,
    A.InnerCode,
    A.SecuCode,
    A.SecuAbbr,
    A.SecuMarket,
    A.RuleCode,
    A.RuleDesc,
    2 AS DataFlag
    FROM TMP_DWD_PC_INT_ZM_StockPoolComponent_02 A
    JOIN DWD_PC_INT_ZM_StockPoolComponent B ON A.CalDate=B.CalDate AND A.RuleCode=B.RuleCode AND A.SecuCode=B.SecuCode AND NVL(A.SecuMarket,0)=NVL(B.SecuMarket,0)
    AND B.DELDATE='99991231'
    WHERE NVL(A.InnerCode,0)<>NVL(B.InnerCode,0)
    OR NVL(A.SecuAbbr,'')<>NVL(B.SecuAbbr,'')
    OR NVL(A.RuleDesc,'')<>NVL(B.RuleDesc,'')

    经过测试发现:

    1、对于CDH6.3.2的集群

      如果使用hive  on spark执行引擎,可以查询出数据;使用hive on mr 或者sparksql来执行是查询不出来数据的;将表格格式切换为txt或者parquet时,使用mr或者hive on spark也查询不出数据

    2、对于TDH6.0.2的集群

    使用tdh集群来查询也查询不出数据;

    3、将CDH集群切换到5.16.4版本

    使用hive  on spark、hive on mr引擎也查不出结果;

    二、问题总结

     虽然具体原因没有定位到,但是可以确定的是hive2.1.1版本对hive on spark支持的不友好,会有bug存在。

       1、由此可见,cdh6.3.2的集群对于在表格格式为orc格式时,使用hive  on spark执行引擎会有出错的情况;CDH6.3.2的集群中的hive版本2.1.1对orc格式的表格处理不友好,会有bug存在;

       2、推荐大家在建立数据库表格时,使用parquet格式。

    现在Cloudera公司在主推CDP集群,后面cdh版本不在维护,hive on mr和hive on spark引擎在CDP7.1推出后也不在支持了。后面会主推hive  on tz。

  • 相关阅读:
    插入排序
    python -- 给电视剧重命名
    程序员你为什么迷茫?
    如何把自己打造成技术圈的 papi 酱
    GitHub中国区前100名到底是什么样的人?
    Python+opencv 图像拼接
    VS2015 新Web项目(C#6)出现CS1617错误的解决
    .Net Task<T>的一种比较神奇的卡死情况(Wait/Result卡死, await能得到结果)
    Xamarin Android自学和实践步骤
    跨过几个坑,终于完成了我的第一个Xamarin Android App!
  • 原文地址:https://www.cnblogs.com/chhyan-dream/p/15471053.html
Copyright © 2020-2023  润新知