Hive – partition table query failed when stored as parquet

Hive – partition table query failed when stored as parquet
Hive is developed by Facebook to analyze and extract useful information from their huge data but now it is very popular in other organizations too such as Netflix and FINRA.

Use-case:

Now a days most of us are using different ways to optimize query or we can say to improve the performance of the Hive query. Out of which 2 most common techniques are:
1. Partitioning
2. Storing data in parquet format.
Partitioning is very known concept to the folks who are processing/analyzing/aggregating their data thru Apache Hive and the Parquet file format incorporates several features that make it highly suited to data warehouse-style.

But most of us are unaware of the fact that Apache hive does not support the query, when storing a partitioned table in parquet format and executing a query on partitioned column.

Let’s have a detail look into it.

Below is the pipe delimiter sample data present in HDFS which we will load into managed non-partitioned Hive table

Below steps will create a managed hive table named “hive_emp1”.

Loading data from HDFS into hive table (hive_emp1) which we have created in above steps.

Take a look into data present in Hive table created above.

We have few Males and 2 Females which are represented by ‘M’ and ‘F’ respectively in last column (sex).

Now, we will create another table in hive name “hive_emp_dynpart”, which will be partitioned on 2 columns (dept and gender) and also data of this table will be stored in parquet format.

Set the hive.exec.dynamic.partition to true and hive.exec.dynamic.partition.mode to nonstrict to load the data dynamically in hive table.

We will insert the data from hive_emp1 table into hive_emp_dynpart table along with partitions too.

Issue:

While querying the hive_emp_dynpart table with one of the partition column, you will get the following error, for all other regular column it is working fine.

Those who are unable to see above screen, can refer to below statements for error.

hive> select * from hive_emp_dynpart where gender = 'M';
OK
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [gender] was not found in schema!
Time taken: 0.255 seconds

Error Description:

It is a known bug in Apache Hive (HIVE-11401) filtering option, when the partitioned was stored as Parquet.

Resolution:

A known workaround is to disable predicate pushdown by setting property hive.optimize.index.filter to false.

Now query the table using same command.

Conclusion:

You need to set the property to false every time you execute the query.

hive2.3.0 fix it: https://issues.apache.org/jira/browse/HIVE-15782
正因为当初对未来做了太多的憧憬，所以对现在的自己尤其失望。生命中曾经有过的所有灿烂，终究都需要用寂寞来偿还。
相关阅读:
Vue.js计算属性
 Vue.js列表渲染&关于列表元素的key&列表过滤与排序
 Maven入门_如何向JAR添加资源&标准目录布局的介绍（部分翻译）
高性能分布式对象存储——MinIO实战操作（MinIO扩容）
列式存储的分布式数据库——HBase（环境部署）
列式存储的分布式数据库——HBase Shell与SQL实战操作（HBase Master高可用实现）
【云原生】Kubernetes（k8s）最新版最完整版环境部署+master高可用实现（k8sV1.24.1+dashboard+harbor）
大数据Hadoop之——HDFS小文件问题与处理实战操作
 大数据Hadoop之——Hadoop 3.3.4 HA（高可用）原理与实现（QJM）
高性能分布式对象存储——MinIO（环境部署）
原文地址：https://www.cnblogs.com/candlia/p/11920260.html

Hive – partition table query failed when stored as parquet

Use-case:

Issue:

Error Description:

Resolution:

Conclusion: