• spark读取空orc文件时报错java.lang.RuntimeException: serious problem at OrcInputFormat.generateSplitsInfo


    问题复现:

    G:igdataspark-2.3.3-bin-hadoop2.7in>spark-shell
    2020-12-26 10:20:48 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://DESKTOP-01KN1P4:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608949256544).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_   version 2.3.3
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show
    ++
    ||
    ++
    ++
    
    (其他窗口新建一个空文件) touch /tmp/empty_orc/zero.orc
    
    scala> sql("select * from empty_orc").show
    
    java.lang.RuntimeException: serious problem
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
      at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:340)
      at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
      at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3278)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
      at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
      at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
      ... 49 elided
    Caused by: java.lang.NullPointerException
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
      ... 99 more

    该问题的主要原因是在读取orc表时,遇到有空文件时报错,bug记录地址:

    SPARK-19809:NullPointerException on zero-size ORC file(https://issues.apache.org/jira/browse/SPARK-19809)

    SPARK-29773:Unable to process empty ORC files in Hive Table using Spark SQL(https://issues.apache.org/jira/browse/SPARK-29773)

    解决办法:使用参数spark.sql.hive.convertMetastoreOrc=true

    G:igdataspark-2.3.3-bin-hadoop2.7in>spark-shell --conf spark.sql.hive.convertMetastoreOrc=true
    2020-12-26 10:29:06 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://DESKTOP-01KN1P4:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608949754291).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_   version 2.3.3
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sql("select * from empty_orc").show
    
    +---+
    |  a|
    +---+
    +---+

    spark的帮助文档种介绍如下:

    ORC Files

    Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader is set to true. For the Hive ORC serde tables (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is also set to true.

    https://spark.apache.org/docs/2.3.3/sql-programming-guide.html#orc-files

  • 相关阅读:
    mysql数据库突然连接失败,启动不了的解决思路,可能是磁盘满了,需要删掉日志后重启才能自动恢复
    关于投资的思考(90) 套利策略的前提,国内NFT数字藏品的探讨,巴菲特的聪明贝塔策略
    关于投资的思考(81) 低估值分批买卖策略梯度抄底法
    windows源代码安装VTK9,并使用Visual Studio 2017配置VTK9
    五种开源协议BSD、Apache、GPL、LGPL、MIT
    vtk简介_1和2章
    如何在 30 分钟完成表格增删改查的前后端框架搭建
    520,解锁开发者的专属浪漫
    JSON数据传输大法第一式——用OADate处理日期格式
    “银行家算法”大揭秘!在前端表格中利用自定义公式实现“四舍六入五成双”
  • 原文地址:https://www.cnblogs.com/flowerbirds/p/14191707.html
Copyright © 2020-2023  润新知