• spark 问题


    问题描述1

      使用spark-shell sc.textFile(“hdfs://test02.com:8020/tmp/w”).count 出现如下异常:

    java.lang.RuntimeException: Error in configuring object

    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)

    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)

    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)

    at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:186)

    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)

    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

    at scala.Option.getOrElse(Option.scala:120)

    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

    at scala.Option.getOrElse(Option.scala:120)

    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1517)

    at org.apache.spark.rdd.RDD.count(RDD.scala:1006)

    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)

    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)

    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)

    at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)

    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:33)

    at $iwC$$iwC$$iwC.<init>(<console>:35)

    at $iwC$$iwC.<init>(<console>:37)

    at $iwC.<init>(<console>:39)

    at <init>(<console>:41)

    at .<init>(<console>:45)

    at .<clinit>(<console>)

    at .<init>(<console>:7)

    at .<clinit>(<console>)

    at $print(<console>)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)

    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)

    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)

    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)

    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)

    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)

    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)

    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)

    at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)

    at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)

    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)

    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)

    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)

    at org.apache.spark.repl.Main$.main(Main.scala:31)

    at org.apache.spark.repl.Main.main(Main.scala)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)

    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)

    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)

    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)

    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

    Caused by: java.lang.reflect.InvocationTargetException

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)

    ... 61 more

    Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.

    at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)

    at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:175)

    at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)

    ... 66 more

    Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found

    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018)

    at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128)

    ... 68 more

    原因:

      这是因为在hadoop core-site.xml mapred-site.xml中开启了压缩,并且压缩式lzo的。这就导致写入/上传到hdfs的文件自动被压缩为lzo了。这个时候你使用sc.textFile读取文件时就会报告一堆lzo找不到的异常。

      最根本的原因就是spark找不到hadoop-lzo.jar lzo本地库,你需要确保集群中每一个机器上都安装了lzolzophadoop-lzo.jar,然后修改spark-env.sh,添加SPARK_LIBRARY_PATHSPARK_CLASSPATH,其中SPARK_LIBRARY_PATH指向lzo本地库,SPARK_CLASSPATH指向hadoop-lzo.jar。如果你从spark-shell中进行测试,在启动spark-shell时需要配置--jars--driver-library-path

      对于cdh集群,hadoop-lzo已经安装了。对于apache集群,你需要自己手动安装

    解决办法:

    • 在集群中的每一台机器上安装hadoop-lzo包。

         一般来说需要在集群中每台机器执行如下步骤:

         安装lzo lib

         安装lzop 可执行程序

         安装twitterhadoop-lzo.jar

         

    • spark-env.sh中添加SPARK_LIBRARY_PATHSPARK_CLASSPATH变量

      添加如下变量:

     export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native/*

    export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native

    SPARK_CLASSPATH=/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib:$SPARK_CLASSPATH

    • 然后按照如下方式启用spark-shell,需要注意的是,不管你以local模式还是master模式,都需要加上如下的配置

      ./spark-shell  --jars hadoop-lzo.jar的全路径  --driver-library-path hadoop-lzonative目录

      这个时候,你就可以在spark-shell中使用textFile读取hdfs数据了。

    譬如,你可以如下启动spark-shell

    ./bin/spark-shell --jars /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo.jar --driver-library-path /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native^C

    maven pom.xml 中的scala 版本应该和spark版本一直:

      如果pom.xml scala版本是2.11

      <properties>

        <scala.version>2.11.4</scala.version>

      </properties>

     那么 spark也应该是2.11的:

         <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->

        <dependency>

          <groupId>org.apache.spark</groupId>

          <artifactId>spark-core_2.11</artifactId>

          <version>1.6.1</version>

        </dependency>

    同样,在使用scalatestscalactic时也是如此。

    问题描述2

      

      maven pom.xml 中的scala 版本应该和spark版本一直:

      如果pom.xml scala版本是2.11

      <properties>

        <scala.version>2.11.4</scala.version>

      </properties>

     那么 spark也应该是2.11的:

         <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->

        <dependency>

          <groupId>org.apache.spark</groupId>

          <artifactId>spark-core_2.11</artifactId>

          <version>1.6.1</version>

        </dependency>

    同样,在使用scalatestscalactic时也是如此。

      

    问题3.比较器异常

    Caused by: java.lang.IllegalArgumentException: Comparison method violates its general contract!

    at java.util.TimSort.mergeHi(TimSort.java:899)

    at java.util.TimSort.mergeAt(TimSort.java:516)

    at java.util.TimSort.mergeCollapse(TimSort.java:441)

    at java.util.TimSort.sort(TimSort.java:245)

    at java.util.Arrays.sort(Arrays.java:1438)

    at scala.collection.SeqLike$class.sorted(SeqLike.scala:618)

    at scala.collection.mutable.ArrayOps$ofRef.sorted(ArrayOps.scala:186)

    at scala.collection.SeqLike$class.sortWith(SeqLike.scala:575)

    at scala.collection.mutable.ArrayOps$ofRef.sortWith(ArrayOps.scala:186)

    at bitnei.utils.Utils$.sortBy(Utils.scala:116)

    at FsmTest$$anonfun$1$$anonfun$apply$mcV$sp$4.apply(FsmTest.scala:30)

    ... 54 more

    出现这个问题的原因,是在排序时,两个相等的值没有返回true。源代码如下:

    def compareDate(dateA:String,dateB:String):Boolean={
      val dateFormat=new java.text.SimpleDateFormat("yyyyMMddHHmmss")
      val timeA=dateFormat.parse(dateA).getTime
      val timeB=dateFormat.parse(dateB).getTime

      timeA<=timeB
    }

    将上面的<=换为<即可。

    问题4 spark-jobserver maven 问题


    <dependency>
      <groupId>spark.jobserver</groupId>
      <artifactId>job-server-api_2.11</artifactId>
      <version>0.6.2</version>
    </dependency>

    如上图,再引用job-server-api_2.11时,maven找不到某些jar的依赖,原因是默认中央仓库不全,需要添加其他中央仓库,如下:

    </pluginRepository>
      <pluginRepository>
        <id>dl-bintray.com/</id>
        <name>Scala-Tools Maven2 Repository</name>
        <url>https://dl.bintray.com/spark-jobserver/maven/</url>
      </pluginRepository>

    <repository>
      <id>dl-bintray.com/</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>https://dl.bintray.com/spark-jobserver/maven/</url>
    </repository>

    然后就ok了。

    问题5 Spark读取hdfs数据,nameservice 无法识别

    java.lang.IllegalArgumentException: java.net.UnknownHostException: nameservice1

    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)

    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)

    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)

    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:448)

    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:410)

    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:128)

    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2308)

    解决方法:

    spark-default.sh中添加如下内容:
    spark.files=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml

    也就是说,将core-site.xmlhdfs-site.xml添加到spark.files

    问题6

     org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/vehicle/result/mid/2016/09/13/_temporary/0/_temporary/attempt_201611121515_0001_m_000005_188/part-00005 could only be replicated to 0 nodes instead of minReplication (=1).  There are 6 datanode(s) running and no node(s) are excluded in this operation.

    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1541)

    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3289)

    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:668)

    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:212)

    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:483)

    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:415)

    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)

    at org.apache.hadoop.ipc.Client.call(Client.java:1468)

    at org.apache.hadoop.ipc.Client.call(Client.java:1399)

    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)

    at com.sun.proxy.$Proxy13.addBlock(Unknown Source)

    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)

    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)

    at com.sun.proxy.$Proxy14.addBlock(Unknown Source)

    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)

    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)

    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)

    Hdfs磁盘空间满了

    JDBC问题

    1.maven中引入ojdbc方式:

    首先将ojdbc安装到本地仓库

    C:Usersfranciswang>mvn install:install-file -Dfile=d:/spark/lib/ojdbc6-11.2.0.3.0.jar -DgroupId=com.oracle -DartifactId=ojdbc6 -Dversion=11.2.0 -Dpackaging=jar

    接下来在porm中引用:


    <dependency>
      <groupId>com.oracle</groupId>
      <artifactId>ojdbc6</artifactId>
      <version>11.2.0</version>
    </dependency>

    当将项目打包到linuxspark执行时,总提示Cannot find or load oracle.jdbc.driver.OracleDriverojdbc.jar放到classpath,项目的libjdk/jre/lib下都没有用,最后将其放到

    Jdk/jre/lib/ext/下才好使。参考地址:

    http://stackoverflow.com/questions/17701610/cannot-find-or-load-oracle-jdbc-driver-oracledriver

  • 相关阅读:
    浏览器返回错误汇总分析
    黄金投资品种众多 个人如何投资黄金
    .NET设计模式系列文章 (转自TerryLee's Tech Space)
    .NET设计模式(7):创建型模式专题总结(Creational Pattern)
    一个很经典的下拉式菜单(附效果)
    .NET设计模式(16):模版方法(Template Method)
    搜索引擎优化基础(转并整理添加)
    .NET设计模式(18):迭代器模式(Iterator Pattern)
    【ASP.NET】网页中嵌入视频的三种方法
    .NET设计模式(15):结构型模式专题总结
  • 原文地址:https://www.cnblogs.com/francisYoung/p/6073842.html
Copyright © 2020-2023  润新知