• Spark1.6 安装


    安装Spark之前需要先安装Hadoop,请参考 《安装hadoop 2.6》

    1、安装JDK, SCALA,并设置hosts

    10.1.234.209 master

    10.1.234.210 slave1

    2、下载并解压 spark-1.6.1-bin-hadoop2.6

    设置 SPARK_HOME=/home/c3/apps/spark-1.6.1-bin-hadoop2.6

    cp spark-env.sh.template spark-env.sh

    vi spark-env.sh

    export JAVA_HOME=/home/c3/apps/jdk1.8
    export SCALA_HOME=/home/c3/apps/scala-2.11.8
    export SPARK_MASTER_IP=rti9
    export SPARK_WORKER_CORES=2
    export SPARK_WORKER_MEMORY=1g
    export HADOOP_HOME/home/yinjq/apps/hadoop-2.6.5
    export HADOOP_CONF_DIR=/home/c3/apps/hadoop-2.6.0/etc/hadoop

    cp slaves.template slaves

    vi slaves

     slave1 

    到这里spark就配置好了,spark的运行模式有 Spark cluster 和 spark on yarn 

    spark cluster,就是启动spark 的master 和 worker节点,spark自己来分配资源

    spark on yarn,是将spark的应用提交到yarn上,yarn来负责分配资源,这种情况,不需要启动 master和worker节点,只需要 hadoop环境,这个特性是从spark 0.6开始就支持了,详见

    3、提交任务到 Spark cluster

    sbin/start-all.sh

    jps查看进程,在master上启动了Master进程,在slave1上启动了Worker进程

    Spark UI: http://master:8081

    这种情况,是这样提交任务的

    spark-submit --master spark://127.0.0.1:7077 --name hello-spark --class com.App --jars /xxxxx.jar, /xxxxx.jar

    4、提交任务到 Yarn

    先启动hadoop和yarn,然后这样提交任务

    spark-submit --master yarn --name hello-spark --class com.App --jars /xxxxx.jar, /xxxxx.jar

    另外,--master 还可以是 mesos、local,其中local就是本地跑任务

    5、spark-sql访问hive

    1) 启动 hive 的metastor服务: $HIVE_HOME/bin/hive --service metastore

    2) 将 mysql-connector-java-5.0.8-bin.jar 放到 $SPARK_HOME/jars/目录下

    3) 将 hive-site.xml 放到$SPARK_HOME/conf目录下

    或者用ln将 hdfs-site.xml, core-site.xml, yarn-site.xml, hive-site.xml 映射到 $SPARK_HOME/conf 下

    
    
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <configuration>
        <property>
            <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://192.168.194.72:3306/hive4yinjq?createDatabaseIfNotExist=true&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
            <description></description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionUserName</name>
            <value>admin</value>
            <description>Username to use against metastore database</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionPassword</name>
            <value>123456</value>
            <description>password to use against metastore database</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
            <description>Driver class name for a JDBC metastore</description>
        </property>
        <property>
            <name>hive.metastore.uris</name>
            <value>thrift://127.0.0.1:9083</value>
            <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.
            </description>
        </property>
    </configuration>
    
    

     另外,spark-sql 有三种使用方式

    Spark Shell:用spark.sql("select * from test")的方式
    CLI: 启动 bin/spark-sql 
    Thrift Server:

    启动Thrift JDBC Server:默认端口是10000

    cd $SPARK_HOME/sbin
    start-thriftserver.sh
    如何修改Thrift JDBC Server的默认监听端口号?借助于 --hiveconf

    start-thriftserver.sh --hiveconf hive.server2.thrift.port=14000
    HiveServer2 Clients 详情参见:https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients

    将Thrift Server运行在yarn上

    ./spark-daemon.sh submit org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 default --name "Spark-Thrift-Server-local2" --master yarn
    --hiveconf hive.server2.thrift.port=11111
    --hiveconf hive.server2.thrift.bind.host=0.0.0.0
    --executor-memory 1G
    --executor-cores 1

    注意

    default 是queue的名称

    还可以有以下参数:

    --conf spark.driver.memory=10G
    --conf spark.scheduler.mode=FAIR
    --conf spark.sql.small.file.combine=true
    --conf hive.input.format=org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat
    --hiveconf hive.metastore.client.socket.lifetime=1s
    --hiveconf hive.metastore.failure.retries=3
    --hiveconf hive.server2.thrift.port=11111
    --hiveconf hive.server2.thrift.bind.host=0.0.0.0
    --hiveconf hive.exec.dynamic.partition.mode=nonstrict
    --master yarn
    --executor-memory 4G
    --num-executors 25
    --executor-cores 2
    --driver-cores 2

    启动beeline

    cd $SPARK_HOME/bin
    beeline -u jdbc:hive2://hadoop000:10000/default -n hadoop

    或者

    beeline> !connect jdbc:hive2://localhost:10000/default

     遇到的问题

    问题1

    Caused by: MetaException(message:Version information not found in metastore. )
        at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6664)
        at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6645)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)
        at com.sun.proxy.$Proxy6.verifySchema(Unknown Source)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:572)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
        at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
        at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)

    解决方法

    这是因为在创建SQLContext实例的时候,要求spark编译的Hive版本和HiveMetaStore里面记录的Hive版本一致,我们可以通过$HIVE_CONF/confi/hive-site.xml 配置hive.metastore.schema.verification参数来取消这种验证,这个参数的默认值是true,我们可以取消验证,设置为false

    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
        <description>
            Enforce metastore schema version consistency. 
            True: Verify that version information stored in is compatible with one from Hive jars. Also disable automatic schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures proper metastore schema migration. (Default)
            False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
        </description>
    </property>

    问题2:

    Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH

    解决方法:

    cp mysql-connector-java-5.0.8-bin.jar ../../spark-2.1.1-bin-hadoop2.6/jars/

    问题3:

    Error:scalac: error while loading package, Missing dependency 'bad symbolic reference to org.apache.spark.annotation encountered in class file 'package.class'.
    Cannot access term annotation in package org.apache.spark. The current classpath may be
    missing a definition for org.apache.spark.annotation, or package.class may have been compiled against a version that's
    incompatible with the one found on the current classpath.', required by /home/yinjq/.m2/repository/org/apache/spark/spark-sql_2.11/2.1.1/spark-sql_2.11-2.1.1.jar(org/apache/spark/sql/package.class)

    解决方法:

    去掉 <scope>provided</scope>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.1.1</version>
            <scope>provided</scope>
        </dependency>
    
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.1</version>
            <scope>provided</scope>
        </dependency>

    问题4:把spark应用运行到yarn上时,报错

    17/08/03 17:49:14 ERROR util.Utils: Uncaught exception in thread Yarn application state monitor
    org.apache.spark.SparkException: Exception thrown in awaitResult
        at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:512)
        at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:93)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:151)
        at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467)
        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588)
        at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1833)
        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1832)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:108)
    Caused by: java.io.IOException: Failed to send RPC 8393667808095342301 to /10.12.130.6:49562: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:249)
        at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:233)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:514)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:488)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:427)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:129)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:36)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1072)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1126)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1061)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
    Caused by: java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

    解决方法:

    Yarn上的提示是:Failing this attempt.Diagnostics: Container [pid=7574,containerID=container_1554900460598_0010_02_000001] is running beyond virtual memory limits. Current usage: 296.3 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.

    按照上述配置提供的信息,目测是我给节点分配的内存太小,yarn直接kill掉了进程,导致ClosedChannelException

    修改yarn-site.xml,添加下列property.

    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>

    问题5:认证的问题

    User: yinjq is not allowed to impersonate hadoop), serverProtocolVersion:null)

    解决方法:

    https://stackoverflow.com/questions/36909002/authorizationexception-user-not-allowed-to-impersonate-user

    问题6:

    17/08/01 15:19:52 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

    解决方法:

    free -g 查看没有内存了,释放内存

    echo 3 > /proc/sys/vm/drop_caches  

    问题7:

    Hive Schema version 2.3.0 does not match metastore's schema version 1.2.0 Metastore is not upgraded or corrupt)  

    解决方法:

    hive-site.xml

    <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    </property>

    问题8:

    java.sql.SQLException: Column name pattern can not be NULL or empty  

    解决方法:

    将 $HIVE_HOME/lib 下 的 mysql-connector-java-6.0.3.jar 替换成 mysql-connector-java-5.1.39.jar。 原因分析:mysql-connector-java 6.x 版本 和 5.1.x 版本不兼容 , nullNamePatternMatchesAll 连接属性的默认值在 mysql-connector-java 5.1 和 6.0 之间发生了改变. 在 5.1 版本中默认值是 true, 而 6.0 版本中默认值是 false。

    如何运行scala程序

    scala -cp spark.jar com.App

     查看class的JDK版本

    javap -verbose Configuration.class|grep major

    查看得到的信息中,major version属性的内容,如下:

    major version: 52,说明这个.class文件是由JDK1.8编译得到的。

    J2SE 6.0 = 50 (0x32 hex)
    J2SE 5.0 = 49 (0x31 hex)
    JDK 1.4 = 48 (0x30 hex)
    JDK 1.3 = 47 (0x2F hex)
    JDK 1.2 = 46 (0x2E hex)
    JDK 1.1 = 45 (0x2D hex)

     

     
  • 相关阅读:
    源码篇:Python 实战案例----银行系统
    源码分享篇:使用Python进行QQ批量登录
    python制作电脑定时关机办公神器,另含其它两种方式,无需编程!
    Newtonsoft.Json 去掉
    C#Listview添加数据,选中最后一行,滚屏
    C# 6.0语法糖
    XmlHelper
    AppSettings操作类
    JsonHelper
    JS加载获取父窗体传递的参数
  • 原文地址:https://www.cnblogs.com/machong/p/5627086.html
Copyright © 2020-2023  润新知