• spark集群配置细则总结


    修改目录与目录组:

    sudo chown -R hadoop:hadoop spark-1.6.1-bin-hadoop2.6

    sudo chown -R hadoop:hadoop jdk1.8.0_101

    sudo chown -R hadoop:hadoop scala2.11.6

    1./etc目录下

    vi hosts

    192.168.xxx.xxx data6(master节点)

    192.168.xxx.xxx data2(worker节点)

    192.168.xxx.xxx data3(worker节点)

    2.spark/conf/目录下

    vi slaves

    data6

    data2

    data3

    vi spark-env

    export JAVA_HOME=/app/jdk1.7

    export SPARK_MASTER_IP=data6

    export SPARK_WORKER_INSTANCES=1

    export SPARK_WORKER_MEMORY=30g

    export SPARK_WORKER_CORES=6

    export SPARK_LOG_DIR=/data/tmp

    export SPARK_PID_DIR=/data/tmp

    export SPARK_DAEMON_JAVA_OPTS="-Djava.io.tmpdir=/home/tmp"

    export PYSPARK_PYTHON=/opt/anaconda3/bin/python3

    export PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/ipython3

    export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip 0.0.0.0 --port 9999"

    export PATH=$PATH:/usr/local/bin

    export SPARK_CLASSPATH=/app/spark-1.6.1/lib/spark-examples-1.6.1-hadoop2.4.0.jar:/app/spark-1.6.1/lib/spark-assembly-1.6.1-hadoop2.4.0.jar:/app/spark-1.6.1/lib/spark-1.6.1-yarn-shuffle.jar:/app/spark-1.6.1/lib/nlp-lang-1.5.jar:/app/spark-1.6.1/lib/mysql-connector-java-5.1.26-bin.jar:/app/spark-1.6.1/lib/datanucleus-rdbms-3.2.9.jar:/app/spark-1.6.1/lib/datanucleus-core-3.2.10.jar:/app/spark-1.6.1/lib/datanucleus-api-jdo-3.2.6.jar:/app/spark-1.6.1/lib/ansj_seg-3.7.3-all-in-one.jar

    vi hive-site.xml

    <configuration>

    <property>

    <name>hive.metastore.uris</name>

    <value>thirft://data6:9083</value>

    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>

    </property>

    <property>

    <name>hive.server2.thrift.min.worker.threads</name>

    <value>5</value>

    <description>maximum number of Thrift worker threads</description>

    </property>

    <property>

    <name>hive.server2.thrift.port</name>

    <value>500</value>

    <description>Port number of HiveSercer2 Thrift interfaace.</description>

    </property>

    <property>

    <name>hive.server2.thrift.min.worker.threads</name>

    <value>11000</value>

    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>

    </property>

    <property>

    <name>hive.server2.thrift.bind.host</name>

    <value>data6</value>

    <description>bind host on which to run the HiveSercer2 Thrift interface</description>

    </property>

    <property>

    <name>mapred.reduce.tasks</name>

    <value>40</value>

    </property>

    vi log4j.properties

    #Setting to quiet third party logs that are too verbose

    log4j.logger.org.spark-project.jetty=WARN

    log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR

    log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO

    log4j.logger.org.apache.spark.repl.SparkILoop$SparkLoopInterpreter=INFO

    log4j.logger.parquet=ERROR

    #SPARK-9183:Setting to avoid annoying messages when looking up nonexitent UDFs in SparkSQL with Hive support

    log4j.logger.org.apachce.hadoop.hive.metastore.RetryingHMSHandler=FATAL

    log4j.logger.org.apachce.hadoop.hive.ql.exec.FunctionRegistry=ERROR

    Spark集群搭建——SSH免密码验证登陆

    机器准备

    笔者有三台机器,左侧栏为ip,右侧为hostname,三台机器都有一个名为spark的用户。通过ping验证三台是可以通信的。

    192.168.248.150 spark-master 192.168.248.153 ubuntu-worker 192.168.248.155 spark-worker1

    根据机器信息加入到三台机器的/etc/hosts中。

    配置

    我们需要设置spark-master 能够免密码验证登陆ubuntu-worker、spark-worker1两台机器。

    1. 安装ssh

      sudo apt-get install openssh-server

    2. 生成秘钥

      执行命令ssh-keygen -t rsa,然后一直按回车键即可。

    3. 复制spark-master结点的id_rsa.pub文件到另外两个结点:

      scp id_rsa.pub spark@ubuntu-worker:~/.ssh/

    4. 到另外两个结点,将公钥加到用于认证的公钥文件中:

      cat id_rsa.pub >> authorized_keys

    5. 修改两个worker的authorized_keys权限为600或者644、将.ssh文件权限改为700

      chmod 700 .ssh

      chmod 600  authorized_keys

    6. 验证:

      登陆spark-master,在终端输入ssh ubuntu-worker,登陆成功则说明配置成功。

    HDFS下载不成功问题

    用windows上的IE来访问namenode节点的监控web页下载不了,需要修改了windows机器的C:WINDOWSsystem32driversetchosts文件,把hadoop集群中的几台机的主机名和IP地址加进去(一般在目录下/etc/hosts),让IE能解析就OK了。

    Namenode没有启动问题,进行格式化

    hadoop namenode -format

    再启动HDFS

  • 相关阅读:
    redo log 转csdn之ppp_10001
    Kafka的topic的partitions数的选取
    log4j:WARN No appenders could be found for logger
    HBase统计表的行数
    /bin/bash: /us/rbin/jdk1.8.0/bin/java: No such file or directory
    HBase shell命令
    Linux按名字杀死进程
    Kafka常用命令
    Plugin 'mavenassemblyplugin:' not found
    Linux搜索指定目录中所有文件的内容
  • 原文地址:https://www.cnblogs.com/zhw-080/p/5942431.html
Copyright © 2020-2023  润新知