• spark部署


    cloudera官网spark:
    https://docs.cloudera.com/documentation/enterprise/6/6.2/topics/spark.html
    spark官网:
    https://spark.apache.org/documentation.html
    CDH安装spark:
    https://my.oschina.net/gywbest/blog/3054588
    spark开发:
    https://docs.cloudera.com/documentation/enterprise/6/6.2/topics/spark_applications_developing.html
    spark部署使用
    https://www.cnblogs.com/qingyunzong/p/8888080.html GG


    问题:
    pattition和shuffle的区别

    在CDH中开启spark:

    添加服务:

     

    选spark---继续

    角色分配,保持默认的点继续

    因为没有启用TLS/SSL,点继续:

     

    自动执行安装:

     

    完成,点重启过时服务部署客户端:

     

    登陆任意节点,输入/usr/bin/spark-shell

    jps可以看到:

     

    A Hive Gateway role is recommended but was not found on these hosts: zhep-opay-temp-[1-4].novalocal; zhep-opay-temp-big-data-1.novalocal. This will prevent Spark applications from reading Hive tables.

    #执行数据量不大的任务,可在CDH中配置hive用spark引擎跑.按下面修改后,用beeline跑sql就调用的是spark,速度会快些.

    之后跑数据量大的还是改回mapreduce,

     

     

    自己搭建的spark:

    In a CDH deployment, SPARK_HOME defaults to /usr/lib/spark in package installations and /opt/cloudera/parcels/CDH/lib/spark in parcel installations. In a Cloudera Manager deployment, the shells are also available from /usr/bin.

    即CDH中要用spark-shell目录在,
    /opt/cloudera/parcels/CDH/lib/
    或: /usr/bin
    #要在有hdfs权限的帐号下,su - hdfs:
    SPARK_HOME/bin/spark-shell
    SPARK_HOME/bin/pyspark


    scala运行wordcount:
    spark-submit --class com.cloudera.sparkwordcount.SparkWordCount
    --master本地--deploy-mode客户端--executor-memory 1g
    --name wordcount --conf“ spark.app.id = wordcount”
    sparkwordcount-1.0-SNAPSHOT-jar-with-dependencies.jar
    hdfs:// namenode_host:8020 / 路径/至 /inputfile.txt 2

    pyspark运行wordcount:
    spark-submit --master yarn --deploy-mode client --executor-memory 1g
    --name wordcount --conf“ spark.app.id = wordcount” wordcount.py
    hdfs:// namenode_host:8020 / 路径/至 /inputfile.txt 2

    sparksql:
    https://docs.cloudera.com/documentation/enterprise/6/6.2/topics/spark_sparksql.html
    spark-shell
    scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TextFile")
    scala> sqlContext.sql("CREATE TABLE sample_08 (code string,description string,total_emp int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TextFile")

    [0: jdbc:hive2://hostname.com:> show tables;
    +------------+--+
    | tab_name |
    +------------+--+
    | sample_07 |
    | sample_08 |
    +------------+--+


    安装:
    https://spark.apache.org/docs/2.4.4/
    wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

    tar -xzvf spark-2.4.4-bin-hadoop2.7.tgz -C /opt/module/

    vim /etc/profile
    export SPARK_HOME=/opt/module/spark-2.4.4-bin-hadoop2.7
    export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

    rename '.template' '' /opt/module/spark-2.4.4-bin-hadoop2.7/conf/*.template ; ll /opt/module/spark-2.4.4-bin-hadoop2.7/conf

    cd /opt/module/spark-2.4.4-bin-hadoop2.7/conf
    vim spark-env.sh
    export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-0.el7_6.x86_64
    export HADOOP_HOME=/hongfeng/software/hadoop-3.1.2
    export HADOOP_CONF_DIR=/hongfeng/software/hadoop-3.1.2/etc/hadoop

    vim spark-defaults.conf
    spark.master spark://oride-dr-algo2:7077
    spark.driver.memory 3g

    vim slaves
    oride-dr-algo3
    oride-dr-algo4
    oride-dr-algo2

    -------------------------------------------------------------------
    在CDH集群中单独装:
    vim spark-env.sh
    export JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera
    export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
    export HADOOP_CONF_DIR=/etc/hadoop/conf

    vim spark-defaults.conf
    spark.master spark://zhep-opay-temp-1.novalocal:7077
    spark.driver.memory 3g

    vim slaves
    172.28.18.139 oride-dr-algo3
    172.28.19.172 oride-dr-algo4
    172.28.20.168 oride-dr-algo2

    ---------------------------------------------------------------------

    把/etc/profile和/opt/module/spark-2.4.4-bin-hadoop2.7发送到其它节点3,4

    启动:
    /opt/module/spark-2.4.4-bin-hadoop2.7/sbin/start-all.sh
    停止
    /opt/module/spark-2.4.4-bin-hadoop2.7/sbin/stop-all.sh

    #这表示已经启动
    [root@zhep-opay-temp-1 ~]# /opt/module/spark-2.4.4-bin-hadoop2.7/sbin/start-all.sh
    starting org.apache.spark.deploy.master.Master, logging to /opt/module/spark-2.4.4-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-zhep-opay-temp-1.novalocal.out
    195.189.142.83: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark-2.4.4-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-zhep-opay-temp-2.novalocal.out
    195.189.142.86: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark-2.4.4-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-zhep-opay-temp-3.novalocal.out
    195.189.142.89: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark-2.4.4-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-zhep-opay-temp-1.novalocal.out
    [root@zhep-opay-temp-1 ~]# jps
    #用jps查看,主节点出现一个master进程,从节点是worker进程

    #8080可以看到spark的webui
    http://195.189.142.89:8080/


    #standalone模式提交作业:
    spark-submit --class org.apache.spark.examples.JavaSparkPi
    > --master spark://oride-dr-algo2:7077
    > /opt/module/spark-2.4.4-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.4.jar


    #yarn模式
    官网:https://spark.apache.org/docs/2.4.4/running-on-yarn.html
    https://www.cnblogs.com/qingyunzong/p/8888080.html

    #client模式
    spark-shell --master yarn --deploy-mode client

    #yarn-cluster模式提交spark
    执行Spark自带的示例程序PI
    spark-submit --class org.apache.spark.examples.SparkPi
    > --master yarn
    > --deploy-mode cluster #用yarn cluster模式
    > --driver-memory 500m #指定使用内存
    > --executor-memory 500m
    > --executor-cores 1 #指定使用cpu核数
    > /home/hadoop/apps/spark/examples/jars/spark-examples_2.11-2.3.0.jar
    > 10

    #可以用这个做测试:
    spark-submit --class org.apache.spark.examples.JavaSparkPi
    > --master yarn-cluster
    > /opt/module/spark-2.4.4-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.4.jar

    yarn模式在yarn可看到刚才运行的作业: 

    用yarn模式的日志不像standalone模式,查看8080端口就可以了,需要整合:
    在CDH中:(默认已经是开启)

    日志保留时间:

    #如是单独搭建的hadoop:
    见spark pdf

  • 相关阅读:
    werfault进程使用CPU率高
    oracel 拆分字符串
    TCP TIME WAIT
    netstat 命令
    Java 理论与实践: 并发集合类
    DIV与SPAN之间有什么区别
    oracle超出打开游标的最大数的原因和解决方案
    Quartz表达式
    Axis创建webservice客户端和服务端
    Web服务cxf框架发布2
  • 原文地址:https://www.cnblogs.com/hongfeng2019/p/11621629.html
Copyright © 2020-2023  润新知