• Spark Streaming实时处理Kafka数据


    使用python编写Spark Streaming实时处理Kafka数据的程序,需要熟悉Spark工作机制和Kafka原理。

    1 配置Spark开发Kafka环境

    首先点击下载spark-streaming-kafka,下载Spark连接Kafka的代码库。然后把下载的代码库放到目录/opt/spark/spark-2.4.0-bin-hadoop2.7/jars目录下,命令如下:

    sudo mv ~/下载/spark-streaming-kafka-0-10_2.11-2.4.0.jar /opt/spark/spark-2.4.0-bin-hadoop2.7/jars
    

    然后在/opt/spark/spark-2.4.0-bin-hadoop2.7/jars目录下新建kafka目录,把/opt/kafka/kafka_2.11-0.10.2.2/libs下所有函数库复制到/opt/spark/spark-2.4.0-bin-hadoop2.7/jars/kafka目录下

    把 Kafka 相关 jar 包的路径信息增加到 spark-env.sh,修改后的 spark-env.sh 类似如下:

    export SPARK_DIST_CLASSPATH=$(/opt/hadoop/hadoop-2.7.6/bin/hadoop classpath):/opt/spark/spark-2.4.0-bin-hadoop2.7/jars/kafka/*:/opt/kafka/kafka_2.11-0.10.2.2/libs/*
    

    因为我使用的python3版本,而spark默认使用的是python2,所以介绍一下,怎么为spark设置python环境。
    要改两个地方,一个是conf目录下的spark_env.sh:在这个文件的开头添加:

    export PYSPARK_PYTHON=/root/anaconda3/bin/python3.7
    

    这里的python3.7是我本地的使用版本。正常来说,python3的版本对于本实验都可以运行。
    第二个地方要修改,/usr/local/spark/bin/pyspark这个文件。参照以下修改这个地方:

    # Determine the Python executable to use for the executors:
    if [[ -z "$PYSPARK_PYTHON" ]]; then
      if [[ $PYSPARK_DRIVER_PYTHON == *ipython* && ! $WORKS_WITH_IPYTHON ]]; then
        echo "IPython requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON" 1>&2
        exit 1
      else
        PYSPARK_PYTHON=python
      fi
    fi
    export PYSPARK_PYTHON
    

    把上面的 PYSPARK_PYTHON=python改成 PYSPARK_PYTHON=python3.7

    spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar放在/opt/kafka/kafka_2.11-0.10.2.2/libs

    2 建立pyspark项目

    把执行代码放在/opt/spark/spark-2.4.0-bin-hadoop2.7/bin

    #sparkstreaming_kafka.py
    
    """
    Created on Mon Mar 16 14:23:45 2020
    
    @author: yoyoyo
    """
    from pyspark import SparkContext
    from pyspark.streaming import StreamingContext
    from pyspark.streaming.kafka import KafkaUtils
    from operator import add
    
    def start():
        sc = SparkContext(master="local[2]", appName="PythonSparkStreaming")
        ssc = StreamingContext(sc, 2)
        zkQuorum = 'xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181'
        topic = {'test': 1}
        groupid = "testConsumer"
        kvs = KafkaUtils.createStream(ssc, zkQuorum, groupid, topic)
        lines = kvs.map(lambda x: x[1])
        counts = lines.flatMap(lambda line: line.split(" "))
        pairs = counts.map(lambda word: (word, 1))
        wordCount = pairs.reduceByKey(lambda a, b: a+b)
        wordCount.pprint()
    
        ssc.start()
        ssc.awaitTermination()
    
    if __name__ == '__main__':
        start()
    

    3 执行命令运行程序

    (base) root@node3:/opt/spark/spark-2.4.0-bin-hadoop2.7/bin# spark-submit --jars /opt/kafka/kafka_2.11-0.10.2.2/libs/spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar sparkstreaming_kafka.py 2>error.log
    

    将日志保存在error.log文件中,执行命令运行程序:

    (base) root@node3:/opt/spark/spark-2.4.0-bin-hadoop2.7/bin# spark-submit --jars /opt/kafka/kafka_2.11-0.10.2.2/libs/spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar sparkstreaming_kafka.py > error.log 2>&1 &
    
  • 相关阅读:
    【转载】理解本真的REST架构风格
    Linux常用命令
    使用MongoDB存储集合的一些问题
    AutoMapper快速上手
    JavaScript instanceof 运算符深入剖析
    使用c#对MongoDB进行查询(1)
    centos7安装rabbitmq3.7.9
    nginx1.14.0版本高可用——keepalived双机热备
    nginx1.14.0版本https加密配置
    nginx1.14.0版本负载均衡upstream配置
  • 原文地址:https://www.cnblogs.com/eugene0/p/12549500.html
Copyright © 2020-2023  润新知