• hadoop系列整理---PySpark


    【PySpark RDD与DataFrame、Pandas转换】

    官网:https://spark.apache.org/docs/latest/rdd-programming-guide.html

     1.  RDD=>  DataFrame

     2.  RDD=>  Pandas

     3.  DataFrame => RDD

     4.  DataFrame => Pandas

     5.  Pandas => DataFrame

    【PySpark与Hbase交互】

      jar依赖:spark-examples_2.11-2.4.0.jar

     参考文档:

           pyspark读写HBase

           spark读写HBase数据-scala

           Hbase过滤器---可适当转换成jvm中过滤器

      spark-examples源码地址:https://github.com/apache/spark/tree/branch-2.3

     1. read

    conf= {
    "inputFormatClass": "org.apache.hadoop.hbase.mapreduce.TableInputFormat",
    "keyClass": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
    "valueClass": "org.apache.hadoop.hbase.client.Result",
    "keyConverter": "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter",
    "valueConverter": "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter",
    "conf": {
    "hbase.zookeeper.quorum": hosts,
    "hbase.mapreduce.inputtable": table,
    "hbase.mapreduce.scan.columns": columns,
    }
    }
    newAPIHadoopRDD(**conf)

     2. write

    conf = {
        "path": "-",
    "keyConverter": "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter",
    "valueConverter": "org.apache.spark.examples.pythonconverters.StringListToPutConverter",
    "conf": {
    "hbase.zookeeper.quorum": hosts,
    "hbase.mapred.outputtable": table,
    "mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
    "mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
    "mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"
    }
    }
    saveAsNewAPIHadoopDataset(**conf)

    【PySpark与elsticsearch交互】

      jar依赖:elasticsearch-hadoop-6.4.1.jar

      官网:https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html

      参考文档:

            Spark读写Elasticsearch

            pyspark访问Elasticsearch数据

     1. read

    conf = {
    "inputFormatClass": "org.elasticsearch.hadoop.mr.EsInputFormat",
    "keyClass": "org.apache.hadoop.io.NullWritable",
    "valueClass": "org.elasticsearch.hadoop.mr.LinkedMapWritable",
    "conf": {
    "es.nodes": hosts,
         "es.output.json": "true",
    "es.resource": f"{_index}/{_type}",
    "es.query": query
    }
    }
    newAPIHadoopRDD(**conf)      读取时每一行为Row=("", "{}")

     2. write

    conf = {
        "path": "-",
    "outputFormatClass": "org.elasticsearch.hadoop.mr.EsOutputFormat",
    "keyClass": "org.apache.hadoop.io.NullWritable",
    "valueClass": "org.elasticsearch.hadoop.mr.LinkedMapWritable",
    "conf": {
    "es.nodes": hosts,
    "es.resource": f"{_index}/{_type}",
    "es.input.json": "true",
    "es.index.auto.create": index_auto_create, # 是否自动创建
    "es.mapping.id": None if not mapping_id or index_auto_create else mapping_id
    }
    }
    saveAsNewAPIHadoopFile(**conf) 写入时要求每一行为Row=("", "{}")
    3. 使用spark Sql读写数据 
    参考: https://blog.icocoro.me/2018/03/07/1803-pyspark-elasticsearch/     
    pyspark访问Elasticsearch数据

    【PySpark与mongodb交互】

       jar依赖:mongo-java-driver-3.6.2.jar, mongo-hadoop-core-2.0.2.jar,

                     [spark-core_2.11-2.3.4.jar, spark-sql_2.11-2.3.4.jar]

       参考文档:pyspark访问MongoDB数据

       spark.sql方式获取配置比较简单   

    input_uri = "mongodb://127.0.0.1:27017/spark.spark_test"
    output_uri = "mongodb://127.0.0.1:27017/spark.spark_test"
    spark = SparkSession
    .builder
    .master("local")
    .appName("MyApp")
    .config("spark.mongodb.input.uri", input_uri)
    .config("spark.mongodb.output.uri", output_uri)
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.0')
    .getOrCreate()

    df = spark.read.format('com.mongodb.spark.sql.DefaultSource').option("collection", input_collection)
    .option("pipeline", pipeline).load()

    df.printSchema()
    df.registerTempTable("test_collection")
    test_df = sqlctx.sql(sql) # sql语句
    # mode 参数可选范围
    # * `append`: Append contents of this :class:`DataFrame` to existing data.
    # * `overwrite`: Overwrite existing data.
    # * `error` or `errorifexists`: Throw an exception if data already exists.
    # * `ignore`: Silently ignore this operation if data already exists.

    df.write.format("com.mongodb.spark.sql.DefaultSource").option("collection", output_collection).mode("append").save()
    spark.stop()

     下面方式测试有异常,详细见官网 https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage 

     1. read

    conf = {
        'inputFormatClass': 'com.mongodb.hadoop.MongoInputFormat',
    'keyClass': 'org.apache.hadoop.io.Text',
    'valueClass': 'org.apache.hadoop.io.MapWritable',
    'conf': {
    'mongo.input.uri': 'mongodb://localhost:27017/db.collection',
    'mongo.input.query': query
    'mongo.input.split.create_input_splits': 'false'
    }
    }
    newAPIHadoopRDD(**conf)

     2. write

    conf = {
        'path': '-',
    'outputFormatClass': 'com.mongodb.hadoop.MongoOutputFormat',
    'keyClass': 'org.apache.hadoop.io.Text',
    'valueClass': 'org.apache.hadoop.io.MapWritable',
    'conf': {
    'mongo.output.uri': 'mongodb://localhost:27017/output.collection'
    }
    }
    saveAsNewAPIHadoopFile(**conf)

    关于spark操作hbase配置

    解决java.lang.ClassNotFoundException: org.apache.hadoop.hbase.io.ImmutableBytesWritable问题

    发现主要是包的问题:

    Spark自身携带的spark-examples_2.11-2.3.4.jar不含org.apache.hadoop.hbase.io.ImmutableBytesWritable

    需要使用 含有该类的jar包,推荐使用:spark-examples_2.11/1.6.0-typesafe-001.jar

    同时设置

    export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath):$(/usr/local/hbase/bin/hbase classpath):/usr/local/spark/jars/*

    Jar包链接:https://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.11/1.6.0-typesafe-001

     

    方案备选:

    参考:https://blog.csdn.net/weixin_39594447/article/details/86696182

    1.hbase classpath文件准备
    mkdir    $SPARK_HOME/jars/hbase/
    cp  $HBASE_HOME/lib/hbase*    $SPARK_HOME/jars/hbase -rf 
    cp  $HBASE_HOME/lib/guava* $SPARK_HOME/jars/hbase
    cp $HBASE_HOME/lib/htrace* $SPARK_HOME/jars/hbase
    cp $HBASE_HOME/lib/protobuf-java* $SPARK_HOME/jars/hbase
    cp spark-examples_2.11-1.6.0-typesafe-001.jar $SPARK_HOME/jars/hbase
    2. 设置环境
    export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath):$(/usr/local/hbase/bin/hbase classpath):/usr/local/spark/jars/hbase*

    PySpark与Java/Scala交互中的一些常用的Spark对象

    参考:https://blog.csdn.net/yolohohohoho/article/details/89811462 

    SparkContext
    如果你的Scala代码需要访问SparkContext(sc),则你的python代码必须传递sc._jsc,并且Scala方法应该接收JavaSparkContext参数并将其解压缩到Scala SparkContext。
    import org.apache.spark.api.java.JavaSparkContext
    def method(jsc: JavaSparkContext) = {
    val sc = JavaSparkContext.toSparkContext(jsc)
    }
    对于SQLContext,我们可以通过发送sqlContext._ssql_ctx从python传递Scala SQLContext。 它可以直接在Scala端使用。

    RDDS
    你可以通过rdd._jrdd将它们从Python传递到Scala。 在Scala端,可以通过访问jrdd.rdd来提取JavaRDD(jrdd)。 而将它转换回Python时,我们可以使用:
    from pyspark.rdd import RDD
    pythonRDD = RDD(jrdd, sc)

    要从python发送DataFrame(df),必须传递df._jdf属性。

    而将Scala DataFrame返回给python时,可以在python端转换为:

    from pyspark.sql import DataFrame
    pythonDf = DataFrame(jdf, sqlContext)

    DataFrames也可以通过使用registerTempTable并通过sqlContext访问它们。

     
    https://github.com/jiangsiwei2018/BigData.git 实例代码git仓库地址
  • 相关阅读:
    实习一面+二面+三面面经
    内核协议栈2
    android之activity生命周期图
    gcc1
    实习一
    android之startActivityForResult
    KFS
    android之使用DDMS帮助开发
    设计模式——工厂模式
    博客备份工具(博主网)开发略谈
  • 原文地址:https://www.cnblogs.com/satansz/p/12903262.html
Copyright © 2020-2023  润新知