• 【慕课网实战】九、以慕课网日志分析为例 进入大数据 Spark SQL 的世界


    即席查询
    普通查询

    Load Data
    1) RDD DataFrame/Dataset
    2) Local Cloud(HDFS/S3)

    将数据加载成RDD
    val masterLog = sc.textFile("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-hadoop001.out")
    val workerLog = sc.textFile("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop001.out")
    val allLog = sc.textFile("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/logs/*out*")

    masterLog.count
    workerLog.count
    allLog.count

    存在的问题:使用使用SQL进行查询呢?

    import org.apache.spark.sql.Row
    val masterRDD = masterLog.map(x => Row(x))
    import org.apache.spark.sql.types._
    val schemaString = "line"

    val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
    val schema = StructType(fields)

    val masterDF = spark.createDataFrame(masterRDD, schema)
    masterDF.show

    JSON/Parquet
    val usersDF = spark.read.format("parquet").load("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet")
    usersDF.show

    spark.sql("select * from parquet.`file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet`").show

    Drill 大数据处理框架

    从Cloud读取数据: HDFS/S3
    val hdfsRDD = sc.textFile("hdfs://path/file")
    val s3RDD = sc.textFile("s3a://bucket/object")
    s3a/s3n

    spark.read.format("text").load("hdfs://path/file")
    spark.read.format("text").load("s3a://bucket/object")

    val df=spark.read.format("json").load("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json")

    df.show

    TPC-DS

    spark-packages.org

  • 相关阅读:
    常见面试测试要点
    怎样在 CentOS/RHEL 7/6 上安装和配置 Sendmail 服务器
    Cannot uninstall 'pyparsing'. It is a distutils installed project
    Linux下校验SHA1和MD5的方法
    Linux查看进程启动时间和运行多长时间
    sqlplus -S参数表示什么意思?
    dnspython模块报错 AttributeError: 'CNAME' object has no attribute 'address'
    CentOS7中安装pip的方法
    四则运算中遇到的一个问题
    动手动脑
  • 原文地址:https://www.cnblogs.com/kkxwz/p/8493777.html
Copyright © 2020-2023  润新知