• sparkSQL


    1.spark SQL简介

    官网:http://spark.apache.org/

    Spark SQLSpark用来处理结构化数据的一个模块,它提供了一个编程对象,叫DataFrame并且作为分布式SQL查询引擎的作用

     

    2.为什么要学习spark SQL

    2.1 我们已经学习了Hive,它是将Hive SQL转换成MapReduce然后提交到集群上执行,大大简化了编写MapReduce的程序的复杂性

    2.2 MapReduce这种计算模型效率慢,代码繁碎,很多开发者都习惯使用sql,所以Spark SQL的应运而生,它是将Spark SQL转换成RDD,然后提交到集群执行,执行效率非常快,而且Spark SQL也支持从Hive中读取数据

    3.spark SQL特点

    3.1 集成 3.2 同意的访问方式 3.3hive集成 3.4标准连接

    4.DataFrame的概念

    (1)DataFram是组成命名列的数据集,它在概念上等同于关系数据库中的表,,但在底层具有更丰富的优化

    2DataFrame可以从各种来源构建,列如:结构化数据文件, hive中的表,外部数据库或现有RDD

    3DataFrame  API支持的语言有ScalaJavaPythonR

     

    4)从上图可以看出,DataFrame多了数据的结构信息,schema创建DataFrame的一种方式,后边详细介绍),RDD是分布式的 Java对象的集合

     

    5DataFrame是分布式的Row对象的集合,DataFrame除了提供了比RDD更丰富的算子以外,更重要的特点是提升执行效率、减少数据读取以及执行计划的优化

     

     val row = lines.map ( _.split ( "," )).map ( tp => Row (  tp ( 0 ).toInt, tp ( 1 ), tp ( 2 ), tp ( 3 ), tp ( 4 ).toDouble, tp ( 5 ).toDouble, tp ( 6 ).toInt ))

     

     val frame: DataFrame = unit.map(tp=>{
          val splits=tp.split(" ")
          val id=splits(0).toInt
          val name=splits(1)
          val age=splits(2).toInt
          val result=splits(3).toInt
          (id,name,age,result)
        }).toDF("id","name","age","address")

     

    5.DataSet的概念

     

    1Dataset是数据的分布式集合,Dataset是在Spark 1.6中添加的一个新接口,是DataFrame之上更高一级的抽象

     

    2Dataset提供了RDD的优点(强类型化,使用强大的lambda函数的能力)以及Spark SQL优化后的执行引擎的优点

     

    3)一个Dataset 可以从JVM对象构造,然后使用函数转换(mapflatMapfilter等)去操作

     

    4Dataset API 支持ScalaJavaPython不支持Dataset API

    6.创建DataFram

    数据 emp.csv

    7369,SMITH,CLERK,1980/12/17,1800,200,20
    7499,ALLEN,SALESMAN,1981/2/20,1600,300,30
    7521,WARD,SALESMAN,1981/2/22,1250,500,30
    7566,JONES,MANAGER,1981/4/2,2975,900,20
    7654,MARTIN,SALESMAN,1981/9/28,1250,1400,30
    7698,BLAKE,MANAGER,1981/5/1,2850,122,30
    7782,CLARK,MANAGER,1981/6/9,2450,344,10
    7788,SCOTT,ANALYST,1987/4/19,3000,555,20
    7839,KING,PRESIDENT,1981/11/17,5000,6666,10
    7844,TURNER,SALESMAN,1981/9/8,1500,0,30
    7876,ADAMS,CLERK,1987/5/23,1100,777,20
    7900,JAMES,CLERK,1981/12/3,1950,888,30
    7902,FORD,ANALYST,1981/12/3,3000,555,20
    7934,MILLER,CLERK,1982/1/23,1300,57,10
    7950,ZHAOYI,NANYOU,1980/1/1,1000,86,10

     

    字段介绍:

    1.编号 id
    2.姓名 name
    3.职业 job
    4.日期 date
    5.工资 sale
    6.奖金 money
    7.部门编号 number

     

    方式一:spark2.0版本以前

     

    package SparkSql
    import org.apache.spark.sql.{DataFrame, SQLContext}
    import org.apache.spark.{SparkConf, SparkContext}
    object SqlDemo01 { def main(args: Array[String]): Unit = { val conf = new SparkConf () .setAppName ( this.getClass.getSimpleName ) .setMaster ( "local[1]" ) val sc = new SparkContext ( conf ) val sqlContext = new SQLContext ( sc ) //获取dataset对象 val lines:DataFrame = sqlContext.read.csv ( "E:\emp.csv" ).toDF("id","name","job","date","sale","money","number") //创建视图 lines.createTempView ( "test01" ) //书写sql语句 sqlContext.sql("select id from test01").limit(10).show() sc.stop () } }

     

     

     方式二:spark 2.0之后

    package SparkSql
    
    import org.apache.spark.sql.{DataFrame, SparkSession}
    
    object SqlDemo02 {
      def main(args: Array[String]): Unit = {
        val session = SparkSession.builder ().appName ( this.getClass.getSimpleName ).master ( "local[1]" ).getOrCreate ()
        //数据源
        val data: DataFrame = session.read.csv ( "E:\emp.csv" ).toDF ( "id", "name", "job", "date", "sale", "money", "number" )
        //创建视图
        data.createTempView ( "test01" )
        //书写sql语句
        session.sql ( "select * from test01" ).show ()
        session.stop ()
    
      }
    }

     方式三:structType

    package SparkSql
    
    
    import org.apache.spark.rdd.RDD
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.{DataFrame, Row, SparkSession}
    
    object SqlDemo03 {
      def main(args: Array[String]): Unit = {
        val session = SparkSession.builder ().appName ( this.getClass.getSimpleName ).master ( "local[1]" ).getOrCreate ()
    
        //数据源
        val lines: RDD[String] = session.sparkContext.textFile ( "E:\emp.csv" )
    
        //创建row
        val row = lines.map ( tp => {
          val splits = tp.split ( "," )
          Row ( splits ( 0 ).toInt, splits ( 1 ), splits ( 2 ), splits ( 3 ), splits ( 4 ).toDouble, splits ( 5 ).toDouble, splits ( 6 ).toDouble )
        } )
    
        //创建schema
        val schme = StructType {
          List (
            StructField ( "id", IntegerType ),
            StructField ( "name", StringType ),
            StructField ( "job", StringType ),
            StructField ( "date", StringType ),
            StructField ( "sale", DoubleType ),
            StructField ( "money", DoubleType ),
            StructField ( "number", DoubleType ) )
        }
    
    
        val fram: DataFrame = session.createDataFrame ( row, schme )
        //创建视图
        fram.createTempView ( "test01" )
    
        //sql
        session.sql ( "select id,name from test01" ).show ()
    
        session.stop ()
      }
    }

     方法四:样例类(case class 相当于表结构)

    package SparkSql
    
    import org.apache.spark.sql.{DataFrame, SparkSession}
    
    object CaseDemo {
      /**
        * 样例类
        *
        * @param args
        */
      def main(args: Array[String]): Unit = {
        val session = SparkSession.builder ().appName ( this.getClass.getSimpleName ).master ( "local[1]" ).getOrCreate ()
    
        //读数据
        val unit = session.sparkContext.textFile ( "E:\emp.csv" ).map ( tp => tp.split ( "," ) )
    
        //如果RDD类型要转换为DataFrame,我们必须要导入隐士转换类
        import session.implicits._
        val frame: DataFrame = unit.map ( tp => Casewords ( tp ( 0 ).toInt, tp ( 1 ), tp ( 2 ), tp ( 3 ), tp ( 4 ).toDouble, tp ( 5 ).toDouble, tp ( 6 ).toDouble ) ).toDF ()
        frame.createTempView ( "test01" )
    
        session.sql (
          """
            |select *
            |from test01
          """.stripMargin
        ).show ()
      }
    }
    
    case class Casewords(
                          val id: Int,
                          val name: String,
                          val job: String,
                          val date: String,
                          val money: Double,
                          val sale: Double,
                          val number: Double) {
    
    
    }

    7.将数据保存到mysql

    package SparkSql
    
    
    import java.util.Properties
    
    import org.apache.spark.rdd.RDD
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.{DataFrame, Row, SparkSession}
    
      object SqlDemo03 {
        def main(args: Array[String]): Unit = {
          val session = SparkSession.builder ().appName ( this.getClass.getSimpleName ).master ( "local[*]" ).getOrCreate ()
    
          //数据源
          val lines: RDD[String] = session.sparkContext.textFile ( "E:\emp.csv" )
    
          //创建row
          val row = lines.map ( tp => {
            val splits = tp.split ( "," )
            Row ( splits ( 0 ).toInt, splits ( 1 ), splits ( 2 ), splits ( 3 ), splits ( 4 ).toDouble, splits ( 5 ).toDouble, splits ( 6 ).toDouble )
          } )
    
          //创建schema
          val schme = StructType {
            List (
              StructField ( "id", IntegerType ),
              StructField ( "name", StringType ),
              StructField ( "job", StringType ),
              StructField ( "date", StringType ),
              StructField ( "sale", DoubleType ),
              StructField ( "money", DoubleType ),
              StructField ( "number", DoubleType ) )
          }
    
    
          val fram: DataFrame = session.createDataFrame ( row, schme )
          //创建视图
          fram.createTempView ( "jobs" )
    
          //sql
       val resultDF:DataFrame=session.sql ( "select id,name from jobs" )
    
          //把数据保存到mysql表中
          //创建Properties对象你配置连接mysql的用户名和密码
          val pro=new Properties()
          pro.setProperty("user","root")
          pro.setProperty("password","123456")
          resultDF.write.jdbc("jdbc:mysql://localhost:3306/test1?characterEncoding=UTF-8&serverTimezone=GMT%2B8","jobs",pro)
    //resultDF.write.mode(SaveMode.Append).jdbc("jdbc:mysql://localhost:3306/test1?characterEncoding=UTF-8&serverTimezone=GMT%2B8","jobs1",pro) //转json数据 coalesce ( 1 )创建分区 将所有json文件合成一个文件
    resultDF.coalesce ( 1 ).write.json ( "E:\jobs" )
    session.stop () } }
  • 相关阅读:
    你的程序员是在努力工作还是在偷懒?
    React 入门实例教程[阮一峰的网络日志] (分享)
    给大家分享下坐标转换的代码的JS和Python两个版本的源码【转】
    一个不错的 在线字体 网站
    用Python作GIS之一:介入STARS
    UI“三重天”之selenium--封装(二)
    UI“三重天”之Selenium(一)
    “点点点”的误区
    (转)CentOS 7安装Zabbix 3.4
    Ui"三重天"之重温Uiaumator(1)
  • 原文地址:https://www.cnblogs.com/wangshuang123/p/11089192.html
Copyright © 2020-2023  润新知