• SparkSQL+Hive


    我这里spark3.0.1+hive3.1.2

    先启动Hive的metastore ----node2

    nohup /usr/local/hive/bin/hive --service metastore &

    jps

    把hive的配置文件hive-site.xml拷贝到spark/conf目录,把mysql驱动上传到spark/jars里面--node1 (也可以把配置文件和jar分发到其他机器,在其他机器使用SparkSQL操作hive)

    我的hive-site.xml

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>javax.jdo.option.ConnectionURL</name>
             <value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true&amp;useUnicode=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionUserName</name>
            <value>root</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionPassword</name>
            <value>123456</value>
        </property>
        <property>
            <name>datanucleus.readOnlyDatastore</name>
            <value>false</value>
        </property>
        <property>
            <name>datanucleus.fixedDatastore</name>
            <value>false</value>
        </property>
        <property>
            <name>datanucleus.autoCreateSchema</name>
            <value>true</value>
        </property>
        <property>
            <name>datanucleus.schema.autoCreateAll</name>
            <value>true</value>
        </property>
        <property>
            <name>datanucleus.autoCreateTables</name>
            <value>true</value>
        </property>
        <property>
            <name>datanucleus.autoCreateColumns</name>
            <value>true</value>
        </property>
        <property>
            <name>hive.metastore.local</name>
            <value>true</value>
        </property>
        <!-- 显示表的列名 -->
        <property>
            <name>hive.cli.print.header</name>
            <value>true</value>
        </property>
        <!-- 显示数据库名称 -->
        <property>
            <name>hive.cli.print.current.db</name>
            <value>true</value>
        </property>
            <property>
                <name>hive.server2.thrift.port</name>
                    <value>10000</value>
            </property>
            <property>
                 <name>hive.server2.thrift.bind.host</name>
                    <value>192.168.58.128</value>
            </property>
        <property>
            <name>hive.server2.session.check.interval</name>
            <value>0</value>
        </property>
    </configuration>

     启动hdfs服务

    start-all.sh

    启动spark/bin下的spark-sql命令行

    cd /usr/local/spark/bin
    spark-sql

    执行sql语句--node1

    show databases;
    show tables;
    CREATE TABLE person (id int, name string, age int) row format delimited fields terminated by ' ';
    LOAD DATA LOCAL INPATH 'file:////home/hadoop/文档/person.txt' IN TABLE person;
    show tables;
    select * from person;

     

     

     person.txt

    1 zhangsan 20
    2 lisi 29
    3 wangwu 25
    4 zhaoliu 30
    5 tianqi 35
    6 kobe 40

    代码中整合Hive

    导入依赖

    <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-hive_2.12</artifactId>
                <version>${spark.version}</version>
    </dependency>

    注意:需要先启动Hive的metastore

    nohup /usr/local/hive/bin/hive --service metastore &

    编写代码

    package cn.itcast.sql
    
    import org.apache.spark.SparkContext
    import org.apache.spark.sql.expressions.UserDefinedFunction
    import org.apache.spark.sql.{Dataset, SparkSession}
    
    /**
     * Author itcast
     * Desc 演示SparkSQL-使用SparkSQL-UDF将数据转为大写
     */
    object Demo09_Hive {
      def main(args: Array[String]): Unit = {
        //TODO 0.准备环境---需要增加参数配置和开启hivesql语法支持
        val spark: SparkSession = SparkSession.builder().appName("sparksql").master("local[*]")
          .config("spark.sql.shuffle.partitions", "4")//本次测试时将分区数设置小一点,实际开发中可以根据集群规模调整大小,默认200
          .config("spark.sql.warehouse.dir", "hdfs://master:9000/user/hive/warehouse")//指定Hive数据库在HDFS上的位置
          .config("hive.metastore.uris", "thrift://master:9083")
          .enableHiveSupport()//开启对hive语法的支持
          .getOrCreate()
        val sc: SparkContext = spark.sparkContext
        sc.setLogLevel("WARN")
        import spark.implicits._
    
        //TODO 1.操作Hive
        spark.sql("show databases").show(false)
        spark.sql("show tables").show(false)
        spark.sql("CREATE TABLE person4 (id int, name string, age int) row format delimited fields terminated by ' '")
        spark.sql("LOAD DATA LOCAL INPATH 'file:///D:/person.txt' INTO TABLE person4")
        spark.sql("show tables").show(false)
        spark.sql("select * from person4").show(false)
    
        spark.stop()
      }
    }

     

     

    Spark分布式SQL引擎

    同样的SparkSQL也支持这样的访问方式:

    1.SparkSQL的命令行

    2.在某一台机器启动SparkSQL的ThriftServer,然后可以在其他机器使用beeline访问

    使用beeline 客户端连接

    hive的元数据库服务metastore 保持启动状态

    nohup /usr/local/hive/bin/hive --service metastore &

    启动SparkSQL的ThriftServer--类似于HiveServer2

    cd /usr/local/spark/sbin/
    start-thriftserver.sh

    在其他机器(我这里全在master上)使用beeline连接sparkSQL的thriftserver

    新开一个终端:

    cd /usr/local/spark/bin
    beeline

     执行sql

    show databases;
    show tables;
    select * from person;

     http://master:4040/jobs/

    查看spark任务

    使用代码访问

    导入依赖

    <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-hive-thriftserver_2.12</artifactId>
                <version>${spark.version}</version>
    </dependency>

    使用jdbc代码访问SparkSQL-thriftserver_2

    package cn.itcast.sql
    
    import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}
    
    /**
     * Author itcast
     * Desc 演示使用jdbc访问SparkSQL的ThriftServer
     */
    object Demo10_ThriftServer{
      def main(args: Array[String]): Unit = {
        //0.加载驱动
        Class.forName("org.apache.hive.jdbc.HiveDriver")
    
        //1.获取连接
        val conn: Connection = DriverManager.getConnection(
          "jdbc:hive2://master:10000/default", //看上去像是在使用Hive的server2,本质上使用Spark的ThriftServer
          "root",
          "4135"
        )
    
        //2.编写sql
        val sql = """select id,name,age from person"""
    
        //3.获取预编译语句对象
        val ps: PreparedStatement = conn.prepareStatement(sql)
    
        //4.执行sql
        val rs: ResultSet = ps.executeQuery()
    
        //5.处理结果
        while (rs.next()){
          val id: Int = rs.getInt("id")
          val name: String = rs.getString("name")
          val age: Int = rs.getInt("age")
          println(s"id=${id},name=${name},age=${age}")
        }
    
        //6.关闭资源
        if(rs != null) rs.close()
        if(ps != null) ps.close()
        if(conn != null) conn.close()
      }
    }

     关闭:

    虚拟机上运行:

    cd /usr/local/spark/sbin
    stop-thriftserver.sh
    cd ~
    stop-all.sh
  • 相关阅读:
    Python基础
    pip install psycopg2出现python setup.py egg_info failed with error code 1 in /tmp/pip-build-YtLeN3/psycopg2错误处理
    Python基础
    C语言基础
    benchmarks
    用 MuGo 搭建 Go Engine 在 KGS 对战
    GPU
    linux 杀掉僵尸进程 (zombie process, defunct)
    CMakeLists.txt 语法
    软件列表(按字母排序)
  • 原文地址:https://www.cnblogs.com/a155-/p/14508617.html
Copyright © 2020-2023  润新知