• SparkSQL使用之Spark SQL CLI

    Spark SQL CLI描述

    Spark SQL CLI的引入使得在SparkSQL中通过hive metastore就可以直接对hive进行查询更加方便;当前版本中还不能使用Spark SQL CLI与ThriftServer进行交互。

    使用Spark SQL CLI前需要注意:



    export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/software/mysql-connector-java-5.1.27-bin.jar

    Spark SQL CLI命令参数介绍:

    cd $SPARK_HOME/bin
    spark-sql --help
    Usage: ./bin/spark-sql [options] [cli option]
    Spark assembly has been built with Hive, including Datanucleus jars on classpath
      --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
      --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                                  on one of the worker machines inside the cluster ("cluster")
                                  (Default: client).
      --class CLASS_NAME          Your application's main class (for Java / Scala apps).
      --name NAME                 A name of your application.
      --jars JARS                 Comma-separated list of local jars to include on the driver
                                  and executor classpaths.
      --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                                  on the PYTHONPATH for Python apps.
      --files FILES               Comma-separated list of files to be placed in the working
                                  directory of each executor.
      --conf PROP=VALUE           Arbitrary Spark configuration property.
      --properties-file FILE      Path to a file from which to load extra properties. If not
                                  specified, this will look for conf/spark-defaults.conf.
      --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 512M).
      --driver-java-options       Extra Java options to pass to the driver.
      --driver-library-path       Extra library path entries to pass to the driver.
      --driver-class-path         Extra class path entries to pass to the driver. Note that
                                  jars added with --jars are automatically included in the
      --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
      --help, -h                  Show this help message and exit
      --verbose, -v               Print additional debug output
     Spark standalone with cluster deploy mode only:
      --driver-cores NUM          Cores for driver (Default: 1).
      --supervise                 If given, restarts the driver on failure.
     Spark standalone and Mesos only:
      --total-executor-cores NUM  Total cores for all executors.
      --executor-cores NUM        Number of cores per executor (Default: 1).
      --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
      --num-executors NUM         Number of executors to launch (Default: 2).
      --archives ARCHIVES         Comma separated list of archives to be extracted into the
                                  working directory of each executor.
    CLI options:
    -d,--define <key=value>          Variable subsitution to apply to hive
                                      commands. e.g. -d A=B or --define A=B
        --database <databasename>     Specify the database to use
     -e <quoted-query-string>         SQL from command line
     -f <filename>                    SQL from files
     -h <hostname>                    connecting to Hive Server on remote host
        --hiveconf <property=value>   Use value for given property
        --hivevar <key=value>         Variable subsitution to apply to hive
                                      commands. e.g. --hivevar A=B
     -i <filename>                    Initialization SQL file
     -p <port>                        connecting to Hive Server on port number
     -S,--silent                      Silent mode in interactive shell
     -v,--verbose                     Verbose mode (echo executed SQL to the console)


    当设定master为yarn时(spark-sql --master yarn)时,可以通过http://hadoop000:8088页面监控到整个job的执行过程;

    注:如果在$SPARK_HOME/conf/spark-defaults.conf中配置了spark.master spark://hadoop000:7077,那么在启动spark-sql时不指定master也是运行在standalone集群之上。


    启动spark-sql: 由于我已经在spark-defaults.conf中配置了spark.master spark://hadoop000:7077,就没在spark-sql启动时指定master了

    cd $SPARK_HOME/bin
    SELECT track_time, url, session_id, referer, ip, end_user_id, city_id FROM page_views WHERE city_id = -1000 limit 10;
    SELECT session_id, count(
    *) c FROM page_views group by session_id order by c desc limit 10;


    create table page_views(
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
    load data local inpath '/home/spark/software/data/page_views.dat' overwrite into table page_views;   
