• Spark2.1.0——剖析spark-shell


            在《Spark2.1.0——运行环境准备》一文介绍了如何准备基本的Spark运行环境,并在《Spark2.1.0——Spark初体验》一文通过在spark-shell中执行word count的过程,让读者了解到可以使用spark-shell提交Spark作业。现在读者应该很想知道spark-shell究竟做了什么呢?

    脚本分析

            在Spark安装目录的bin文件夹下可以找到spark-shell,其中有代码清单1-1所示的一段脚本。

    代码清单1-1       spark-shell脚本

    function main() {
      if $cygwin; then
        stty -icanon min 1 -echo > /dev/null 2>&1
        export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
        "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
        stty icanon echo > /dev/null 2>&1
      else
        export SPARK_SUBMIT_OPTS
        "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
      fi
    }
    

    我们看到脚本spark-shell里执行了spark-submit脚本,那么打开spark-submit脚本,发现代码清单1-2中所示的脚本。

    代码清单1-2        spark-submit脚本

    if [ -z "${SPARK_HOME}" ]; then
      source "$(dirname "$0")"/find-spark-home
    fi
    
    # disable randomized hash for string in Python 3.3+
    export PYTHONHASHSEED=0
    
    exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
    

    可以看到spark-submit中又执行了脚本spark-class。打开脚本spark-class,首先发现以下一段脚本:

    # Find the java binary
    if [ -n "${JAVA_HOME}" ]; then
      RUNNER="${JAVA_HOME}/bin/java"
    else
      if [ "$(command -v java)" ]; then
        RUNNER="java"
      else
        echo "JAVA_HOME is not set" >&2
        exit 1
      fi
    fi
    

    上面的脚本是为了找到Java命令。在spark-class脚本中还会找到以下内容:

    build_command() {
      "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
      printf "%d" $?
    }
    
    CMD=()
    while IFS= read -d '' -r ARG; do
      CMD+=("$ARG")
    done < <(build_command "$@")
    

    根据代码清单1-2,脚本spark-submit在执行spark-class脚本时,给它增加了参数SparkSubmit 。所以读到这,应该知道Spark启动了以SparkSubmit为主类的JVM进程。

    远程监控

            为便于在本地对Spark进程进行远程监控,在spark-shell脚本中找到以下配置:

    SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"

    并追加以下jmx配置:

    -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=10207 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

    如果Spark安装在其他机器,那么在本地打开jvisualvm后需要添加远程主机,如图1所示:

    图1  添加远程主机

    右键单击已添加的远程主机,添加JMX连接,如图2:

    图2  添加JMX连接

            如果Spark安装在本地,那么打开jvisualvm后就会在应用程序窗口看到org.apache.spark.deploy.SparkSubmit进程,只需双击即可。

            选择右侧的“线程”选项卡,选择main线程,然后点击“线程Dump”按钮,如图3。

    图3 查看Spark线程

    从线程Dump的内容中找到线程main的信息如代码清单1-3所示。

    代码清单1-3       main线程的Dump信息

    "main" #1 prio=5 os_prio=31 tid=0x00007fa012802000 nid=0x1303 runnable [0x000000010d11c000]
       java.lang.Thread.State: RUNNABLE
    	at java.io.FileInputStream.read0(Native Method)
    	at java.io.FileInputStream.read(FileInputStream.java:207)
    	at jline.internal.NonBlockingInputStream.read(NonBlockingInputStream.java:169)
    	- locked <0x00000007837a8ab8> (a jline.internal.NonBlockingInputStream)
    	at jline.internal.NonBlockingInputStream.read(NonBlockingInputStream.java:137)
    	at jline.internal.NonBlockingInputStream.read(NonBlockingInputStream.java:246)
    	at jline.internal.InputStreamReader.read(InputStreamReader.java:261)
    	- locked <0x00000007837a8ab8> (a jline.internal.NonBlockingInputStream)
    	at jline.internal.InputStreamReader.read(InputStreamReader.java:198)
    	- locked <0x00000007837a8ab8> (a jline.internal.NonBlockingInputStream)
    	at jline.console.ConsoleReader.readCharacter(ConsoleReader.java:2145)
    	at jline.console.ConsoleReader.readLine(ConsoleReader.java:2349)
    	at jline.console.ConsoleReader.readLine(ConsoleReader.java:2269)
    	at scala.tools.nsc.interpreter.jline.InteractiveReader.readOneLine(JLineReader.scala:57)
    	at scala.tools.nsc.interpreter.InteractiveReader$$anonfun$readLine$2.apply(InteractiveReader.scala:37)
    	at scala.tools.nsc.interpreter.InteractiveReader$$anonfun$readLine$2.apply(InteractiveReader.scala:37)
    	at scala.tools.nsc.interpreter.InteractiveReader$.restartSysCalls(InteractiveReader.scala:44)
    	at scala.tools.nsc.interpreter.InteractiveReader$class.readLine(InteractiveReader.scala:37)
    	at scala.tools.nsc.interpreter.jline.InteractiveReader.readLine(JLineReader.scala:28)
    	at scala.tools.nsc.interpreter.ILoop.readOneLine(ILoop.scala:404)
    	at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:413)
    	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:923)
    	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
    	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
    	at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
    	at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
    	at org.apache.spark.repl.Main$.doMain(Main.scala:68)
    	at org.apache.spark.repl.Main$.main(Main.scala:51)
    	at org.apache.spark.repl.Main.main(Main.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
    	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

    从main线程的栈信息中看出程序的调用顺序:SparkSubmit.main→repl.Main→Iloop.process。

    源码分析

    我们根据上面的线索,直接阅读Iloop的process方法的源码(Iloop是Scala语言自身的类库中的用于实现交互式shell的实现类,提供对REPL(Read-eval-print-loop)的实现),见代码清单1-4。

    代码清单1-4       process的实现

      def process(settings: Settings): Boolean = savingContextLoader {
        this.settings = settings
        createInterpreter()
    
        // sets in to some kind of reader depending on environmental cues
        in = in0.fold(chooseReader(settings))(r => SimpleReader(r, out, interactive = true))
        globalFuture = future {
          intp.initializeSynchronous()
          loopPostInit()
          !intp.reporter.hasErrors
        }
        loadFiles(settings)
        printWelcome()
    
        try loop() match {
          case LineResults.EOF => out print Properties.shellInterruptedString
          case _               =>
        }
        catch AbstractOrMissingHandler()
        finally closeInterpreter()
    
        true
      }

    根据代码清单1-4,Iloop的process方法调用了loadFiles方法。Spark中的SparkILoop继承了Iloop并重写了loadFiles方法,其实现如下:

      override def loadFiles(settings: Settings): Unit = {
        initializeSpark()
        super.loadFiles(settings)
      }

    根据上面展示的代码,loadFiles方法调用了SparkILoop的initializeSpark方法,initializeSpark的实现见代码清单1-5。

    代码清单1-5        initializeSpark的实现

      def initializeSpark() {
        intp.beQuietDuring {
          processLine("""
            @transient val spark = if (org.apache.spark.repl.Main.sparkSession != null) {
                org.apache.spark.repl.Main.sparkSession
              } else {
                org.apache.spark.repl.Main.createSparkSession()
              }
            @transient val sc = {
              val _sc = spark.sparkContext
              if (_sc.getConf.getBoolean("spark.ui.reverseProxy", false)) {
                val proxyUrl = _sc.getConf.get("spark.ui.reverseProxyUrl", null)
                if (proxyUrl != null) {
                  println(s"Spark Context Web UI is available at ${proxyUrl}/proxy/${_sc.applicationId}")
                } else {
                  println(s"Spark Context Web UI is available at Spark Master Public URL")
                }
              } else {
                _sc.uiWebUrl.foreach {
                  webUrl => println(s"Spark context Web UI available at ${webUrl}")
                }
              }
              println("Spark context available as 'sc' " +
                s"(master = ${_sc.master}, app id = ${_sc.applicationId}).")
              println("Spark session available as 'spark'.")
              _sc
            }
            """)
          processLine("import org.apache.spark.SparkContext._")
          processLine("import spark.implicits._")
          processLine("import spark.sql")
          processLine("import org.apache.spark.sql.functions._")
          replayCommandStack = Nil // remove above commands from session history.
        }
      }

    我们看到initializeSpark向交互式shell发送了一大串代码,Scala的交互式shell将调用org.apache.spark.repl.Main的createSparkSession方法(见代码清单1-6)创建SparkSession。我们看到常量spark将持有SparkSession的引用,并且sc持有SparkSession内部初始化好的SparkContext。所以我们才能够在spark-shell的交互式shell中使用sc和spark。

    代码清单1-6        createSparkSession的实现

      def createSparkSession(): SparkSession = {
        val execUri = System.getenv("SPARK_EXECUTOR_URI")
        conf.setIfMissing("spark.app.name", "Spark shell")
        conf.set("spark.repl.class.outputDir", outputDir.getAbsolutePath())
        if (execUri != null) {
          conf.set("spark.executor.uri", execUri)
        }
        if (System.getenv("SPARK_HOME") != null) {
          conf.setSparkHome(System.getenv("SPARK_HOME"))
        }
    
        val builder = SparkSession.builder.config(conf)
        if (conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase == "hive") {
          if (SparkSession.hiveClassesArePresent) {
            sparkSession = builder.enableHiveSupport().getOrCreate()
            logInfo("Created Spark session with Hive support")
          } else {
            builder.config(CATALOG_IMPLEMENTATION.key, "in-memory")
            sparkSession = builder.getOrCreate()
            logInfo("Created Spark session")
          }
        } else {
          sparkSession = builder.getOrCreate()
          logInfo("Created Spark session")
        }
        sparkContext = sparkSession.sparkContext
        sparkSession
      }

    根据代码清单1-6,createSparkSession方法通过SparkSession的API创建SparkSession实例。本书将有关SparkSession等API的内容在《Spark内核设计的艺术》一书的第10章讲解,初次接触Spark的读者现在只需要了解即可。

    关于《Spark内核设计的艺术 架构设计与实现》

    经过近一年的准备,基于Spark2.1.0版本的《Spark内核设计的艺术 架构设计与实现》一书现已出版发行,图书如图:
     
    纸质版售卖链接如下:
  • 相关阅读:
    Educational Codeforces Round 99 (Rated for Div. 2) (FG咕咕)
    2020 ccpc 威海
    【dp每日一题】CF 543A. Writing Code
    【dp每日一题】CF543C. Remembering Strings
    【dp每日一题】CF545C. Woodcutters
    【dp每日一题】CF566F. Clique in the Divisibility Graph
    Codeforces Round #686 (Div. 3)
    【dp每日一题】CF 559C. Gerald and Giant Chess
    2019 ICPC Asia Yinchuan Regional
    Educational Codeforces Round 98 (Rated for Div. 2)
  • 原文地址:https://www.cnblogs.com/jiaan-geng/p/9065021.html
Copyright © 2020-2023  润新知