• spark-shell简单使用介绍(scala)


    >>提君博客原创  http://www.cnblogs.com/tijun/  <<

    提君博客原创

    1.进入命令窗口

    ./bin/spark-shell

    附上帮助指令,查看一些帮助信息

    scala> :help
    All commands can be abbreviated, e.g., :he instead of :help.
    :edit <id>|<line>        edit history
    :help [command]          print this summary or command-specific help
    :history [num]           show the history (optional num is commands to show)
    :h? <string>             search the history
    :imports [name name ...] show import history, identifying sources of names
    :implicits [-v]          show the implicits in scope
    :javap <path|class>      disassemble a file or class name
    :line <id>|<line>        place line(s) at the end of history
    :load <path>             interpret lines in a file
    :paste [-raw] [path]     enter paste mode or paste a file
    :power                   enable power user mode
    :quit                    exit the interpreter
    :replay [options]        reset the repl and replay all previous commands
    :require <path>          add a jar to the classpath
    :reset [options]         reset the repl to its initial state, forgetting all session entries
    :save <path>             save replayable session to a file
    :sh <command line>       run a shell command (result is implicitly => List[String])
    :settings <options>      update compiler options, if possible; see reset
    :silent                  disable/enable automatic printing of results
    :type [-v] <expr>        display the type of an expression without evaluating it
    :kind [-v] <expr>        display the kind of expression's type
    :warnings                show the suppressed warnings from the most recent line which had any

    2.使用spark加载文件,创建Dataset

    scala> val textFile = spark.read.textFile("hdfs://cluster1/input/README.txt")
    textFile: org.apache.spark.sql.Dataset[String] = [value: string]

     3.使用sc加载文件,创建RDD

    scala> val textFile=sc.textFile("hdfs://cluster1/input/README.txt")
    textFile: org.apache.spark.rdd.RDD[String] = hdfs://cluster1/input/README.txt MapPartitionsRDD[1] at textFile at <console>:24

    4.统计textFile里面有多少行(item)

    提君博客原创

    scala> textFile.count()    // Number of items in this Dataset
    res0: Long = 31

    5.查看第一个iterm

    scala> textFile.first()   // First item in this Dataset
    res1: String = For the latest information about Hadoop, please visit our website at:

    上面都挺简单,下面来一个完整的wordcount

    >>提君博客原创  http://www.cnblogs.com/tijun/  <<

    6.wordcount

    scala> val wordsRdd=textFile.flatMap(line=>line.split(" "))
    wordsRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26
    
    scala> val kvsRdd=wordsRdd.map(word=>(word,1))
    kvsRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:28
    
    scala> val countRdd=kvsRdd.reduceByKey(_+_)
    countRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:30
    
    scala> countRdd.collect()
    res2: Array[(String, Int)] = Array((under,1), (this,3), (distribution,2), (Technology,1), (country,1), (is,1), (Jetty,1), (currently,1), (permitted.,1), (check,1), (have,1), (Security,1), (U.S.,1), (with,1), (BIS,1), (This,1), (mortbay.org.,1), ((ECCN),1), (using,2), (security,1), (Department,1), (export,1), (reside,1), (any,1), (algorithms.,1), (from,1), (re-export,2), (has,1), (SSL,1), (Industry,1), (Administration,1), (details,1), (provides,1), (http://hadoop.apache.org/core/,1), (country's,1), (Unrestricted,1), (740.13),1), (policies,1), (country,,1), (concerning,1), (uses,1), (Apache,1), (possession,,2), (information,2), (our,2), (as,1), ("",18), (Bureau,1), (wiki,,1), (please,2), (form,1), (information.,1), (ENC,1), (Export,2), (included,1), (asymmetric,1), (Commodity,1), (For,1),...

    本篇先暂时写到这里,后续再继续完善。

    提君博客原创

    >>提君博客原创  http://www.cnblogs.com/tijun/  <<

  • 相关阅读:
    Flume下读取kafka数据后再打把数据输出到kafka,利用拦截器解决topic覆盖问题
    idea 党用快捷键
    Idea 调试快捷键
    log4j实时将数据写入到kafka,Demo和相关的配置详解
    windows环境下,kafka常用命令
    ElasticSearch 基本概念
    elasticsearch REST API方式批量插入数据
    自提柜-资产管理柜
    10.智能快递柜(源码下载)
    9.智能快递柜SDK(串口型锁板)
  • 原文地址:https://www.cnblogs.com/tijun/p/7562885.html
Copyright © 2020-2023  润新知