• spark学习12(Wordcount程序之spark-shell)


    在目录/home/hadoop/2016113012下有文件words.txt

    hello scala
    hello java
    hello python
    hello wujiadong
    

    上传该文件到hdfs

    hadoop@slave01:~/2016113012$ hadoop fs -put /home/hadoop/2016113012/words.txt /student/2016113012/spark
    hadoop@slave01:~/2016113012$ hadoop fs -lsr /student/2016113012
    

    启动spark shell

    
    1和2为spark local模式,因为没有指定master地址
    
    方式1:不设置任何参数
    hadoop@master:~$ spark-shell
    方式2:设置相关参数
    hadoop@master:~$ spark-shell  --executor-memory 2g --total-executor-cores 2 --executor-cores 1
    方式3:指定master地址(暂时没用到过,用到再写)
    
    
    注释:
    --executor-memory 2g:指定每个worker可用内存为2g
    --total-executor-cores 2:指定整个集群使用的cup核数为2个
    --executor-cores:每个executor使用的cpu核数
    
    
    Spark Shell中已经默认将SparkContex类初始化为对象sc。用户代码如果需用到,直接使用sc即可
    

    在spark shell中使用Scala编写spark程序

    scala> val fileRDD = sc.textFile("hdfs://master:9000/student/2016113012/spark/words.txt")
    fileRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:15
    
    scala> val wordRDD = fileRDD.flatMap(_.split(" "))
    wordRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at flatMap at <console>:17
    
    scala> val wordPair = wordRDD.map((_,1))
    wordPair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] at map at <console>:19
    
    scala> val result = wordPair.reduceByKey(_+_) 
    17/03/04 21:08:37 INFO FileInputFormat: Total input paths to process : 1
    result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at <console>:21
    
    scala> result.sortBy(_._2,false)
    res1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[9] at sortBy at <console>:24
    
    scala> result.sortBy(_._2,false).collect()
    17/03/04 21:09:49 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
    17/03/04 21:09:49 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
    17/03/04 21:09:49 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
    17/03/04 21:09:49 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
    17/03/04 21:09:49 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
    res2: Array[(String, Int)] = Array((hello,4), (scala,1), (wujiadong,1), (python,1), (java,1))
    scala> result.sortBy(_._2,false).saveAsTextFile("hdfs://master:9000/wordcount_out") 
    17/03/04 21:11:03 INFO FileOutputCommitter: Saved output of task 'attempt_201703042111_0005_m_000000_4' to hdfs://master:9000/wordcount_out/_temporary/0/task_201703042111_0005_m_000000
    
    
    查看运行的结果
    hadoop@master:~$ hadoop fs -ls hdfs://master:9000/wordcount_out
    17/03/04 21:12:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 2 items
    -rw-r--r--   3 hadoop supergroup          0 2017-03-04 21:11 hdfs://master:9000/wordcount_out/_SUCCESS
    -rw-r--r--   3 hadoop supergroup         54 2017-03-04 21:11 hdfs://master:9000/wordcount_out/part-00000
    hadoop@master:~$ hadoop fs -text hdfs://master:9000/wordcount_out/part-00000
    17/03/04 21:14:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    (hello,4)
    (scala,1)
    (wujiadong,1)
    (python,1)
    (java,1)
    
    
    一行写完
    scala> sc.textFile("hdfs://master:9000/student/2016113012/spark/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
    res9: Array[(String, Int)] = Array((scala,1), (wujiadong,1), (python,1), (hello,4), (java,1))
    //或者输出到hdfs
    scala> sc.textFile("hdfs://master:9000/student/2016113012/spark/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false).saveAsTextFile("hdfs://master:9000/spark_out")
    
    
    说明:
    sc是SparkContext对象,该对象是提交spark程序的入口。spark shell中已经默认将SparkContext类初始化为对象sc,可以直接使用sc
    textFile()是hdfs中读取数据
    flatMap(_.spli" ")) 先map再压平
    map((_,1))将单词和1构成元组
    reduceByKey(_+_)按照key进行reduce,并将value累加
    sortBy(_._2,false):按值进行排序
    saveAsTextFile("")将结果写入到hdfs中
    
  • 相关阅读:
    git常用命令
    git误删文件找回方法/git版本回退方法
    如何解决inline和linline-block在浏览器中的间距问题
    windows系统如何添加ssh key到github
    如何使用git命令添加文件和提交文件
    sublime text常用快捷键
    sublime text3好用的插件
    Atom编辑器的插件
    使用SharedPreference保存用户数据的步骤
    解析xml文件步骤 -- pullparser
  • 原文地址:https://www.cnblogs.com/wujiadong2014/p/6502741.html
Copyright © 2020-2023  润新知