• (六)Spark-Eclipse开发环境WordCount-Java&Python版Spark


    Spark-Eclipse开发环境WordCount

    视频教程:

    1、优酷

    2、YouTube

    安装eclipse

    解压eclipse-jee-mars-2-win32-x86_64.zip

    JavaWordcount

    解压spark-2.0.0-bin-hadoop2.6.tgz

    创建 Java Project-->Spark

    将spark-2.0.0-bin-hadoop2.6下的jars里面的jar全部复制到Spark项目下的lib下

    Add Build Path

      1 package com.bean.spark.wordcount;
      2 
      3  
      4 
      5 import java.util.Arrays;
      6 
      7 import java.util.Iterator;
      8 
      9  
     10 
     11 import org.apache.spark.SparkConf;
     12 
     13 import org.apache.spark.api.java.JavaPairRDD;
     14 
     15 import org.apache.spark.api.java.JavaRDD;
     16 
     17 import org.apache.spark.api.java.JavaSparkContext;
     18 
     19 import org.apache.spark.api.java.function.FlatMapFunction;
     20 
     21 import org.apache.spark.api.java.function.Function2;
     22 
     23 import org.apache.spark.api.java.function.PairFunction;
     24 
     25 import org.apache.spark.api.java.function.VoidFunction;
     26 
     27  
     28 
     29 import scala.Tuple2;
     30 
     31  
     32 
     33 public class WordCount {
     34 
     35 public static void main(String[] args) {
     36 
     37 //创建SparkConf对象,设置Spark应用程序的配置信息
     38 
     39 SparkConf conf = new SparkConf();
     40 
     41 conf.setMaster("local");
     42 
     43 conf.setAppName("wordcount");
     44 
     45  
     46 
     47 //创建SparkContext对象,Java开发使用JavaSparkContext;Scala开发使用SparkContext
     48 
     49 //SparkContext负责连接Spark集群,创建RDD、累积量和广播量等
     50 
     51 JavaSparkContext sc = new JavaSparkContext(conf);
     52 
     53  
     54 
     55 //sc中提供了textFile方法是SparkContext中定义的,用来读取HDFS上的
     56 
     57 //文本文件、集群中节点的本地文本文件或任何支持Hadoop的文件系统上的文本文件,它的返回值是JavaRDD[String],是文本文件每一行
     58 
     59 JavaRDD<String> lines = sc.textFile("D:/tools/data/wordcount/wordcount.txt");
     60 
     61 //将每一行文本内容拆分为多个单词
     62 
     63 //lines调用flatMap这个transformation算子(参数类型是FlatMapFunction接口实现类)返回每一行的每个单词
     64 
     65 JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
     66 
     67  
     68 
     69 private static final long serialVersionUID = 1L;
     70 
     71  
     72 
     73 @Override
     74 
     75 public Iterator<String> call(String s) throws Exception {
     76 
     77 // TODO Auto-generated method stub
     78 
     79 return Arrays.asList(s.split(" ")).iterator();
     80 
     81 }
     82 
     83 });
     84 
     85 //将每个单词的初始数量都标记为1个
     86 
     87 //words调用mapToPair这个transformation算子(参数类型是PairFunction接口实现类,
     88 
     89 //PairFunction<String, String, Integer>的三个参数是<输入单词, Tuple2的key, Tuple2的value>),
     90 
     91 //返回一个新的RDD,即JavaPairRDD
     92 
     93 JavaPairRDD<String, Integer> word = words.mapToPair(new PairFunction<String, String, Integer>() {
     94 
     95  
     96 
     97 private static final long serialVersionUID = 1L;
     98 
     99  
    100 
    101 @Override
    102 
    103 public Tuple2<String, Integer> call(String s) throws Exception {
    104 
    105 // TODO Auto-generated method stub
    106 
    107 return new Tuple2<String, Integer>(s, 1);
    108 
    109 }
    110 
    111 });
    112 
    113 //计算每个相同单词出现的次数
    114 
    115 //pairs调用reduceByKey这个transformation算子(参数是Function2接口实现类)对每个key的value进行reduce操作,
    116 
    117 //返回一个JavaPairRDD,这个JavaPairRDD中的每一个Tuple的key是单词、value则是相同单词次数的和
    118 
    119 JavaPairRDD<String, Integer> counts = word.reduceByKey(new Function2<Integer, Integer, Integer>() {
    120 
    121  
    122 
    123 private static final long serialVersionUID = 1L;
    124 
    125  
    126 
    127 @Override
    128 
    129 public Integer call(Integer s1, Integer s2) throws Exception {
    130 
    131 // TODO Auto-generated method stub
    132 
    133 return s1 + s2;
    134 
    135 }
    136 
    137 });
    138 
    139 counts.foreach(new VoidFunction<Tuple2<String,Integer>>() {
    140 
    141  
    142 
    143 private static final long serialVersionUID = 1L;
    144 
    145  
    146 
    147 @Override
    148 
    149 public void call(Tuple2<String, Integer> wordcount) throws Exception {
    150 
    151 // TODO Auto-generated method stub
    152 
    153 System.out.println(wordcount._1+" : "+wordcount._2);
    154 
    155 }
    156 
    157 });
    158 
    159 //将计算结果文件输出到文件系统
    160 
    161 /*
    162 
    163  * HDFS
    164 
    165  * 新版的API
    166 
    167  * org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
    168 
    169  * counts.saveAsNewAPIHadoopFile("hdfs://master:9000/data/wordcount/output", Text.class, IntWritable.class, TextOutputFormat.class, new Configuration());
    170 
    171  * 使用默认TextOutputFile写入到HDFS(注意写入HDFS权限,如无权限则执行:hdfs dfs -chmod -R 777 /data/wordCount/output)
    172 
    173          * wordCount.saveAsTextFile("hdfs://soy1:9000/data/wordCount/output");
    174 
    175          *
    176 
    177  *
    178 
    179  * */
    180 
    181 counts.saveAsTextFile("D:/tools/data/wordcount/output");
    182 
    183  
    184 
    185  
    186 
    187 //关闭SparkContext容器,结束本次作业
    188 
    189 sc.close();
    190 
    191 }
    192 
    193 }

    运行出错

    在代码中加入:只要式加在JavaSparkContext初始化之前就可以

    System.setProperty("hadoop.home.dir", "D:/tools/spark-2.0.0-bin-hadoop2.6");

    hadoop2.6(x64)工具.zip解压到D: oolsspark-2.0.0-bin-hadoop2.6in目录下

    PythonWordcount

    eclipse集成python插件

    解压pydev.zip将features和plugins中的包复制到eclipse的对应目录

     1 #-*- coding:utf-8-*-
     2 
     3  
     4 
     5 from __future__ import print_function
     6 
     7 from operator import add
     8 
     9 import os
    10 
    11 from pyspark.context import SparkContext
    12 
    13 '''
    14 
    15 wordcount
    16 
    17 '''
    18 
    19 if __name__ == "__main__":
    20 
    21     os.environ["HADOOP_HOME"] = "D:/tools/spark-2.0.0-bin-hadoop2.6"
    22 
    23     sc = SparkContext()
    24 
    25     lines = sc.textFile("file:///D:/tools/data/wordcount/wordcount.txt").map(lambda r: r[0:])
    26 
    27     counts = lines.flatMap(lambda x: x.split(' ')) 
    28 
    29                   .map(lambda x: (x, 1)) 
    30 
    31                   .reduceByKey(add)
    32 
    33     output = counts.collect()
    34 
    35     for (word, count) in output:
    36 
    37         print("%s: %i" % (word, count))

    提交代码到集群上运行

    java:

    [hadoop@master application]$ spark-submit --master spark://master:7077 --class com.bean.spark.wordcount.WordCount spark.jar

     python:

    [hadoop@master application]$ spark-submit --master spark://master:7077 wordcount.py

  • 相关阅读:
    NIO[读]、[写]在同一线程(单线程)中执行,让CPU使用率最大化,提高处理效率
    解码(ByteBuffer): CharsetDecoder.decode() 与 Charset.decode() 的不同
    ACCESS与SQL Server下SQL Like 查询的不同
    试着用c写了一个多线程的同步
    收藏:c语言的多线程同步
    java.sql.SQLException: [Microsoft][ODBC Microsoft Access Driver] 不能使用 '(未知的)';文件已在使用中
    Linux进程管理
    MySQL数据备份
    MySQL索引
    Socket.IO 概述
  • 原文地址:https://www.cnblogs.com/LgyBean/p/6251344.html
Copyright © 2020-2023  润新知