在IDEA中编写Spark的WordCount程序

1：spark shell仅在测试和验证我们的程序时使用的较多，在生产环境中，通常会在IDE中编制程序，然后打成jar包，然后提交到集群，最常用的是创建一个Maven项目，利用Maven来管理jar包的依赖。

2：配置Maven的pom.xml：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.bie</groupId>
    <artifactId>sparkWordCount</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <scala.version>2.10.6</scala.version>
        <scala.compat.version>2.10</scala.compat.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>1.5.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>1.5.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.2</version>
        </dependency>
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-make:transitive</arg>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.18.1</version>
                <configuration>
                    <useFile>false</useFile>
                    <disableXmlReport>true</disableXmlReport>
                    <includes>
                        <include>**/*Test.*</include>
                        <include>**/*Suite.*</include>
                    </includes>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>com.bie.WordCount</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

注意：配置好pom.xml以后，点击Enable Auto-Import即可;

3：将src/main/java和src/test/java分别修改成src/main/scala和src/test/scala，与pom.xml中的配置保持一致（）；

4：新建一个scala class，类型为Object，然后编写spark程序，如下所示：

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {

  def main(args: Array[String]): Unit = {
    //创建SparkConf()并且设置App的名称
    val conf = new SparkConf().setAppName("wordCount");
    //创建SparkContext,该对象是提交spark app的入口
    val sc = new SparkContext(conf);
    //使用sc创建rdd,并且执行相应的transformation和action
    sc.textFile(args(0)).flatMap(_.split(" ")).map((_ ,1)).reduceByKey(_ + _,1).sortBy(_._2,false).saveAsTextFile(args(1));
    //停止sc，结束该任务
    sc.stop();
  }
}

5：使用Maven打包：首先修改pom.xml中的mainClass，使其和自己的类路径对应起来：

然后，点击idea右侧的Maven Project选项，点击Lifecycle,选择clean和package，然后点击Run Maven Build：

等待编译完成，选择编译成功的jar包，并将该jar上传到Spark集群中的某个节点上：

记得，启动你的hdfs和Spark集群，然后使用spark-submit命令提交Spark应用（注意参数的顺序）：

可以看下简单的几行代码，但是打成的包就将近百兆，都是封装好的啊，感觉牛人太多了。

然后开始进行Spark Submit提交操作，命令如下所示：

[root@master spark-1.6.1-bin-hadoop2.6]# bin/spark-submit 
> --class com.bie.WordCount 
> --master spark://master:7077 
> --executor-memory 512M 
> --total-executor-cores 2 
> /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar 
> hdfs://master:9000/wordcount.txt 
> hdfs://master:9000/output

或者如下：
bin/spark-submit --class com.bie.WordCount --master spark://master:7077 --executor-memory 512M --total-executor-cores 2 /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount.txt hdfs://master:9000/outpu

操作如下所示：

可以在图形化页面看到多了一个Application：

然后呢，就出错了，学知识，不出点错，感觉都不正常：

  1 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
  2     at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
  3     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
  4     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
  5     at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
  6     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
  7     at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
  8     at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
  9     at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.removeExecutor(CoarseGrainedSchedulerBackend.scala:359)
 10     at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.executorRemoved(SparkDeploySchedulerBackend.scala:144)
 11     at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:186)
 12     at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
 13     at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
 14     at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
 15     at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
 16     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 17     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 18     at java.lang.Thread.run(Thread.java:745)
 19 Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
 20     at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 21     at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 22     at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 23     at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 24     at scala.concurrent.Await$.result(package.scala:107)
 25     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
 26     ... 12 more
 27 18/02/23 01:28:46 WARN NettyRpcEndpointRef: Error sending message [message = UpdateBlockInfo(BlockManagerId(driver, 192.168.3.129, 60565),broadcast_1_piece0,StorageLevel(false, true, false, false, 1),2358,0,0)] in 1 attempts
 28 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
 29     at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
 30     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
 31     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
 32     at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
 33     at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
 34     at scala.util.Try$.apply(Try.scala:161)
 35     at scala.util.Failure.recover(Try.scala:185)
 36     at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
 37     at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
 38     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 39     at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
 40     at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
 41     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
 42     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
 43     at scala.concurrent.Promise$class.complete(Promise.scala:55)
 44     at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
 45     at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
 46     at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
 47     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 48     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
 49     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
 50     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
 51     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
 52     at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 53     at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
 54     at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
 55     at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
 56     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
 57     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
 58     at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
 59     at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
 60     at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241)
 61     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 62     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 63     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
 64     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
 65     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 66     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 67     at java.lang.Thread.run(Thread.java:745)
 68 Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
 69     at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
 70     ... 7 more
 71 
 72 
 73 18/02/23 01:30:48 WARN NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(0,Command exited with code 1)] in 2 attempts
 74 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
 75     at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
 76     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
 77     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
 78     at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
 79     at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
 80     at scala.util.Try$.apply(Try.scala:161)
 81     at scala.util.Failure.recover(Try.scala:185)
 82     at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
 83     at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
 84     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 85     at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
 86     at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
 87     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
 88     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
 89     at scala.concurrent.Promise$class.complete(Promise.scala:55)
 90     at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
 91     at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
 92     at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
 93     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 94     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
 95     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
 96     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
 97     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
 98     at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 99     at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
100     at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
101     at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
102     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
103     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
104     at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
105     at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
106     at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241)
107     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
108     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
109     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
110     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
111     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
112     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
113     at java.lang.Thread.run(Thread.java:745)
114 Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
115     at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
116     ... 7 more
117 18/02/23 01:30:49 WARN NettyRpcEndpointRef: Error sending message [message = UpdateBlockInfo(BlockManagerId(driver, 192.168.3.129, 60565),broadcast_1_piece0,StorageLevel(false, true, false, false, 1),2358,0,0)] in 2 attempts
118 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
119     at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
120     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
121     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
122     at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
123     at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
124     at scala.util.Try$.apply(Try.scala:161)
125     at scala.util.Failure.recover(Try.scala:185)
126     at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
127     at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
128     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
129     at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
130     at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
131     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
132     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
133     at scala.concurrent.Promise$class.complete(Promise.scala:55)
134     at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
135     at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
136     at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
137     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
138     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
139     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
140     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
141     at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
142     at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
143     at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
144     at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
145     at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
146     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
147     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
148     at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
149     at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
150     at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241)
151     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
152     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
153     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
154     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
155     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
156     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
157     at java.lang.Thread.run(Thread.java:745)
158 Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
159     at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
160     ... 7 more

解决思路，百度了一下，也没缕出思路，就只知道是连接超时了，超过了120s，然后呢，我感觉是自己内存设置小了，因为开的虚拟机，主机8G，三台虚拟机，每台分了1G内存，然后设置Spark可以占用800M，跑程序的时候，第一次设置为512M，就连接超时了，第二次设置为了700M，顺利跑完，可以看看跑的过程，还是很有意思的：

  1 [root@master hadoop]# bin/spark-submit --class com.bie.WordCount --master spark://master:7077 --executor-memory 700M --total-executor-cores 2 /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount.txt hdfs://master:9000/output
  2 bash: bin/spark-submit: No such file or directory
  3 [root@master hadoop]# cd spark-1.6.1-bin-hadoop2.6/
  4 [root@master spark-1.6.1-bin-hadoop2.6]# bin/spark-submit --class com.bie.WordCount --master spark://master:7077 --executor-memory 700M --total-executor-cores 2 /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount.txt hdfs://master:9000/output
  5 18/02/23 01:45:46 INFO SparkContext: Running Spark version 1.6.1
  6 18/02/23 01:45:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  7 18/02/23 01:45:47 INFO SecurityManager: Changing view acls to: root
  8 18/02/23 01:45:47 INFO SecurityManager: Changing modify acls to: root
  9 18/02/23 01:45:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
 10 18/02/23 01:45:48 INFO Utils: Successfully started service 'sparkDriver' on port 60097.
 11 18/02/23 01:45:49 INFO Slf4jLogger: Slf4jLogger started
 12 18/02/23 01:45:49 INFO Remoting: Starting remoting
 13 18/02/23 01:45:49 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.3.129:55353]
 14 18/02/23 01:45:49 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 55353.
 15 18/02/23 01:45:49 INFO SparkEnv: Registering MapOutputTracker
 16 18/02/23 01:45:49 INFO SparkEnv: Registering BlockManagerMaster
 17 18/02/23 01:45:49 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-17e41a67-b880-4c06-95eb-a0f64928f668
 18 18/02/23 01:45:49 INFO MemoryStore: MemoryStore started with capacity 517.4 MB
 19 18/02/23 01:45:49 INFO SparkEnv: Registering OutputCommitCoordinator
 20 18/02/23 01:45:50 INFO Utils: Successfully started service 'SparkUI' on port 4040.
 21 18/02/23 01:45:50 INFO SparkUI: Started SparkUI at http://192.168.3.129:4040
 22 18/02/23 01:45:50 INFO HttpFileServer: HTTP File server directory is /tmp/spark-99c897ab-ea17-4797-8695-3a5df89ed490/httpd-f346e1dd-642d-437d-8947-6190f2e83065
 23 18/02/23 01:45:50 INFO HttpServer: Starting HTTP Server
 24 18/02/23 01:45:50 INFO Utils: Successfully started service 'HTTP file server' on port 35900.
 25 18/02/23 01:45:51 INFO SparkContext: Added JAR file:/home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar at http://192.168.3.129:35900/jars/sparkWordCount-1.0-SNAPSHOT.jar with timestamp 1519379151547
 26 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Connecting to master spark://master:7077...
 27 18/02/23 01:45:52 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20180223014552-0004
 28 18/02/23 01:45:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37128.
 29 18/02/23 01:45:52 INFO NettyBlockTransferService: Server created on 37128
 30 18/02/23 01:45:52 INFO BlockManagerMaster: Trying to register BlockManager
 31 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Executor added: app-20180223014552-0004/0 on worker-20180223110813-192.168.3.131-55934 (192.168.3.131:55934) with 1 cores
 32 18/02/23 01:45:52 INFO SparkDeploySchedulerBackend: Granted executor ID app-20180223014552-0004/0 on hostPort 192.168.3.131:55934 with 1 cores, 700.0 MB RAM
 33 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Executor added: app-20180223014552-0004/1 on worker-20180223110811-192.168.3.130-40991 (192.168.3.130:40991) with 1 cores
 34 18/02/23 01:45:52 INFO SparkDeploySchedulerBackend: Granted executor ID app-20180223014552-0004/1 on hostPort 192.168.3.130:40991 with 1 cores, 700.0 MB RAM
 35 18/02/23 01:45:52 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.3.129:37128 with 517.4 MB RAM, BlockManagerId(driver, 192.168.3.129, 37128)
 36 18/02/23 01:45:52 INFO BlockManagerMaster: Registered BlockManager
 37 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Executor updated: app-20180223014552-0004/1 is now RUNNING
 38 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Executor updated: app-20180223014552-0004/0 is now RUNNING
 39 18/02/23 01:45:53 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
 40 18/02/23 01:45:53 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
 41 18/02/23 01:45:53 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 146.7 KB, free 146.7 KB)
 42 18/02/23 01:45:53 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 160.6 KB)
 43 18/02/23 01:45:53 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.3.129:37128 (size: 13.9 KB, free: 517.4 MB)
 44 18/02/23 01:45:53 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:13
 45 Java HotSpot(TM) Client VM warning: You have loaded library /tmp/libnetty-transport-native-epoll4006421548933729587.so which might have disabled stack guard. The VM will try to fix the stack guard now.
 46 It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
 47 18/02/23 01:45:55 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slaver1:60082) with ID 1
 48 18/02/23 01:45:55 INFO BlockManagerMasterEndpoint: Registering block manager slaver1:56633 with 282.5 MB RAM, BlockManagerId(1, slaver1, 56633)
 49 18/02/23 01:45:56 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
 50 18/02/23 01:45:56 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
 51 18/02/23 01:45:56 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
 52 18/02/23 01:45:56 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
 53 18/02/23 01:45:56 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
 54 18/02/23 01:45:57 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slaver2:46908) with ID 0
 55 18/02/23 01:45:57 INFO SparkContext: Starting job: saveAsTextFile at WordCount.scala:13
 56 18/02/23 01:45:58 INFO FileInputFormat: Total input paths to process : 1
 57 18/02/23 01:45:58 INFO BlockManagerMasterEndpoint: Registering block manager slaver2:36572 with 282.5 MB RAM, BlockManagerId(0, slaver2, 36572)
 58 18/02/23 01:45:58 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:13)
 59 18/02/23 01:45:58 INFO DAGScheduler: Registering RDD 5 (sortBy at WordCount.scala:13)
 60 18/02/23 01:45:58 INFO DAGScheduler: Got job 0 (saveAsTextFile at WordCount.scala:13) with 1 output partitions
 61 18/02/23 01:45:58 INFO DAGScheduler: Final stage: ResultStage 2 (saveAsTextFile at WordCount.scala:13)
 62 18/02/23 01:45:58 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
 63 18/02/23 01:45:58 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
 64 18/02/23 01:45:58 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:13), which has no missing parents
 65 18/02/23 01:45:59 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 164.7 KB)
 66 18/02/23 01:45:59 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 167.0 KB)
 67 18/02/23 01:45:59 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.3.129:37128 (size: 2.3 KB, free: 517.4 MB)
 68 18/02/23 01:45:59 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
 69 18/02/23 01:45:59 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:13)
 70 18/02/23 01:45:59 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
 71 18/02/23 01:45:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slaver2, partition 0,NODE_LOCAL, 2196 bytes)
 72 18/02/23 01:45:59 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, slaver1, partition 1,NODE_LOCAL, 2196 bytes)
 73 18/02/23 01:47:19 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on slaver1:56633 (size: 2.3 KB, free: 282.5 MB)
 74 18/02/23 01:47:21 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on slaver1:56633 (size: 13.9 KB, free: 282.5 MB)
 75 18/02/23 01:47:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on slaver2:36572 (size: 2.3 KB, free: 282.5 MB)
 76 18/02/23 01:47:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on slaver2:36572 (size: 13.9 KB, free: 282.5 MB)
 77 18/02/23 01:47:36 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 97388 ms on slaver1 (1/2)
 78 18/02/23 01:48:19 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:13) finished in 140.390 s
 79 18/02/23 01:48:19 INFO DAGScheduler: looking for newly runnable stages
 80 18/02/23 01:48:19 INFO DAGScheduler: running: Set()
 81 18/02/23 01:48:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 140387 ms on slaver2 (2/2)
 82 18/02/23 01:48:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
 83 18/02/23 01:48:19 INFO DAGScheduler: waiting: Set(ShuffleMapStage 1, ResultStage 2)
 84 18/02/23 01:48:19 INFO DAGScheduler: failed: Set()
 85 18/02/23 01:48:19 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[5] at sortBy at WordCount.scala:13), which has no missing parents
 86 18/02/23 01:48:19 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.5 KB, free 170.5 KB)
 87 18/02/23 01:48:19 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.0 KB, free 172.5 KB)
 88 18/02/23 01:48:19 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.3.129:37128 (size: 2.0 KB, free: 517.4 MB)
 89 18/02/23 01:48:19 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
 90 18/02/23 01:48:19 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[5] at sortBy at WordCount.scala:13)
 91 18/02/23 01:48:19 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
 92 18/02/23 01:48:19 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, slaver2, partition 0,NODE_LOCAL, 1956 bytes)
 93 18/02/23 01:48:19 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on slaver2:36572 (size: 2.0 KB, free: 282.5 MB)
 94 18/02/23 01:48:19 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to slaver2:46908
 95 18/02/23 01:48:19 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 156 bytes
 96 18/02/23 01:48:20 INFO DAGScheduler: ShuffleMapStage 1 (sortBy at WordCount.scala:13) finished in 0.708 s
 97 18/02/23 01:48:20 INFO DAGScheduler: looking for newly runnable stages
 98 18/02/23 01:48:20 INFO DAGScheduler: running: Set()
 99 18/02/23 01:48:20 INFO DAGScheduler: waiting: Set(ResultStage 2)
100 18/02/23 01:48:20 INFO DAGScheduler: failed: Set()
101 18/02/23 01:48:20 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[8] at saveAsTextFile at WordCount.scala:13), which has no missing parents
102 18/02/23 01:48:20 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 711 ms on slaver2 (1/1)
103 18/02/23 01:48:20 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
104 18/02/23 01:48:20 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 65.1 KB, free 237.6 KB)
105 18/02/23 01:48:20 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 22.6 KB, free 260.2 KB)
106 18/02/23 01:48:20 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.3.129:37128 (size: 22.6 KB, free: 517.4 MB)
107 18/02/23 01:48:20 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
108 18/02/23 01:48:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[8] at saveAsTextFile at WordCount.scala:13)
109 18/02/23 01:48:20 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
110 18/02/23 01:48:20 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, slaver2, partition 0,NODE_LOCAL, 1967 bytes)
111 18/02/23 01:48:20 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on slaver2:36572 (size: 22.6 KB, free: 282.5 MB)
112 18/02/23 01:48:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to slaver2:46908
113 18/02/23 01:48:20 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 136 bytes
114 18/02/23 01:48:25 INFO DAGScheduler: ResultStage 2 (saveAsTextFile at WordCount.scala:13) finished in 5.008 s
115 18/02/23 01:48:25 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 5012 ms on slaver2 (1/1)
116 18/02/23 01:48:25 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
117 18/02/23 01:48:25 INFO DAGScheduler: Job 0 finished: saveAsTextFile at WordCount.scala:13, took 147.737606 s
118 18/02/23 01:48:26 INFO SparkUI: Stopped Spark web UI at http://192.168.3.129:4040
119 18/02/23 01:48:26 INFO SparkDeploySchedulerBackend: Shutting down all executors
120 18/02/23 01:48:26 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
121 18/02/23 01:48:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
122 18/02/23 01:48:26 INFO MemoryStore: MemoryStore cleared
123 18/02/23 01:48:26 INFO BlockManager: BlockManager stopped
124 18/02/23 01:48:26 INFO BlockManagerMaster: BlockManagerMaster stopped
125 18/02/23 01:48:26 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
126 18/02/23 01:48:26 INFO SparkContext: Successfully stopped SparkContext
127 18/02/23 01:48:26 INFO ShutdownHookManager: Shutdown hook called
128 18/02/23 01:48:26 INFO ShutdownHookManager: Deleting directory /tmp/spark-99c897ab-ea17-4797-8695-3a5df89ed490
129 18/02/23 01:48:26 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
130 18/02/23 01:48:26 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
131 18/02/23 01:48:26 INFO ShutdownHookManager: Deleting directory /tmp/spark-99c897ab-ea17-4797-8695-3a5df89ed490/httpd-f346e1dd-642d-437d-8947-6190f2e83065
132 [root@master spark-1.6.1-bin-hadoop2.6]#

最后查看执行结果即可（由于第一次跑失败了，作为强迫症的我就把第一次的输出结果文件删除了）：

相关阅读:
Windows10安装Oracle19c数据库详细记录(图文详解)
构建 FTP 文件传输服务器
 构建 Samba 文件共享服务器
 Linux磁盘配额（xfs）
Linux配置磁盘配额（ext4）
Linux 制作ISO镜像
 Linux磁盘分区
 用户和文件权限管理命令的使用（实验）
CentOS7 配置yum源
 VMware CentOS 7 安装（vmware的版本15.5）
原文地址：https://www.cnblogs.com/biehongli/p/8462625.html