• Spark Standalone模式应用程序开发


      在本博客的《Spark高速入门指南(Quick Start Spark)》文章中简单地介绍了怎样通过Spark shell来高速地运用API。本文将介绍怎样高速地利用Spark提供的API开发Standalone模式的应用程序。Spark支持三种程序语言的开发:Scala (利用SBT进行编译), Java (利用Maven进行编译)以及Python。以下我将分别用Scala、Java和Python开发相同功能的程序:

    一、Scala版本号:

    程序例如以下:

    01package scala
    02/**
    03 * User: 过往记忆
    04 * Date: 14-6-10
    05 * Time: 下午11:37
    08 * 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
    09 * 过往记忆博客微信公共帐号:iteblog_hadoop
    10 */
    11import org.apache.spark.SparkContext
    12import org.apache.spark.SparkConf
    13object Test {
    14    def main(args: Array[String]) {
    15      val logFile = "file:///spark-bin-0.9.1/README.md"
    16      val conf = new SparkConf().setAppName("Spark Application in Scala")
    17      val sc = new SparkContext(conf)
    18      val logData = sc.textFile(logFile, 2).cache()
    19      val numAs = logData.filter(line => line.contains("a")).count()
    20      val numBs = logData.filter(line => line.contains("b")).count()
    21      println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
    22    }
    23  }
    24}

    为了编译这个文件,须要创建一个xxx.sbt文件,这个文件相似于pom.xml文件,这里我们创建一个scala.sbt文件,内容例如以下:

    1name := "Spark application in Scala"
    2version := "1.0"
    3scalaVersion := "2.10.4"
    4libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"
    5resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

    编译:

    1# sbt/sbt package
    2[info] Done packaging.
    3[success] Total time: 270 s, completed Jun 11, 2014 1:05:54 AM
    二、Java版本号
    01/**
    02 * User: 过往记忆
    03 * Date: 14-6-10
    04 * Time: 下午11:37
    07 * 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
    08 * 过往记忆博客微信公共帐号:iteblog_hadoop
    09 */
    10/* SimpleApp.java */
    11import org.apache.spark.api.java.*;
    12import org.apache.spark.SparkConf;
    13import org.apache.spark.api.java.function.Function;
    14 
    15public class SimpleApp {
    16    public static void main(String[] args) {
    17        String logFile = "file:///spark-bin-0.9.1/README.md";
    18        SparkConf conf =new SparkConf().setAppName("Spark Application in Java");
    19        JavaSparkContext sc = new JavaSparkContext(conf);
    20        JavaRDD<String> logData = sc.textFile(logFile).cache();
    21 
    22        long numAs = logData.filter(new Function<String, Boolean>() {
    23            public Boolean call(String s) { return s.contains("a"); }
    24        }).count();
    25 
    26        long numBs = logData.filter(new Function<String, Boolean>() {
    27            public Boolean call(String s) { return s.contains("b"); }
    28        }).count();
    29 
    30        System.out.println("Lines with a: " + numAs +",lines with b: " + numBs);
    31    }
    32}

    本程序分别统计README.md文件里包括a和b的行数。本项目的pom.xml文件内容例如以下:

    01<?xml version="1.0" encoding="UTF-8"?>
    03         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    04         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
    05 
    06http://maven.apache.org/xsd/maven-4.0.0.xsd">
    07 
    08    <modelVersion>4.0.0</modelVersion>
    09 
    10    <groupId>spark</groupId>
    11    <artifactId>spark</artifactId>
    12    <version>1.0</version>
    13 
    14    <dependencies>
    15        <dependency>
    16            <groupId>org.apache.spark</groupId>
    17            <artifactId>spark-core_2.10</artifactId>
    18            <version>1.0.0</version>
    19        </dependency>
    20    </dependencies>
    21</project>

    利用Maven来编译这个工程:

    1# mvn install
    2[INFO] ------------------------------------------------------------------------
    3[INFO] BUILD SUCCESS
    4[INFO] ------------------------------------------------------------------------
    5[INFO] Total time: 5.815s
    6[INFO] Finished at: Wed Jun 11 00:01:57 CST 2014
    7[INFO] Final Memory: 13M/32M
    8[INFO] ------------------------------------------------------------------------
    三、Python版本号
    01#
    02# User: 过往记忆
    03# Date: 14-6-10
    04# Time: 下午11:37
    05# bolg: http://www.iteblog.com
    06# 本文地址:http://www.iteblog.com/archives/1041
    07# 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
    08# 过往记忆博客微信公共帐号:iteblog_hadoop
    09#
    10from pyspark import SparkContext
    11 
    13sc = SparkContext("local", "Spark Application in Python")
    14logData = sc.textFile(logFile).cache()
    15 
    16numAs = logData.filter(lambda s: 'a' in s).count()
    17numBs = logData.filter(lambda s: 'b' in s).count()
    18 
    19print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
    四、測试执行

    本程序的程序环境是Spark 1.0.0,单机模式,測试例如以下:
    1、測试Scala版本号的程序

    1# bin/spark-submit --class "scala.Test" 
    2                   --master local[4]   
    3              target/scala-2.10/simple-project_2.10-1.0.jar
    4 
    514/06/11 01:07:53 INFO spark.SparkContext: Job finished:
    6count at Test.scala:18, took 0.019705 s
    7Lines with a: 62, Lines with b: 35

    2、測试Java版本号的程序

    1# bin/spark-submit --class "SimpleApp" 
    2                   --master local[4]   
    3              target/spark-1.0-SNAPSHOT.jar
    4 
    514/06/11 00:49:14 INFO spark.SparkContext: Job finished:
    6count at SimpleApp.java:22, took 0.019374 s
    7Lines with a: 62, lines with b: 35

    3、測试Python版本号的程序

    1# bin/spark-submit --master local[4]   
    2                simple.py
    3 
    4Lines with a: 62, lines with b: 35

    本文地址:《Spark Standalone模式应用程序开发》:http://www.iteblog.com/archives/1041,过往记忆,大量关于Hadoop、Spark等个人原创技术博客本博客文章除特别声明,所有都是原创!

    尊重原创,转载请注明: 转载自过往记忆(http://www.iteblog.com/)
    本文链接地址: 《Spark Standalone模式应用程序开发》(http://www.iteblog.com/archives/1041)
    E-mail:wyphao.2007@163.com    

  • 相关阅读:
    PAT (Advanced Level) 1080. Graduate Admission (30)
    PAT (Advanced Level) 1079. Total Sales of Supply Chain (25)
    PAT (Advanced Level) 1078. Hashing (25)
    PAT (Advanced Level) 1077. Kuchiguse (20)
    PAT (Advanced Level) 1076. Forwards on Weibo (30)
    PAT (Advanced Level) 1075. PAT Judge (25)
    PAT (Advanced Level) 1074. Reversing Linked List (25)
    PAT (Advanced Level) 1073. Scientific Notation (20)
    PAT (Advanced Level) 1072. Gas Station (30)
    PAT (Advanced Level) 1071. Speech Patterns (25)
  • 原文地址:https://www.cnblogs.com/mfrbuaa/p/4330708.html
Copyright © 2020-2023  润新知