第一步, 创建用户uspark
root@hadoop1:~# adduser uspark
Adding user `uspark' ... Adding new group `uspark' (1002) ... Adding new user `uspark' (1002) with group `uspark' ... Creating home directory `/home/uspark' ... Copying files from `/etc/skel' ... Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully Changing the user information for uspark Enter the new value, or press ENTER for the default Full Name []: Room Number []: Work Phone []: Home Phone []: Other []: Is the information correct? [Y/n] y root@hadoop1:~# |
第二步, 配置Java环境变量
uspark@hadoop:~$ java -version
The program 'java' can be found in the following packages: * default-jre * gcj-4.8-jre-headless * openjdk-7-jre-headless * gcj-4.6-jre-headless * openjdk-6-jre-headless Ask your administrator to install one of them uspark@hadoop:~$ vi .bashrc |
在 .bashrc 文件末尾加上
#set Java Environment
export JAVA_HOME=/home/uspark/jdk1.8.0_60
export CLASSPATH=".:$JAVA_HOME/lib/rt.jar:$JAVA_HOME/lib/tools.jar:$CLASSPATH"
export PATH="$JAVA_HOME/bin:$PATH"
|
uspark@hadoop:~$ source .bashrc
uspark@hadoop:~$ java -version java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) Server VM (build 25.60-b23, mixed mode) uspark@hadoop:~$ |
第三步, 下载spark
复制下作链接
下载完成,解压文件
tar xf spark-1.5.1-bin-hadoop2.6.tgz
|
uspark@liuhy:~$ cd spark-1.5.1-bin-hadoop2.6/
uspark@liuhy:~/spark-1.5.1-bin-hadoop2.6$ ll total 1068 drwxr-xr-x 2 uspark uspark 4096 Oct 8 05:13 bin/ -rw-r--r-- 1 uspark uspark 960539 Oct 8 05:13 CHANGES.txt drwxr-xr-x 2 uspark uspark 4096 Oct 8 05:13 conf/ drwxr-xr-x 3 uspark uspark 4096 Oct 8 05:12 data/ -rw-rw-r-- 1 uspark uspark 747 Oct 8 05:23 derby.log drwxr-xr-x 3 uspark uspark 4096 Oct 8 05:12 ec2/ drwxr-xr-x 3 uspark uspark 4096 Oct 8 05:13 examples/ drwxr-xr-x 2 uspark uspark 4096 Oct 8 05:12 lib/ -rw-r--r-- 1 uspark uspark 50972 Oct 8 05:12 LICENSE drwxrwxr-x 5 uspark uspark 4096 Oct 8 05:23 metastore_db/ -rw-r--r-- 1 uspark uspark 22559 Oct 8 05:12 NOTICE drwxr-xr-x 6 uspark uspark 4096 Oct 8 05:12 python/ drwxr-xr-x 3 uspark uspark 4096 Oct 8 05:12 R/ -rw-r--r-- 1 uspark uspark 3593 Oct 8 05:12 README.md -rw-r--r-- 1 uspark uspark 120 Oct 8 05:12 RELEASE drwxr-xr-x 2 uspark uspark 4096 Oct 8 05:12 sbin/ uspark@liuhy:~/spark-1.5.1-bin-hadoop2.6$ |
Interactive Analysis with the Spark Shell
uspark@liuhy:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) Client VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. 15/10/08 05:23:26 WARN Utils: Your hostname, liuhy resolves to a loopback address: 127.0.1.1; using 192.168.1.112 instead (on interface eth0) 15/10/08 05:23:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/10/08 05:23:30 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. Spark context available as sc. |
scala> val tf = sc.textFile("README.md")
tf: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21
scala> tf.count
count countApprox countApproxDistinct countByValue
countByValueApprox
scala> tf.count
res2: Long = 98
scala>
|
scala> val lineWithSpark = tf.filter(_.contains("Spark"))
lineWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at <console>:23 scala> lineWithSpark.first res5: String = # Apache Spark scala> lineWithSpark.count count countApprox countApproxDistinct countByValue countByValueApprox scala> lineWithSpark.count res6: Long = 18 scala> lineWithSpark.foreach foreach foreachPartition foreachWith scala> lineWithSpark.foreach(println) # Apache Spark Spark is a fast and general cluster computing system for Big Data. It provides rich set of higher-level tools including Spark SQL for SQL and DataFrames, and Spark Streaming for stream processing. You can find the latest Spark documentation, including a programming ## Building Spark Spark is built using [Apache Maven](http://maven.apache.org/). To build Spark and its example programs, run: ["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html). The easiest way to start using Spark is through the Scala shell: Spark also comes with several sample programs in the `examples` directory. ./bin/run-example SparkPi MASTER=spark://host:7077 ./bin/run-example SparkPi Testing first requires [building Spark](#building-spark). Once Spark is built, tests Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported Hadoop, you must build Spark against the same version that your cluster runs. for guidance on building a Spark application that works with a particular in the online documentation for an overview on how to configure Spark. scala> |
over