在sparkR在配置完成的基础上,本例采用Spark on yarn模式,介绍sparkR运行的一个例子。
在spark的安装目录下,/examples/src/main/r,有一个dataframe.R文件。该文件默认是在本地的模式下运行的,不与hdfs交互。可以将脚本进行相应修改,提交到yarn模式下。
在提交之前,要先将${SPARK_HOME}/examples/src/main/resources/people.json 文件上传到hdfs上,我上传到了hdfs://data-mining-cluster/data/bigdata_mining/sparkr/people.json 目录下。
- dataframe.R文件内容:
- # Convert local data frame to a SparkR DataFrame
- df <- createDataFrame(sqlContext, localDF)
- # Print its schema
- printSchema(df)
- # root
- # |-- name: string (nullable = true)
- # |-- age: double (nullable = true)
- # Create a DataFrame from a JSON file
- #path <- file.path("hdfs://data-mining-cluster/data/bigdata_mining/sparkr/people.json")
- path <- file.path("hdfs://data-mining-cluster/data/bigdata_mining/sparkr/people.json")
- peopleDF <- read.json(sqlContext, path)
- printSchema(peopleDF)
- # Register this DataFrame as a table.
- registerTempTable(peopleDF, "people")
- # SQL statements can be run by using the sql methods provided by sqlContext
- teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")
- # Call collect to get a local data.frame
- teenagersLocalDF <- collect(teenagers)
- # Print the teenagers in our dataset
- print(teenagersLocalDF)
- # Stop the SparkContext now
- sparkR.stop()
sparkR --master yarn-client dataframe.R 这样就可以将任务提交到yarn上了。
另外,有些集群会报如下错误:
- 16/06/16 11:40:35 ERROR RBackendHandler: json on 15 failed
- Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
- java.lang.RuntimeException: Error in configuring object
- at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
- at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
- at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
- at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:185)
- at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:198)
- at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
- at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
- at scala.Option.getOrElse(Option.scala:120)
- at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
- at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
- at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
- at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
- at scala.Option.getOrElse(Option.scala:120)
- at org.apache.spark.rdd.RDD.partitions(
- Calls: read.json -> callJMethod -> invokeJava
这种情况一般是由于启用了lzo压缩导致的。可以通过--jars 添加lzo的jar包,就可以了。例如:sparkR --master yarn-client dataframe.R --jars /usr/local/Hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar