checkpoint使用
检查点 Checkpoint功能
- 1.保存 每批中 state信息,累加加销售营业额
- 2.保存 没从Kafka topic 中读取数据的offset
- 3.保存DStream的来源和DStream处理函数和输出函数
什么时候需要使用Checkpoint
- 有状态转换的用法 - 如果在应用程序中使用了updateStateByKey或reduceByKeyAndWindow(with inverse function),则必须提供检查点目录以允许定期RDD检查点。
- 从运行应用程序的驱动程序的故障中恢复 - 元数据检查点用于使用进度信息进行恢复。
以下例子主要使用checkpoint的state状态转换和使用进度恢复
maven依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.0-cdh6.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.0-cdh6.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.0.0-cdh6.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.0.0-cdh6.3.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
- 集群集成了kerberos的认证所以代码中需要加入kerberos 的认证
代码如下
package util
import org.apache.hadoop.security.UserGroupInformation
class GetKerberosAuth {
def getKerberosAuth(): Unit = {
val KRB5="D:\sparkstreaming\sparkstreaming_Test\src\main\scala\kerberos\krb5.conf"
val principal="D:\sparkstreaming\sparkstreaming_Test\src\main\scala\kerberos\leayun.keytab"
System.setProperty("java.security.krb5.conf", KRB5)
UserGroupInformation.loginUserFromKeytab("leayun@GREE.IO", principal)
println(UserGroupInformation.getLoginUser)
}
}
以下使用从soctet中处理数据
代码如下:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, MapWithStateDStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}
import util.GetKerberosAuth
object SparkStreamingSocket {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF) //设置日志级别
new GetKerberosAuth().getKerberosAuth() //kerberos认证
// 通过checkpoint进行恢复
val ssc= StreamingContext.getOrCreate("hdfs://cdh-master02:8020/user/leayun/260212/hdfstest",createFunc _)
ssc.start()
ssc.awaitTermination()
ssc.stop()
}
def createFunc():StreamingContext={
val sparkConf = new SparkConf().setAppName("SparkStreamingSocket1")
.setMaster("local[*]")
val ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(5)) //设置处理批次大小时长
val lines: ReceiverInputDStream[String] = ssc.socketTextStream("10.2.7.140", 6789)
val result: DStream[(String, Int)] = lines.flatMap(_.split(" ")).map((_, 1))
val wordCount: MapWithStateDStream[String, Int, Int, Any] = result.mapWithState(StateSpec.function(func)) //使用mapWithState算子进行状态的保存
ssc.checkpoint("hdfs://cdh-master02:8020/user/leayun/260212/hdfstest") //设置检查点路径保存在hdfs中
wordCount.print();
ssc
}
//自定义一个函数处理数据
/**
/*param word 表示key
/*option 表示 value
/* state 表示状态即上次的保存的结果
*/
val func = (word: String, option: Option[Int], state: State[Int]) => {
println(state)
val sum = option.getOrElse(0) + state.getOption().getOrElse(0)
val output = (word, sum)
state.update(sum)
output
}
}