• Spark Streaming接收Kafka数据存储到Hbase


    Spark Streaming接收Kafka数据存储到Hbase

    主要参考了这篇文章https://yq.aliyun.com/articles/60712([点我])(https://yq.aliyun.com/articles/60712), 不过这篇文章使用的spark貌似是spark1.x的。我这里主要是改为了spark2.x的方式

    kafka生产数据

    闲话少叙,直接上代码:

    1. import java.util.{Properties, UUID
    2.  
    3. import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord
    4. import org.apache.kafka.common.serialization.StringSerializer 
    5.  
    6. import scala.util.Random 
    7.  
    8.  
    9. object KafkaProducerTest
    10. def main(args: Array[String]): Unit = { 
    11. val rnd = new Random() 
    12. // val topics = "world" 
    13. val topics = "test" 
    14. val brokers = "localhost:9092" 
    15. val props = new Properties() 
    16. props.put("delete.topic.enable", "true"
    17. props.put("key.serializer", classOf[StringSerializer]) 
    18. // props.put("value.serializer", "org.apache.kafka.common.serialization.StringDeserializer") 
    19. props.put("value.serializer", classOf[StringSerializer]) 
    20. props.put("bootstrap.servers", brokers) 
    21. //props.put("batch.num.messages","10");//props.put("batch.num.messages","10"); 
    22.  
    23. //props.put("queue.buffering.max.messages", "20"); 
    24. //linger.ms should be 0~100 ms 
    25. props.put("linger.ms", "50"
    26. //props.put("block.on.buffer.full", "true"); 
    27. //props.put("max.block.ms", "100000"); 
    28. //batch.size and buffer.memory should be changed with "the kafka message size and message sending speed" 
    29. props.put("batch.size", "16384"
    30. props.put("buffer.memory", "1638400"
    31.  
    32. props.put("queue.buffering.max.messages", "1000000"
    33. props.put("queue.enqueue.timeout.ms", "20000000"
    34. props.put("producer.type", "sync"
    35.  
    36. val producer = new KafkaProducer[String,String](props) 
    37. for(i <- 1001 to 2000){ 
    38. val key = UUID.randomUUID().toString.substring(0,5
    39. val value = "fly_" + i + "_" + key 
    40. producer.send(new ProducerRecord[String, String](topics,key, value))//.get() 
    41.  

    42.  
    43. producer.flush() 


    44.  

    生产的数据格式为(key,value) = (uuid, fly_i_key) 的形式

    spark streaming 读取kafka并保存到hbase

    当kafka里面有数据后,使用spark streaming 读取,并存。直接上代码:

    1. import java.util.UUID 
    2.  
    3. import org.apache.hadoop.hbase.HBaseConfiguration 
    4. import org.apache.hadoop.hbase.client.{Mutation, Put
    5. import org.apache.hadoop.hbase.io.ImmutableBytesWritable 
    6. import org.apache.hadoop.hbase.mapreduce.TableOutputFormat 
    7. import org.apache.hadoop.hbase.util.Bytes 
    8. import org.apache.hadoop.mapred.JobConf 
    9. import org.apache.hadoop.mapreduce.OutputFormat 
    10. import org.apache.kafka.clients.consumer.ConsumerRecord 
    11. import org.apache.kafka.common.serialization.StringDeserializer 
    12. import org.apache.spark.rdd.RDD 
    13. import org.apache.spark.sql.SparkSession 
    14. import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe 
    15. import org.apache.spark.streaming.kafka010.KafkaUtils 
    16. import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent 
    17. import org.apache.spark.streaming.{Seconds, StreamingContext
    18.  
    19. /** 
    20. * spark streaming 写入到hbase 
    21. * Sparkstreaming读取Kafka消息再结合SparkSQL,将结果保存到HBase 
    22. */ 
    23.  
    24.  
    25. object OBDSQL
    26. case class Person(name: String, age: Int, key: String) 
    27.  
    28. def main(args: Array[String]): Unit = { 
    29. val spark = SparkSession 
    30. .builder() 
    31. .appName("sparkSql"
    32. .master("local[4]"
    33. .getOrCreate() 
    34.  
    35. val sc = spark.sparkContext 
    36.  
    37. val ssc = new StreamingContext(sc, Seconds(5)) 
    38.  
    39. val topics = Array("test"
    40. val kafkaParams = Map
    41. "bootstrap.servers" -> "localhost:9092,anotherhost:9092"
    42. "key.deserializer" -> classOf[StringDeserializer], 
    43. "value.deserializer" -> classOf[StringDeserializer], 
    44. // "group.id" -> "use_a_separate_group_id_for_each_stream", 
    45. "group.id" -> "use_a_separate_group_id_for_each_stream_fly"
    46. // "auto.offset.reset" -> "latest", 
    47. "auto.offset.reset" -> "earliest"
    48. // "auto.offset.reset" -> "none", 
    49. "enable.auto.commit" -> (false: java.lang.Boolean

    50.  
    51. val lines = KafkaUtils.createDirectStream[String, String]( 
    52. ssc, 
    53. PreferConsistent
    54. Subscribe[String, String](topics, kafkaParams) 

    55.  
    56. // lines.map(record => (record.key, record.value)).print() 
    57. // lines.map(record => record.value.split("_")).map(x=> (x(0),x(1), x(2))).print() 
    58.  
    59. lines.foreachRDD((rdd: RDD[ConsumerRecord[String, String]]) => { 
    60. import spark.implicits._ 
    61. if (!rdd.isEmpty()) { 
    62.  
    63. // temp table 
    64. rdd.map(_.value.split("_")).map(p => Person(p(0), p(1).trim.toInt, p(2))).toDF.createOrReplaceTempView("temp"
    65.  
    66. // use spark sql 
    67. val rs = spark.sql("select * from temp"
    68.  
    69. // create hbase conf 
    70. val hconf = HBaseConfiguration.create 
    71. hconf.set("hbase.zookeeper.quorum", "localhost"); //ZKFC 
    72. hconf.set("hbase.zookeeper.property.clientPort", "2181"
    73. hconf.set("hbase.defaults.for.version.skip", "true"
    74. hconf.set(TableOutputFormat.OUTPUT_TABLE, "t1") // t1是表名, 表里面有一个列簇 cf1 
    75. hconf.setClass("mapreduce.job.outputformat.class", classOf[TableOutputFormat[String]], classOf[OutputFormat[String, Mutation]]) 
    76. val jobConf = new JobConf(hconf) 
    77.  
    78. // convert every line to hbase lines 
    79. rs.rdd.map(line => { 
    80. val put = new Put(Bytes.toBytes(UUID.randomUUID().toString.substring(0, 9))) 
    81. put.addColumn(Bytes.toBytes("cf1"
    82. , Bytes.toBytes("name"
    83. , Bytes.toBytes(line.get(0).toString) 

    84. put.addColumn(Bytes.toBytes("cf1"
    85. , Bytes.toBytes("age"
    86. , Bytes.toBytes(line.get(1).toString) 

    87. put.addColumn(Bytes.toBytes("cf1"
    88. , Bytes.toBytes("key"
    89. , Bytes.toBytes(line.get(2).toString) 

    90. (new ImmutableBytesWritable, put) 
    91. }).saveAsNewAPIHadoopDataset(jobConf) 

    92. }) 
    93.  
    94. lines.map(record => record.value.split("_")).map(x=> (x(0),x(1), x(2))).print() 
    95.  
    96. ssc start() 
    97. ssc awaitTermination() 
    98.  


    99.  
  • 相关阅读:
    kubernetes集群-04测试kubernetes集群
    kubernetes集群-03网络calico
    kubernetes集群-02部署Master Node
    kubernetes集群-01基础设置(v1.18.0)
    AWS CLI 安装
    如何理解AWS ELB
    AWS-CLI-Command
    terraform 常用命令
    terraform 初始化
    Excel设置下拉框
  • 原文地址:https://www.cnblogs.com/flyu6/p/10191488.html
Copyright © 2020-2023  润新知