• RDD 重新分区,排序 repartitionAndSortWithinPartitions


    需求:将rdd数据中相同班级的学生分到一个partition中,并根据分数降序排序。

    此实例用到的repartitionAndSortWithinPartitions是Spark官网推荐的一个算子,官方建议,如果需要在repartition重分区之后,还要进行排序,建议直接使用repartitionAndSortWithinPartitions算子。因为该算子可以一边进行重分区的shuffle操作,一边进行排序。shuffle与sort两个操作同时进行,比先shuffle再sort来说,性能可能是要高的。 

    import org.apache.spark.{SparkContext, SparkConf}
    
    /**
     * Created by sunxufeng on 2016/6/18.
     */
    class Student {
    
    }
    
    
    //创建key类,key组合键为grade,score
    case class StudentKey(grade:String,score:Int)
    //  extends Ordered[StudentKey]{
    //  def compare(that: StudentKey) : Int = {
    //    var result:Int = this.grade.compareTo(that.grade)
    //    if (result == 0){
    //      result = this.student.compareTo(that.student)
    //      if(result ==0){
    //        result = that.score.compareTo(this.score)
    //      }
    //    }
    //    result
    //  }
    //}
    
    object StudentKey {
      implicit def orderingByGradeStudentScore[A <: StudentKey] : Ordering[A] = {
    //    Ordering.by(fk => (fk.grade, fk.student, fk.score * -1))
        Ordering.by(fk => (fk.grade, fk.score * -1))
      }
    }
    
    object Student{
      def main(args: Array[String]) {
    
    
    
        //定义hdfs文件索引值
        val grade_idx:Int=0
        val student_idx:Int=1
        val course_idx:Int=2
        val score_idx:Int=3
    
        //定义转化函数,不能转化为Int类型的,给默认值0
        def safeInt(s: String): Int = try { s.toInt } catch { case _: Throwable  => 0 }
    
        //定义提取key的函数
        def createKey(data: Array[String]):StudentKey={
          StudentKey(data(grade_idx),safeInt(data(score_idx)))
        }
    
        //定义提取value的函数
        def listData(data: Array[String]):List[String]={
          List(data(grade_idx),data(student_idx),data(course_idx),data(score_idx))
        }
        
        def createKeyValueTuple(data: Array[String]) :(StudentKey,List[String]) = {
          (createKey(data),listData(data))
        }
    
        //创建分区类
        import org.apache.spark.Partitioner
        class StudentPartitioner(partitions: Int) extends Partitioner {
          require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
    
          override def numPartitions: Int = partitions
    
          override def getPartition(key: Any): Int = {
            val k = key.asInstanceOf[StudentKey]
            k.grade.hashCode() % numPartitions
          }
        }
    
        //设置master为local,用来进行本地调试
        val conf = new SparkConf().setAppName("Student_partition_sort").setMaster("local")
        val sc = new SparkContext(conf)
    
        //学生信息是打乱的
        val student_array =Array(
          "c001,n003,chinese,59",
          "c002,n004,english,79",
          "c002,n004,chinese,13",
          "c001,n001,english,88",
          "c001,n002,chinese,10",
          "c002,n006,chinese,29",
          "c001,n001,chinese,54",
          "c001,n002,english,32",
          "c001,n003,english,43",
          "c002,n005,english,80",
          "c002,n005,chinese,48",
          "c002,n006,english,69"
          )
       //将学生信息并行化为rdd
        val student_rdd = sc.parallelize(student_array)
       //生成key-value格式的rdd
        val student_rdd2 = student_rdd.map(line => line.split(",")).map(createKeyValueTuple)
        //根据StudentKey中的grade进行分区,并根据score降序排列
        val student_rdd3 = student_rdd2.repartitionAndSortWithinPartitions(new StudentPartitioner(10))
       //打印数据 student_rdd3.collect.foreach(println) } }

     排序后的数据:

    (StudentKey(c001,88),List(c001, n001, english, 88))
    (StudentKey(c001,59),List(c001, n003, chinese, 59))
    (StudentKey(c001,54),List(c001, n001, chinese, 54))
    (StudentKey(c001,43),List(c001, n003, english, 43))
    (StudentKey(c001,32),List(c001, n002, english, 32))
    (StudentKey(c001,10),List(c001, n002, chinese, 10))
    (StudentKey(c002,80),List(c002, n005, english, 80))
    (StudentKey(c002,79),List(c002, n004, english, 79))
    (StudentKey(c002,69),List(c002, n006, english, 69))
    (StudentKey(c002,48),List(c002, n005, chinese, 48))
    (StudentKey(c002,29),List(c002, n006, chinese, 29))
    (StudentKey(c002,13),List(c002, n004, chinese, 13))

     

    参考:http://codingjunkie.net/spark-secondary-sort/

  • 相关阅读:
    Linux监控平台、安装zabbix、修改zabbix的admin密码
    LVS DR模式搭建、keepalived+lvs
    负载均衡集群相关、LVS介绍、LVS调度算法、LVS NAT模式搭建
    集群相关、用keepalived配置高可用集群
    mysql基础
    MySQL主从、环境搭建、主从配制
    Tomcat配置虚拟主机、tomcat的日志
    Tomcat介绍、安装jdk、安装Tomcat、配置Tomcat监听80端口
    FTP相关、用vsftpd搭建ftp、xshell使用xftp传输文件、使用pure-ftpd搭建ftp服务
    HTTP Status 500
  • 原文地址:https://www.cnblogs.com/suinlove/p/5594754.html
Copyright © 2020-2023  润新知