• 1Spark学习笔记1


    Spark-core

    Spark概述

    • Spark是什么

      • Spark是一种基于内存的快速、通用、可扩展的大数据分析计算引擎
    • Spark and Hadoop

      • Spark和Hadoop的根本差异是多个作业之间的数据通信问题:Spark多个作业之间的数据通信是基于内存,而Hadoop的基于磁盘

      • 经过上面的比较,我们可以看出在绝大多数的数据计算场景中,Spark 确实会比 MapReduce更有优势。但是 Spark 是基于内存的,所以在实际的生产环境中,由于内存的限制,可能会由于内存资源不够导致 Job 执行失败,此时,MapReduce 其实是一个更好的选择,所以Spark并不能完全替代 MR

    • Spark核心模块

      Spark快速上手

      • 增加依赖关系(一次引入后续各个模块的依赖)

        • <?xml version="1.0" encoding="UTF-8"?>
          <project xmlns="http://maven.apache.org/POM/4.0.0"
                   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
              <modelVersion>4.0.0</modelVersion>
          
              <groupId>com.lotuslaw</groupId>
              <artifactId>spark-core</artifactId>
              <version>1.0-SNAPSHOT</version>
          
              <properties>
                  <maven.compiler.source>8</maven.compiler.source>
                  <maven.compiler.target>8</maven.compiler.target>
              </properties>
          
              <dependencies>
                  <dependency>
                      <groupId>org.apache.spark</groupId>
                      <artifactId>spark-core_2.12</artifactId>
                      <version>3.0.0</version>
                  </dependency>
          
                  <dependency>
                      <groupId>org.apache.spark</groupId>
                      <artifactId>spark-sql_2.12</artifactId>
                      <version>3.0.0</version>
                  </dependency>
          
                  <dependency>
                      <groupId>mysql</groupId>
                      <artifactId>mysql-connector-java</artifactId>
                      <version>5.1.27</version>
                  </dependency>
          
                  <dependency>
                      <groupId>org.apache.spark</groupId>
                      <artifactId>spark-hive_2.12</artifactId>
                      <version>3.0.0</version>
                  </dependency>
          
                  <dependency>
                      <groupId>org.apache.hive</groupId>
                      <artifactId>hive-exec</artifactId>
                      <version>1.2.1</version>
                  </dependency>
          
                  <dependency>
                      <groupId>org.apache.spark</groupId>
                      <artifactId>spark-streaming_2.12</artifactId>
                      <version>3.0.0</version>
                  </dependency>
          
                  <dependency>
                      <groupId>org.apache.spark</groupId>
                      <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
                      <version>3.0.0</version>
                  </dependency>
                  <dependency>
                      <groupId>com.fasterxml.jackson.core</groupId>
                      <artifactId>jackson-core</artifactId>
                      <version>2.10.1</version>
                  </dependency>
          
                  <dependency>
                      <groupId>com.alibaba</groupId>
                      <artifactId>druid</artifactId>
                      <version>1.1.10</version>
                  </dependency>
          
              </dependencies>
              <build>
                  <plugins>
                      <!-- 该插件用于将 Scala 代码编译成 class 文件 -->
                      <plugin>
                          <groupId>net.alchim31.maven</groupId>
                          <artifactId>scala-maven-plugin</artifactId>
                          <version>3.2.2</version>
                          <executions>
                              <execution>
                                  <!-- 声明绑定到 maven 的 compile 阶段 -->
                                  <goals>
                                      <goal>testCompile</goal>
                                  </goals>
                              </execution>
                          </executions>
                      </plugin>
                      <plugin>
                          <groupId>org.apache.maven.plugins</groupId>
                          <artifactId>maven-assembly-plugin</artifactId>
                          <version>3.1.0</version>
                          <configuration>
                              <descriptorRefs>
                                  <descriptorRef>jar-with-dependencies</descriptorRef>
                              </descriptorRefs>
                          </configuration>
                          <executions>
                              <execution>
                                  <id>make-assembly</id>
                                  <phase>package</phase>
                                  <goals>
                                      <goal>single</goal>
                                  </goals>
                              </execution>
                          </executions>
                      </plugin>
                  </plugins>
              </build>
          </project>
          
      • log4j.properties文件

        • log4j.rootCategory=ERROR, console
          log4j.appender.console=org.apache.log4j.ConsoleAppender
          log4j.appender.console.target=System.err
          log4j.appender.console.layout=org.apache.log4j.PatternLayout
          log4j.appender.console.layout.ConversionPattern=%d{yy/MM/ddHH:mm:ss} %p %c{1}: %m%n
          # Set the default spark-shell log level to ERROR. When running the spark-shell,the
          # log level for this class is used to overwrite the root logger's log level, sothat
          # the user can have different defaults for the shell and regular Spark apps.
          log4j.logger.org.apache.spark.repl.Main=ERROR
          # Settings to quiet third party logs that are too verbose
          log4j.logger.org.spark_project.jetty=ERROR
          log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
          log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
          log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
          log4j.logger.org.apache.parquet=ERROR
          log4j.logger.parquet=ERROR
          # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent
          UDFs in SparkSQL with Hive support
          log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
          log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
          
      • WordCount

        • package com.lotuslaw.spark.core.wc
          
          import org.apache.spark.{SparkConf, SparkContext}
          
          /**
           * @author: lotuslaw
           * @version: V1.0
           * @package: com.lotuslaw.spark.core.wc
           * @create: 2021-12-02 10:08
           * @description:
           */
          object Spark01_WordCount {
          
            def main(args: Array[String]): Unit = {
              // 创建Spark运行配置对象
              val sparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
          
              // 创建Spark上下文环境对象(连接对象)
              val sc = new SparkContext(sparkConf)
          
              // 读取文件数据
              val fileRDD = sc.textFile("input/word.txt")
          
              // 将文件中的数据进行分词
              val wordRDD = fileRDD.flatMap(_.split(" "))
          
              // 转换数据结构
              val word2OneRDD = wordRDD.map((_, 1))
          
              // 将转换结构后的数据按照相同的单词进行分组聚合
              val word2CountRDD = word2OneRDD.reduceByKey(_ + _)
          
              // 将数据聚合结果采集到内存中
              val word2Count = word2CountRDD.collect()
          
              // 打印结果
              word2Count.foreach(println)
          
              // 关闭连接
              sc.stop()
          
            }
          
          }
          

    Spark运行环境

    注释如下内容:
    #SPARK_MASTER_HOST=linux1
    #SPARK_MASTER_PORT=7077
    添加如下内容:
    #Master 监控页面默认访问端口为 8080,但是可能会和 Zookeeper 冲突,所以改成 8989,也可以自
    定义,访问 UI 监控页面时请注意
    SPARK_MASTER_WEBUI_PORT=8989
    export SPARK_DAEMON_JAVA_OPTS="
    -Dspark.deploy.recoveryMode=ZOOKEEPER 
    -Dspark.deploy.zookeeper.url=linux1,linux2,linux3
    -Dspark.deploy.zookeeper.dir=/spark"
    

    Spark运行架构

    Spark核心编程

    • RDD创建

      • 从集合(内存)中创建

        • package com.lotuslaw.spark.core.rdd.builder
          
          import org.apache.spark.{SparkConf, SparkContext}
          
          /**
           * @author: lotuslaw
           * @version: V1.0
           * @package: com.lotuslaw.spark.core.rdd
           * @create: 2021-12-02 11:26
           * @description:
           */
          object Spark01_RDD_Memory {
          
            def main(args: Array[String]): Unit = {
          
              val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
          
              val sparkContext = new SparkContext(sparkConf)
          
              val rdd = sparkContext.makeRDD(List(1, 2, 3, 4))
              rdd.collect().foreach(println)
          
              sparkContext.stop()
            }
          
          }
          
      • 从外部存储(文件)创建RDD

        • package com.lotuslaw.spark.core.rdd.builder
          
          import org.apache.spark.{SparkConf, SparkContext}
          
          /**
           * @author: lotuslaw
           * @version: V1.0
           * @package: com.lotuslaw.spark.core.rdd
           * @create: 2021-12-02 11:26
           * @description:
           */
          object Spark01_RDD_File {
          
            def main(args: Array[String]): Unit = {
          
              val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
          
              val sparkContext = new SparkContext(sparkConf)
          
              val rdd = sparkContext.textFile("input")
              rdd.collect().foreach(println)
          
              sparkContext.stop()
            }
          
          }
          
      • 从其他RDD创建

      • 直接创建RDD(new)

    • RDD并行度与分区

      • package com.lotuslaw.spark.core.rdd.builder
        
        import org.apache.spark.{SparkConf, SparkContext}
        
        /**
         * @author: lotuslaw
         * @version: V1.0
         * @package: com.lotuslaw.spark.core.rdd
         * @create: 2021-12-02 11:26
         * @description:
         */
        object Spark01_RDD_Memory2 {
        
          def main(args: Array[String]): Unit = {
        
            val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
        
            sparkConf.set("spark.default.parallelism", "5")
        
            val sc = new SparkContext(sparkConf)
        
            // RDD的并行度&分区
            /*
            makeRDD方法可以传递第二个参数,这个参数表示分区的数量
            第二个参数可以不传递的,那么makeRDD方法会使用默认值 : defaultParallelism(默认并行度)
            scheduler.conf.getInt("spark.default.parallelism", totalCores)
            spark在默认情况下,从配置对象中获取配置参数:spark.default.parallelism
            如果获取不到,那么使用totalCores属性,这个属性取值为当前运行环境的最大可用核数
             */
        
            val rdd = sc.makeRDD(List(1, 2, 3, 4), 2)
            rdd.saveAsTextFile("output")
        
            sc.stop()
          }
        
        }
        
      • package com.lotuslaw.spark.core.rdd.builder
        
        import org.apache.spark.{SparkConf, SparkContext}
        
        /**
         * @author: lotuslaw
         * @version: V1.0
         * @package: com.lotuslaw.spark.core.rdd
         * @create: 2021-12-02 11:40
         * @description:
         */
        object Spark01_RDD_Memory_File {
        
          def main(args: Array[String]): Unit = {
        
            val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
        
            val sc = new SparkContext(sparkConf)
        
            val dataRDD = sc.makeRDD(List(1, 2, 3, 4), 4)
        
            val fileRDD = sc.textFile("input", 2)
        
            fileRDD.collect().foreach(println)
        
            sc.stop()
          }
        
        }
        
    • RDD转换算子

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Map {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val dataRDD = sc.makeRDD(List(1, 2, 3, 4))
        val dataRDD1 = dataRDD.map(
          num => {
            num * 2
          }
        )
        val dataRDD2 = dataRDD1.map(", " + _)
    
        dataRDD2.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    
    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Map_Test {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.textFile("input/apache.log")
    
        val mapRDD = rdd.map(
          line => {
            val datas = line.split(" ")
            datas(6)
          }
        )
    
        mapRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_MapPartitions {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val dataRDD = sc.makeRDD(List(1, 2, 3, 4))
    
        val data1RDD = dataRDD.mapPartitions(
          datas => {
            datas.filter(_ == 2)
          }
        )
    
        data1RDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_MapPartitionsWithIndex {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val dataRDD = sc.makeRDD(List(1, 2, 3, 4), 2)
    
        val dataRDD1 = dataRDD.mapPartitionsWithIndex(
          (index, iter) => {
            if (index == 1) {
              iter
            } else {
              Nil.iterator
            }
          }
        )
    
        dataRDD1.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_FlatMap {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val dataRDD = sc.makeRDD(List(
          List(1, 2), 3, List(3, 4)
        ))
    
        /*
        val dataRDD1 = dataRDD.flatMap(
          data => {
            data match {
              case list: List[_] => list
              case data_tmp => List(data_tmp)
            }
          }
        )
         */
    
        val dataRDD1 = dataRDD.flatMap {
          case list: List[_] => list
          case data_tmp => List(data_tmp)
        }
    
        println(dataRDD1.collect().toList.mkString(", "))
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Glom {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4), 2)
    
        val glomRDD = rdd.glom()
    
        glomRDD.collect().foreach(data => println(data.mkString(", ")))
    
        sc.stop()
    
      }
    
    }
    
    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Glom_Test {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4), 2)
    
        val glomRDD = rdd.glom()
    
        val maxRDD = glomRDD.map(
          array => {
            array.max
          }
        )
    
        println(maxRDD.collect().sum)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_GroupBy {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4), 2)
    
        val groupRDD = rdd.groupBy(_ % 2)
    
        groupRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    import java.text.SimpleDateFormat
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Filter {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val dataRDD = sc.makeRDD(List(1, 2, 3, 4), 1)
    
        val dataRDD1 = dataRDD.filter(_ % 2 == 0)
    
        dataRDD1.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    
    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    import java.text.SimpleDateFormat
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Filter_Test {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.textFile("input/apache.log")
    
        rdd.filter(
          line => {
            val datas = line.split(" ")
            val time = datas(3)
            time.startsWith("17/05/2015")
          }
        ).collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Sample {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
    
        rdd.sample(false, 0.4, 42).collect().foreach(println)
    
        rdd.sample(true, 2, 42).collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Distinct {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4, 1, 2, 3, 4))
    
        val rdd1 = rdd.distinct()
    
        rdd1.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Coalesce {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // coalesce方法默认情况下不会将分区的数据打乱重新组合
        // 这种情况下的缩减分区可能会导致数据不均衡,出现数据倾斜
        // 如果想要让数据均衡,可以进行shuffle处理
        val rdd = sc.makeRDD(List(1, 2, 3, 4, 5, 6), 3)
    //    val newRDD = rdd.coalesce(2)
        val newRDD = rdd.coalesce(2, true)
        newRDD.saveAsTextFile("output")
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Repartition {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // 缩减分区:coalesce,如果想要数据均衡,可以采用shuffle
        // 扩大分区:repartition, 底层代码调用的就是coalesce,而且肯定采用shuffle
        val rdd = sc.makeRDD(List(1, 2, 3, 4, 5, 6), 2)
    
        val newRDD = rdd.repartition(3)
    
        newRDD.saveAsTextFile("output")
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_SortBy {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(6, 3, 4, 2, 5, 1), 2)
    
        val newRDD = rdd.sortBy(num => num)
    
        newRDD.saveAsTextFile("output")
    
        sc.stop()
    
      }
    
    }
    
    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_SortBy2 {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // sortBy方法可以根据指定的规则对数据源中的数据进行排序,默认为升序,第二个参数可以改变排序的方式
        // sortBy默认情况下,不会改变分区。但是中间存在shuffle操作
    
        val rdd = sc.makeRDD(List(("1", 1), ("11", 2), ("2", 3)), 2)
    
        val newRDD = rdd.sortBy(t => t._1.toInt, false)
    
        newRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Intersection_And {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // 交集,并集和差集要求两个数据源数据类型保持一致
        // 拉链操作两个数据源的类型可以不一致
    
        val rdd1 = sc.makeRDD(List(1, 2, 3, 4))
        val rdd2 = sc.makeRDD(List(3, 4, 5, 6))
        val rdd3 = sc.makeRDD(List("1", "2", "3", "4"))
    
        val rdd4 = rdd1.intersection(rdd2)
        println(rdd4.collect().mkString(", "))
    
        val rdd5 = rdd1.union(rdd2)
        println(rdd5.collect().mkString(", "))
    
        val rdd6 = rdd1.subtract(rdd2)
        println(rdd6.collect().mkString(", "))
    
        val rdd7 = rdd1.zip(rdd3)
        println(rdd7.collect().mkString(", "))
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_PartitionBy {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4), 2)
    
        val mapRDD = rdd.map((_, 1))
    
        // RDD => PairRDDFunctions
        // 隐式转换(二次编译)
    
        val newRDD = mapRDD.partitionBy(new HashPartitioner(2))
    
        newRDD.saveAsTextFile("output")
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_ReduceByKey {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // reduceByKey : 相同的key的数据进行value数据的聚合操作
        // scala语言中一般的聚合操作都是两两聚合,spark基于scala开发的,所以它的聚合也是两两聚合
        val rdd = sc.makeRDD(List(
          ("a", 1), ("a", 2), ("a", 3), ("b", 4)
        ))
        val reduceRDD = rdd.reduceByKey((x, y) => {
          println(s"s = ${x}, y = ${y}")
          x + y
        })
    
        reduceRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_GroupByKey {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(
          ("a", 1), ("a", 2), ("a", 3), ("b", 4)
        ))
        val groupRDD = rdd.groupByKey()
    
        groupRDD.collect().foreach(println)
    
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_AggregateByKey {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // aggregateByKey存在函数柯里化,有两个参数列表
        // 第一个参数列表,需要传递一个参数,表示为初始值
        //       主要用于当碰见第一个key的时候,和value进行分区内计算
        // 第二个参数列表需要传递2个参数
        //      第一个参数表示分区内计算规则
        //      第二个参数表示分区间计算规则
        val rdd = sc.makeRDD(List(
          ("a", 1), ("a", 2), ("a", 3), ("a", 4)
        ), 2)
    
        rdd.aggregateByKey(0)(
          (x, y) => math.max(x, y),
          (x, y) => x + y
        ).collect().foreach(println)
    
    
        sc.stop()
    
      }
    
    }
    
    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_AggregateByKey_Test {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(
          ("a", 1), ("a", 2), ("b", 3),
          ("b", 4), ("b", 5), ("a", 6)
        ), 2)
    
        val newRDD = rdd.aggregateByKey((0, 0))(
          (t, v) => {
            (t._1 + v, t._2 + 1)
          },
          (t1, t2) => {
            (t1._1 + t2._1, t1._2 + t2._2)
          }
        )
        val resRDD = newRDD.mapValues {
          case (num, cnt) => {
            num / cnt
          }
        }
    
        resRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_FoldByKey {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(
          ("a", 1), ("a", 2), ("b", 3),
          ("b", 4), ("b", 5), ("a", 6)
        ), 2)
    
        rdd.foldByKey(0)(_+_).collect().foreach(println)
    
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_CombineByKey {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // combineByKey : 方法需要三个参数
        // 第一个参数表示:将相同key的第一个数据进行结构的转换,实现操作
        // 第二个参数表示:分区内的计算规则
        // 第三个参数表示:分区间的计算规则
        val rdd = sc.makeRDD(List(
          ("a", 1), ("a", 2), ("b", 3),
          ("b", 4), ("b", 5), ("a", 6)
        ),2)
    
        val newRDD = rdd.combineByKey(
          v => (v, 1),
          (t: (Int, Int), v) => {
            (t._1 + v, t._2 + 1)
          },
          (t1: (Int, Int), t2: (Int, Int)) => {
            (t1._1 + t2._1, t2._2 + t2._2)
          }
        )
        newRDD.mapValues{
          case (num, cnt) => {
            num / cnt
          }
        }.collect().foreach(println)
    
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_SortByKey {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(("a", 1), ("b", 2), ("c", 0)))
        val sortRDD = rdd.sortByKey(false)
        sortRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Join {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // join : 两个不同数据源的数据,相同的key的value会连接在一起,形成元组
        //        如果两个数据源中key没有匹配上,那么数据不会出现在结果中
        //        如果两个数据源中key有多个相同的,会依次匹配,可能会出现笛卡尔乘积,数据量会几何性增长,会导致性能降低。
    
        val rdd1 = sc.makeRDD(List(
          ("a", 1), ("a", 2), ("c", 3)
        ))
    
        val rdd2 = sc.makeRDD(List(
          ("a", 5), ("c", 6), ("a", 4)
        ))
        val joinRDD = rdd1.join(rdd2)
        joinRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_LeftOuterJoin {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // join : 两个不同数据源的数据,相同的key的value会连接在一起,形成元组
        //        如果两个数据源中key没有匹配上,那么数据不会出现在结果中
        //        如果两个数据源中key有多个相同的,会依次匹配,可能会出现笛卡尔乘积,数据量会几何性增长,会导致性能降低。
    
        val rdd1 = sc.makeRDD(List(
          ("a", 1), ("a", 2), //("c", 3)
        ))
    
        val rdd2 = sc.makeRDD(List(
          ("a", 5), ("c", 6), ("a", 4)
        ))
        val lojRDD = rdd1.leftOuterJoin(rdd2)
        lojRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_CoGroup {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // cogroup : connect + group (分组,连接)
        val rdd1 = sc.makeRDD(List(
          ("a", 1), ("b", 2)
        ))
    
        val rdd2 = sc.makeRDD(List(
          ("a", 4), ("b", 5), ("c", 6), ("c", 7)
        ))
    
        val cgRDD = rdd1.cogroup(rdd2)
    
        cgRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.transform
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Transform_Test {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val dataRDD = sc.textFile("input/agent.log")
    
        val mapRDD = dataRDD.map(
          line => {
            val datas = line.split(" ")
            ((datas(1), datas(4)), 1)
          }
        )
    
        val reduceRDD = mapRDD.reduceByKey(_ + _)
    
        val newMapRDD = reduceRDD.map {
          case ((prv, ad), sum) => {
            (prv, (ad, sum))
          }
        }
    
        val groupRDD = newMapRDD.groupByKey()
    
        val resRDD = groupRDD.mapValues(
          iter => {
            iter.toList.sortBy(_._2)(Ordering.Int.reverse).take(3)
          }
        )
    
        resRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_Reduce {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4))
    
        val reduceRes = rdd.reduce(_ + _)
    
        println(reduceRes)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_Collect {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4))
    
        val collectRes = rdd.collect()
    
        collectRes.foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_Count {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4))
    
        val countRes = rdd.count()
    
        println(countRes)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_First {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4))
    
        val firstRes = rdd.first()
    
        println(firstRes)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_Take {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4))
    
        val takeRes = rdd.take(3)
    
        println(takeRes.mkString(", "))
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_TakeOrdered {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 4, 3))
    
        val takeORes = rdd.takeOrdered(3)
    
        println(takeORes.mkString(", "))
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_Aggregate {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // aggregateByKey : 初始值只会参与分区内计算
        // aggregate : 初始值会参与分区内计算,并且和参与分区间计算
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4), 2)
    
        val aggRes = rdd.aggregate(10)(_ + _, _ + _)
    
        println(aggRes)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_fold {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // aggregateByKey : 初始值只会参与分区内计算
        // aggregate : 初始值会参与分区内计算,并且和参与分区间计算
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4), 2)
    
        val foldRes = rdd.fold(10)(_ + _)
    
        println(foldRes)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_CountByKey {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(
          (1, "a"), (1, "a"), (1, "a"),
          (2, "b"), (3, "c"), (3, "c")
        ))
    
        val cbkRes = rdd.countByKey()
    
        println(cbkRes)
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_Save {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        // (1, "a")作为一个value
        val rdd = sc.makeRDD(List(
          (1, "a"), (1, "a"), (1, "a"),
          (2, "b"), (3, "c"), (3, "c")
        ))
    
        rdd.saveAsTextFile("output")
        rdd.saveAsObjectFile("output")
        // saveAsSequenceFile方法要求数据的格式必须为K-V类型
        rdd.saveAsSequenceFile("output2")
    
        sc.stop()
    
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.operator.actor
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.operator.transform.action.transform
     * @create: 2021-12-02 11:46
     * @description:
     */
    object Spark_RDD_Operator_Action_ForEach {
    
      def main(args: Array[String]): Unit = {
    
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    
        val sc = new SparkContext(sparkConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4))
    
        // foreach 其实是Driver端内存集合的循环遍历方法
        rdd.collect().foreach(println)
        println("******************************")
        // foreach 其实是Executor端内存数据打印
        rdd.foreach(println)
    
        sc.stop()
    
      }
    
    }
    

    object serializable02_function {
     def main(args: Array[String]): Unit = {
     //1.创建 SparkConf 并设置 App 名称
     val conf: SparkConf = new 
    SparkConf().setAppName("SparkCoreTest").setMaster("local[*]")
     //2.创建 SparkContext,该对象是提交 Spark App 的入口
     val sc: SparkContext = new SparkContext(conf)
     //3.创建一个 RDD
     val rdd: RDD[String] = sc.makeRDD(Array("hello world", "hello spark", 
    "hive", "atguigu"))
     //3.1 创建一个 Search 对象
     val search = new Search("hello")
     //3.2 函数传递,打印:ERROR Task not serializable
     search.getMatch1(rdd).collect().foreach(println)
     //3.3 属性传递,打印:ERROR Task not serializable
     search.getMatch2(rdd).collect().foreach(println)
     //4.关闭连接
     sc.stop()
     } }
    class Search(query:String) extends Serializable {
     def isMatch(s: String): Boolean = {
     s.contains(query)
     }
     // 函数序列化案例
     def getMatch1 (rdd: RDD[String]): RDD[String] = {
     //rdd.filter(this.isMatch)
     rdd.filter(isMatch)
     }
     // 属性序列化案例
     def getMatch2(rdd: RDD[String]): RDD[String] = {
     //rdd.filter(x => x.contains(this.query))
     rdd.filter(x => x.contains(query))
     //val q = query
     //rdd.filter(x => x.contains(q))
     } }
    

    object serializable_Kryo {
     def main(args: Array[String]): Unit = {
     val conf: SparkConf = new SparkConf()
     .setAppName("SerDemo")
     .setMaster("local[*]")
     // 替换默认的序列化机制
     .set("spark.serializer", 
    "org.apache.spark.serializer.KryoSerializer")
     // 注册需要使用 kryo 序列化的自定义类
     .registerKryoClasses(Array(classOf[Searcher]))
     val sc = new SparkContext(conf)
     val rdd: RDD[String] = sc.makeRDD(Array("hello world", "hello atguigu", 
    "atguigu", "hahah"), 2)
     val searcher = new Searcher("hello")
     val result: RDD[String] = searcher.getMatchedRDD1(rdd)
     result.collect.foreach(println)
     } }
    case class Searcher(val query: String) {
     def isMatch(s: String) = {
     s.contains(query)
     }
     def getMatchedRDD1(rdd: RDD[String]) = {
     rdd.filter(isMatch) 
     }
     def getMatchedRDD2(rdd: RDD[String]) = {
     val q = query
     rdd.filter(_.contains(q))
     } 
    }
    

    package com.lotuslaw.spark.core.rdd.dep
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.dep
     * @create: 2021-12-02 14:35
     * @description:
     */
    object Spark_RDD_Dependency {
    
      def main(args: Array[String]): Unit = {
    
        val sparConf = new SparkConf().setMaster("local").setAppName("WordCount")
        val sc = new SparkContext(sparConf)
    
        val lines = sc.textFile("input/word.txt")
        println(lines.toDebugString)
        println("************************")
        val words = lines.flatMap(_.split(" "))
        println(words.toDebugString)
        println("************************")
        val wordToOne = words.map((_, 1))
        println(wordToOne.toDebugString)
        println("************************")
        val wordCount = wordToOne.reduceByKey(_ + _)
        println(wordCount.toDebugString)
        println("************************")
        val array = wordCount.collect()
        array.foreach(println)
    
        sc.stop()
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.dep
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.dep
     * @create: 2021-12-02 14:35
     * @description:
     */
    object Spark_RDD_Dependency2 {
    
      def main(args: Array[String]): Unit = {
    
        val sparConf = new SparkConf().setMaster("local").setAppName("WordCount")
        val sc = new SparkContext(sparConf)
    
        val lines = sc.textFile("input/word.txt")
        println(lines.dependencies)
        println("************************")
        val words = lines.flatMap(_.split(" "))
        println(words.dependencies)
        println("************************")
        val wordToOne = words.map((_, 1))
        println(wordToOne.dependencies)
        println("************************")
        val wordCount = wordToOne.reduceByKey(_ + _)
        println(wordCount.dependencies)
        println("************************")
        val array = wordCount.collect()
        array.foreach(println)
    
        sc.stop()
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.persist
    
    import org.apache.spark.storage.StorageLevel
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.persist
     * @create: 2021-12-02 14:50
     * @description:
     */
    object Spark_RDD_Persist {
    
      def main(args: Array[String]): Unit = {
    
        val sparConf = new SparkConf().setMaster("local").setAppName("Persist")
        val sc = new SparkContext(sparConf)
    
        val list = List("Hello Scala", "Hello Spark")
    
        val rdd = sc.makeRDD(list)
    
        val flatRDD = rdd.flatMap(_.split(" "))
    
        val mapRDD = flatRDD.map(word => {
          println("@@@@@@@@@@@@@@")
          (word, 1)
        })
    
        // cache默认持久化的操作,只能将数据保存到内存中,如果想要保存到磁盘文件,需要更改存储级别
        //mapRDD.cache()
    
        // 持久化操作必须在行动算子执行时完成的。
        mapRDD.persist(StorageLevel.DISK_ONLY)
    
        val reduceRDD = mapRDD.reduceByKey(_ + _)
        reduceRDD.collect().foreach(println)
        println("********************************")
        val groupRDD = mapRDD.groupByKey()
        groupRDD.collect().foreach(println)
    
        sc.stop()
      }
    
    }
    
    package com.lotuslaw.spark.core.rdd.persist
    
    import org.apache.spark.storage.StorageLevel
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.persist
     * @create: 2021-12-02 14:50
     * @description:
     */
    object Spark_RDD_Persist2 {
    
      def main(args: Array[String]): Unit = {
    
        val sparConf = new SparkConf().setMaster("local").setAppName("Persist")
        val sc = new SparkContext(sparConf)
        sc.setCheckpointDir("cp")
    
        val list = List("Hello Scala", "Hello Spark")
    
        val rdd = sc.makeRDD(list)
    
        val flatRDD = rdd.flatMap(_.split(" "))
    
        val mapRDD = flatRDD.map(word => {
          println("@@@@@@@@@@@@@@")
          (word, 1)
        })
    
        // checkpoint 需要落盘,需要指定检查点保存路径
        // 检查点路径保存的文件,当作业执行完毕后,不会被删除
        // 一般保存路径都是在分布式存储系统:HDFS
        mapRDD.checkpoint()
    
        val reduceRDD = mapRDD.reduceByKey(_ + _)
        reduceRDD.collect().foreach(println)
        println("********************************")
        val groupRDD = mapRDD.groupByKey()
        groupRDD.collect().foreach(println)
    
        sc.stop()
      }
    
    }
    
    package com.lotuslaw.spark.core.rdd.persist
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.persist
     * @create: 2021-12-02 14:50
     * @description:
     */
    object Spark_RDD_Persist3 {
    
      def main(args: Array[String]): Unit = {
    
        val sparConf = new SparkConf().setMaster("local").setAppName("Persist")
        val sc = new SparkContext(sparConf)
        sc.setCheckpointDir("cp")
    
        val list = List("Hello Scala", "Hello Spark")
    
        val rdd = sc.makeRDD(list)
    
        val flatRDD = rdd.flatMap(_.split(" "))
    
        val mapRDD = flatRDD.map(word => {
          println("@@@@@@@@@@@@@@")
          (word, 1)
        })
    
        // cache : 将数据临时存储在内存中进行数据重用
        //         会在血缘关系中添加新的依赖。一旦,出现问题,可以重头读取数据
        // persist : 将数据临时存储在磁盘文件中进行数据重用
        //           涉及到磁盘IO,性能较低,但是数据安全
        //           如果作业执行完毕,临时保存的数据文件就会丢失
        // checkpoint : 将数据长久地保存在磁盘文件中进行数据重用
        //           涉及到磁盘IO,性能较低,但是数据安全
        //           为了保证数据安全,所以一般情况下,会独立执行作业
        //           为了能够提高效率,一般情况下,是需要和cache联合使用
        //           执行过程中,会切断血缘关系。重新建立新的血缘关系
        //           checkpoint等同于改变数据源
        mapRDD.cache()
        mapRDD.checkpoint()
        println(mapRDD.toDebugString)
    
        val reduceRDD = mapRDD.reduceByKey(_ + _)
        reduceRDD.collect().foreach(println)
        println("********************************")
        val groupRDD = mapRDD.groupByKey()
        groupRDD.collect().foreach(println)
    
        sc.stop()
      }
    
    }
    

    package com.lotuslaw.spark.core.rdd.partitioner
    
    import org.apache.spark.{Partitioner, SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.rdd.partitioner
     * @create: 2021-12-02 15:01
     * @description:
     */
    object Spark_RDD_Partitioner {
    
      def main(args: Array[String]): Unit = {
    
        val sparConf = new SparkConf().setMaster("local").setAppName("WordCount")
        val sc = new SparkContext(sparConf)
    
        val rdd = sc.makeRDD(List(
          ("nba", "XXXXXXXXXXXXXX"),
          ("cba", "XXXXXXXXXXXXXX"),
          ("wnba", "XXXXXXXXXXXXXX"),
          ("nba", "XXXXXXXXXXXXXX")
        ), 3)
    
        val partRDD = rdd.partitionBy(new MyPartitioner)
    
        partRDD.saveAsTextFile("output")
    
        sc.stop()
      }
    
      /**
       * 自定义分区器
       * 1. 继承Partitioner
       * 2. 重写方法
       */
      class MyPartitioner extends Partitioner{
    
        // 分区数量
        override def numPartitions: Int = 3
    
        // 根据数据的key值返回数据所在的分区索引,从0开始
        override def getPartition(key: Any): Int = {
          key match {
            case "nba" => 0
            case "wnba" => 1
            case _ => 2
          }
        }
      }
    
    }
    

    package com.lotuslaw.spark.core.accbc
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.acc
     * @create: 2021-12-02 15:09
     * @description:
     */
    object Spark_Acc {
    
      def main(args: Array[String]): Unit = {
    
        val sparConf = new SparkConf().setMaster("local").setAppName("Acc")
        val sc = new SparkContext(sparConf)
    
        val rdd = sc.makeRDD(List(1, 2, 3, 4))
    
        // 获取系统累加器
        // Spark默认就提供了简单数据聚合的累加器
        // 转换算子中调用累加器,如果没有行动算子的话,那么不会执行
        val sumAcc = sc.longAccumulator("sum")
        //sc.doubleAccumulator
        //sc.collectionAccumulator
        rdd.foreach(
          num => {
            sumAcc.add(num)
          }
        )
    
        // 获取累加器的值
        println(sumAcc.value)
    
        val mapRDD = rdd.map(
          num => {
            sumAcc.add(num)
          }
        )
        println(sumAcc.value)
        mapRDD.collect()
        println(sumAcc.value)
    
        sc.stop()
      }
    
    }
    

    package com.lotuslaw.spark.core.accbc
    
    import org.apache.spark.util.AccumulatorV2
    import org.apache.spark.{SparkConf, SparkContext}
    
    import scala.collection.mutable
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.acc
     * @create: 2021-12-02 15:09
     * @description:
     */
    object Spark_Acc_DIY {
    
      def main(args: Array[String]): Unit = {
    
        val sparConf = new SparkConf().setMaster("local").setAppName("Acc")
        val sc = new SparkContext(sparConf)
    
        val rdd = sc.makeRDD(List("hello", "spark", "hello"))
    
        val wcAcc = new MyAccumulator()
    
        // 向spark注册
        sc.register(wcAcc, "wordCountAcc")
    
        rdd.foreach(
          word => {
            wcAcc.add(word)
          }
        )
    
        println(wcAcc.value)
    
        sc.stop()
      }
    
      /*
          自定义数据累加器:WordCount
    
          1. 继承AccumulatorV2, 定义泛型
             IN : 累加器输入的数据类型 String
             OUT : 累加器返回的数据类型 mutable.Map[String, Long]
    
          2. 重写方法(6)
         */
      class MyAccumulator extends AccumulatorV2[String, mutable.Map[String, Long]] {
    
        private val wcMap = mutable.Map[String, Long]()
    
        // 判断是否初始状态
        override def isZero: Boolean = {
          wcMap.isEmpty
        }
    
        override def copy(): AccumulatorV2[String, mutable.Map[String, Long]] = {
          new MyAccumulator()
        }
    
        override def reset(): Unit = {
          wcMap.clear()
        }
    
        // 获取累加器需要计算的值
        override def add(v: String): Unit = {
          val newCnt = wcMap.getOrElse(v, 0L) + 1
          wcMap.update(v, newCnt)
        }
    
        // Driver合并多个累加器
        override def merge(other: AccumulatorV2[String, mutable.Map[String, Long]]): Unit = {
          val map1 = this.wcMap
          val map2 = other.value
    
          map2.foreach{
            case (word, count) => {
              val newCount = map1.getOrElse(word, 0L) + count
              map1.update(word, newCount)
            }
          }
        }
    
        // 累加器结果
        override def value: mutable.Map[String, Long] = {
          wcMap
        }
      }
    
    }
    

    package com.lotuslaw.spark.core.accbc
    
    import org.apache.spark.util.AccumulatorV2
    import org.apache.spark.{SparkConf, SparkContext}
    
    import scala.collection.mutable
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.acc
     * @create: 2021-12-02 15:09
     * @description:
     */
    object Spark_Bc {
    
      def main(args: Array[String]): Unit = {
    
        val sparConf = new SparkConf().setMaster("local").setAppName("Acc")
        val sc = new SparkContext(sparConf)
    
        val rdd1 = sc.makeRDD(List(
          ("a", 1), ("b", 2), ("c", 3)
        ))
    
        val map = mutable.Map(("a", 4), ("b", 5), ("c", 6))
    
        // 封装广播变量
        val bc = sc.broadcast(map)
    
        rdd1.map{
          case (w, c) => {
            val l = bc.value.getOrElse(w, 0)
            (w, (c, l))
          }
        }.collect().foreach(println)
    
    
        sc.stop()
      }
    
    }
    

    Spark案例实操

    package com.lotuslaw.spark.core.req
    
    import org.apache.spark.rdd.RDD
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.req
     * @create: 2021-12-02 15:53
     * @description:
     */
    object Spark_RDD_Req1 {
    
      def main(args: Array[String]): Unit = {
        // TODO: Top10热门品类
        val sparConf = new SparkConf().setMaster("local[*]").setAppName("HotCategoryTop10Analysis")
        val sc = new SparkContext(sparConf)
    
        // 读取原始日志数据
        val actionRDD = sc.textFile("input/user_visit_action.txt")
    
        // 统计品类的点击数量:(品类ID,点击数量)
        val clickActionRDD = actionRDD.filter(
          action => {
            val datas = action.split("_")
            datas(6) != "-1"
          }
        )
    
        val clickCountRDD = clickActionRDD.map(
          action => {
            val datas = action.split("_")
            (datas(6), 1)
          }
        ).reduceByKey(_ + _)
    
        // 统计品类的下单数量
        val orderActionRDD = actionRDD.filter(
          action => {
            val datas = action.split("_")
            datas(8) != "null"
          }
        )
    
        val orderCountRDD = orderActionRDD.flatMap(
          action => {
            val datas = action.split("_")
            val cid = datas(8)
            val cids = cid.split(",")
            cids.map(id => (id, 1))
          }
        ).reduceByKey(_ + _)
    
        // 统计品类的支付数量
        val payActionRDD = actionRDD.filter(
          action => {
            val datas = action.split("_")
            datas(10) != "null"
          }
        )
    
        val payCountRDD = payActionRDD.flatMap(
          action => {
            val datas = action.split("_")
            val cid = datas(10)
            val cids = cid.split(",")
            cids.map(id => (id, 1))
          }
        ).reduceByKey(_ + _)
    
        // 将品类进行排序,并且取前10名
        //    点击数量排序,下单数量排序,支付数量排序
        //    元组排序:先比较第一个,再比较第二个,再比较第三个,依此类推
        //    ( 品类ID, ( 点击数量, 下单数量, 支付数量 ) )
        val cogroupRDD = clickCountRDD.cogroup(orderCountRDD, payCountRDD)
    
        val analysisRDD = cogroupRDD.mapValues {
          case (clickIter, orderIter, payIter) => {
            var clickCnt = 0
            val iter1 = clickIter.iterator
            if (iter1.hasNext) {
              clickCnt = iter1.next()
            }
    
            var orderCnt = 0
            val iter2 = orderIter.iterator
            if (iter2.hasNext) {
              orderCnt = iter2.next()
            }
    
            var payCnt = 0
            val iter3 = payIter.iterator
            if (iter3.hasNext) {
              payCnt = iter3.next()
            }
            (clickCnt, orderCnt, payCnt)
          }
        }
    
        val resultRDD = analysisRDD.sortBy(_._2, false).take(10)
    
        // 将结果采集到控制台打印出来
        resultRDD.foreach(println)
    
        sc.stop()
      }
    
    }
    
    package com.lotuslaw.spark.core.req
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.req
     * @create: 2021-12-02 15:53
     * @description:
     */
    object Spark_RDD_Req2 {
    
      def main(args: Array[String]): Unit = {
        // TODO: Top10热门品类
        val sparConf = new SparkConf().setMaster("local[*]").setAppName("HotCategoryTop10Analysis")
        val sc = new SparkContext(sparConf)
    
        // Q:actionRDD重复使用
        // Q:cogroup性能可能较低
    
        // 读取原始日志数据
        val actionRDD = sc.textFile("input/user_visit_action.txt")
        actionRDD.cache()
    
        // 统计品类的点击数量
        val clickActionRDD = actionRDD.filter(
          action => {
            val datas = action.split("_")
            datas(6) != "-1"
          }
        )
    
        val clickCountRDD = clickActionRDD.map(
          action => {
            val datas = action.split("_")
            (datas(6), 1)
          }
        ).reduceByKey(_ + _)
    
        // 统计品类的下单数量
        val orderActionRDD = actionRDD.filter(
          action => {
            val datas = action.split("_")
            datas(8) != "null"
          }
        )
    
        val orderCountRDD = orderActionRDD.flatMap(
          action => {
            val datas = action.split("_")
            val cid = datas(8)
            val cids = cid.split(",")
            cids.map(id => (id, 1))
          }
        ).reduceByKey(_ + _)
    
        // 统计品类的支付数量
        val payActionRDD = actionRDD.filter(
          action => {
            val datas = action.split("_")
            datas(10) != "null"
          }
        )
    
        val payCountRDD = payActionRDD.flatMap(
          action => {
            val datas = action.split("_")
            val cid = datas(10)
            val cids = cid.split(",")
            cids.map(id => (id, 1))
          }
        ).reduceByKey(_ + _)
    
        val rdd1 = clickCountRDD.map {
          case (cid, cnt) => {
            (cid, (cnt, 0, 0))
          }
        }
    
        val rdd2 = orderCountRDD.map {
          case (cid, cnt) => {
            (cid, (0, cnt, 0))
          }
        }
    
        val rdd3 = payCountRDD.map {
          case (cid, cnt) => {
            (cid, (0, 0, cnt))
          }
        }
    
        // 将三个数据源合并在一起,统一进行聚合计算
        val sourceRDD = rdd1.union(rdd2).union(rdd3)
        val analysisRDD = sourceRDD.reduceByKey(
          (t1, t2) => {
            (t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)
          }
        )
    
        val resultRDD = analysisRDD.sortBy(_._2, false).take(10)
    
        resultRDD.foreach(println)
    
        sc.stop()
      }
    
    }
    
    package com.lotuslaw.spark.core.req
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.req
     * @create: 2021-12-02 15:53
     * @description:
     */
    object Spark_RDD_Req3 {
    
      def main(args: Array[String]): Unit = {
        // TODO: Top10热门品类
        val sparConf = new SparkConf().setMaster("local[*]").setAppName("HotCategoryTop10Analysis")
        val sc = new SparkContext(sparConf)
    
        // Q:存在大量的shuffle操作(reduceByKey)
    
        // 读取原始日志数据
        val actionRDD = sc.textFile("input/user_visit_action.txt")
    
        val flatRDD = actionRDD.flatMap(
          action => {
            val datas = action.split("_")
            if (datas(6) != "-1") {
              // 点击的场合
              List((datas(6), (1, 0, 0)))
            } else if (datas(8) != "null") {
              // 下单的场合
              val ids = datas(8).split(",")
              ids.map(id => (id, (0, 1, 0)))
            } else if (datas(10) != "null") {
              // 支付的场合
              val ids = datas(10).split(",")
              ids.map(id => (id, (0, 0, 1)))
            } else {
              Nil
            }
          }
        )
    
        // 将相同的品类ID数据进行分组聚合
        val analysisRDD = flatRDD.reduceByKey(
          (t1, t2) => {
            (t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)
          }
        )
    
        // 将统计结果进行降序处理,取前10名
        val resultRDD = analysisRDD.sortBy(_._2, false).take(10)
    
        resultRDD.foreach(println)
    
        sc.stop()
      }
    
    }
    
    package com.lotuslaw.spark.core.req
    
    import org.apache.spark.util.AccumulatorV2
    import org.apache.spark.{SparkConf, SparkContext}
    
    import scala.collection.mutable
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.req
     * @create: 2021-12-02 15:53
     * @description:
     */
    object Spark_RDD_Req4 {
    
      def main(args: Array[String]): Unit = {
        // TODO: Top10热门品类
        val sparConf = new SparkConf().setMaster("local[*]").setAppName("HotCategoryTop10Analysis")
        val sc = new SparkContext(sparConf)
    
        // 自定义累加器
    
        // 读取原始日志数据
        val actionRDD = sc.textFile("input/user_visit_action.txt")
    
        val acc = new HotCategoryAccumulator
        sc.register(acc, "hotCategory")
    
        // 将数据转化结构
        actionRDD.foreach(
          action => {
            val datas = action.split("_")
            if (datas(6) != "-1") {
              // 点击的场合
              acc.add((datas(6), "click"))
            } else if (datas(8) != "null"){
              // 下单的场合
              val ids = datas(8).split(",")
              ids.foreach(
                id => {
                  acc.add((id, "order"))
                }
              )
            } else if (datas(10) != "null") {
              // 支付的场合
              val ids = datas(10).split(",")
              ids.foreach(
                id => {
                  acc.add((id, "pay"))
                }
              )
            }
          }
        )
    
        val accVal = acc.value
        val categories = accVal.values
    
        val sort = categories.toList.sortWith(
          (left, right) => {
            if (left.clickCnt > right.clickCnt) {
              true
            } else if (left.clickCnt == right.clickCnt) {
              if (left.orderCnt > right.orderCnt) {
                true
              } else if (left.orderCnt == right.orderCnt) {
                left.payCnt > right.payCnt
              } else {
                false
              }
            } else {
              false
            }
          }
        )
    
        // 将结果采集到控制台打印出来
        sort.take(10).foreach(println)
    
    
        sc.stop()
      }
    
      case class HotCategory(cid: String, var clickCnt: Int, var orderCnt: Int, var payCnt: Int)
    
      class HotCategoryAccumulator extends AccumulatorV2[(String, String), mutable.Map[String, HotCategory]] {
    
        private val hcMap = mutable.Map[String, HotCategory]()
    
        override def isZero: Boolean = {
          hcMap.isEmpty
        }
    
        override def copy(): AccumulatorV2[(String, String), mutable.Map[String, HotCategory]] = {
          new HotCategoryAccumulator()
        }
    
        override def reset(): Unit = {
          hcMap.clear()
        }
    
        override def add(v: (String, String)): Unit = {
          val cid = v._1
          val actionType = v._2
          val category: HotCategory = hcMap.getOrElse(cid, HotCategory(cid, 0, 0, 0))
          if (actionType == "click") {
            category.clickCnt += 1
          } else if (actionType == "order") {
            category.orderCnt += 1
          } else if (actionType == "pay") {
            category.payCnt += 1
          }
          hcMap.update(cid, category)
        }
    
        override def merge(other: AccumulatorV2[(String, String), mutable.Map[String, HotCategory]]): Unit = {
          val map1 = this.hcMap
          val map2 = other.value
    
          map2.foreach{
            case (cid, hc) => {
              val category: HotCategory = map1.getOrElse(cid, HotCategory(cid, 0, 0, 0))
              category.clickCnt += hc.clickCnt
              category.orderCnt += hc.orderCnt
              category.payCnt += hc.payCnt
              map1.update(cid, category)
            }
          }
        }
    
        override def value: mutable.Map[String, HotCategory] = hcMap
      }
    
    }
    
    package com.lotuslaw.spark.core.req
    
    import org.apache.spark.rdd.RDD
    import org.apache.spark.util.AccumulatorV2
    import org.apache.spark.{SparkConf, SparkContext}
    
    import scala.collection.mutable
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.req
     * @create: 2021-12-02 15:53
     * @description:
     */
    object Spark_RDD_Req5 {
    
      def main(args: Array[String]): Unit = {
        // TODO: Top10热门品类
        val sparConf = new SparkConf().setMaster("local[*]").setAppName("HotCategoryTop10Analysis")
        val sc = new SparkContext(sparConf)
    
        // 读取原始日志数据
        val actionRDD = sc.textFile("input/user_visit_action.txt")
        actionRDD.cache()
    
        val Top10Ids = top10Category(actionRDD)
    
        // 过滤原始数据,保留点击和前10品类ID
        val filterActionRDD = actionRDD.filter(
          action => {
            val datas = action.split("_")
            if (datas(6) != "-1") {
              Top10Ids.contains(datas(6))
            } else {
              false
            }
          }
        )
    
        val reduceRDD = filterActionRDD.map(
          action => {
            val datas = action.split("_")
            ((datas(6), datas(2)), 1)
          }
        ).reduceByKey(_ + _)
    
        // 将统计结果进行结构的转换
        val mapRDD = reduceRDD.map {
          case ((cid, sid), sum) => {
            (cid, (sid, sum))
          }
        }
    
        // 相同的品类进行分组
        val groupRDD = mapRDD.groupByKey()
    
        // 将分组后的数据进行点击量的排序,取前10名
        val resultRDD = groupRDD.mapValues(
          iter => {
            iter.toList.sortBy(_._2)(Ordering.Int.reverse).take(10)
          }
        )
    
        resultRDD.collect().foreach(println)
    
        sc.stop()
    
      }
    
      def top10Category(actionRDD: RDD[String]) = {
        val flatRDD = actionRDD.flatMap(
          action => {
            val datas = action.split("_")
            if (datas(6) != "-1") {
              // 点击的场合
              List((datas(6), (1, 0, 0)))
            } else if (datas(8) != "null") {
              // 下单的场合
              val ids = datas(8).split(",")
              ids.map(id => (id, (0, 1, 0)))
            } else if (datas(10) != "null") {
              // 支付的场合
              val ids = datas(10).split(",")
              ids.map(id => (id, (0, 0, 1)))
            } else {
              Nil
            }
          }
        )
    
        val analysisRDD = flatRDD.reduceByKey(
          (t1, t2) => {
            (t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)
          }
        )
    
        analysisRDD.sortBy(_._2, false).take(10).map(_._1)
      }
    }
    
    package com.lotuslaw.spark.core.req
    
    import org.apache.spark.rdd.RDD
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author: lotuslaw
     * @version: V1.0
     * @package: com.lotuslaw.spark.core.req
     * @create: 2021-12-02 15:53
     * @description:
     */
    object Spark_RDD_Req6 {
    
      def main(args: Array[String]): Unit = {
        // TODO: Top10热门品类
        val sparConf = new SparkConf().setMaster("local[*]").setAppName("HotCategoryTop10Analysis")
        val sc = new SparkContext(sparConf)
    
        // 读取原始日志数据
        val actionRDD = sc.textFile("input/user_visit_action.txt")
    
        val actionDataRDD = actionRDD.map(
          action => {
            val datas = action.split("_")
            UserVisitAction(
              datas(0),
              datas(1).toLong,
              datas(2),
              datas(3).toLong,
              datas(4),
              datas(5),
              datas(6).toLong,
              datas(7).toLong,
              datas(8),
              datas(9),
              datas(10),
              datas(11),
              datas(12).toLong
            )
          }
        )
    
        actionDataRDD.cache()
    
        // TODO: 对指定的页面连续跳转进行统计
        val ids = List[Long](1, 2, 3, 4, 5, 6, 7)
        val okflowIds = ids.zip(ids.tail)
    
        // TODO: 计算分母
        val pageidToCountMap = actionDataRDD.filter(
          action => {
            ids.init.contains(action.page_id)
          }
        ).map(
          action => {
            (action.page_id, 1L)
          }
        ).reduceByKey(_ + _).collect().toMap
    
        // TODO: 计算分子
        // 根据session进行分组
        val sessionRDD = actionDataRDD.groupBy(_.session_id)
    
        // 分组后,根据访问时间进行排序(升序)
        val mvRDD = sessionRDD.mapValues(
          iter => {
            val sortList = iter.toList.sortBy(_.action_time)
            val flowIds = sortList.map(_.page_id)
            val pageflowIds = flowIds.zip(flowIds.tail)
            pageflowIds.filter(
              t => {
                okflowIds.contains(t)
              }
            ).map(
              t => {
                (t, 1)
              }
            )
          }
        )
    
        val flatRDD = mvRDD.map(_._2).flatMap(list => list)
    
        val dataRDD = flatRDD.reduceByKey(_ + _)
    
        // TODO:计算单跳转换率
        dataRDD.foreach{
          case ((pageid1, pageid2), sum) => {
            val lon = pageidToCountMap.getOrElse(pageid1, 0L)
    
            println(s"页面${pageid1}跳转页面${pageid2}单跳转换率为:" + (sum.toDouble / lon))
          }
        }
    
        sc.stop()
      }
    
      // 用户访问动作表
      case class UserVisitAction(
                                date: String,
                                user_id: Long,
                                session_id: String,
                                page_id: Long,
                                action_time: String,
                                search_keyword: String,
                                click_category_id: Long,
                                click_product_id: Long,
                                order_category_ids: String,
                                order_product_ids: String,
                                pay_category_ids: String,
                                pay_product_ids: String,
                                city_id: Long
                                )
    }
    
  • 相关阅读:
    重构第四天 : 用多态替换条件语句(if else & switch)
    MSBuild 教程(2)
    为什么Nhibernate中属性和方法必须Virtual的
    重构第三天:提升方法&下移方法
    重构第二天:移动方法
    重构第一天:封装集合
    MSbuild 教程
    工程经验总结之吹水"管理大境界"
    呕心沥血之作:完美解决Informix的中文乱码问题
    万事开头难——我的蛮荒时代
  • 原文地址:https://www.cnblogs.com/lotuslaw/p/15639762.html
Copyright © 2020-2023  润新知