• cassandra 之 在spark-shell 中使用 spark cassandra connector 完整案例


    1、cassandra 准备

    启动cqlsh,

    CQLSH_HOST=172.16.163.131 bin/cqlsh
    cqlsh>CREATE KEYSPACE productlogs WITH REPLICATION = { 'class' : 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2' } 
    
    cqlsh>CREATE TABLE productlogs.logs (
        ids uuid,
        app_name text,
        app_version text,
        city text,
        client_time timestamp,
        country text,
        created_at timestamp,
        cs_count int,
        device_id text,
        id int,
        modle_name text,
        province text,
        remote_ip text,
        updated_at timestamp,
        PRIMARY KEY (ids)
    )

    2、spark cassandra conector jar包

    新建空项目,使用sbt,引入connector,打包为spark-cassandra-connector-full.jar,在*.sbt文件中添加如下一行

    libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.5.0"

    这步的意义在于:官方的connector包没有将依赖打进去,所以,直接使用官方包的时候,需要自己将依赖找出来。不同版本依赖的包及版本也不相同,简单起见,直接打一个full包

    3、启动spark-shell

    /opt/db/spark-1.5.2-bin-hadoop2.6/bin/spark-shell --master spark://u1:7077  --jars ~/spark-cassandra-connector-full.jar

    以下为sparkshell 命令

    4、准备数据源:

    //可能大多数文档都先stop掉当前sc,再重启一个,其实根本没必要,直接在原有sc上添加cassandra的参数就好
    scala>sc.getConf.set("spark.cassandra.connection.host", "172.16.163.131")
    //读取HDFS上的数据源
    scala>val df = sc.textFile("/data/logs")
    //引入需要的命令空间
    scala>import org.apache.spark.sql._
    scala>import org.apache.spark.sql.types._
    scala>import com.datastax.spark.connector._
    scala>import java.util.UUID
    //定义shcmea
    scala>val schema = StructType(
      StructField("ids", StringType, true) ::
        StructField("id", IntegerType, true) ::
        StructField("app_name", StringType, true) ::
        StructField("app_version", StringType, true) ::
        StructField("client_time", TimestampType, true) ::
        StructField("device_id", StringType, true) ::
        StructField("modle_name", StringType, true) ::
        StructField("cs_count", IntegerType, true) ::
        StructField("created_at", TimestampType, true) ::
        StructField("updated_at", TimestampType, true) ::
        StructField("remote_ip", StringType, true) ::
        StructField("country", StringType, true) ::
        StructField("province", StringType, true) ::
       StructField("city", StringType, true) :: Nil)
    //指定数据源的schema
    scala>val rowRDD = df.map(_.split("	")).map(p => Row(UUID.randomUUID().toString(), p(0).toInt, p(1), p(2), java.sql.Timestamp.valueOf(p(3)), p(4), p(5), p(6).toInt, java.sql.Timestamp.valueOf(p(7)), java.sql.Timestamp.valueOf(p(8)), p(9), p(10), p(11), p(12)))
    scala>val df= sqlContext.createDataFrame(rowRDD, schema)
    scala>df.registerTempTable("logs")
    //看下结果
    scala>sqlContext.sql("select * from logs limit 1").show

     如果你足够细心的话,你可能看到在类型为uuid的ids列,我用的是字符串UUID.randomUUID().toString()。为什么呢?其实在spark cassandra connector内部,会进行转换的。见附录1

    5、将数据存入cassandra

    scala>import org.apache.spark.sql.cassandra._
    scala>df.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "logs", "keyspace" -> "productlogs")).save()

    6、取出刚存的数据:

    scala>import org.apache.spark.sql.cassandra._
    scala>val cdf = sqlContext.read.
      format("org.apache.spark.sql.cassandra").
      options(Map("table" -> "logs", "keyspace" -> "productlogs")).
      load().registerTempTable("logs")
    scala>sqlContext.sql("select * from logs_jsut_save limit 1").show
    

     7、cassandra 与spark sql 数据类型对应关系

    spark-cassandra-connector/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/DataTypeConverter.scala

      private[cassandra] val primitiveTypeMap = Map[connector.types.ColumnType[_], catalystTypes.DataType](
        connector.types.TextType       -> catalystTypes.StringType,
        connector.types.AsciiType      -> catalystTypes.StringType,
        connector.types.VarCharType    -> catalystTypes.StringType,
    
        connector.types.BooleanType    -> catalystTypes.BooleanType,
    
        connector.types.IntType        -> catalystTypes.IntegerType,
        connector.types.BigIntType     -> catalystTypes.LongType,
        connector.types.CounterType    -> catalystTypes.LongType,
        connector.types.FloatType      -> catalystTypes.FloatType,
        connector.types.DoubleType     -> catalystTypes.DoubleType,
        connector.types.SmallIntType   -> catalystTypes.ShortType,
        connector.types.TinyIntType    -> catalystTypes.ByteType,
    
        connector.types.VarIntType     -> catalystTypes.DecimalType(38, 0), // no native arbitrary-size integer type
        connector.types.DecimalType    -> catalystTypes.DecimalType(38, 18),
    
        connector.types.TimestampType  -> catalystTypes.TimestampType,
        connector.types.InetType       -> catalystTypes.StringType,
        connector.types.UUIDType       -> catalystTypes.StringType,
        connector.types.TimeUUIDType   -> catalystTypes.StringType,
        connector.types.BlobType       -> catalystTypes.BinaryType,
        connector.types.DateType       -> catalystTypes.DateType,
        connector.types.TimeType       -> catalystTypes.LongType
      )

     备注:作者在spark-shell下,使用spark-cassandra-conector 主要使用了两个技巧

    1、新建空项目,引入spark-cassandra-conector,将依赖包打进来

    2、在spark-shell,直接获取conf,然后添加cassandra 连接参数,这样,就可以在默认的sparkcontext、sqlContext:HiveContext上使用,而不需要先sc.stop

  • 相关阅读:
    XML属性
    4.9Java游戏项目练习
    关于JVM结构的学习
    HelloWorld之Struts2
    进程调度
    JVM垃圾回收总结
    学会阅读Java字节码
    关于产品需求文档的各种D
    刘强东学习亚马逊:控制供应链 技术是最大障碍
    JVM内存溢出的方式
  • 原文地址:https://www.cnblogs.com/piaolingzxh/p/5427568.html
Copyright © 2020-2023  润新知