• 如何在spark中读写cassandra数据 ---- 分布式计算框架spark学习之六


    由于预处理的数据都存储在cassandra里面,所以想要用spark进行数据分析的话,需要读取cassandra数据,并把分析结果也一并存回到cassandra;因此需要研究一下spark如何读写cassandra。

    话说这个单词敲起来好累,说是spark,其实就是看你开发语言是否有对应的driver了。

    因为cassandra是datastax主打的,所以该公司也提供了spark的对应的driver了,见这里

    我就参考它的demo,使用scala语言来测试一把。

    1.执行代码

    //CassandraTest.scala
    import org.apache.spark.{Logging, SparkContext, SparkConf} import com.datastax.spark.connector.cql.CassandraConnector
    object CassandraTestApp { def main(args: Array[String]) {
        #配置spark,cassandra的ip,这里都是本机 val SparkMasterHost
    = "127.0.0.1" val CassandraHost = "127.0.0.1" // Tell Spark the address of one Cassandra node: val conf = new SparkConf(true) .set("spark.cassandra.connection.host", CassandraHost) .set("spark.cleaner.ttl", "3600") .setMaster("local[12]") .setAppName("CassandraTestApp") // Connect to the Spark cluster: lazy val sc = new SparkContext(conf)
        //预处理脚本,连接的时候就执行这些 CassandraConnector(conf).withSessionDo { session
    => session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }") session.execute("CREATE TABLE IF NOT EXISTS test.key_value (key INT PRIMARY KEY, value VARCHAR)") session.execute("TRUNCATE test.key_value") session.execute("INSERT INTO test.key_value(key, value) VALUES (1, 'first row')") session.execute("INSERT INTO test.key_value(key, value) VALUES (2, 'second row')") session.execute("INSERT INTO test.key_value(key, value) VALUES (3, 'third row')") }     

    //加载connector import com.datastax.spark.connector._
    // Read table test.kv and print its contents: val rdd = sc.cassandraTable("test", "key_value").select("key", "value") rdd.collect().foreach(row => println(s"Existing Data: $row")) // Write two new rows to the test.kv table: val col = sc.parallelize(Seq((4, "fourth row"), (5, "fifth row"))) col.saveToCassandra("test", "key_value", SomeColumns("key", "value")) // Assert the two new rows were stored in test.kv table: assert(col.collect().length == 2) col.collect().foreach(row => println(s"New Data: $row")) println(s"Work completed, stopping the Spark context.") sc.stop() } }

    2.目录结构

    由于构建工具是用sbt,所以目录结构也必须遵循sbt规范,主要是build.sbt 和 src目录, 其它目录会自动生成。

    qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $ll
    total 8
    drwxr-xr-x   6 qpzhang  staff  204 11 26 12:14 ./
    drwxr-xr-x  10 qpzhang  staff  340 11 25 17:30 ../
    -rw-r--r--   1 qpzhang  staff  460 11 26 10:11 build.sbt
    drwxr-xr-x   3 qpzhang  staff  102 11 25 17:42 project/
    drwxr-xr-x   3 qpzhang  staff  102 11 25 17:32 src/
    drwxr-xr-x   6 qpzhang  staff  204 11 25 17:55 target/
    qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $tree src/
    src/
    └── main
        └── scala
            └── CassandraTest.scala
    qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $cat build.sbt
    
    name := "CassandraTest"
    
    version := "1.0"
    
    scalaVersion := "2.10.4"
    
    libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided"
    libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2"
    
    assemblyMergeStrategy in assembly := {
      case PathList(ps @ _*) if ps.last endsWith ".properties" => MergeStrategy.first
      case x =>
        val oldStrategy = (assemblyMergeStrategy in assembly).value
        oldStrategy(x)
    }

    这里需要注意的是,sbt安装的是当时最新版本 0.13 , 并且安装了 assembly插件(这里要吐槽一下sbt,下载一坨坨的jar包,最好有翻墙代理,否则下载等待时间很长)。

    qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $cat ~/.sbt/0.13/plugins/plugins.sbt
    addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0")
    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")

    3.sbt编译打包

    在build.sbt 目录下,使用sbt命令启动。

    然后使用 compile 命令进行编译,使用assembly进行打包。

    在次期间,遇到了sbt-assembly-deduplicate-error的问题,参考这里

    > compile
    [success] Total time: 0 s, completed 2015-11-26 10:11:50
    >> assembly
    [info] Including from cache: slf4j-api-1.7.5.jar
    [info] Including from cache: metrics-core-3.0.2.jar
    [info] Including from cache: netty-codec-4.0.27.Final.jar
    [info] Including from cache: netty-handler-4.0.27.Final.jar
    [info] Including from cache: netty-common-4.0.27.Final.jar
    [info] Including from cache: joda-time-2.3.jar
    [info] Including from cache: netty-buffer-4.0.27.Final.jar
    [info] Including from cache: commons-lang3-3.3.2.jar
    [info] Including from cache: jsr166e-1.1.0.jar
    [info] Including from cache: cassandra-clientutil-2.1.5.jar
    [info] Including from cache: joda-convert-1.2.jar
    [info] Including from cache: netty-transport-4.0.27.Final.jar
    [info] Including from cache: guava-16.0.1.jar
    [info] Including from cache: spark-cassandra-connector_2.10-1.5.0-M2.jar
    [info] Including from cache: cassandra-driver-core-2.2.0-rc3.jar
    [info] Including from cache: scala-reflect-2.10.5.jar
    [info] Including from cache: scala-library-2.10.5.jar
    [info] Checking every *.class/*.jar file's SHA-1.
    [info] Merging files...
    [warn] Merging 'META-INF/INDEX.LIST' with strategy 'discard'
    [warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'
    [warn] Merging 'META-INF/io.netty.versions.properties' with strategy 'first'
    [warn] Merging 'META-INF/maven/com.codahale.metrics/metrics-core/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/com.datastax.cassandra/cassandra-driver-core/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/com.google.guava/guava/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/com.twitter/jsr166e/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/io.netty/netty-buffer/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/io.netty/netty-codec/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/io.netty/netty-common/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/io.netty/netty-handler/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/io.netty/netty-transport/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/joda-time/joda-time/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/org.apache.commons/commons-lang3/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/org.joda/joda-convert/pom.xml' with strategy 'discard'
    [warn] Merging 'META-INF/maven/org.slf4j/slf4j-api/pom.xml' with strategy 'discard'
    [warn] Strategy 'discard' was applied to 15 files
    [warn] Strategy 'first' was applied to a file
    [info] SHA-1: d2cb403e090e6a3ae36b08c860b258c79120fc90
    [info] Packaging /Users/qpzhang/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar ...
    [info] Done packaging.
    [success] Total time: 19 s, completed 2015-11-26 10:12:22

    4.提交到spark,执行结果

    qpzhang@qpzhangdeMac-mini:~/project/spark-1.5.2-bin-hadoop2.6 $./bin/spark-submit --class "CassandraTestApp" --master local[4] ~/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar
    //...........................
    5/11/26 11:40:23 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, NODE_LOCAL, 26660 bytes)
    15/11/26 11:40:23 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
    15/11/26 11:40:23 INFO Executor: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar with timestamp 1448509221160
    15/11/26 11:40:23 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
    15/11/26 11:40:23 INFO Utils: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar to /private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf-8489-4540-976e-e98eedf50412/userFiles-63085bda-aa04-4906-9621-c1cedd98c163/fetchFileTemp7487594
    894647111926.tmp
    15/11/26 11:40:23 INFO Executor: Adding file:/private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf-8489-4540-976e-e98eedf50412/userFiles-63085bda-aa04-4906-9621-c1cedd98c163/CassandraTest-assembly-1.0.jar to class loader
    15/11/26 11:40:24 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
    15/11/26 11:40:24 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
    15/11/26 11:40:25 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2676 bytes result sent to driver
    15/11/26 11:40:25 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2462 ms on localhost (1/1)
    15/11/26 11:40:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
    15/11/26 11:40:25 INFO DAGScheduler: ResultStage 0 (collect at CassandraTest.scala:32) finished in 2.481 s
    15/11/26 11:40:25 INFO DAGScheduler: Job 0 finished: collect at CassandraTest.scala:32, took 2.940601 s
    Existing Data: CassandraRow{key: 1, value: first row}
    Existing Data: CassandraRow{key: 2, value: second row}
    Existing Data: CassandraRow{key: 3, value: third row}
    //....................
    5/11/26 11:40:27 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
    15/11/26 11:40:27 INFO DAGScheduler: ResultStage 3 (collect at CassandraTest.scala:41) finished in 0.032 s
    15/11/26 11:40:27 INFO DAGScheduler: Job 3 finished: collect at CassandraTest.scala:41, took 0.046502 s
    New Data: (4,fourth row)
    New Data: (5,fifth row)
    Work completed, stopping the Spark context.

    cassandra中的数据

    cqlsh:test> select * from key_value ;
    
     key | value
    -----+------------
       5 |  fifth row
       1 |  first row
       2 | second row
       4 | fourth row
       3 |  third row
    
    (5 rows)

    到此位置,还算顺利,除了assembly 重复文件的问题,都还算ok。

  • 相关阅读:
    CF601C Kleofáš and the n-thlon 题解
    CSP-J2 2020 T3,T4 题解
    题解:Luogu P2051 [AHOI2009]中国象棋
    三角函数
    Luogu P1904 天际线
    计算几何初步
    C++STL(set……)
    斜率优化DP
    欧拉图、哈密顿图
    初赛—错题集
  • 原文地址:https://www.cnblogs.com/zhangqingping/p/4997357.html
Copyright © 2020-2023  润新知