• Accessing data in Hadoop using dplyr and SQL


    If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr. The dplyrpackage has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.

    There are two methods for accessing data in Hadoop using dplyr and SQL.

    ODBC

    You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBIdplyr, and odbc. Note that the dplyr package may also reference the dbplyr package to help translate R into specific variants of SQL. You can use the odbc package to create a connection with Hadoop and run queries:

    library(odbc)

    con <- dbConnect(odbc::odbc(), driver = <driver>, host = <host>, dbname = <dbname>, user = <user>, password = <password>, port = 10000)

    tbl(con, "mytable") # dplyr
    dbGetQuery(con, "SELECT * FROM mytable") # SQL

    dbDisconnect(con)

    Spark

    If you are running Spark on Hadoop, you may also elect to use the sparklyr package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr package communicates with the Spark API to run SQL queries, and it also has a dplyr backend. You can use sparklyr to create a connect with Spark run queries:

    library(sparklyr)
    
    con <- spark_connect(master = "yarn-client")

    tbl(con, "mytable") # dplyr
    dbGetQuery(con, "SELECT * FROM mytable") # SQL

    spark_disconnect(con)


    转自:https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL
  • 相关阅读:
    猴子得到一堆桃,当天吃了一半之后,又多吃了1个。以后每天,猴子都吃了剩余的一半桃子之>后,又多吃一个。在第10天,只剩下1个桃子。输出这堆桃最初有多少个。
    打印9 9乘法表
    尝试实现一个管理系统, 名字和电话号分别用两个列表存储 =======通讯录管理系统======= 1.增加姓名和手机 2.删除姓名 3.修改手机 4.查询所有用户 5.根据姓名查找手机号 6.退出
    求结果
    背景流动
    1
    zuoye
    假期 作业1220
    python1217作业
    pythonzuoye20181212
  • 原文地址:https://www.cnblogs.com/payton/p/8758893.html
Copyright © 2020-2023  润新知