spark scala 删除所有列全为空值的行

删除表中全部为NaN的行

df.na.drop("all")

删除表任一列中有NaN的行

df.na.drop("any")

示例:

scala> df.show
+----+-------+--------+-------------------+-----+----------+
|  id|zipcode|    type|               city|state|population|
+----+-------+--------+-------------------+-----+----------+
|   1|    704|STANDARD|               null|   PR|     30100|
|   2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|   3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|   4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|   5|  76177|STANDARD|               null|   TX|      null|
|null|   null|    null|               null| null|      null|
|   7|  76179|STANDARD|               null|   TX|      null|
+----+-------+--------+-------------------+-----+----------+


scala> df.na.drop("all").show()
+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|               null|   PR|     30100|
|  2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|               null|   TX|      null|
|  7|  76179|STANDARD|               null|   TX|      null|
+---+-------+--------+-------------------+-----+----------+


scala> df.na.drop().show()
+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+


scala> df.na.drop("any").show()
+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+

删除给定列为Null的行:

val nameArray = sparkEnv.sc.textFile("/master/abc.txt").collect()
val df = df.na.drop("all", nameArray.toList.toArray)

df.na.drop(Seq("population","type"))

函数原型:

def drop(): DataFrame
Returns a new DataFrame that drops rows containing any null or NaN values.

def drop(how: String): DataFrame
Returns a new DataFrame that drops rows containing null or NaN values.
If how is "any", then drop rows containing any null or NaN values. If how is "all", then drop rows only if every column is null or NaN for that row.

def drop(how: String, cols: Seq[String]): DataFrame
(Scala-specific) Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.

def drop(how: String, cols: Array[String]): DataFrame
Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.

def drop(cols: Seq[String]): DataFrame
(Scala-specific) Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.

def drop(cols: Array[String]): DataFrame
Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.

更多函数原型:
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

参考:
N多spark使用示例:https://sparkbyexamples.com/spark/spark-dataframe-drop-rows-with-null-values/
示例代码及数据集:https://github.com/spark-examples/spark-scala-examples csv路径:src/main/resources/small_zipcode.csv
https://www.jianshu.com/p/39852729736a

相关阅读:
Python Kivy 安装问题解决
 cisco asa5510 配置
 对于yum中没有的源的解决办法-EPEL
python安装scrapy小问题总结
 win10 清理winsxs文件夹
 centos(7.0) 上 crontab 计划任务
 CentOS — MySQL备份 Shell 脚本
 python 2,3版本自动识别导入
 segmenter.go
segment.go
原文地址：https://www.cnblogs.com/v5captain/p/14248636.html