根据Wikipedia的定义,几何平均值定义为n个数字乘积的n次根。根据同一页,几何平均值也可以表示为对数的算术平均值的指数。我将使用它来计算每一列的几何平均值。
计算几何平均值,方法是将c1
和c2
的列数据组合到一个名为value
的新列中,该列将源列名称存储在column
中。重新格式化数据后,通过按column
(c1
或c2
)分组并计算每个组的对数值的算术平均值的指数来确定几何平均值。在此计算中,NaN
值被忽略。
%spark.pyspark from pyspark.sql import functions as F df = sqlContext.range(0, 10) df = df.select(F.rand(seed=10).alias("c1"), F.randn(seed=27).alias("c2")) df_id = df.repartition(1).withColumn("id", F.monotonically_increasing_id()).repartition(2) print("df_id表:") df_id.show() kvp = F.explode(F.array([F.struct(F.lit(c).alias("column"), F.col(c).alias("value")) for c in df.columns])).alias("kvp") df_pivoted = df_id.select(['id'] + [kvp]).select(['id'] + ["kvp.column", "kvp.value"]) print("df_pivoted:") df_pivoted.show() df_geometric_mean = df_pivoted.groupBy(['column']).agg(F.exp(F.avg(F.log(df_pivoted.value)))) print("df_geometric_mean:") df_geometric_mean.show() df_geometric_mean.withColumnRenamed("EXP(avg(LOG(value)))", "geometric_mean").show() 结果: df_id表: +-------------------+--------------------+---+ | c1| c2| id| +-------------------+--------------------+---+ | 0.5996723933366402| -0.6950946460326453| 8| |0.16452185994603707| 0.8306551181802446| 3| | 0.3654625958161396| 0.16420866768809156| 1| | 0.4175019040792016| -1.0451987154313813| 2| |0.34084319330900115| -1.4298752463301871| 6| | 0.6396141227834357|-0.07706276399864868| 9| | 0.9697474945375325| 0.5991404703866096| 5| |0.18141810315190554| 1.3792681955381285| 4| |0.03422639313807285| 0.45800664187768786| 0| | 0.4725977369833597|-0.19668314148825305| 7| +-------------------+--------------------+---+ df_pivoted: +---+------+--------------------+ | id|column| value| +---+------+--------------------+ | 8| c1| 0.5996723933366402| | 8| c2| -0.6950946460326453| | 3| c1| 0.16452185994603707| | 3| c2| 0.8306551181802446| | 1| c1| 0.3654625958161396| | 1| c2| 0.16420866768809156| | 2| c1| 0.4175019040792016| | 2| c2| -1.0451987154313813| | 6| c1| 0.34084319330900115| | 6| c2| -1.4298752463301871| | 9| c1| 0.6396141227834357| | 9| c2|-0.07706276399864868| | 5| c1| 0.9697474945375325| | 5| c2| 0.5991404703866096| | 4| c1| 0.18141810315190554| | 4| c2| 1.3792681955381285| | 0| c1| 0.03422639313807285| | 0| c2| 0.45800664187768786| | 7| c1| 0.4725977369833597| | 7| c2|-0.19668314148825305| +---+------+--------------------+ df_geometric_mean: +------+-------------------+ |column|EXP(avg(ln(value)))| +------+-------------------+ | c1|0.31407283822800264| | c2| 0.5528065693289157| +------+-------------------+ +------+-------------------+ |column|EXP(avg(ln(value)))| +------+-------------------+ | c1|0.31407283822800264| | c2| 0.5528065693289157| +------+-------------------+