要实现一个类似于 SQL
中的 case when
功能,为用户打上标签。
例如:
select tj_month,
name,
online_time,
case when online_time < 12 then '(3,12]'
when online_time >= 12 and online_time < 24 then '[12,24)'
when online_time >= 24 and online_time < 36 then '[24,36)'
when online_time >= 36 and online_time < 48 then '[36,48)'
when online_time >= 48 and online_time < 60 then '[48,60)'
else '>60' end as online_time_cut
from table_name
where tj_month = '202106';
一、利用pandas.DataFrame.loc直接筛选
构造测试数据框。
import numpy as np
import pandas as pd
data = np.array([[np.nan, 0], [2, 0], [np.nan, 1]])
df = pd.DataFrame(data=data, columns=['a', 'b'])
'''
a b
0 NaN 0.0
1 2.0 0.0
2 NaN 1.0
'''
直接筛选符合条件数据进行打标。
# 此方法已不推荐 不支持 建议使用loc/iloc定位
df[(df['a'].isnull()) & (df['b'] == 0)]['c'] = 1
# loc定位
df['c'] = 0
df.loc[(df['a'].isnull()) & (df['b'] == 0), 'c'] = 1
'''
a b c
0 NaN 0.0 1.0
1 2.0 0.0 NaN
2 NaN 1.0 NaN
'''
二、利用np.where筛选
# 满足条件 输出x 否则输出y
np.where(condition, x, y)
np.where(df.isnull(), 100, 5)
'''
array([[100, 5],
[ 5, 5],
[100, 5]])
'''
# 打标签
df['c'] = np.where((df['a'].isnull()) & (df['b'] == 0), 1, 0)
One more嵌套判断的例子:
df['class'] = np.where(df['score'].between(0, 60, inclusive=False), '不及格',
np.where(df['score'].between(60, 80, inclusive=True), '良好', '优秀'))
三、利用np.select筛选
np.select
函数可以根据某些条件筛选某些元素,使用语法为:
np.select(condition_list, choice_list, default=0)
# 条件列表、执行操作列表、缺失值
# 返回列表
实操:
df['c'] = np.select([(df['a'].isnull()) & (df['b'] == 0),
(df['a'].isnull()) & (df['b'] == 1),
(df['a'] == 2) & (df['b'] == 0)],
['one', 'two', 'three'],
default = 'XXX')
'''
a b c
0 NaN 0.0 one
1 2.0 0.0 three
2 NaN 1.0 two
'''
四、利用apply函数与if语句
apply
应用在 dataframe
上,用于对行或者列进行计算。
- axis=1 指定按行计算
- lambda匿名函数判断满足条件为1,不满足为0
df['c'] = df.apply(lambda x: 1 if np.isnan(x[0]) and x[1] == 0 else 0, axis=1)
df
'''
a b c
0 NaN 0.0 1
1 2.0 0.0 0
2 NaN 1.0 0
'''
另外一个简单的例子:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,11,size=(1000000,5)), columns=('a','b','c','d','e'))
def func(a,b,c,d,e):
if e == 10:
return c*d
elif (e < 10) and (e >= 5):
return c+d
elif e < 5:
return a+b
df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
df
'''
a b c d e new
0 2 0 5 7 5 12
1 9 3 3 0 2 12
2 2 0 9 10 3 2
3 5 8 3 8 9 11
4 1 10 0 2 0 11
'''
# 例子
def function(x):
if x['数学'] != 0:
s = x['语文']/x['数学']
else:
s = 0
return s
data['result'] = data.apply(lambda x: function(x), axis=1)
data
参考链接:Pandas等价于创建新变量的SQL case when语句