第二期 Pandas数据处理
21.读取本地EXCEL数据
import pandas as pd
df = pd.read_excel('pandas120.xlsx')
22.查看df数据前5行
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503211846313-1560062416.png)
23.将salary列数据转换为最大值与最小值的平均值
#备注,在某些版本pandas中.ix方法可能失效,可使用.iloc,参考https://mp.weixin.qq.com/s/5xJ-VLaHCV9qX2AMNOLRtw
#为什么不能直接使用max,min函数,因为我们的数据中是20k-35k这种字符串,所以需要先用正则表达式提取数字
import re
# 方法一:apply + 自定义函数
def func(df):
lst = df['salary'].split('-')
smin = int(lst[0].strip('k'))
smax = int(lst[1].strip('k'))
df['salary'] = int((smin + smax) / 2 * 1000)
return df
df = df.apply(func,axis=1)
# 方法二:iterrows + 正则
import re
for index,row in df.iterrows():
nums = re.findall('d+',row[2])
df.iloc[index,2] = int(eval(f'({nums[0]} + {nums[1]}) / 2 * 1000'))
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503211942038-272822598.png)
24.将数据根据学历进行分组并计算平均薪资
print(df.groupby('education').mean())
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212138169-220466684.png)
25.将createTime列时间转换为月-日
#备注,在某些版本pandas中.ix方法可能失效,可使用.iloc,参考https://mp.weixin.qq.com/s/5xJ-VLaHCV9qX2AMNOLRtw
for i in range(len(df)):
df.ix[i,0] = df.ix[i,0].to_pydatetime().strftime("%m-%d")
df.head()
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212228737-751452161.png)
26.查看索引、数据类型和内存信息
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212259626-1628623474.png)
27.查看数值型列的汇总统计
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212330350-37419895.png)
28.新增一列根据salary将数据分为三组
bins = [0,5000, 20000, 50000]
group_names = ['低', '中', '高']
df['categories'] = pd.cut(df['salary'], bins, labels=group_names)
df
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212357150-1794736386.png)
29.按照salary列对数据降序排列
df.sort_values('salary', ascending=False)
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212613605-160412636.png)
30.取出第33行数据
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212650558-1503909402.png)
31.计算salary列的中位数
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212729414-984831056.png)
32.绘制薪资水平频率分布直方图
#执行两次
df.salary.plot(kind='hist')
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212802331-1834673723.png)
33.绘制薪资水平密度曲线
df.salary.plot(kind='kde',xlim=(0,80000))
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212828113-1786241696.png)
34.删除最后一列categories
del df['categories']
# 等价于
df.drop(columns=['categories'], inplace=True)
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503212901644-704665338.png)
35.将df的第一列与第二列合并为新的一列
df['test'] = df['education']+df['createTime']
df
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213047791-1073402520.png)
36.将education列与salary列合并为新的一列
#备注:salary为int类型,操作与35题有所不同
df["test1"] = df["salary"].map(str) + df['education']
df
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213113514-1901989197.png)
37.计算salary最大值与最小值之差
df[['salary']].apply(lambda x: x.max() - x.min())
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213142841-1172361843.png)
38.将第一行与最后一行拼接
pd.concat([df[:1], df[-2:-1]])
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213208653-1337466606.png)
39.将第8行数据添加至末尾
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213236581-1102533828.png)
40.查看每列的数据类型
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213453216-1471708496.png)
41.将createTime列设置为索引
df.set_index("createTime")
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213525343-999236634.png)
42.生成一个和df长度相同的随机数dataframe
df1 = pd.DataFrame(pd.Series(np.random.randint(1, 10, 135)))
df1
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213557166-417991643.png)
43.将上一题生成的dataframe与df合并
df= pd.concat([df,df1],axis=1)
df
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213629888-1448266567.png)
44.生成新的一列new为salary列减去之前生成随机数列
df["new"] = df["salary"] - df[0]
df
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213701137-684094272.png)
45.检查数据中是否含有任何缺失值
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503213947130-82954291.png)
46.将salary列类型转换为浮点数
df['salary'].astype(np.float64)
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503214016158-370497256.png)
47.计算salary大于10000的次数
len(df[df['salary']>10000])
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503214041313-1989964015.png)
48.查看每种学历出现的次数
df.education.value_counts()
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503214108744-881458221.png)
49.查看education列共有几种学历
df['education'].nunique()
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503214134117-117246528.png)
50.提取salary与new列的和大于60000的最后3行
df1 = df[['salary','new']]
rowsums = df1.apply(np.sum, axis=1)
res = df.iloc[np.where(rowsums > 60000)[0][-3:], :]
res
![](https://img2020.cnblogs.com/blog/1275415/202105/1275415-20210503214203407-446261420.png)