• Day03-基础篇(四)Pandas与数据清洗


    一、基本介绍

      1、Pandas之所以使用频率高的原因:

        (1)Pandas提供的基础数据结构DataFrame和json的契合度高,转换起来方便。

        (2)对于不复杂大数据清理工作采用Pandas可以快速对数据进行规整。

      2、Series(创建一维序列)

        (1)定义:定长的字典序列(相当于存储了两个ndarray)。

        (2)基本属性:index和values,index默认为0,1,2...递增的整数序列,我们也可以自己指定索引。

    import pandas as pd
    from pandas import Series,DataFrame
    #以默认index创建series
    x1 = Series([1,2,3,4])
    #以规定index创建series
    x2 = Series(data = [1,2,3,4],index = ['a','b','c','d'])
    print(x1)
    print(x2)

        (3)采用字典的方式创建Series

    import pandas as pd
    from pandas import Series,DataFrame
    #先创建数组,再用数组创建series
    d = {'a':1,'b':2,'c':3,'d':4}
    x3 = Series(d)
    print(x3)

      3、DataFrame(二维表结构)

        (1)定义:类似数据库表,包含行索引和列索引,可以看成是由相同索引的Series组成的字典类型。(一个索引对应多个数值)

    import pandas as pd
    from pandas import Series,DataFrame
    data = {'Chinese':[66,95,93,90,80],'English':[65,85,92,88,90],'Math':[30,98,96,77,90]}
    #以字典的数据为基础创建df1表结构
    df1 = DataFrame(data) #默认行索引为整数序列
    #column可以规定字典设置的列索引
    df2 = DataFrame(data,index=['ZhangFei','GuanYu','ZhaoYun','HuangZhong','DianWei'],
                    columns=['English','Math','Chinese'])
    print(df1)
    print('======================================')
    print(df2)

      4、数据导入和输出

        Pandas允许从xlsx,csv等文件中导入数据并输出。

    import pandas as pd
    from pandas import Series,DataFrame
    score = DataFrame(pd.read_excel('D:Personal大一下学期计算机科研立项进入国赛人数.xlsx'))
    score.to_excel('D:Personal大一下学期计算机科研立项data.xlsx')
    print(score)
    score1 = DataFrame(pd.read_excel('D:Personal大一下学期计算机科研立项data.xlsx'))
    print('=============================')
    print(score1)

      5、数据清洗

         (1)删除DataFrame中的不必要的列或者行,采用drop()函数即可快速删除

    import pandas as pd
    from pandas import Series,DataFrame
    data = {'Chinese':[66,95,93,90,80],'English':[65,85,92,88,90],'Math':[30,98,96,77,90]}
    df2 = DataFrame(data,index=['ZhangFei','GuanYu','ZhaoYun','HuangZhong','DianWei'],
                     columns=['English','Math','Chinese'])
    print(df2)
    #删除DataFrame中不必要的列或行
    df2 = df2.drop(columns=['Chinese']) #删除语文列
    df2 = df2.drop(index = ['ZhangFei']) #删除张飞行
    print(df2)

        (2)重命名行或列(rename)

    import pandas as pd
    from pandas import Series,DataFrame
    data = {'Chinese':[66,95,93,90,80],'English':[65,85,92,88,90],'Math':[30,98,96,77,90]}
    df2 = DataFrame(data,index=['ZhangFei','GuanYu','ZhaoYun','HuangZhong','DianWei'],
                     columns=['English','Math','Chinese'])
    print(df2)
    #重命名列名
    df2.rename(columns={'Chinese':'YuWen','English':'YingYu'},inplace = True)
    print(df2)

        (3)去重复的值(drop_duplicates)

            自动去除重复的行

    df = df.drop_duplicates() #去除重复行

        (4)更改数据格式(astype)

    import numpy as np
    import pandas as pd
    from pandas import Series,DataFrame
    data = {'Chinese':[66,95,93,90,80],'English':[65,85,92,88,90],'Math':[30,98,96,77,90]}
    df2 = DataFrame(data,index=['ZhangFei','GuanYu','ZhaoYun','HuangZhong','DianWei'],
                     columns=['English','Math','Chinese'])
    print(df2)
    #格式转换
    df2['Chinese'] = df2['Chinese'].astype('str')
    print(df2['Chinese'].dtype)
    df2['Chinese'] = df2['Chinese'].astype(np.int64)
    print(df2['Chinese'].dtype)

        (5)数据间空格的删除

            在把数据格式转化成str之后,是为了方便对数据进行操作,要删除数据间的空格,我们需要引入strip函数:(map的作用是将strip函数施加到Chinese字段的每一个数据上)

    import numpy as np
    import pandas as pd
    from pandas import Series,DataFrame
    data = {'Chinese':[66,95,93,90,80],'English':[65,85,92,88,90],'Math':[30,98,96,77,90]}
    df2 = DataFrame(data,index=['ZhangFei','GuanYu','ZhaoYun','HuangZhong','DianWei'],
                     columns=['English','Math','Chinese'])
    print(df2)
    #格式转换
    df2['Chinese'] = df2['Chinese'].astype('str')
    print(df2['Chinese'].dtype)
    #删除左右两边的空格
    df2['Chinese'] = df2['Chinese'].map(str.strip)
    #删除左边空格
    df2['Chinese'] = df2['Chinese'].map(str.lstrip)
    #删除右边空格
    df2['Chinese'] = df2['Chinese'].map(str.rstrip)
    print(df2['Chinese'])

        (6)删除特殊符号的方法

    df2['Chinese']=df2['Chinese'].str.strip('$')

        (7)大小写转换

    import numpy as np
    import pandas as pd
    from pandas import Series,DataFrame
    data = {'Chinese':[66,95,93,90,80],'English':[65,85,92,88,90],'Math':[30,98,96,77,90]}
    df2 = DataFrame(data,index=['ZhangFei','GuanYu','ZhaoYun','HuangZhong','DianWei'],
                     columns=['English','Math','Chinese'])
    print(df2)
    print('==================================')
    #全部大写
    df2.columns = df2.columns.str.upper()
    print(df2)
    print('==================================')
    #全部小写
    df2.columns = df2.columns.str.lower()
    print(df2)
    print('==================================')
    #首字母大写
    df2.columns = df2.columns.str.title()
    print(df2)
    print('==================================')

        (8)查找空值

            直接查看具体位置的空值采用isnull()方法,查看某列存在空值要采用isnull().any()方法

    import numpy as np
    import pandas as pd
    from pandas import Series,DataFrame
    data = {'Chinese':[66,95,93,90,80],'English':[65,85,92,88,90],'Math':[None,98,96,77,90]}
    df2 = DataFrame(data,index=['ZhangFei','GuanYu','ZhaoYun','HuangZhong','DianWei'],
                     columns=['English','Chinese','Math'])
    print(df2)
    print('==================================')
    print(df2.isnull())
    print('==================================')
    print(df2.isnull().any())

      6、使用apply函数进行数据清洗。

         (1)对name列所有数值进行大写转化

    import pandas as pd
    from pandas import Series, DataFrame
    data = {'Chinese': ['up', 'up', 'mid', 'mid','low'],'English': [65, 85, 92, 88, 90],'Math': [30, 98, 96, 77, 90]}
    df= DataFrame(data)
    print(df)
    print('===================================')
    df['Chinese'] = df['Chinese'].apply(str.upper)
    print(df)

       

        (2)也可以在apply函数中使用自定义函数对数据进行处理

    import pandas as pd
    from pandas import Series, DataFrame
    data = {'语文': ['up', 'up', 'mid', 'mid','low'],'英语': [65, 85, 92, 88, 90],'数学': [30, 98, 96, 77, 90]}
    df= DataFrame(data)
    print(df)
    print('===================================')
    #定义一个可以将原数组*2的函数
    def double_df(x):
        return 2*x
    df['数学'] = df['数学'].apply(double_df)
    print(df)

        (3)对定义函数中对原def进行修改后返回再重新应用,axis=1代表按列进行操作,args为传递的参数,对应def中传入的两个参数,从而生成新的def

    import pandas as pd
    from pandas import Series, DataFrame
    data = {'语文': ['up', 'up', 'mid', 'mid','low'],'英语': [65, 85, 92, 88, 90],'数学': [30, 98, 96, 77, 90]}
    df= DataFrame(data)
    print(df)
    print('===================================')
    #定义一个可以增加新列的函数
    def plus(df,m,n):
        df['new1'] = (df['数学']+df['英语'])*m
        df['new2'] = (df['数学']+df['英语'])*n
        return df
    df = df.apply(plus,axis = 1,args = (2,3,))
    print(df)

      7、数据统计

        (1)统计函数部分有很多和numpy库已知,只做简单介绍,具体操作请查看numpy进行辅助学习即可。包含的函数有:

        (2)describe()函数快速了解数据

    import pandas as pd
    from pandas import Series, DataFrame
    
    df = DataFrame({'data1':range(5),'data2':range(5)})
    print(df)
    print('=============================')
    print (df.describe())

      8、数据表合并(merge()函数)

        (1)创建两个数据表

    import pandas as pd
    from pandas import Series, DataFrame
    df1 = DataFrame({'name':['a','b','c','d','e'],'data1':range(5)})
    df2 = DataFrame({'name':['A','B','C','D','E'],'data2':range(5)})

        (2)基于指定列进行连接

    import pandas as pd
    from pandas import Series, DataFrame
    df1 = DataFrame({'name':['a','b','c','d','e'],'data1':range(5)})
    df2 = DataFrame({'name':['a','b','c','D','E'],'data2':range(5)})
    #基于'name'进行链接
    df3 = pd.merge(df1,df2,on='name')
    print(df3)

        (2)inner内连接

            inner内链接使merge合并的默认情况,其就是键的交集,基于相同的键进行字段链接.

    import pandas as pd
    from pandas import Series, DataFrame
    df1 = DataFrame({'name':['a','b','c','d','e'],'data1':range(5)})
    df2 = DataFrame({'name':['a','b','c','D','E'],'data2':range(5)})
    df3 = pd.merge(df1, df2, how='inner')
    print(df3)

        (3)left左连接(右连接和左连接相反)

            左连接是以第一个DataFrame为主进行的连接

    import pandas as pd
    from pandas import Series, DataFrame
    df1 = DataFrame({'name':['a','b','c','d','e'],'data1':range(5)})
    df2 = DataFrame({'name':['a','b','c','D','E'],'data2':range(5)})
    df3 = pd.merge(df1, df2, how='left')
    print(df3)

        (4)outer外连接

          外连接相同于求两个DataFrame的并集

    import pandas as pd
    from pandas import Series, DataFrame
    df1 = DataFrame({'name':['a','b','c','d','e'],'data1':range(5)})
    df2 = DataFrame({'name':['a','b','c','D','E'],'data2':range(5)})
    df3 = pd.merge(df1, df2, how='outer')
    print(df3)

  • 相关阅读:
    快速幂取模算法详解
    牛客网小白月赛5I区间(差分数组)
    多重背包模板
    hdu5791(DP)
    CodeForces
    最长上升子序列LIS(51nod1134)
    POJ1088(记忆搜索加dp)
    最长公共子序列LCS(POJ1458)
    Gym 100971J-Robots at Warehouse
    模板
  • 原文地址:https://www.cnblogs.com/MurasameLory-chenyulong/p/14286368.html
Copyright © 2020-2023  润新知