• pandas处理丢失数据-【老鱼学pandas】


    假设我们的数据集中有缺失值,该如何进行处理呢?

    丢弃缺失值的行或列

    首先我们定义了数据集的缺失值:

    import pandas as pd
    import numpy as np
    dates = pd.date_range("2017-01-08", periods=6)
    data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"])
    
    data.iloc[0, 1] = np.nan
    data.iloc[1, 2] = np.nan
    
    print("data:")
    print(data)
    

    这里缺失值用np.nan来设置,输出为:

    data:
                 A     B     C   D
    2017-01-08   0   NaN   2.0   3
    2017-01-09   4   5.0   NaN   7
    2017-01-10   8   9.0  10.0  11
    2017-01-11  12  13.0  14.0  15
    2017-01-12  16  17.0  18.0  19
    2017-01-13  20  21.0  22.0  23
    

    丢弃缺失值数据

    可以使用dropna函数把拥有缺失值数据的行或列进行丢弃。
    我们这里以丢弃掉拥有缺失值行作为例子:

    import pandas as pd
    import numpy as np
    dates = pd.date_range("2017-01-08", periods=6)
    data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"])
    
    data.iloc[0, 1] = np.nan
    data.iloc[1, 2] = np.nan
    
    print("data:")
    print(data)
    
    print("处理结果:")
    print(data.dropna(axis=0))
    
    

    输出为:

    data:
                 A     B     C   D
    2017-01-08   0   NaN   2.0   3
    2017-01-09   4   5.0   NaN   7
    2017-01-10   8   9.0  10.0  11
    2017-01-11  12  13.0  14.0  15
    2017-01-12  16  17.0  18.0  19
    2017-01-13  20  21.0  22.0  23
    处理结果:
                 A     B     C   D
    2017-01-10   8   9.0  10.0  11
    2017-01-11  12  13.0  14.0  15
    2017-01-12  16  17.0  18.0  19
    2017-01-13  20  21.0  22.0  23
    

    这样把拥有NaN的2017-01-08和2017-01-09行给丢弃掉了。
    dropna所拥有的参数有:
    axis:0=按行进行删除,1=按列进行删除
    how:'all'=丢掉全为NaN的行,'any'=丢弃只要此行中出现一个NaN的字段就丢弃

    把缺失值替换成其它值

    在处理缺失值时,我们也可以把缺失值替换成其它值,具体是通过使用fillna函数来实现的。
    比如,我们想把缺失值设置成-1:

    import pandas as pd
    import numpy as np
    dates = pd.date_range("2017-01-08", periods=6)
    data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"])
    
    data.iloc[0, 1] = np.nan
    data.iloc[1, 2] = np.nan
    
    print("data:")
    print(data)
    
    ret = data.fillna(-1)
    print("处理结果:")
    print(ret)
    
    
    

    输出为:

    data:
                 A     B     C   D
    2017-01-08   0   NaN   2.0   3
    2017-01-09   4   5.0   NaN   7
    2017-01-10   8   9.0  10.0  11
    2017-01-11  12  13.0  14.0  15
    2017-01-12  16  17.0  18.0  19
    2017-01-13  20  21.0  22.0  23
    处理结果:
                 A     B     C   D
    2017-01-08   0  -1.0   2.0   3
    2017-01-09   4   5.0  -1.0   7
    2017-01-10   8   9.0  10.0  11
    2017-01-11  12  13.0  14.0  15
    2017-01-12  16  17.0  18.0  19
    2017-01-13  20  21.0  22.0  23
    
    

    检查是否存在缺失数据

    isnull()函数用来检查是否存在缺失值,如果存在缺失值,则对应位置就会显示True:

    import pandas as pd
    import numpy as np
    dates = pd.date_range("2017-01-08", periods=6)
    data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"])
    
    data.iloc[0, 1] = np.nan
    data.iloc[1, 2] = np.nan
    
    print("data:")
    print(data)
    
    ret = data.isnull()
    print("处理结果:")
    print(ret)
    

    输出为:

    data:
                 A     B     C   D
    2017-01-08   0   NaN   2.0   3
    2017-01-09   4   5.0   NaN   7
    2017-01-10   8   9.0  10.0  11
    2017-01-11  12  13.0  14.0  15
    2017-01-12  16  17.0  18.0  19
    2017-01-13  20  21.0  22.0  23
    处理结果:
                    A      B      C      D
    2017-01-08  False   True  False  False
    2017-01-09  False  False   True  False
    2017-01-10  False  False  False  False
    2017-01-11  False  False  False  False
    2017-01-12  False  False  False  False
    2017-01-13  False  False  False  False
    
    

    如果我们想要知道整个的数据中是否存在缺失值,例子如下:

    import pandas as pd
    import numpy as np
    dates = pd.date_range("2017-01-08", periods=6)
    data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"])
    
    data.iloc[0, 1] = np.nan
    data.iloc[1, 2] = np.nan
    
    print("data:")
    print(data)
    
    ret = np.any(data.isnull() == True)
    
    print("处理结果:")
    print(ret)
    
    
    

    输出为:

    data:
                 A     B     C   D
    2017-01-08   0   NaN   2.0   3
    2017-01-09   4   5.0   NaN   7
    2017-01-10   8   9.0  10.0  11
    2017-01-11  12  13.0  14.0  15
    2017-01-12  16  17.0  18.0  19
    2017-01-13  20  21.0  22.0  23
    处理结果:
    True
    
  • 相关阅读:
    一文了解网络编程之走进TCP三次握手和HTTP那些你不知道的事
    并发编程面试必备之ConcurrentHashMap源码解析
    java延迟队列DelayQueue及底层优先队列PriorityQueue实现原理源码详解
    聊一聊面试中常问的延时队列
    面试必备HashMap源码解析
    synchronized解锁源码分析
    synchronized的jvm源码加锁流程分析聊锁的意义
    jvm源码解析java对象头
    从ReentrantLock源码入手看锁的实现
    从synchronized和lock区别入手聊聊java锁机制
  • 原文地址:https://www.cnblogs.com/dreampursuer/p/7843298.html
Copyright © 2020-2023  润新知