• Pandas的高级操作


    pandas数据处理

    1. 删除重复元素

    使用duplicated()函数检测重复的行,返回元素为布尔类型的Series对象,每个元素对应一行,如果该行不是第一次出现,则元素为True

    • keep参数:指定保留哪一重复的行数据

    • 创建具有重复元素行的DataFrame

    import numpy as np
    import pandas as pd
    from pandas import DataFrame
    
    # 创建一个df
    df = DataFrame(data=np.random.randint(0,100,size=(12,7)))
    df               
    
    0 1 2 3 4 5 6
    0 18 75 98 68 33 35 15
    1 48 71 36 34 41 17 46
    2 3 28 81 21 86 53 85
    3 15 35 35 90 63 83 14
    4 12 36 65 79 25 53 95
    5 98 63 4 58 35 64 80
    6 31 61 23 33 80 53 60
    7 52 47 60 58 54 35 17
    8 7 92 42 61 31 40 56
    9 76 45 30 42 74 83 53
    10 69 2 89 99 12 51 62
    11 17 86 1 76 40 34 41
    # 手动将df的某几行设置成相同的内容
    df.iloc[1] = [6,6,6,6,6,6,6]
    df.iloc[8] = [6,6,6,6,6,6,6]
    df.iloc[5] = [6,6,6,6,6,6,6]
    df
    
    0 1 2 3 4 5 6
    0 18 75 98 68 33 35 15
    1 6 6 6 6 6 6 6
    2 3 28 81 21 86 53 85
    3 15 35 35 90 63 83 14
    4 12 36 65 79 25 53 95
    5 6 6 6 6 6 6 6
    6 31 61 23 33 80 53 60
    7 52 47 60 58 54 35 17
    8 6 6 6 6 6 6 6
    9 76 45 30 42 74 83 53
    10 69 2 89 99 12 51 62
    11 17 86 1 76 40 34 41
    • 使用drop_duplicates()函数删除重复的行
      • drop_duplicates(keep='first/last'/False)
    df.drop_duplicates(keep='last')  # 保留最后一个重复的行
    
    0 1 2 3 4 5 6
    0 26 9 31 11 38 46 22
    2 89 24 53 28 64 89 40
    3 7 80 43 91 32 95 6
    4 96 92 58 55 82 73 21
    6 43 1 13 54 24 34 43
    7 75 32 88 85 40 29 41
    8 6 6 6 6 6 6 6
    9 93 3 62 4 60 51 78
    10 58 11 63 42 62 30 10
    11 89 93 96 49 23 40 57

    2. 映射

    1) replace()函数:替换元素

    DataFrame替换操作

    • 单值替换

      • 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
      • 按列指定单值替换: to_replace={列标签:替换值} value='value'
    • 多值替换

      • 列表替换: to_replace=[] value=[]
      • 字典替换(推荐) to_replace={to_replace:value,to_replace:value}
    df
    
    0 1 2 3 4 5 6
    0 26 9 31 11 38 46 22
    1 6 6 6 6 6 6 6
    2 89 24 53 28 64 89 40
    3 7 80 43 91 32 95 6
    4 96 92 58 55 82 73 21
    5 6 6 6 6 6 6 6
    6 43 1 13 54 24 34 43
    7 75 32 88 85 40 29 41
    8 6 6 6 6 6 6 6
    9 93 3 62 4 60 51 78
    10 58 11 63 42 62 30 10
    11 89 93 96 49 23 40 57

    注意:DataFrame中,无法使用method和limit参数

    df.replace(to_replace=6,value='six') # 将数据中的所有6替换成six,默认不改变原表
    
    0 1 2 3 4 5 6
    0 18 75 98 68 33 35 15
    1 six six six six six six six
    2 3 28 81 21 86 53 85
    3 15 35 35 90 63 83 14
    4 12 36 65 79 25 53 95
    5 six six six six six six six
    6 31 61 23 33 80 53 60
    7 52 47 60 58 54 35 17
    8 six six six six six six six
    9 76 45 30 42 74 83 53
    10 69 2 89 99 12 51 62
    11 17 86 1 76 40 34 41
    df.replace(to_replace={6:'six'})  # 效果同上
    
    0 1 2 3 4 5 6
    0 26 9 31 11 38 46 22
    1 six six six six six six six
    2 89 24 53 28 64 89 40
    3 7 80 43 91 32 95 six
    4 96 92 58 55 82 73 21
    5 six six six six six six six
    6 43 1 13 54 24 34 43
    7 75 32 88 85 40 29 41
    8 six six six six six six six
    9 93 3 62 4 60 51 78
    10 58 11 63 42 62 30 10
    11 89 93 96 49 23 40 57
    df.replace(to_replace={5:6},value='six')  # 将第5列中的6,替换成six
    
    0 1 2 3 4 5 6
    0 26 9 31 11 38 46 22
    1 6 6 6 6 6 six 6
    2 89 24 53 28 64 89 40
    3 7 80 43 91 32 95 6
    4 96 92 58 55 82 73 21
    5 6 6 6 6 6 six 6
    6 43 1 13 54 24 34 43
    7 75 32 88 85 40 29 41
    8 6 6 6 6 6 six 6
    9 93 3 62 4 60 51 78
    10 58 11 63 42 62 30 10
    11 89 93 96 49 23 40 57

    2) map()函数

    新建一列 , map函数并不是df的方法,而是series的方法

    • map()可以映射新一列数据

    • map()中可以使用lambd表达式

    • map()中可以使用方法,可以是自定义的方法

      eg:map({to_replace:value})

    • 注意 map()中不能使用sum之类的函数,for循环

    • 新增一列:给df中,添加一列,该列的值为中文名对应的英文名

    dic = {
        'name':['张三','周杰伦','张三'],
        'salary':[20000,10000,20000]
    }
    df = DataFrame(data=dic)
    df
    
    name salary
    0 张三 20000
    1 周杰伦 10000
    2 张三 20000
    # 映射关系表
    dic = {
        '张三':'tom',
        '周杰伦':'jay'
    }
    df['e_name'] = df['name'].map(dic)
    df
    
    name salary e_name
    0 张三 20000 tom
    1 周杰伦 10000 jay
    2 张三 20000 tom

    map当做一种运算工具,至于执行何种运算,是由map函数的参数决定的(参数:lambda,函数)

    • 使用自定义函数
    def after_sal(s):
        return s - (s-3000)*0.5
    
    # 超过3000部分的钱缴纳50%的税
    df['after_sal'] = df['salary'].map(after_sal)
    df
    
    name salary e_name after_sal
    0 张三 20000 tom 11500.0
    1 周杰伦 10000 jay 6500.0
    2 张三 20000 tom 11500.0
    df['after_sal'] = df['salary'].apply(after_sal)  # apply效率高于map
    df
    
    name salary e_name after_sal
    0 张三 20000 tom 11500.0
    1 周杰伦 10000 jay 6500.0
    2 张三 20000 tom 11500.0
    • 使用lambda表达式
    df['after_sal'] = df['salary'].apply(lambda x:x-(x-3000)*0.5)  # 或map
    df
    
    name salary e_name after_sal
    0 张三 20000 tom 11500.0
    1 周杰伦 10000 jay 6500.0
    2 张三 20000 tom 11500.0

    注意:并不是任何形式的函数都可以作为map的参数。只有当一个函数具有一个参数且有返回值,那么该函数才可以作为map的参数。

    3. 使用聚合操作对数据异常值检测和过滤

    使用df.std()函数可以求得DataFrame对象每一列的标准差

    • 创建一个1000行3列的df 范围(0-1),求其每一列的标准差
    df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
    df
    
    A B C
    0 0.056365 0.080972 0.378327
    1 0.371930 0.007791 0.318345
    2 0.140999 0.921772 0.752930
    3 0.877110 0.447756 0.760049
    4 0.212178 0.143772 0.621486
    5 0.255404 0.195473 0.008677
    6 0.011568 0.308934 0.882607
    7 0.470868 0.080049 0.285998
    8 0.659013 0.794802 0.270541
    9 0.315826 0.814653 0.906056
    10 0.892474 0.301340 0.687254
    11 0.015484 0.567598 0.043682
    12 0.957620 0.967676 0.063608
    13 0.102506 0.490077 0.235902
    14 0.099083 0.778190 0.451824
    15 0.023148 0.074169 0.589411
    16 0.425894 0.772662 0.797658
    17 0.939475 0.773502 0.766101
    18 0.330299 0.984615 0.346554
    19 0.882735 0.237546 0.847036
    20 0.578589 0.730879 0.751632
    21 0.504627 0.716272 0.386102
    22 0.424879 0.231262 0.590047
    23 0.580738 0.675268 0.726104
    24 0.507248 0.136465 0.463764
    25 0.421517 0.814806 0.449040
    26 0.275373 0.935430 0.525679
    27 0.404031 0.221492 0.730966
    28 0.779142 0.063435 0.120807
    29 0.618392 0.535934 0.554632
    ... ... ... ...
    970 0.378107 0.687434 0.567923
    971 0.876770 0.443219 0.236627
    972 0.486757 0.416836 0.524889
    973 0.886021 0.203959 0.789022
    974 0.838247 0.279468 0.333581
    975 0.762230 0.352878 0.550439
    976 0.044568 0.680916 0.350743
    977 0.031232 0.029839 0.918445
    978 0.323142 0.686965 0.978349
    979 0.746471 0.081773 0.729567
    980 0.810169 0.793025 0.993532
    981 0.480849 0.321984 0.233431
    982 0.491794 0.056681 0.429988
    983 0.278019 0.105290 0.435492
    984 0.480974 0.098199 0.958667
    985 0.465396 0.806955 0.668972
    986 0.602675 0.966963 0.338542
    987 0.051971 0.105833 0.132917
    988 0.416362 0.861777 0.832573
    989 0.951651 0.002912 0.942564
    990 0.274033 0.071102 0.941272
    991 0.632913 0.807060 0.540686
    992 0.035006 0.526970 0.058584
    993 0.368957 0.395593 0.210440
    994 0.692847 0.655492 0.877564
    995 0.245593 0.003551 0.913750
    996 0.374804 0.311604 0.680521
    997 0.355928 0.924330 0.224949
    998 0.923060 0.834740 0.275359
    999 0.905336 0.482290 0.722851

    1000 rows × 3 columns

    对df应用筛选条件,去除标准差太大的数据:假设过滤条件为 C列数据大于两倍的C列标准差

    twice_std = df['C'].std() * 2
    twice_std
    
    0.5714973528631762
    
    ~(df['C'] > twice_std)
    
    0       True
    1       True
    2      False
    3      False
    4      False
    5       True
    6      False
    7       True
    8       True
    9      False
    10     False
    11      True
    12      True
    13      True
    14      True
    15     False
    16     False
    17     False
    18      True
    19     False
    20     False
    21      True
    22     False
    23     False
    24      True
    25      True
    26      True
    27     False
    28      True
    29      True
           ...  
    970     True
    971     True
    972     True
    973    False
    974     True
    975     True
    976     True
    977    False
    978    False
    979    False
    980    False
    981     True
    982     True
    983     True
    984    False
    985    False
    986     True
    987     True
    988    False
    989    False
    990    False
    991     True
    992     True
    993     True
    994    False
    995    False
    996    False
    997     True
    998     True
    999    False
    Name: C, Length: 1000, dtype: bool
    
    df.loc[~(df['C'] > twice_std)]
    
    A B C
    0 0.056365 0.080972 0.378327
    1 0.371930 0.007791 0.318345
    5 0.255404 0.195473 0.008677
    7 0.470868 0.080049 0.285998
    8 0.659013 0.794802 0.270541
    11 0.015484 0.567598 0.043682
    12 0.957620 0.967676 0.063608
    13 0.102506 0.490077 0.235902
    14 0.099083 0.778190 0.451824
    18 0.330299 0.984615 0.346554
    21 0.504627 0.716272 0.386102
    24 0.507248 0.136465 0.463764
    25 0.421517 0.814806 0.449040
    26 0.275373 0.935430 0.525679
    28 0.779142 0.063435 0.120807
    29 0.618392 0.535934 0.554632
    31 0.616251 0.034984 0.342615
    33 0.009574 0.195987 0.221378
    35 0.721609 0.518311 0.561978
    36 0.316993 0.678054 0.163737
    37 0.494355 0.499986 0.560351
    39 0.584863 0.881738 0.481162
    43 0.160369 0.402388 0.208208
    45 0.002698 0.576528 0.070493
    47 0.764883 0.778927 0.494559
    48 0.868643 0.392903 0.109240
    49 0.058928 0.350504 0.497170
    50 0.373490 0.783554 0.335720
    55 0.638066 0.442382 0.173654
    56 0.837218 0.722685 0.454352
    ... ... ... ...
    943 0.322268 0.957496 0.108147
    944 0.384463 0.490386 0.245737
    945 0.382611 0.726888 0.345724
    947 0.713337 0.828064 0.364005
    948 0.818703 0.445825 0.281585
    951 0.968651 0.897188 0.368103
    952 0.136136 0.431300 0.543917
    954 0.846105 0.064527 0.200963
    955 0.708107 0.857570 0.475146
    957 0.595819 0.060763 0.294676
    958 0.268046 0.790128 0.342255
    959 0.116645 0.968789 0.493773
    967 0.073665 0.204168 0.286095
    968 0.205796 0.596242 0.468190
    970 0.378107 0.687434 0.567923
    971 0.876770 0.443219 0.236627
    972 0.486757 0.416836 0.524889
    974 0.838247 0.279468 0.333581
    975 0.762230 0.352878 0.550439
    976 0.044568 0.680916 0.350743
    981 0.480849 0.321984 0.233431
    982 0.491794 0.056681 0.429988
    983 0.278019 0.105290 0.435492
    986 0.602675 0.966963 0.338542
    987 0.051971 0.105833 0.132917
    991 0.632913 0.807060 0.540686
    992 0.035006 0.526970 0.058584
    993 0.368957 0.395593 0.210440
    997 0.355928 0.924330 0.224949
    998 0.923060 0.834740 0.275359

    559 rows × 3 columns

    • 检测过滤缺失值
      • dropna
      • fillna
    • 检测过滤重复值
      • drop_duplicated(keep)
    • 检测过滤异常值
      • 得到鉴定异常值的条件
      • 将异常值对应的行删除

    4. 排序

    使用.take()函数排序

    • take()函数接受一个索引列表,用数字表示,使得df根据列表中索引的顺序进行排序
    • eg:df.take([1,3,4,2,5])

    可以借助np.random.permutation()函数随机排序

    np.random.permutation(1000)  #  将0-999进行随机排列
    
    array([956, 614, 993, 437, 371, 215, 579, 282, 301, 646, 893,   7, 441,
           539, 953, 794, 155, 370, 154, 100, 753, 793, 412, 867, 941, 998,
           672, 590, 708,   1, 634, 899, 417, 242, 557, 122, 397, 850, 543,
           560, 389, 896, 903, 505, 685, 334, 665, 460, 768, 937, 522, 637,
           121, 605, 107, 130, 286, 532, 982, 563, 995,  89, 217, 218,  82,
           781, 951, 798, 200, 947, 790, 398, 538, 411,  15,  44, 784, 205,
           281, 314, 439, 132, 192, 238, 795, 470,  65, 842, 259, 426, 528,
           383, 682, 750, 119, 465, 503, 278, 715, 603, 544, 265, 239, 569,
           204, 616, 343, 710, 653, 256,   6, 873, 338,  27, 570, 707,  70,
            73, 233, 838, 799, 266, 859, 279, 136, 479, 724, 870, 611, 574,
           564, 655, 177,  39, 253, 148, 471, 317, 661, 851,  69, 523, 513,
           928, 650,  23, 582, 622, 814, 959, 723, 938, 612, 912, 865, 402,
           638,  80, 962, 214, 983, 194, 680, 758,  29,  74,  86, 102, 583,
           695, 580, 835, 931, 832, 454, 258, 493, 967, 670, 555, 494, 501,
           581, 591, 179, 354, 118, 671, 380, 732, 229, 719, 623, 874, 495,
           944, 900, 123, 250, 628, 244, 872, 731, 625, 586,  57, 752, 596,
           827, 775, 841, 163, 394, 833, 153, 669, 295, 826, 384, 890, 711,
            60, 141, 237, 198, 404, 463, 712, 960, 749, 510, 866, 609,  26,
           169, 372, 459, 365, 949, 124, 733,  12, 257, 668, 878, 487, 138,
           652, 300, 219, 413, 445, 193, 207, 337, 779,  77,  95, 693, 812,
           409,  33, 490, 992,   9, 167, 358, 743, 369,  99, 817, 542, 706,
           289, 589, 666, 927, 391, 761, 844, 452,  66, 830, 498, 968, 689,
           329, 508, 526, 335, 884, 129, 972, 507, 480, 274, 110, 425, 500,
           388, 418, 869, 769, 251, 863, 456, 112, 247, 304, 478, 481, 429,
           741, 241, 347,  37, 673, 427, 285, 415,  59, 853, 144, 822, 125,
           455,  64, 332,  71, 971, 763, 804,  19, 191, 918, 608,  61, 327,
           137, 116, 746, 482, 828, 766, 691, 424, 727, 468, 633, 302, 861,
           848, 134, 704, 491, 320, 280, 660, 375, 846, 359, 987, 511, 342,
           307, 399,  76, 825,  11,  28, 961, 485, 451, 675, 457, 618, 554,
           551, 885, 531, 880, 534, 160, 607, 367, 374, 797, 910, 970, 595,
           575, 756,  90, 897, 801,  49, 140, 985, 512, 577, 922, 168, 225,
           360, 315, 350, 919, 231, 911, 631,  31, 774, 103, 186, 892, 293,
           483, 149, 860, 887,  93, 340, 744, 908,  52, 196, 222, 955,   3,
           930, 571, 484, 156,  50, 843, 599, 506, 936, 703, 881, 273, 520,
            41,  85, 328, 223,  48, 492,  97,  56,  36, 974, 924, 656,  58,
           649,  92, 114,  62, 173, 984, 973, 346, 573, 996, 597, 990, 667,
           206, 917, 213, 272, 462, 686, 469, 472, 236, 643, 787, 224, 120,
           255,  24, 171,  94, 904, 771, 344, 556, 981, 593, 988, 271, 762,
           363, 254, 535, 361, 979, 303, 692, 964, 504, 150, 894, 349, 796,
           714, 525, 943, 785, 260, 145, 292, 718, 811, 234, 641, 403, 818,
           999, 461, 778, 802, 901, 352,  40, 515,  32, 877, 664, 323, 966,
           635, 905, 754, 940, 810, 182,  75, 442, 308, 262, 776, 592, 267,
           203, 294, 657,  34, 414, 405, 232, 151, 373, 601,  14, 807, 467,
           421,  43, 935, 430, 287, 313, 283, 152, 516, 530, 356, 559, 518,
           644, 889, 977, 521, 548, 381, 674, 929,   0, 916, 246, 540, 297,
            67, 980, 422, 117, 772,  53,  13,  91,  46, 423, 509,  21, 128,
           598, 115, 610, 679, 783, 264,  78, 270, 824, 311, 648, 220, 636,
           226, 658, 886, 227, 268, 773, 620, 529, 864, 502, 567, 713, 963,
           366, 210, 333, 249, 600, 701,   2, 640, 407, 745, 942, 113,  87,
           390, 159, 188, 948, 957, 488, 351, 288, 245, 431, 248, 164, 767,
           839, 702, 803, 792, 594, 837, 489, 934, 684, 386, 629, 519, 876,
            63, 448,  98, 858, 378, 298, 368, 453,  25, 868, 624,  79, 133,
           902, 906, 428, 401, 162, 157, 728, 950, 662, 190, 496, 568, 975,
           952, 627, 909, 994, 131, 780, 751, 883, 871, 319, 722, 199, 536,
           209, 821, 318, 290, 393,  35, 325, 187, 786, 681, 284, 514, 331,
           647, 855, 143, 989, 642,  96, 676, 986, 561, 602, 336,  20, 379,
           847, 735, 954, 645, 547, 357, 447, 435, 739, 228, 566, 305, 353,
           158, 755, 716, 730, 856, 127,  47, 392, 862, 809, 720, 760, 432,
           243, 932, 208, 382, 585, 747, 111, 836, 736, 700, 705, 615, 355,
            18, 330, 820,   8, 857, 184, 175, 221, 737, 524, 697, 436, 395,
           764, 939, 104, 759, 819, 240, 659, 147, 269, 387, 420, 621, 364,
           926, 201, 549, 165, 696, 742, 997, 181, 277, 726,  10, 683, 991,
           291,  81, 126,  68, 920, 808, 572, 740, 533, 699,  72, 146, 230,
           888,   5, 606, 466, 263, 458, 898, 604, 385, 805, 105, 211, 945,
           958, 721, 823, 376, 497, 545, 576, 738, 626, 852, 449, 541, 444,
           406, 976,  88, 815, 552, 166, 183, 178, 438, 553,  84,  83, 717,
           651, 782, 678, 324, 584,  42, 687, 517, 195, 106, 101, 933, 434,
           348, 440, 587, 310, 923, 663, 921, 499, 565, 296,  38, 891, 895,
           316,  30, 978, 677, 170, 322, 613, 546, 527, 630, 476, 174,  51,
           816, 845, 185, 108,  17, 321, 813, 806, 109, 882, 197, 550, 907,
           339, 698, 965, 362, 729, 914, 791, 694, 475, 879, 486, 309, 748,
           326, 688, 202, 410, 915, 690, 854, 377, 341, 788,  22, 777, 275,
           473, 261, 400,  45,  54, 135, 770, 189, 946, 562, 925, 537, 789,
           312, 829, 725, 252, 800, 578, 446,  55, 419, 396,   4, 558, 212,
           831, 450, 299, 161, 617, 345, 306, 757, 709, 180, 235, 433, 840,
           477, 913, 474, 734, 408, 443, 834, 654, 875, 172, 632, 416,  16,
           216, 464, 139, 619, 588, 969, 176, 276, 142, 639, 765, 849])
    
    # 行排序与列排序均随机
    df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)
    
    A C B
    810 0.056462 0.836914 0.105296
    2 0.140999 0.752930 0.921772
    721 0.941986 0.206568 0.283233
    803 0.302248 0.027969 0.946815
    46 0.576391 0.604795 0.199215
    224 0.091563 0.448896 0.460941
    682 0.081894 0.360009 0.174743
    894 0.758221 0.311932 0.054626
    389 0.951142 0.174418 0.764700
    441 0.283697 0.577370 0.698306
    350 0.623445 0.681211 0.547610
    53 0.186217 0.617344 0.339724
    467 0.231915 0.540558 0.972880
    962 0.543442 0.895628 0.444214
    598 0.516110 0.047393 0.670478
    337 0.022056 0.222698 0.010719
    481 0.182805 0.301250 0.652167
    277 0.127561 0.749532 0.170472
    162 0.767004 0.261541 0.381312
    250 0.847071 0.344957 0.539958
    416 0.369274 0.495600 0.393579
    425 0.228196 0.273655 0.114908
    843 0.394974 0.904397 0.875514
    893 0.451844 0.336345 0.787189
    492 0.516625 0.499929 0.350670
    453 0.218878 0.957251 0.308231
    186 0.611224 0.981765 0.809362
    243 0.092659 0.374212 0.658671
    522 0.773774 0.436375 0.037527
    961 0.172133 0.762221 0.800747
    ... ... ... ...
    624 0.587435 0.183552 0.831386
    675 0.636248 0.542904 0.918788
    861 0.519202 0.322943 0.315798
    989 0.951651 0.942564 0.002912
    136 0.940608 0.069835 0.504026
    950 0.294872 0.712361 0.821118
    529 0.648302 0.860493 0.626701
    833 0.783501 0.823326 0.357173
    173 0.181090 0.697154 0.906783
    615 0.177069 0.732558 0.275658
    182 0.091686 0.262477 0.340532
    913 0.069850 0.903723 0.102737
    417 0.353772 0.345310 0.618327
    487 0.697415 0.083422 0.921608
    345 0.331507 0.295755 0.995060
    978 0.323142 0.978349 0.686965
    197 0.947977 0.235533 0.295503
    133 0.428408 0.963203 0.485624
    214 0.861541 0.840486 0.435903
    640 0.453934 0.807253 0.940066
    977 0.031232 0.918445 0.029839
    698 0.780159 0.042282 0.127449
    427 0.326411 0.101616 0.915007
    898 0.768911 0.231629 0.451036
    77 0.718200 0.682757 0.986735
    865 0.553171 0.535761 0.088467
    513 0.203601 0.908238 0.116113
    711 0.655778 0.164941 0.472295
    685 0.012172 0.035356 0.501114
    801 0.891855 0.355426 0.682663

    1000 rows × 3 columns

    • np.random.permutation(x)可以生成x个从0-(x-1)的随机数列
    df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)[0:5]
    
    B A C
    839 0.817163 0.346661 0.113644
    708 0.644456 0.327089 0.081710
    244 0.852833 0.366820 0.028656
    728 0.627186 0.850947 0.375577
    238 0.784179 0.764240 0.579280

    随机抽样

    当DataFrame规模足够大时,直接使用np.random.permutation(x)函数,就配合take()函数实现随机抽样

    5. 数据分类处理【重点】

    数据聚合是数据处理的最后一步,通常是要使每一个数组生成一个单一的数值。

    数据分类处理:

    • 分组:先把数据分为几组
    • 用函数处理:为不同组的数据应用不同的函数以转换数据
    • 合并:把不同组得到的结果合并起来

    数据分类处理的核心:

    • groupby()函数
    • groups属性查看分组情况
    • eg: df.groupby(by='item').groups

    分组

    df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                    'price':[4,3,3,2.5,4,2],
                   'color':['red','yellow','yellow','green','green','green'],
                   'weight':[12,20,50,30,20,44]})
    df
    
    color item price weight
    0 red Apple 4.0 12
    1 yellow Banana 3.0 20
    2 yellow Orange 3.0 50
    3 green Banana 2.5 30
    4 green Orange 4.0 20
    5 green Apple 2.0 44
    • 使用groupby实现分组
    df.groupby(by='item',axis=0)
    
    <pandas.core.groupby.DataFrameGroupBy object at 0x00000235AA6F6C18>
    
    • 使用groups查看分组情况
    # 该函数可以进行数据的分组,但是不显示分组情况
    df.groupby(by='item',axis=0).groups
    
    {'Apple': Int64Index([0, 5], dtype='int64'),
     'Banana': Int64Index([1, 3], dtype='int64'),
     'Orange': Int64Index([2, 4], dtype='int64')}
    
    • 分组后的聚合操作:分组后的成员中可以被进行运算的值会进行运算,不能被运算的值不进行运算
    # 给df创建一个新列,内容为各个水果的平均价格
    df.groupby(by='item',axis=0).mean()['price']
    
    item
    Apple     3.00
    Banana    2.75
    Orange    3.50
    Name: price, dtype: float64
    
    mean_price_series = df.groupby(by='item',axis=0)['price'].mean()
    mean_price_series
    
    item
    Apple     3.00
    Banana    2.75
    Orange    3.50
    Name: price, dtype: float64
    
    # 映射关系表
    dic = mean_price_series.to_dict()
    df['mean_price'] = df['item'].map(dic)
    df
    
    color item price weight mean_price
    0 red Apple 4.0 12 3.00
    1 yellow Banana 3.0 20 2.75
    2 yellow Orange 3.0 50 3.50
    3 green Banana 2.5 30 2.75
    4 green Orange 4.0 20 3.50
    5 green Apple 2.0 44 3.00

    计算出苹果的平均价格

    df
    
    color item price weight
    0 red Apple 4.0 12
    1 yellow Banana 3.0 20
    2 yellow Orange 3.0 50
    3 green Banana 2.5 30
    4 green Orange 4.0 20
    5 green Apple 2.0 44

    按颜色查看各种颜色的水果的平均价格

    汇总:将各种颜色水果的平均价格和df进行汇总

    df['color_mean_price'] = df['color'].map(df.groupby(by='color')['price'].mean().to_dict())
    df
    
    color item price weight mean_price color_mean_price
    0 red Apple 4.0 12 3.00 4.000000
    1 yellow Banana 3.0 20 2.75 3.000000
    2 yellow Orange 3.0 50 3.50 3.000000
    3 green Banana 2.5 30 2.75 2.833333
    4 green Orange 4.0 20 3.50 2.833333
    5 green Apple 2.0 44 3.00 2.833333

    6.高级数据聚合

    使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算

    • df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
    • transform和apply都会进行运算,在transform或者apply中传入函数即可
    • transform和apply也可以传入一个lambda表达式
    df.groupby(by='item')['price'].mean()
    
    item
    Apple     3.00
    Banana    2.75
    Orange    3.50
    Name: price, dtype: float64
    
    def my_mean(s):
        sum = 0
        for i in s:
            sum += i
        return sum/len(s)
    
    # 使用apply函数求出水果的平均价格
    df.groupby(by='item')['price'].apply(my_mean)
    
    item
    Apple     3.00
    Banana    2.75
    Orange    3.50
    Name: price, dtype: float64
    
    # 使用transform函数求出水果的平均价格
    df.groupby(by='item')['price'].transform(my_mean)
    
    0    3.00
    1    2.75
    2    3.50
    3    2.75
    4    3.50
    5    3.00
    Name: price, dtype: float64
  • 相关阅读:
    qsort()的使用
    c语言不寻常的类型转换(类型提升)
    堆栈段的三个主要用途
    区分 声明与定义
    宏定义陷阱与typedef
    约瑟夫环解决方案
    线程中断测试
    Redis
    本地缓存
    tomcat优化
  • 原文地址:https://www.cnblogs.com/zyyhxbs/p/11708525.html
Copyright © 2020-2023  润新知