• Python基本数据统计(二)---- 数据选择 & 简单统计与处理


    1. 便捷数据获取

      1.2 网络数据获取:

        1.2.1 urllib, urllib2, httplib, httplib2和正则表达式(python3中为urllib.request, http.client)

    In [179]: import urllib2,re
    
    In [180]: dStr = urllib2.urlopen('https://hk.finance.yahoo.com/q/cp?s=%5EDJI').read()
    
    In [181]: m = re.findall('<tr><td class="yfnc_tabledata1"><b><a href=".*?">(.*?)</a></b></td><td class="yfnc_tabledata1">(.*?)</td>.*?<b>(.*?)</b>.*?</tr>',dStr)
    
    In [182]: m
    Out[182]: 
    [('AAPL', 'xe8x98x8bxe6x9ex9cxe5x85xacxe5x8fxb8', '120.000'),
     ('AXP', 'American Express Company', '76.200'),
     ('BA', 'The Boeing Company', '159.530'),
    ...
     ('XOM', 'Exxon Mobil Corporation', '85.890')]
    View Code

    2. 数据准备和整理

    3. 数据显示

    4. 数据选择

      4.1 选择行

        4.1.1 索引

    In [189]: quotesdf.ix['2016-12-30']
    Out[189]: 
    open      7.364590e+01
    close     7.376540e+01
    high      7.402429e+01
    low       7.352641e+01
    volume    3.378800e+06
    Name: 2016-12-30, dtype: float64
    quotesdf.ix['index']

        4.1.2 切片

    In [190]: quotesdf['2016-12-30':'2017-01-02']
    Out[190]: 
                     open      close       high       low     volume
    2016-12-30  73.645904  73.765397  74.024287  73.52641  3378800.0
    quotesdf['2016-12-30':'2017-01-02']

      4.2 选择列

    In [192]: djidf['code']
    Out[192]: 
    0     AAPL
    1      AXP
    2       BA
    ...
    29     XOM
    Name: code, dtype: object
    
    In [193]: djidf.code
    Out[193]: 
    0     AAPL
    1      AXP
    2       BA
    ...
    29     XOM
    Name: code, dtype: object
    djidf['code'] / djidf.code

      4.3 行、列  - 标签label ( loc )

    In [64]: djidf.loc[1:5,]
    Out[64]: 
       code                      name lasttrade
    1   AXP  American Express Company    76.200
    2    BA        The Boeing Company   159.530
    3   CAT          Caterpillar Inc.    94.580
    4  CSCO                    思科系?公司    30.100
    5   CVX       Chevron Corporation   115.600
    
    In [65]: djidf.loc[:,['code','lasttrade']]
    Out[65]: 
        code lasttrade
    0   AAPL   120.000
    1    AXP    76.200
    2     BA   159.530
    3    CAT    94.580
    ...
    29   XOM    85.890
    obj.loc[x : xx, ['y','yy'] ]

      4.4 行和列的区域  - 标签label ( loc 和 at )

    In [66]: djidf.loc[1:5,['code','lasttrade']]
    Out[66]: 
       code lasttrade
    1   AXP    76.200
    2    BA   159.530
    3   CAT    94.580
    4  CSCO    30.100
    5   CVX   115.600
    
    In [67]: djidf.loc[1,'lasttrade']
    Out[67]: '76.200'
    
    In [68]: djidf.at[1,'lasttrade']
    Out[68]: '76.200'
    obj.loc[x, 'y']

      4.5 行、列和区域 ( iloc 和 iat )

    In [69]: djidf.loc[1:5,['code','lasttrade']]
    Out[69]: 
       code lasttrade
    1   AXP    76.200
    2    BA   159.530
    3   CAT    94.580
    4  CSCO    30.100
    5   CVX   115.600
    
    In [70]: djidf.iloc[1:6,[0,2]]
    Out[70]: 
       code lasttrade
    1   AXP    76.200
    2    BA   159.530
    3   CAT    94.580
    4  CSCO    30.100
    5   CVX   115.600
    
    In [71]: djidf.loc[1,'lasttrade']
    Out[71]: '76.200'
    
    In [72]: djidf.at[1,'lasttrade']
    Out[72]: '76.200'
    
    In [73]: djidf.iloc[1,2]
    Out[73]: '76.200'
    
    In [74]: djidf.iat[1,2]
    Out[74]: '76.200'
    obj.iloc[ a:b, [c,d] ]

      4.5 条件筛选

    In [77]: quotesdf[quotesdf.index >= '2016-12-20']
    Out[77]: 
                     open      close       high        low     volume
    2016-12-20  74.681487  74.741230  75.179363  74.213482  3244900.0
    ...
    2017-01-20  75.989998  76.199997  76.910004  75.389999  8382000.0
    
    In [78]: quotesdf[(quotesdf.index >= '2016-12-20') & (quotesdf.close >=76)]
    Out[78]: 
                     open      close       high        low     volume
    2017-01-04  75.260002  76.260002  76.550003  75.059998  4635800.0
    ...
    2017-01-20  75.989998  76.199997  76.910004  75.389999  8382000.0
    quotesdf[(quotesdf.index >= '2016-12-20') & (quotesdf.close >=76)]

    5. 简单统计与处理

      5.1 平均值(high列的平均值)

    In [199]: quotesdf.high.mean()
    Out[199]: 64.246265703871586
    quotesdf.high.mean()

        # 待解决问题:通过正则表达式得到是数据是string类型,无法计算mean值

        2017/01/22 解决:djidf.lasttrade = djidf.lasttrade.astype(float)  # lasttrade列转为float类型,再赋给djidf

      5.2 简单筛选(筛选出开盘价open大于75的数据的收盘价close)

    In [204]: quotesdf[quotesdf.open>=75].close
    Out[204]: 
    2017-01-04    76.260002
    2017-01-05    75.320000
    ...
    2017-01-20    76.199997
    Name: close, dtype: float64
    quotesdf[quotesdf.open>=75].close

      5.3 简单统计(统计AXP公司近一年股票涨和跌的天数)

    In [206]: len(quotesdf[quotesdf.close > quotesdf.open])
    Out[206]: 135
    
    In [207]: len(quotesdf)-135
    Out[207]: 118
    len(quotesdf[quotesdf.close > quotesdf.open])

      5.4 简单统计(统计AXP公司近一年相邻两天收盘价的涨跌情况)

    In [211]: import numpy as np
    
    In [212]: np.diff(quotesdf.close)  # 相邻两天,后一天减前一天的差
    Out[212]: 
    array([-7.436398, -0.039243,  0.068674, -0.559202, -1.60893 ,  0.608254,
            ...
           -0.030007, -0.259994, -0.020005,  0.89    , -0.799996, -0.490005])
    
    In [213]: len(np.diff(quotesdf.close))
    Out[213]: 252
    
    In [214]: len(quotesdf)
    Out[214]: 253
    
    In [215]: status = np.sign(np.diff(quotesdf.close))  # 提取差值array的正负符号
    
    In [216]: status
    Out[216]: 
    array([-1., -1.,  1., -1., -1.,  1.,  1., -1.,  1.,  1., -1., -1.,  1.,
            ...
           -1., -1.,  1., -1., -1.])
    
    In [218]: status[np.where(status==1.)].size  # 差值符号为正的个数,即涨的天数
    Out[218]: 140
    
    In [219]: status[np.where(status==-1.)].size  # 差值符号为负的个数,即跌的天数
    Out[219]: 112
        
    np.sign(np.diff(quotesdf.close))

      5.5 排序 (按最近一次成交价对30只股票进行排序,列出前三名的code)

    In [313]: djidf.sort_values('lasttrade',axis=0,ascending=False)[:3].code
    Out[313]: 
    9      GS
    17    MMM
    11    IBM
    Name: code, dtype: object
    
    In [314]: djidf.sort_values('lasttrade',axis=0,ascending=False)[:3]
    Out[314]: 
       code                                         name  lasttrade
    9    GS                The Goldman Sachs Group, Inc.     232.20
    17  MMM                                   3M Company     178.49
    11  IBM  International Business Machines Corporation     170.55
    
    In [315]: djidf.sort_values('lasttrade',axis=0,ascending=False)
    Out[315]: 
        code                                         name  lasttrade
    9     GS                The Goldman Sachs Group, Inc.     232.20
    17   MMM                                   3M Company     178.49
    11   IBM  International Business Machines Corporation     170.55
    2     BA                           The Boeing Company     159.53
    ...
    4   CSCO                                       思科系統公司      30.10
    djidf.sort_values('lasttrade',axis=0,ascending=False)[:3].code

      5.6 计数统计 (统计某月开盘天数)

    In [319]: t = quotesdf[(quotesdf.index >= '2016-12-01') & (quotesdf.index < '2017-01-01')]
    
    In [320]: t
    Out[320]: 
                     open      close       high        low     volume
    2016-12-01  72.142314  72.221976  72.460955  71.475161  4298200.0
    2016-12-02  72.212016  71.554823  72.351429  71.126648  2959100.0
    ...
    2016-12-30  73.645904  73.765397  74.024287  73.526410  3378800.0
    
    In [321]: len(t)
    Out[321]: 21
    quotesdf[(quotesdf.index >= '2016-12-01') & (quotesdf.index < '2017-01-01')]

      5.7 计数统计 (统计近一年每个月的开盘天数)

    In [322]: import time
    
    In [323]: listtemp = []
    
    In [324]: for i in range(0,len(quotesdf)):
         ...:     temp = time.strptime(quotesdf.index[i],"%Y-%m-%d")  # 取出每行索引,转为日期格式
         ...:     listtemp.append(temp.tm_mon)  # 将日期的月份提取出来,追加到列表
         ...:     
    
    In [325]: print listtemp
    [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
    In [326]: tempdf = quotesdf.copy()  # 将quotesdf复制一份
    
    In [327]: tempdf['month'] = listtemp  # 将日期list添加为df的一列month
    
    In [328]: tempdf['month'].value_counts()  # 统计month列值的个数
    8     23
    6     22
    3     22
    12    21
    11    21
    10    21
    9     21
    5     21
    4     21
    7     20
    2     20
    1     20
    Name: month, dtype: int64
    tempdf['month'].value_counts()
    In [344]: print listtemp
    [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
    In [345]: adf = pd.DataFrame(listtemp)
    
    In [346]: adf
    Out[346]: 
          0
    0     1
    1     1
    2     1
    3     1
    4     1
    5     1
    6     1
    7     2
    8     2
    9     2
    10    2
    11    2
    12    2
    13    2
    14    2
    15    2
    16    2
    17    2
    18    2
    19    2
    20    2
    21    2
    22    2
    23    2
    24    2
    25    2
    26    2
    27    3
    28    3
    29    3
    ..   ..
    223  12
    224  12
    225  12
    226  12
    227  12
    228  12
    229  12
    230  12
    231  12
    232  12
    233  12
    234  12
    235  12
    236  12
    237  12
    238  12
    239  12
    240   1
    241   1
    242   1
    243   1
    244   1
    245   1
    246   1
    247   1
    248   1
    249   1
    250   1
    251   1
    252   1
    
    [253 rows x 1 columns]
    
    In [347]: adf[0].value_counts()
    Out[347]: 
    8     23
    6     22
    3     22
    12    21
    11    21
    10    21
    9     21
    5     21
    4     21
    7     20
    2     20
    1     20
    Name: 0, dtype: int64
    adf[0].value_counts()

    6. Grouping

    7. Merge

  • 相关阅读:
    (转)bash内置命令mapfile:读取文件内容到数组
    new 一个接口?
    Linq的一些操作符图表展示
    StreamReader 和文件乱码
    XSLT 的调试
    不一样的大小写转换
    一些可能没用过的调试窗口
    私人工具分享:博客下载工具
    简单的实例来理解WCF 数据服务
    谈谈char ,nchar,varchar,nvarchar 和Uniqueidentifier
  • 原文地址:https://www.cnblogs.com/wnzhong/p/6337474.html
Copyright © 2020-2023  润新知