Python基本数据统计（二）---- 数据选择 & 简单统计与处理

1. 便捷数据获取

　　1.2 网络数据获取：

　　　　1.2.1 urllib, urllib2, httplib, httplib2和正则表达式（python3中为urllib.request, http.client）

In [179]: import urllib2,re

In [180]: dStr = urllib2.urlopen('https://hk.finance.yahoo.com/q/cp?s=%5EDJI').read()

In [181]: m = re.findall('<tr><td class="yfnc_tabledata1"><b><a href=".*?">(.*?)</a></b></td><td class="yfnc_tabledata1">(.*?)</td>.*?<b>(.*?)</b>.*?</tr>',dStr)

In [182]: m
Out[182]: 
[('AAPL', 'xe8x98x8bxe6x9ex9cxe5x85xacxe5x8fxb8', '120.000'),
 ('AXP', 'American Express Company', '76.200'),
 ('BA', 'The Boeing Company', '159.530'),
...
 ('XOM', 'Exxon Mobil Corporation', '85.890')]

View Code

2. 数据准备和整理

3. 数据显示

4. 数据选择

　　4.1 选择行

　　　　4.1.1 索引

In [189]: quotesdf.ix['2016-12-30']
Out[189]: 
open      7.364590e+01
close     7.376540e+01
high      7.402429e+01
low       7.352641e+01
volume    3.378800e+06
Name: 2016-12-30, dtype: float64

quotesdf.ix['index']

　　　　4.1.2 切片

In [190]: quotesdf['2016-12-30':'2017-01-02']
Out[190]: 
                 open      close       high       low     volume
2016-12-30  73.645904  73.765397  74.024287  73.52641  3378800.0

quotesdf['2016-12-30':'2017-01-02']

　　4.2 选择列

In [192]: djidf['code']
Out[192]: 
0     AAPL
1      AXP
2       BA
...
29     XOM
Name: code, dtype: object

In [193]: djidf.code
Out[193]: 
0     AAPL
1      AXP
2       BA
...
29     XOM
Name: code, dtype: object

djidf['code'] / djidf.code

　　4.3 行、列 - 标签label ( loc )

In [64]: djidf.loc[1:5,]
Out[64]: 
   code                      name lasttrade
1   AXP  American Express Company    76.200
2    BA        The Boeing Company   159.530
3   CAT          Caterpillar Inc.    94.580
4  CSCO                    思科系?公司    30.100
5   CVX       Chevron Corporation   115.600

In [65]: djidf.loc[:,['code','lasttrade']]
Out[65]: 
    code lasttrade
0   AAPL   120.000
1    AXP    76.200
2     BA   159.530
3    CAT    94.580
...
29   XOM    85.890

obj.loc[x : xx, ['y','yy'] ]

　　4.4 行和列的区域 - 标签label ( loc 和 at )

In [66]: djidf.loc[1:5,['code','lasttrade']]
Out[66]: 
   code lasttrade
1   AXP    76.200
2    BA   159.530
3   CAT    94.580
4  CSCO    30.100
5   CVX   115.600

In [67]: djidf.loc[1,'lasttrade']
Out[67]: '76.200'

In [68]: djidf.at[1,'lasttrade']
Out[68]: '76.200'

obj.loc[x, 'y']

　　4.5 行、列和区域 ( iloc 和 iat )

In [69]: djidf.loc[1:5,['code','lasttrade']]
Out[69]: 
   code lasttrade
1   AXP    76.200
2    BA   159.530
3   CAT    94.580
4  CSCO    30.100
5   CVX   115.600

In [70]: djidf.iloc[1:6,[0,2]]
Out[70]: 
   code lasttrade
1   AXP    76.200
2    BA   159.530
3   CAT    94.580
4  CSCO    30.100
5   CVX   115.600

In [71]: djidf.loc[1,'lasttrade']
Out[71]: '76.200'

In [72]: djidf.at[1,'lasttrade']
Out[72]: '76.200'

In [73]: djidf.iloc[1,2]
Out[73]: '76.200'

In [74]: djidf.iat[1,2]
Out[74]: '76.200'

obj.iloc[ a:b, [c,d] ]

　　4.5 条件筛选

In [77]: quotesdf[quotesdf.index >= '2016-12-20']
Out[77]: 
                 open      close       high        low     volume
2016-12-20  74.681487  74.741230  75.179363  74.213482  3244900.0
...
2017-01-20  75.989998  76.199997  76.910004  75.389999  8382000.0

In [78]: quotesdf[(quotesdf.index >= '2016-12-20') & (quotesdf.close >=76)]
Out[78]: 
                 open      close       high        low     volume
2017-01-04  75.260002  76.260002  76.550003  75.059998  4635800.0
...
2017-01-20  75.989998  76.199997  76.910004  75.389999  8382000.0

quotesdf[(quotesdf.index >= '2016-12-20') & (quotesdf.close >=76)]

5. 简单统计与处理

　　5.1 平均值（high列的平均值）

In [199]: quotesdf.high.mean()
Out[199]: 64.246265703871586

quotesdf.high.mean()

　　　　# 待解决问题：通过正则表达式得到是数据是string类型，无法计算mean值

　　　　2017/01/22 解决：djidf.lasttrade = djidf.lasttrade.astype(float) # lasttrade列转为float类型，再赋给djidf

　　5.2 简单筛选（筛选出开盘价open大于75的数据的收盘价close）

In [204]: quotesdf[quotesdf.open>=75].close
Out[204]: 
2017-01-04    76.260002
2017-01-05    75.320000
...
2017-01-20    76.199997
Name: close, dtype: float64

quotesdf[quotesdf.open>=75].close

　　5.3 简单统计（统计AXP公司近一年股票涨和跌的天数）

In [206]: len(quotesdf[quotesdf.close > quotesdf.open])
Out[206]: 135

In [207]: len(quotesdf)-135
Out[207]: 118

len(quotesdf[quotesdf.close > quotesdf.open])

　　5.4 简单统计（统计AXP公司近一年相邻两天收盘价的涨跌情况）

In [211]: import numpy as np

In [212]: np.diff(quotesdf.close)  # 相邻两天，后一天减前一天的差
Out[212]: 
array([-7.436398, -0.039243,  0.068674, -0.559202, -1.60893 ,  0.608254,
        ...
       -0.030007, -0.259994, -0.020005,  0.89    , -0.799996, -0.490005])

In [213]: len(np.diff(quotesdf.close))
Out[213]: 252

In [214]: len(quotesdf)
Out[214]: 253

In [215]: status = np.sign(np.diff(quotesdf.close))  # 提取差值array的正负符号

In [216]: status
Out[216]: 
array([-1., -1.,  1., -1., -1.,  1.,  1., -1.,  1.,  1., -1., -1.,  1.,
        ...
       -1., -1.,  1., -1., -1.])

In [218]: status[np.where(status==1.)].size  # 差值符号为正的个数，即涨的天数
Out[218]: 140

In [219]: status[np.where(status==-1.)].size  # 差值符号为负的个数，即跌的天数
Out[219]: 112

np.sign(np.diff(quotesdf.close))

　　5.5 排序（按最近一次成交价对30只股票进行排序，列出前三名的code）

In [313]: djidf.sort_values('lasttrade',axis=0,ascending=False)[:3].code
Out[313]: 
9      GS
17    MMM
11    IBM
Name: code, dtype: object

In [314]: djidf.sort_values('lasttrade',axis=0,ascending=False)[:3]
Out[314]: 
   code                                         name  lasttrade
9    GS                The Goldman Sachs Group, Inc.     232.20
17  MMM                                   3M Company     178.49
11  IBM  International Business Machines Corporation     170.55

In [315]: djidf.sort_values('lasttrade',axis=0,ascending=False)
Out[315]: 
    code                                         name  lasttrade
9     GS                The Goldman Sachs Group, Inc.     232.20
17   MMM                                   3M Company     178.49
11   IBM  International Business Machines Corporation     170.55
2     BA                           The Boeing Company     159.53
...
4   CSCO                                       思科系統公司      30.10

djidf.sort_values('lasttrade',axis=0,ascending=False)[:3].code

　　5.6 计数统计（统计某月开盘天数）

In [319]: t = quotesdf[(quotesdf.index >= '2016-12-01') & (quotesdf.index < '2017-01-01')]

In [320]: t
Out[320]: 
                 open      close       high        low     volume
2016-12-01  72.142314  72.221976  72.460955  71.475161  4298200.0
2016-12-02  72.212016  71.554823  72.351429  71.126648  2959100.0
...
2016-12-30  73.645904  73.765397  74.024287  73.526410  3378800.0

In [321]: len(t)
Out[321]: 21

quotesdf[(quotesdf.index >= '2016-12-01') & (quotesdf.index < '2017-01-01')]

　　5.7 计数统计（统计近一年每个月的开盘天数）

In [322]: import time

In [323]: listtemp = []

In [324]: for i in range(0,len(quotesdf)):
     ...:     temp = time.strptime(quotesdf.index[i],"%Y-%m-%d")  # 取出每行索引，转为日期格式
     ...:     listtemp.append(temp.tm_mon)  # 将日期的月份提取出来，追加到列表
     ...:     

In [325]: print listtemp
[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [326]: tempdf = quotesdf.copy()  # 将quotesdf复制一份

In [327]: tempdf['month'] = listtemp  # 将日期list添加为df的一列month

In [328]: tempdf['month'].value_counts()  # 统计month列值的个数
8     23
6     22
3     22
12    21
11    21
10    21
9     21
5     21
4     21
7     20
2     20
1     20
Name: month, dtype: int64

tempdf['month'].value_counts()

In [344]: print listtemp
[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [345]: adf = pd.DataFrame(listtemp)

In [346]: adf
Out[346]: 
      0
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     2
8     2
9     2
10    2
11    2
12    2
13    2
14    2
15    2
16    2
17    2
18    2
19    2
20    2
21    2
22    2
23    2
24    2
25    2
26    2
27    3
28    3
29    3
..   ..
223  12
224  12
225  12
226  12
227  12
228  12
229  12
230  12
231  12
232  12
233  12
234  12
235  12
236  12
237  12
238  12
239  12
240   1
241   1
242   1
243   1
244   1
245   1
246   1
247   1
248   1
249   1
250   1
251   1
252   1

[253 rows x 1 columns]

In [347]: adf[0].value_counts()
Out[347]: 
8     23
6     22
3     22
12    21
11    21
10    21
9     21
5     21
4     21
7     20
2     20
1     20
Name: 0, dtype: int64

adf[0].value_counts()

6. Grouping

7. Merge

相关阅读:
（转）bash内置命令mapfile：读取文件内容到数组
 new 一个接口?
Linq的一些操作符图表展示
 StreamReader 和文件乱码
 XSLT 的调试
 不一样的大小写转换
 一些可能没用过的调试窗口
 私人工具分享:博客下载工具
 简单的实例来理解WCF 数据服务
 谈谈char ,nchar,varchar,nvarchar 和Uniqueidentifier
原文地址：https://www.cnblogs.com/wnzhong/p/6337474.html