1. 便捷数据获取
1.2 网络数据获取:
1.2.1 urllib, urllib2, httplib, httplib2和正则表达式(python3中为urllib.request, http.client)
In [179]: import urllib2,re In [180]: dStr = urllib2.urlopen('https://hk.finance.yahoo.com/q/cp?s=%5EDJI').read() In [181]: m = re.findall('<tr><td class="yfnc_tabledata1"><b><a href=".*?">(.*?)</a></b></td><td class="yfnc_tabledata1">(.*?)</td>.*?<b>(.*?)</b>.*?</tr>',dStr) In [182]: m Out[182]: [('AAPL', 'xe8x98x8bxe6x9ex9cxe5x85xacxe5x8fxb8', '120.000'), ('AXP', 'American Express Company', '76.200'), ('BA', 'The Boeing Company', '159.530'), ... ('XOM', 'Exxon Mobil Corporation', '85.890')]
2. 数据准备和整理
3. 数据显示
4. 数据选择
4.1 选择行
4.1.1 索引
In [189]: quotesdf.ix['2016-12-30'] Out[189]: open 7.364590e+01 close 7.376540e+01 high 7.402429e+01 low 7.352641e+01 volume 3.378800e+06 Name: 2016-12-30, dtype: float64
4.1.2 切片
In [190]: quotesdf['2016-12-30':'2017-01-02'] Out[190]: open close high low volume 2016-12-30 73.645904 73.765397 74.024287 73.52641 3378800.0
4.2 选择列
In [192]: djidf['code'] Out[192]: 0 AAPL 1 AXP 2 BA ... 29 XOM Name: code, dtype: object In [193]: djidf.code Out[193]: 0 AAPL 1 AXP 2 BA ... 29 XOM Name: code, dtype: object
4.3 行、列 - 标签label ( loc )
In [64]: djidf.loc[1:5,] Out[64]: code name lasttrade 1 AXP American Express Company 76.200 2 BA The Boeing Company 159.530 3 CAT Caterpillar Inc. 94.580 4 CSCO 思科系?公司 30.100 5 CVX Chevron Corporation 115.600 In [65]: djidf.loc[:,['code','lasttrade']] Out[65]: code lasttrade 0 AAPL 120.000 1 AXP 76.200 2 BA 159.530 3 CAT 94.580 ... 29 XOM 85.890
4.4 行和列的区域 - 标签label ( loc 和 at )
In [66]: djidf.loc[1:5,['code','lasttrade']] Out[66]: code lasttrade 1 AXP 76.200 2 BA 159.530 3 CAT 94.580 4 CSCO 30.100 5 CVX 115.600 In [67]: djidf.loc[1,'lasttrade'] Out[67]: '76.200' In [68]: djidf.at[1,'lasttrade'] Out[68]: '76.200'
4.5 行、列和区域 ( iloc 和 iat )
In [69]: djidf.loc[1:5,['code','lasttrade']] Out[69]: code lasttrade 1 AXP 76.200 2 BA 159.530 3 CAT 94.580 4 CSCO 30.100 5 CVX 115.600 In [70]: djidf.iloc[1:6,[0,2]] Out[70]: code lasttrade 1 AXP 76.200 2 BA 159.530 3 CAT 94.580 4 CSCO 30.100 5 CVX 115.600 In [71]: djidf.loc[1,'lasttrade'] Out[71]: '76.200' In [72]: djidf.at[1,'lasttrade'] Out[72]: '76.200' In [73]: djidf.iloc[1,2] Out[73]: '76.200' In [74]: djidf.iat[1,2] Out[74]: '76.200'
4.5 条件筛选
In [77]: quotesdf[quotesdf.index >= '2016-12-20'] Out[77]: open close high low volume 2016-12-20 74.681487 74.741230 75.179363 74.213482 3244900.0 ... 2017-01-20 75.989998 76.199997 76.910004 75.389999 8382000.0 In [78]: quotesdf[(quotesdf.index >= '2016-12-20') & (quotesdf.close >=76)] Out[78]: open close high low volume 2017-01-04 75.260002 76.260002 76.550003 75.059998 4635800.0 ... 2017-01-20 75.989998 76.199997 76.910004 75.389999 8382000.0
5. 简单统计与处理
5.1 平均值(high列的平均值)
In [199]: quotesdf.high.mean()
Out[199]: 64.246265703871586
# 待解决问题:通过正则表达式得到是数据是string类型,无法计算mean值
2017/01/22 解决:djidf.lasttrade = djidf.lasttrade.astype(float) # lasttrade列转为float类型,再赋给djidf
5.2 简单筛选(筛选出开盘价open大于75的数据的收盘价close)
In [204]: quotesdf[quotesdf.open>=75].close Out[204]: 2017-01-04 76.260002 2017-01-05 75.320000 ... 2017-01-20 76.199997 Name: close, dtype: float64
5.3 简单统计(统计AXP公司近一年股票涨和跌的天数)
In [206]: len(quotesdf[quotesdf.close > quotesdf.open]) Out[206]: 135 In [207]: len(quotesdf)-135 Out[207]: 118
5.4 简单统计(统计AXP公司近一年相邻两天收盘价的涨跌情况)
In [211]: import numpy as np In [212]: np.diff(quotesdf.close) # 相邻两天,后一天减前一天的差 Out[212]: array([-7.436398, -0.039243, 0.068674, -0.559202, -1.60893 , 0.608254, ... -0.030007, -0.259994, -0.020005, 0.89 , -0.799996, -0.490005]) In [213]: len(np.diff(quotesdf.close)) Out[213]: 252 In [214]: len(quotesdf) Out[214]: 253 In [215]: status = np.sign(np.diff(quotesdf.close)) # 提取差值array的正负符号 In [216]: status Out[216]: array([-1., -1., 1., -1., -1., 1., 1., -1., 1., 1., -1., -1., 1., ... -1., -1., 1., -1., -1.]) In [218]: status[np.where(status==1.)].size # 差值符号为正的个数,即涨的天数 Out[218]: 140 In [219]: status[np.where(status==-1.)].size # 差值符号为负的个数,即跌的天数 Out[219]: 112
5.5 排序 (按最近一次成交价对30只股票进行排序,列出前三名的code)
In [313]: djidf.sort_values('lasttrade',axis=0,ascending=False)[:3].code Out[313]: 9 GS 17 MMM 11 IBM Name: code, dtype: object In [314]: djidf.sort_values('lasttrade',axis=0,ascending=False)[:3] Out[314]: code name lasttrade 9 GS The Goldman Sachs Group, Inc. 232.20 17 MMM 3M Company 178.49 11 IBM International Business Machines Corporation 170.55 In [315]: djidf.sort_values('lasttrade',axis=0,ascending=False) Out[315]: code name lasttrade 9 GS The Goldman Sachs Group, Inc. 232.20 17 MMM 3M Company 178.49 11 IBM International Business Machines Corporation 170.55 2 BA The Boeing Company 159.53 ... 4 CSCO 思科系統公司 30.10
5.6 计数统计 (统计某月开盘天数)
In [319]: t = quotesdf[(quotesdf.index >= '2016-12-01') & (quotesdf.index < '2017-01-01')] In [320]: t Out[320]: open close high low volume 2016-12-01 72.142314 72.221976 72.460955 71.475161 4298200.0 2016-12-02 72.212016 71.554823 72.351429 71.126648 2959100.0 ... 2016-12-30 73.645904 73.765397 74.024287 73.526410 3378800.0 In [321]: len(t) Out[321]: 21
5.7 计数统计 (统计近一年每个月的开盘天数)
In [322]: import time In [323]: listtemp = [] In [324]: for i in range(0,len(quotesdf)): ...: temp = time.strptime(quotesdf.index[i],"%Y-%m-%d") # 取出每行索引,转为日期格式 ...: listtemp.append(temp.tm_mon) # 将日期的月份提取出来,追加到列表 ...: In [325]: print listtemp [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] In [326]: tempdf = quotesdf.copy() # 将quotesdf复制一份 In [327]: tempdf['month'] = listtemp # 将日期list添加为df的一列month In [328]: tempdf['month'].value_counts() # 统计month列值的个数 8 23 6 22 3 22 12 21 11 21 10 21 9 21 5 21 4 21 7 20 2 20 1 20 Name: month, dtype: int64
In [344]: print listtemp [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] In [345]: adf = pd.DataFrame(listtemp) In [346]: adf Out[346]: 0 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 2 26 2 27 3 28 3 29 3 .. .. 223 12 224 12 225 12 226 12 227 12 228 12 229 12 230 12 231 12 232 12 233 12 234 12 235 12 236 12 237 12 238 12 239 12 240 1 241 1 242 1 243 1 244 1 245 1 246 1 247 1 248 1 249 1 250 1 251 1 252 1 [253 rows x 1 columns] In [347]: adf[0].value_counts() Out[347]: 8 23 6 22 3 22 12 21 11 21 10 21 9 21 5 21 4 21 7 20 2 20 1 20 Name: 0, dtype: int64
6. Grouping
7. Merge