导入导出数据
在导入,导出DataFrame数据时,会用到各种格式,分为 to_csv ;to_excel;to_hdf;to_sql;to_json;to_msgpack ;to_html;to_gbq ;to_stata;to_clipboard;to_pickle
可参照IO Tools 分类。
输出指定colums是,会用到arg colums,例如
to_csv(filename,columns=["col1","col2"],......) # 此处注意的是要使用双引号,单引号不起效果,不知道为什么,另外 # index,header设置为False会不写入行号(索引好)和列标 #也可如下方式使用list函数 to_csv(filename,columns = list('col1','col2'),......)
如果想要保存为ascii文本则可以使用to_csv,可以对是否保存索引(行号)等参数进设置。
调换colums顺序
若原始数据是这样的:
In [6]: df
Out[6]:
0 1 2 3 4 mean
0 0.445598 0.173835 0.343415 0.682252 0.582616 0.445543
1 0.881592 0.696942 0.702232 0.696724 0.373551 0.670208
2 0.662527 0.955193 0.131016 0.609548 0.804694 0.632596
3 0.260919 0.783467 0.593433 0.033426 0.512019 0.436653
4 0.131842 0.799367 0.182828 0.683330 0.019485 0.363371
5 0.498784 0.873495 0.383811 0.699289 0.480447 0.587165
6 0.388771 0.395757 0.745237 0.628406 0.784473 0.588529
7 0.147986 0.459451 0.310961 0.706435 0.100914 0.345149
8 0.394947 0.863494 0.585030 0.565944 0.356561 0.553195
9 0.689260 0.865243 0.136481 0.386582 0.730399 0.561593
In [7]: cols = df.columns.tolist()
In [8]: cols
Out[8]: [0L, 1L, 2L, 3L, 4L, 'mean']
通过调换columns更改顺序
In [12]: cols = cols[-1:] + cols[:-1] In [13]: cols Out[13]: ['mean', 0L, 1L, 2L, 3L, 4L]
进而可以达到如下效果
In [16]: df = df[cols] # OR df = df.ix[:, cols] In [17]: df Out[17]: mean 0 1 2 3 4 0 0.445543 0.445598 0.173835 0.343415 0.682252 0.582616 1 0.670208 0.881592 0.696942 0.702232 0.696724 0.373551 2 0.632596 0.662527 0.955193 0.131016 0.609548 0.804694 3 0.436653 0.260919 0.783467 0.593433 0.033426 0.512019 4 0.363371 0.131842 0.799367 0.182828 0.683330 0.019485 5 0.587165 0.498784 0.873495 0.383811 0.699289 0.480447 6 0.588529 0.388771 0.395757 0.745237 0.628406 0.784473 7 0.345149 0.147986 0.459451 0.310961 0.706435 0.100914 8 0.553195 0.394947 0.863494 0.585030 0.565944 0.356561 9 0.561593 0.689260 0.865243 0.136481 0.386582 0.730399
(参考来源)
pandas DataFrame 中指定位置数据的修改:
df['one']['second'] = value # 由于DataFrame在索引数据是得到的是副本copy所以,此时原数据df并没有修改,并会抛出警告Warning: SettingWithCopy df.loc['one','second'] = value #如上会修改原数据df #或是: dfmi.loc[:,('one','second')] = value
具体参考SettingWithCopy
pandas DataFrame & Series 遍历数据(loop iterate on data)
DataFrame
1 dates = pd.date_range("20150101",periods=3) 2 df = pd.DataFrame(np.random.randn(3,4),index = dates,columns=['A','B','C','D']) 3 df 4 dates = pd.date_range("20150101",periods=3) 5 df = pd.DataFrame(np.random.randn(3,4),index = dates,columns=['A','B','C','D']) 6 df 7 Out[36]: 8 A B C D 9 2015-01-01 -0.888495 -0.983042 0.162524 -0.768370 10 2015-01-02 0.954982 0.777860 -0.635805 -0.271617 11 2015-01-03 1.778827 1.052819 0.090116 -1.822029
- DataFrame.iteritems() :Iterator over (column name, Series) pairs.
1 for colName,colSeries in df.iteritems(): 2 print colName 3 print colSeries
1 A 2 2015-01-01 -0.888495 3 2015-01-02 0.954982 4 2015-01-03 1.778827 5 Freq: D, Name: A, dtype: float64 6 B 7 2015-01-01 -0.983042 8 2015-01-02 0.777860 9 2015-01-03 1.052819 10 Freq: D, Name: B, dtype: float64 11 C 12 2015-01-01 0.162524 13 2015-01-02 -0.635805 14 2015-01-03 0.090116 15 Freq: D, Name: C, dtype: float64 16 D 17 2015-01-01 -0.768370 18 2015-01-02 -0.271617 19 2015-01-03 -1.822029 20 Freq: D, Name: D, dtype: float64
- DataFrame.iterrows() :Iterate over the rows of a DataFrame as (index, Series) pairs. 数据一致是对列来说的,所以此方法迭代时数据类型会改变,如果想使用原始数据类型,最好使用itertuples,且速度快于Itetuples.
1 for index,rowSeries in df.iterrows(): 2 print index 3 print rowSeries
1 2015-01-01 00:00:00 2 A -0.888495 3 B -0.983042 4 C 0.162524 5 D -0.768370 6 Name: 2015-01-01 00:00:00, dtype: float64 7 2015-01-02 00:00:00 8 A 0.954982 9 B 0.777860 10 C -0.635805 11 D -0.271617 12 Name: 2015-01-02 00:00:00, dtype: float64 13 2015-01-03 00:00:00 14 A 1.778827 15 B 1.052819 16 C 0.090116 17 D -1.822029 18 Name: 2015-01-03 00:00:00, dtype: float64
- DataFrame.itertuples(index=True) :Iterate over the rows of DataFrame as tuples, with index value as first element of the tuple.
1 for rowTuple in df.itertuples(): 2 print rowTuple[0] 3 print rowTuple[1:]
1 2015-01-01 00:00:00 2 (-0.88849501182393553, -0.98304167749573845, 0.1625244406175089, -0.76836987403165646) 3 2015-01-02 00:00:00 4 (0.95498214900986345, 0.77786021238601544, -0.635805031818656, -0.27161684716624435) 5 2015-01-03 00:00:00 6 (1.7788269763069902, 1.0528194112440166, 0.09011643978723563, -1.82202928954011)
Series
- Series.iteritems() :Lazily iterate over (index, value) tuples
1 In [51]: 2 3 s = pd.Series(['a','b','c','d','e']) 4 s 5 s = pd.Series(['a','b','c','d','e']) 6 s 7 Out[51]: 8 0 a 9 1 b 10 2 c 11 3 d 12 4 e 13 dtype: object
1 for index,value in s.iteritems(): 2 print index,value 3 0 a 4 1 b 5 2 c 6 3 d 7 4 e