参考链接:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex
DataFrame.
reindex
(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)[source]
Conform Series/DataFrame to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
.
- Parameters
- keywords for axesarray-like, optional
-
New labels / index to conform to, should be specified using keywords. Preferably an Index object to avoid duplicating data.
- method{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}
-
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
-
None (default): don’t fill gaps
-
pad / ffill: Propagate last valid observation forward to next valid.
-
backfill / bfill: Use next valid observation to fill gap.
-
nearest: Use nearest valid observations to fill gap.
-
- copybool, default True
-
Return a new object, even if the passed indexes are the same.
- levelint or name
-
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_valuescalar, default np.NaN
-
Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
- limitint, default None
-
Maximum number of consecutive elements to forward or backward fill.
- toleranceoptional
-
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance
.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
DataFrame.reindex
supports two calling conventions
-
(index=index_labels, columns=column_labels, ...)
-
(labels, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
通过查寻了解,这个主要是外部定义一个索引,返回一个新的df对象,对于新的索引的缺省项,可以设置一些默认值。
可以通过两种方式传参,推荐使用第一种。
参数col_level在我调试的版本中已经改为level
书中示例代码,该方法主要用于重设index,并且为新的index中的内容添加默认值。
In [123]: index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] ...: df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301], ...: 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, ...: index=index) In [124]: df Out[124]: http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00 In [125]:
定义了一个df对象,定义了一个index
后面将定义一个新的index对象,另外使用默认参数
In [130]: new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', ...: 'Chrome'] In [131]: df Out[131]: http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00 In [132]: df.reindex(index=new_index) Out[132]: http_status response_time Safari 404.0 0.07 Iceweasel NaN NaN Comodo Dragon NaN NaN IE10 404.0 0.08 Chrome 200.0 0.02
生成了一个新的df对象,添加的index
我们也可以通过fill_value的选项来设置默认值
In [133]: df.reindex(index=new_index, fill_value='missing') Out[133]: http_status response_time Safari 404 0.07 Iceweasel missing missing Comodo Dragon missing missing IE10 404 0.08 Chrome 200 0.02
也可以通过下面两种方式重设列的索引。
In [134]: df.reindex(columns=['http_status', 'user_agent']) Out[134]: http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN In [135]: df.reindex(['http_status', 'user_agent'], axis="columns") Out[135]: http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
为了进一步说明reindex的使用中,针对的有序索引,使用metho的参数,填写默认值。
首先创建一个时间索引的df对象
In [137]: date_index = pd.date_range('1/1/2010', periods=6, freq='D') ...: df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, ...: index=date_index) ...: In [138]: df2 Out[138]: prices 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0
然后通过reindex替换成一个时间周期更长的,并使用method参数。
In [139]: date_index2 = pd.date_range('12/29/2009', periods=10, freq='D') In [140]: df2.reindex(index=date_index2) Out[140]: prices 2009-12-29 NaN 2009-12-30 NaN 2009-12-31 NaN 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN In [141]: df2.reindex(index=date_index2, method='bfill') Out[141]: prices 2009-12-29 100.0 2009-12-30 100.0 2009-12-31 100.0 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN In [142]:
从输出可以看出,默认的还是NAN参数,使用了后面数据为默认数据,新的索引已经添加了数据,但老的索引内的数据并没有修改。
如果需要更改,使用fillna的方法。