Time Series
import pandas as pd
import numpy as np
Date and Time data types and tools
from datetime import datetime
from datetime import date
from datetime import timedelta
now=datetime.now()
now
datetime.datetime(2020, 5, 5, 9, 51, 27, 686891)
now.year,now.month,now.day
(2020, 5, 5)
now.time()
datetime.time(9, 51, 27, 686891)
datetime
stores both the date and time down to the microsecond.timedelta
represents the temporal difference between two datetime
objects.
delta=datetime(2020,4,29)-datetime(2018,8,19)
delta
datetime.timedelta(619)
delta.days
619
delta.seconds
0
You can add(or subtract) a timedelta
or multiple thereof to a datetime
object to yield a new shifted object:
start=datetime(2020,4,30)
start+timedelta(100)
datetime.datetime(2020, 8, 8, 0, 0)
datetime.now()
datetime.datetime(2020, 5, 5, 9, 51, 27, 888352)
date(2008,8,8)-date(2000,8,8)
datetime.timedelta(2922)
Converting between string and datetime
You can format datetime
and pandas Timestamp
objects ,as string using str
or strftime
method,passing a format specification.
stamp=datetime(2011,1,3)
str(stamp)
'2011-01-03 00:00:00'
stamp.strftime('%Y--%m-!-%d') # can be meomorized by 'str from time', datetime object instance method
'2011--01-!-03'
description={'%Y':'four-digit year',
'%y':'two-digit year',
'%m':'two-digit month[01,12]',
'%d':'two-digit day[01,31]',
'%H':'Hour(24-hour clock)[00,23]',
'%I':'Hour(12-hour clock)[01,12]',
'%M':'two-digit minute[00,59]',
'%S':'second[00,61](seconds60,61 account for leap seconds)',
'%w':'weekday as integer[0(sunday),6]',
'%U':'week number of the year[00,53];sunday is considered the first day of the first day of the week,and days before the first sunday of the year are "week 0"',
'%W':'week number of the year[00,53];monday is considered the first day of the week and days before the first monday of the year are "week 0"',
'%z':'UTC time zone offset as +HHMM or -HHMM;empty if time zone naive',
'%F':'shortcut for %Y-%m-%d',
'%D':'shortcut for %m/%d/%y(e.g04/18/12)'}
pd.DataFrame(description,index=[0])
%Y | %y | %m | %d | %H | %I | %M | %S | %w | %U | %W | %z | %F | %D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | four-digit year | two-digit year | two-digit month[01,12] | two-digit day[01,31] | Hour(24-hour clock)[00,23] | Hour(12-hour clock)[01,12] | two-digit minute[00,59] | second[00,61](seconds60,61 account for leap se... | weekday as integer[0(sunday),6] | week number of the year[00,53];sunday is consi... | week number of the year[00,53];monday is consi... | UTC time zone offset as +HHMM or -HHMM;empty i... | shortcut for %Y-%m-%d | shortcut for %m/%d/%y(e.g04/18/12) |
pd.DataFrame(description,index=[0]).stack()
0 %Y four-digit year
%y two-digit year
%m two-digit month[01,12]
%d two-digit day[01,31]
%H Hour(24-hour clock)[00,23]
%I Hour(12-hour clock)[01,12]
%M two-digit minute[00,59]
%S second[00,61](seconds60,61 account for leap se...
%w weekday as integer[0(sunday),6]
%U week number of the year[00,53];sunday is consi...
%W week number of the year[00,53];monday is consi...
%z UTC time zone offset as +HHMM or -HHMM;empty i...
%F shortcut for %Y-%m-%d
%D shortcut for %m/%d/%y(e.g04/18/12)
dtype: object
You can use these same format codes to convert strings to dates using datetime.strptime
:
value='2011-01-03'
datetime.strptime(value,'%Y-%m-%d') # can be memorized by 'str produce time' datetime model method
datetime.datetime(2011, 1, 3, 0, 0)
datestrs=['7/6/2011','8/6/2011']
[datetime.strptime(x,'%m/%d/%Y') for x in datestrs]
[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]
datetime.strptime
is a good way to parse a date with a known format.However,it can be a bit annoyning to have to write a format spec each time,especially for common date formats.In this case,you can use the parser.parse
method in the third-party dateutil
package(this is installed automatically when you install pandas):
from dateutil.parser import parse # notice that, it is `parse` not `parser`!
parse('2020--4--30')
datetime.datetime(2020, 4, 30, 0, 0)
dateutil
is capable of parsing most human-intelligible date representation:
parse('Jan 31,1997 10:45PM') #? what happened here?
datetime.datetime(2020, 1, 31, 22, 45)
In international locales,day appearing before month is very common,so you can pass dayfirst=True
to indicate this:
parse('6/12/2022',dayfirst=True)
datetime.datetime(2022, 12, 6, 0, 0)
pandas is generally oriented toward working with arrays of date,whether used as an axis index or a column in a DataFrame.The to_datetime
method parses many different kinds of date representations.Standard date formats like ISO 8601 can be parsed very quickly.
datestrs=['2011-07-16 12:00:00','2011-08-06 00:00:00']
pd.to_datetime(['2011-07-16 ','2011-08-06 '])
DatetimeIndex(['2011-07-16', '2011-08-06'], dtype='datetime64[ns]', freq=None)
pd.to_datetime(datestrs)
DatetimeIndex(['2011-07-16 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)
It also handles values that should be considered missing (None,empty string,etc.):
idx=pd.to_datetime(datestrs+[None]);idx
DatetimeIndex(['2011-07-16 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)
NaT
(not a time) is pandas's null value for timestamp data.
Notice
dateutil.parse
is a useful but imperfect tool.
Time series basics
A basic kind of time series object in pandas is a Series indexed by timestamps,which is often represented external to pandas as Python strings or datetime
objects.
dates=[datetime(2020,1,2),datetime(2020,1,5),datetime(2020,1,7),datetime(2020,1,8),datetime(2020,1,10),datetime(2020,1,12)]
ts=pd.Series(np.random.randn(6),index=dates);ts
2020-01-02 -1.140949
2020-01-05 -0.328999
2020-01-07 -0.046164
2020-01-08 -0.783714
2020-01-10 -0.126047
2020-01-12 -0.848602
dtype: float64
pd.Series(np.random.randn(3),index=idx)
2011-07-16 12:00:00 2.259805
2011-08-06 00:00:00 -0.877063
NaT -0.697678
dtype: float64
Under the hood,these datetime
objects have been put in a DatetimeIndex:
ts.index
DatetimeIndex(['2020-01-02', '2020-01-05', '2020-01-07', '2020-01-08',
'2020-01-10', '2020-01-12'],
dtype='datetime64[ns]', freq=None)
Like other Series,arithmetic operations between differently indexed time series automatically align on the dates:
ts[::2]
2020-01-02 -1.140949
2020-01-07 -0.046164
2020-01-10 -0.126047
dtype: float64
ts+ts[::2]
2020-01-02 -2.281898
2020-01-05 NaN
2020-01-07 -0.092328
2020-01-08 NaN
2020-01-10 -0.252095
2020-01-12 NaN
dtype: float64
pd.Series([1,2,3])+pd.Series([3,4])
0 4.0
1 6.0
2 NaN
dtype: float64
pandas stores timestamps using Numpy's datetime64
data type at the nanosecond resolution.
ts.index.dtype
dtype('<M8[ns]')
Scalar values from a DatetimeIndex
are pandas Timestamp
objects:
stamp=ts.index[0];stamp
Timestamp('2020-01-02 00:00:00')
A Timestamp
can be substituted anywhere you would use a datetime
object.Addtionally,it can store frequency information(if any) and understands how to do time zone conversions and other kinds of manipulations.
Indexing,selection,subsetting
ts
2020-01-02 -1.140949
2020-01-05 -0.328999
2020-01-07 -0.046164
2020-01-08 -0.783714
2020-01-10 -0.126047
2020-01-12 -0.848602
dtype: float64
Time series behaves like any other pandas.Series
when you are indexing and selecting data based on label:
stamp=ts.index[2]
ts[stamp]
-0.04616414830843706
As a convenience,you can also pass a string that is interpretable as a date:
ts['1/10/2020']
-0.12604738036158042
ts['2020-1-10']
-0.12604738036158042
For longer time series,a year or only a year and month can be passed to easily select slices of data:
help(pd.date_range)
Help on function date_range in module pandas.core.indexes.datetimes:
date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
Return a fixed frequency DatetimeIndex.
Parameters
----------
start : str or datetime-like, optional
Left bound for generating dates.
end : str or datetime-like, optional
Right bound for generating dates.
periods : integer, optional
Number of periods to generate.
freq : str or DateOffset, default 'D'
Frequency strings can have multiples, e.g. '5H'. See
:ref:`here <timeseries.offset_aliases>` for a list of
frequency aliases.
tz : str or tzinfo, optional
Time zone name for returning localized DatetimeIndex, for example
'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is
timezone-naive.
normalize : bool, default False
Normalize start/end dates to midnight before generating date range.
name : str, default None
Name of the resulting DatetimeIndex.
closed : {None, 'left', 'right'}, optional
Make the interval closed with respect to the given frequency to
the 'left', 'right', or both sides (None, the default).
**kwargs
For compatibility. Has no effect on the result.
Returns
-------
rng : DatetimeIndex
See Also
--------
DatetimeIndex : An immutable container for datetimes.
timedelta_range : Return a fixed frequency TimedeltaIndex.
period_range : Return a fixed frequency PeriodIndex.
interval_range : Return a fixed frequency IntervalIndex.
Notes
-----
Of the four parameters ``start``, ``end``, ``periods``, and ``freq``,
exactly three must be specified. If ``freq`` is omitted, the resulting
``DatetimeIndex`` will have ``periods`` linearly spaced elements between
``start`` and ``end`` (closed on both sides).
To learn more about the frequency strings, please see `this link
<http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
Examples
--------
**Specifying the values**
The next four examples generate the same `DatetimeIndex`, but vary
the combination of `start`, `end` and `periods`.
Specify `start` and `end`, with the default daily frequency.
>>> pd.date_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
Specify `start` and `periods`, the number of periods (days).
>>> pd.date_range(start='1/1/2018', periods=8)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
Specify `end` and `periods`, the number of periods (days).
>>> pd.date_range(end='1/1/2018', periods=8)
DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
'2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
dtype='datetime64[ns]', freq='D')
Specify `start`, `end`, and `periods`; the frequency is generated
automatically (linearly spaced).
>>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
'2018-04-27 00:00:00'],
dtype='datetime64[ns]', freq=None)
**Other Parameters**
Changed the `freq` (frequency) to ``'M'`` (month end frequency).
>>> pd.date_range(start='1/1/2018', periods=5, freq='M')
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
'2018-05-31'],
dtype='datetime64[ns]', freq='M')
Multiples are allowed
>>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
'2019-01-31'],
dtype='datetime64[ns]', freq='3M')
`freq` can also be specified as an Offset object.
>>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
'2019-01-31'],
dtype='datetime64[ns]', freq='3M')
Specify `tz` to set the timezone.
>>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
'2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
'2018-01-05 00:00:00+09:00'],
dtype='datetime64[ns, Asia/Tokyo]', freq='D')
`closed` controls whether to include `start` and `end` that are on the
boundary. The default includes boundary points on either end.
>>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None)
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
dtype='datetime64[ns]', freq='D')
Use ``closed='left'`` to exclude `end` if it falls on the boundary.
>>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left')
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
dtype='datetime64[ns]', freq='D')
Use ``closed='right'`` to exclude `start` if it falls on the boundary.
>>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right')
DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
dtype='datetime64[ns]', freq='D')
pd.date_range('2020-05-01',periods=5)
DatetimeIndex(['2020-05-01', '2020-05-02', '2020-05-03', '2020-05-04',
'2020-05-05'],
dtype='datetime64[ns]', freq='D')
longer_ts=pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
longer_ts['2001']
2001-01-01 -2.081715
2001-01-02 1.425891
2001-01-03 0.314430
2001-01-04 0.153332
2001-01-05 0.282888
...
2001-12-27 -0.299994
2001-12-28 1.852834
2001-12-29 1.847192
2001-12-30 0.592563
2001-12-31 0.519122
Freq: D, Length: 365, dtype: float64
Here,the string '2001' is interpreted as a year and selects that time period.This also works if you specify the month:
longer_ts['2001-05']
2001-05-01 -0.408720
2001-05-02 -0.958682
2001-05-03 -0.424746
2001-05-04 0.771404
2001-05-05 1.959182
2001-05-06 0.287984
2001-05-07 -0.199789
2001-05-08 -0.369938
2001-05-09 0.309950
2001-05-10 -1.649661
2001-05-11 0.119676
2001-05-12 0.205413
2001-05-13 0.416938
2001-05-14 -0.305450
2001-05-15 -0.126385
2001-05-16 1.665036
2001-05-17 0.627492
2001-05-18 -1.317637
2001-05-19 -2.734170
2001-05-20 -0.163745
2001-05-21 -0.784528
2001-05-22 0.151304
2001-05-23 0.583916
2001-05-24 0.571195
2001-05-25 -1.498402
2001-05-26 -1.485187
2001-05-27 0.411882
2001-05-28 0.323999
2001-05-29 0.627545
2001-05-30 -2.054165
2001-05-31 -1.493494
Freq: D, dtype: float64
longer_ts['2000-05-01']
-0.08042675710861961
Slicing with datetime
objects works as well:
ts[datetime(2020,1,7)]
-0.04616414830843706
Because most time series data is ordered chronologically,you can slice with timestamps not contained in a time series to perform a range query:
ts
2020-01-02 -1.140949
2020-01-05 -0.328999
2020-01-07 -0.046164
2020-01-08 -0.783714
2020-01-10 -0.126047
2020-01-12 -0.848602
dtype: float64
ts['2020-1-6':'2020-1-14'] # Notice that,'2020-1-6' and '2020-1-14' are not in ts.
2020-01-07 -0.046164
2020-01-08 -0.783714
2020-01-10 -0.126047
2020-01-12 -0.848602
dtype: float64
As before,you can pass either a string date,datetime
,or timestamp.Remeber that slicing in this manner produces views on the source time series like slicing Numpy arrays.This means that no data is copied and modifications on the slice will be reflected in the original data.
There is an equivalent instance method,truncate
,that slices a Series between tow-dates:
ts.truncate(after='1/9/2020')
2020-01-02 -1.140949
2020-01-05 -0.328999
2020-01-07 -0.046164
2020-01-08 -0.783714
dtype: float64
All of this holds true for DataFrame as well,indexing on its rows:
dates=pd.date_range('1/1/2000',periods=100,freq='w-wed')
long_df=pd.DataFrame(np.random.randn(100,4),index=dates,columns=['Colorado','Texas','New York','Ohio']) # Attention,type disscussed above is Series,and here,changed to be DataFrame
long_df.loc['5-2001']
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2001-05-02 | -0.157680 | -0.398869 | 0.399008 | -0.109813 |
2001-05-09 | -0.475721 | -0.550544 | 0.406308 | 0.290822 |
2001-05-16 | -1.101315 | 2.469916 | -0.062604 | -0.409562 |
2001-05-23 | 0.972402 | 1.035692 | -0.594960 | 1.255631 |
2001-05-30 | -1.823161 | -0.327003 | -0.294791 | 0.795953 |
Time series with duplicate indices
In some applications,there may be multiple data observations failling on a particular timestamp.
dates=pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000','1/2/2000','1/3/2000'])
dup_ts=pd.Series(np.arange(5),index=dates)
dup_ts
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int32
dup_ts.index.is_unique
False
Indexing into this time series will now either produce scalar values or slices depending on whether a timestamp is duplicated:
dup_ts['1/3/2000']
4
dup_ts['1/2/2000']
2000-01-02 1
2000-01-02 2
2000-01-02 3
dtype: int32
Suppose you wanted to aggregate the data having non-unique timestamps.One way to do this is to use groupby
and pass level=0
:
grouped=dup_ts.groupby(level=0)
grouped.mean()
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int32
grouped.count()
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
Date ranges,frequencies,and shifting
Generic time series in pandas are assumed to be irregular;that is,thet have no fixed frequency.For many applications this is sufficient.However,it's often desirable to work relative to a fixed frequency,such as daily,monthly,or every 15 minutes,even if that means introducing missing values into a time series. Fortunately pandas has a full suite of standard time series frequencies and tools for resampling,inferring frequencies,and generating fixed-frequency date ranges.
ts
2020-01-02 -1.140949
2020-01-05 -0.328999
2020-01-07 -0.046164
2020-01-08 -0.783714
2020-01-10 -0.126047
2020-01-12 -0.848602
dtype: float64
resampler=ts.resample('D');resampler
<pandas.core.resample.DatetimeIndexResampler object at 0x0000014F20270320>
The string 'D' is interpreted as a daily frequency.Conversion between frequencies or resampling is a big enough topic to have its own section.
Generating date ranges
pd.date_range
is responsible for generating a DatetimeIndex
with an indicated length according to a particular frequency.
index=pd.date_range('2012-04-01','2012-06-01')
index
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
'2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
'2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
'2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
'2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
'2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
'2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
'2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
'2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
'2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
'2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
'2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
By default,date_range
generates daily timestamps.If you pass only a start or end date,you must pass a number of periods to generate:
pd.date_range(start='2012-04-01',periods=20)
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
dtype='datetime64[ns]', freq='D')
pd.date_range(end='2012-06-1',periods=20)
DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
'2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
'2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
'2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
'2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
The start and end dates define strict boundaries for the generated date index.For example,if you wanted a date index containing the last business day of each month,you would pass the BM
frequency(business end of month) and only dates falling on or inside the date interval will be included:
pd.date_range('2020-04-01','2020-07-01',freq='BM') #'BM'-->BusinessMonthEnd
DatetimeIndex(['2020-04-30', '2020-05-29', '2020-06-30'], dtype='datetime64[ns]', freq='BM')
pd.date_range('2020-05-01',periods=10,freq='w-fri') # 'w-mon,w-tue...'--->Weekly on given day of week
DatetimeIndex(['2020-05-01', '2020-05-08', '2020-05-15', '2020-05-22',
'2020-05-29', '2020-06-05', '2020-06-12', '2020-06-19',
'2020-06-26', '2020-07-03'],
dtype='datetime64[ns]', freq='W-FRI')
pd.date_range('2020-05-01',periods=3,freq='b') # 'b'--> BusinessDay
DatetimeIndex(['2020-05-01', '2020-05-04', '2020-05-05'], dtype='datetime64[ns]', freq='B')
pd.date_range('2020-05-01',periods=4,freq='q-feb')# quarterly dates anchored on last calendar day of each month.
DatetimeIndex(['2020-05-31', '2020-08-31', '2020-11-30', '2021-02-28'], dtype='datetime64[ns]', freq='Q-FEB')
pd.date_range('2020-05-01',periods=4,freq='q-mar')# to-->march for the last item
DatetimeIndex(['2020-06-30', '2020-09-30', '2020-12-31', '2021-03-31'], dtype='datetime64[ns]', freq='Q-MAR')
date_range
by default preserves the time(if any) of the start or end timestamp.
pd.date_range('2012-05-02 12:56:31',periods=5)
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
'2012-05-04 12:56:31', '2012-05-05 12:56:31',
'2012-05-06 12:56:31'],
dtype='datetime64[ns]', freq='D')
Sometimes you will have start or end dates with time information but want to generate a set of timestamps normalized to midnight as a convention.To do this ,there is a normalize
option:
pd.date_range('2012-05-02 12:56:31',periods=5,normalize=True)
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
'2012-05-06'],
dtype='datetime64[ns]', freq='D')
Frequencies and date offsets
Frequencies in pandas are composed of a base frequency
and a multiplier.Base frequencies are typically referred to by a string alias,like M
for monthly or H
for hourly.For each base frequency, there is an object defined generally referred to as a date offset.
from pandas.tseries.offsets import Hour,Minute
hour=Hour()
hour
<Hour>
four_hour=Hour(4)
four_hour
<4 * Hours>
In most applications,you would never need to explicitly create one of these objects,instead using a string alias like 'H' or '4H'.Putting an integer before the base frequency creates a multiple:
pd.date_range('2000-01-01','2000-01-03 23:59',freq='4H')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
'2000-01-01 08:00:00', '2000-01-01 12:00:00',
'2000-01-01 16:00:00', '2000-01-01 20:00:00',
'2000-01-02 00:00:00', '2000-01-02 04:00:00',
'2000-01-02 08:00:00', '2000-01-02 12:00:00',
'2000-01-02 16:00:00', '2000-01-02 20:00:00',
'2000-01-03 00:00:00', '2000-01-03 04:00:00',
'2000-01-03 08:00:00', '2000-01-03 12:00:00',
'2000-01-03 16:00:00', '2000-01-03 20:00:00'],
dtype='datetime64[ns]', freq='4H')
Many offsets can be combined together by addition:
Hour(2)+Minute(30)
<150 * Minutes>
Similarly,you can pass frequency strings,like '1h30min',that will effectively be parsed to the same expression:
pd.date_range('2000-01-01',periods=10,freq='1h30min')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
'2000-01-01 03:00:00', '2000-01-01 04:30:00',
'2000-01-01 06:00:00', '2000-01-01 07:30:00',
'2000-01-01 09:00:00', '2000-01-01 10:30:00',
'2000-01-01 12:00:00', '2000-01-01 13:30:00'],
dtype='datetime64[ns]', freq='90T')
Some frequencies describe points in time that are not evenly spaced.For example,'M'(calendar month end) and 'BM'(last business/weekday of month) depend on the number of days in a month and, in the latter case ,whether the month ends on weekend or not.We refer to these as anchor offset.
week of month dates
One useful frequency class is 'week of month',starting with WOM
.This enables you to get dates like the third Fridy of each month.
rng=pd.date_range('2012-01-01','2012-09-01',freq='WOM-3FRI')
list(rng)
[Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]
Shifting(Leading and lagging)data
Shifting means moving data backward and forward through time.Both Series and DataFrame have a shift
method for doing naive shifts forward,leaving the index unmodified:
ts=pd.Series(np.random.randn(4),index=pd.date_range('1/1/2020',periods=4,freq='M'))
help(ts.shift)
Help on method shift in module pandas.core.series:
shift(periods=1, freq=None, axis=0, fill_value=None) method of pandas.core.series.Series instance
Shift index by desired number of periods with an optional time `freq`.
When `freq` is not passed, shift the index without realigning the data.
If `freq` is passed (in this case, the index must be date or datetime,
or it will raise a `NotImplementedError`), the index will be
increased using the periods and the `freq`.
Parameters
----------
periods : int
Number of periods to shift. Can be positive or negative.
freq : DateOffset, tseries.offsets, timedelta, or str, optional
Offset to use from the tseries module or time rule (e.g. 'EOM').
If `freq` is specified then the index values are shifted but the
data is not realigned. That is, use `freq` if you would like to
extend the index when shifting and preserve the original data.
axis : {0 or 'index', 1 or 'columns', None}, default None
Shift direction.
fill_value : object, optional
The scalar value to use for newly introduced missing values.
the default depends on the dtype of `self`.
For numeric data, ``np.nan`` is used.
For datetime, timedelta, or period data, etc. :attr:`NaT` is used.
For extension dtypes, ``self.dtype.na_value`` is used.
.. versionchanged:: 0.24.0
Returns
-------
Series
Copy of input object, shifted.
See Also
--------
Index.shift : Shift values of Index.
DatetimeIndex.shift : Shift values of DatetimeIndex.
PeriodIndex.shift : Shift values of PeriodIndex.
tshift : Shift the time index, using the index's frequency if
available.
Examples
--------
>>> df = pd.DataFrame({'Col1': [10, 20, 15, 30, 45],
... 'Col2': [13, 23, 18, 33, 48],
... 'Col3': [17, 27, 22, 37, 52]})
>>> df.shift(periods=3)
Col1 Col2 Col3
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 10.0 13.0 17.0
4 20.0 23.0 27.0
>>> df.shift(periods=1, axis='columns')
Col1 Col2 Col3
0 NaN 10.0 13.0
1 NaN 20.0 23.0
2 NaN 15.0 18.0
3 NaN 30.0 33.0
4 NaN 45.0 48.0
>>> df.shift(periods=3, fill_value=0)
Col1 Col2 Col3
0 0 0 0
1 0 0 0
2 0 0 0
3 10 13 17
4 20 23 27
ts
2020-01-31 0.225376
2020-02-29 -0.024710
2020-03-31 0.117686
2020-04-30 1.513727
Freq: M, dtype: float64
ts.shift(2)
2020-01-31 NaN
2020-02-29 NaN
2020-03-31 0.225376
2020-04-30 -0.024710
Freq: M, dtype: float64
ts.shift(-2)
2020-01-31 0.117686
2020-02-29 1.513727
2020-03-31 NaN
2020-04-30 NaN
Freq: M, dtype: float64
When we shift like this,missing data is introduced either at the start or the end of the time Series.
A common use of shift
is computing percent changes in a time series or multiple time series as DataFrame columns.
Because naive shift leave the index unmodified,some data is discard.Thus if the frequency is known ,it can be passed to shift
to advance the timestamps instead of simply the data:
Shift index by desired number of periods with an optional time freq
.
When freq
is not passed, shift the index without realigning the data.
If freq
is passed (in this case, the index must be date or datetime,
or it will raise a NotImplementedError
), the index will be
increased using the periods and the freq
.
ts.shift(2,freq='M')
2020-03-31 0.225376
2020-04-30 -0.024710
2020-05-31 0.117686
2020-06-30 1.513727
Freq: M, dtype: float64
ts
2020-01-31 0.225376
2020-02-29 -0.024710
2020-03-31 0.117686
2020-04-30 1.513727
Freq: M, dtype: float64
Other frequencies can be passed too,giving you some flexibility in how to lead and lag the data:
ts.shift(3,freq='D') # shift every timestamp in ts forward 3 days
2020-02-03 0.225376
2020-03-03 -0.024710
2020-04-03 0.117686
2020-05-03 1.513727
dtype: float64
ts.shift(1,freq='90T') # shift every timestamp in ts forward '90T'
2020-01-31 01:30:00 0.225376
2020-02-29 01:30:00 -0.024710
2020-03-31 01:30:00 0.117686
2020-04-30 01:30:00 1.513727
Freq: M, dtype: float64
Shift dates with offsets
The pandas date offsets can also be used with datetime
or Timestamp
objects:
from pandas.tseries.offsets import Day,MonthEnd
now=datetime(2020,11,17)
now+3*Day()
Timestamp('2020-11-20 00:00:00')
now+MonthEnd(2)
Timestamp('2020-12-31 00:00:00')
Anchored offsets can explicitly 'roll' dates forward or backward by simply using their rollforward
and rollback
methods,respectively:
offset=MonthEnd()
offset.rollforward(now)
Timestamp('2020-11-30 00:00:00')
offset.rollback(now)
Timestamp('2020-10-31 00:00:00')
A creative use of date offset is to use these methods with groupby
:
ts=pd.Series(np.random.randn(20),index=pd.date_range('1/15/2000',periods=20,freq='4d'))
ts
list(ts.groupby(offset.rollforward))
[(Timestamp('2000-01-31 00:00:00'),
2000-01-15 -0.209800
2000-01-19 -2.189881
2000-01-23 -1.779681
2000-01-27 0.437441
2000-01-31 1.054685
Freq: 4D, dtype: float64),
(Timestamp('2000-02-29 00:00:00'),
2000-02-04 -0.506648
2000-02-08 0.484109
2000-02-12 -0.385587
2000-02-16 -0.732983
2000-02-20 -1.459167
2000-02-24 -1.133808
2000-02-28 0.097860
Freq: 4D, dtype: float64),
(Timestamp('2000-03-31 00:00:00'),
2000-03-03 0.480492
2000-03-07 1.040105
2000-03-11 0.634999
2000-03-15 0.621187
2000-03-19 -1.410100
2000-03-23 0.319765
2000-03-27 -1.079803
2000-03-31 -1.292514
Freq: 4D, dtype: float64)]
ts.groupby(offset.rollforward).mean()
2000-01-31 -0.537447
2000-02-29 -0.519461
2000-03-31 -0.085734
dtype: float64
Of course ,an easier and faster way to do this is using resample
ts.resample('M').mean()
2000-01-31 -0.537447
2000-02-29 -0.519461
2000-03-31 -0.085734
Freq: M, dtype: float64
Time zone handling
Working with time zone is generally considered one of the most unpleasant parts of time series manipulation.As a result,many time series users choose to work with time series in coordinated universal time or UTC,which is the successor to Greenwich Mean Time and is the current internaational standard.Time zones are expressed as offsets from UTC;
In Python,time zone information comes from the third-party pytz
library,which exposes the Olson database,a compilation of world time zone information. Pandas wraps pytz
's funtionality so you can ignore its API outside of the time zone names. Time zone names can be found interactively and in the docs.
import pytz
pytz.common_timezones[-5:]
['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']
len(pytz.common_timezones)
440
To get a time zone object from pytz,use pytz.timezone
:
tz=pytz.timezone('America/New_York')
tz
<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>
Time zone localization and conversion
By default,time series in pandas are time zone naive, for example,consider the following time series:
rng=pd.date_range('3/9/2020 9:00',periods=6,freq='D')
ts=pd.Series(np.random.randn(len(rng)),index=rng)
ts
2020-03-09 09:00:00 -0.384109
2020-03-10 09:00:00 -0.195272
2020-03-11 09:00:00 -0.473277
2020-03-12 09:00:00 1.430223
2020-03-13 09:00:00 -0.222399
2020-03-14 09:00:00 -0.844174
Freq: D, dtype: float64
print(ts.index.tz) # print None,indicating that ts.index is naive time zone
None
The index's field is None. Date ranges can be generated with a time zone set:
pd.date_range('3/9/2020 9:30',periods=10,freq='D',tz='UTC')
DatetimeIndex(['2020-03-09 09:30:00+00:00', '2020-03-10 09:30:00+00:00',
'2020-03-11 09:30:00+00:00', '2020-03-12 09:30:00+00:00',
'2020-03-13 09:30:00+00:00', '2020-03-14 09:30:00+00:00',
'2020-03-15 09:30:00+00:00', '2020-03-16 09:30:00+00:00',
'2020-03-17 09:30:00+00:00', '2020-03-18 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
Conversion from naive to localized is handled by tz_localize
:
ts
2020-03-09 09:00:00 -0.384109
2020-03-10 09:00:00 -0.195272
2020-03-11 09:00:00 -0.473277
2020-03-12 09:00:00 1.430223
2020-03-13 09:00:00 -0.222399
2020-03-14 09:00:00 -0.844174
Freq: D, dtype: float64
ts_utc=ts.tz_localize('utc')
ts_utc
2020-03-09 09:00:00+00:00 -0.384109
2020-03-10 09:00:00+00:00 -0.195272
2020-03-11 09:00:00+00:00 -0.473277
2020-03-12 09:00:00+00:00 1.430223
2020-03-13 09:00:00+00:00 -0.222399
2020-03-14 09:00:00+00:00 -0.844174
Freq: D, dtype: float64
ts_utc.index # dtype has been changed to be UTC
DatetimeIndex(['2020-03-09 09:00:00+00:00', '2020-03-10 09:00:00+00:00',
'2020-03-11 09:00:00+00:00', '2020-03-12 09:00:00+00:00',
'2020-03-13 09:00:00+00:00', '2020-03-14 09:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
Operations with Time zone --Aware Timestamp objects
Similar to time series and date ranges,individual Timestamp
objects similarly can be localized from naive to time-aware and converted from one time zone to another:
stamp=pd.Timestamp('2011-03-12 04:00')
print(stamp.tz)
None
stamp_utc=stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')
Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')
ts.index.tz_localize('utc').tz_convert('America/New_York') # naive shall be localized and then can be converted to anoter time zone
DatetimeIndex(['2020-03-09 05:00:00-04:00', '2020-03-10 05:00:00-04:00',
'2020-03-11 05:00:00-04:00', '2020-03-12 05:00:00-04:00',
'2020-03-13 05:00:00-04:00', '2020-03-14 05:00:00-04:00'],
dtype='datetime64[ns, America/New_York]', freq='D')
ts.index.tz_localize('utc').tz
<UTC>
You can also pass a time zone when creating the Timestamp
:
stamp_mscow=pd.Timestamp('2011-03-12 04:00',tz='Europe/Moscow')
stamp_mscow.tz
<DstTzInfo 'Europe/Moscow' MSK+3:00:00 STD>
Time zone-aware Timestamp
objects internally store a UTC timestamp value as nano-seconds since the Unix-epoch(January 1,1970);this UTC value is invariant between time zone conversions:
stamp_mscow.value
1299891600000000000
stamp_mscow.tz_convert('America/New_York')
Timestamp('2011-03-11 20:00:00-0500', tz='America/New_York')
When performing time arithmetic using pandas's DateOffset
objects,pandas respects daylight saving time transitions where possible.Here we construct timestamps that occur right before DST transitions.First,30 minutes before transitioning to DST:
from pandas.tseries.offsets import Hour
stamp=pd.Timestamp('2012-03-12 01:30',tz='US/Eastern')
stamp
Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')
stamp+Hour()
Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')
Then ,90 minutes before transitioning out of DST:
stamp=pd.Timestamp('2012-10-04 00:30',tz='US/Eastern')
stamp
Timestamp('2012-10-04 00:30:00-0400', tz='US/Eastern')
stamp+2*Hour() ### 02:30
Timestamp('2012-10-04 02:30:00-0400', tz='US/Eastern')
stamp=pd.Timestamp('2012-11-04 00:30',tz='US/Eastern')
stamp
Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')
stamp+2*Hour() #01:30
Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')
Operations between different time zones
If two time series with different time zones are combined,the result will be UTC.Since the timestamps are stored under the hood in UTC,this is a straightforward operation and requires no conversion to happen:
rng=pd.date_range('3/7/2012 9:30',periods=10,freq='B')
ts=pd.Series(np.random.randn(len(rng)),index=rng)
ts1=ts[:7].tz_localize('Europe/London')
ts2=ts1[2:].tz_convert('Europe/Moscow')
result=ts1+ts2
result.index
DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
'2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
'2012-03-15 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='B')
Periods and period arithmetic
Periods represent timespans,like days,months,quarters,or years.The Period
class represents this data type,requiring a string or integer.
p=pd.Period(2007,freq='A-DEC') #Annual dates anchored on last calendar day of given month,yearEnd
p
Period('2007', 'A-DEC')
In this case,the Period
object represents the full timespan from January1,2007,to December 31,2007,inclusive.Conveniently,adding integers from periods has the effect of shiftting by their frequency:
p+5
Period('2012', 'A-DEC')
p-2
Period('2005', 'A-DEC')
If two periods have the same frequency,their difference is the number of units between them:
pd.Period('2014','A-DEC')-p
<7 * YearEnds: month=12>
Regular ranges of periods can be constructed with the period_range
function:
rng=pd.period_range('2000-01-01','2000-06-30',freq='M')
rng
PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')
The PeriodIndex
class stores a sequence of periods and can serve as an axis index in any pandas data structure:
pd.Series(np.random.randn(6),index=rng)
2000-01 1.337147
2000-02 -0.201512
2000-03 -0.261829
2000-04 0.124229
2000-05 -0.723703
2000-06 -2.130917
Freq: M, dtype: float64
If you have an array of strings,you can also use the PeriodIndex
class:
values=['2001Q3','2002Q2','2003Q1']
index=pd.PeriodIndex(values,freq='Q-DEC') # Quarterly dates anchored on last calendar day of each month,for year ending in indicated month.
index
PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')
Period frequency convension
Periods and PeriodIndex
objects can be converted to another frequency with their asfreq
method.As an example,suppose we had an annual period and wanted to convert it into a monthly period either at the start or end of the year.This is fairly straightforward.
p=pd.Period('2007',freq='A-DEC') # Annual period,ending with the last day of 2007-12
p
Period('2007', 'A-DEC')
p.asfreq('M',how='start')
Period('2007-01', 'M')
p.asfreq('M',how='end')
Period('2007-12', 'M')
You can think of Period('2007','A-DEC')
as being a sort of cursor pointing to a span of time,subdivided by monthly periods.
p=pd.Period('2007','A-JUN') #Annual period,ending with the last day of 2007-06
p.asfreq('M','start')
Period('2006-07', 'M')
p.asfreq('M','end')
Period('2007-06', 'M')
Whole PeriodIndex
objects or time series can be similarly converted with the same semantics:
rng=pd.period_range('2006','2009',freq='A-DEC')
ts=pd.Series(np.random.randn(len(rng)),index=rng)
ts
2006 -0.297145
2007 -0.304496
2008 0.705818
2009 -1.829369
Freq: A-DEC, dtype: float64
ts.asfreq('M',how='start')
2006-01 -0.297145
2007-01 -0.304496
2008-01 0.705818
2009-01 -1.829369
Freq: M, dtype: float64
Here,the annual periods are replaced with monthly periods corresponding to the first month falling within each annual period.If we instead wanted the last business day of each year, we can use 'B' frequency and indicate that we want the end of the period:
ts.asfreq('B',how='end')
2006-12-29 -0.297145
2007-12-31 -0.304496
2008-12-31 0.705818
2009-12-31 -1.829369
Freq: B, dtype: float64
Quarterly period frequencies
Quarterly data is stanard in accounting,finance,and other fields.Much quarterly data is reported relative to a fiscal year end,typically the last calender or business day of one of the 12 months of the year.Thus,the period 2012Q4
has a different meaning depending on fiscal year end.pandas supports all 12 possible quarterly frequencies as Q-JAN
through Q-DEC
.
p=pd.Period('2012Q4',freq='Q-JAN')
p
Period('2012Q4', 'Q-JAN')
In the case of fiscal year ending in January. 2012Q4 runs from November through January,whcin you can check by converting to daily frequency.
p.asfreq('D','start') #check p by converting to daily frequency.
Period('2011-11-01', 'D')
p.asfreq('D','end')
Period('2012-01-31', 'D')
Thus,it's possible to do easy period arithmetic,for example,to get the timestamp at 4PM on the second-to-last business day of the quarter,
p4pm=(p.asfreq('B','e')-1).asfreq('T','s')+16*60 #'T' means 'Minute'
p4pm
Period('2012-01-30 16:00', 'T')
p4pm.to_timestamp()
Timestamp('2012-01-30 16:00:00')
You can generate quarterly ranges using period_range
.
rng=pd.period_range('2011Q3','2012Q4',freq='Q-JAN')
ts=pd.Series(np.arange(len(rng)),index=rng)
ts
2011Q3 0
2011Q4 1
2012Q1 2
2012Q2 3
2012Q3 4
2012Q4 5
Freq: Q-JAN, dtype: int32
new_rng=(rng.asfreq('B','e')-1).asfreq('T','s')+16*60
ts.index=new_rng.to_timestamp()
ts
2010-10-28 16:00:00 0
2011-01-28 16:00:00 1
2011-04-28 16:00:00 2
2011-07-28 16:00:00 3
2011-10-28 16:00:00 4
2012-01-30 16:00:00 5
dtype: int32
Converting Timestamps to Periods(and Back
Series and DataFrame objects indexed by timestamps can be converted to periods with the to_period
method:
rng=pd.date_range('2000-01-01',periods=3,freq='M')
ts=pd.Series(np.random.randn(3),index=rng);ts
2000-01-31 -0.115457
2000-02-29 -0.318769
2000-03-31 0.166398
Freq: M, dtype: float64
pts=ts.to_period();pts
2000-01 -0.115457
2000-02 -0.318769
2000-03 0.166398
Freq: M, dtype: float64
type(pts)
pandas.core.series.Series
Since periods refer to non-overlapping timespans,a timestamp can only belong to a single period for a given frequency.While the frequency of the new PeriodIndex
is inferred from the timestamp by default,you can specify any frequency you want.
rng=pd.date_range('1/29/2000',periods=6,freq='D')
ts2=pd.Series(np.random.randn(6),index=rng)
ts2
2000-01-29 0.511537
2000-01-30 2.661260
2000-01-31 0.954388
2000-02-01 -0.903825
2000-02-02 -0.399345
2000-02-03 1.160727
Freq: D, dtype: float64
ts2.to_period('M')
2000-01 0.511537
2000-01 2.661260
2000-01 0.954388
2000-02 -0.903825
2000-02 -0.399345
2000-02 1.160727
Freq: M, dtype: float64
To convert back to timestamps,use to_timestamp
:
pts=ts2.to_period()
pts.to_timestamp(how='end')
2000-01-29 23:59:59.999999999 0.511537
2000-01-30 23:59:59.999999999 2.661260
2000-01-31 23:59:59.999999999 0.954388
2000-02-01 23:59:59.999999999 -0.903825
2000-02-02 23:59:59.999999999 -0.399345
2000-02-03 23:59:59.999999999 1.160727
Freq: D, dtype: float64
Creating a PeriodIndex from Arrays
data=pd.read_csv('.pydata-book-2nd-editionexamplesmacrodata.csv')
data.head(5)
year | quarter | realgdp | realcons | realinv | realgovt | realdpi | cpi | m1 | tbilrate | unemp | pop | infl | realint | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1959.0 | 1.0 | 2710.349 | 1707.4 | 286.898 | 470.045 | 1886.9 | 28.98 | 139.7 | 2.82 | 5.8 | 177.146 | 0.00 | 0.00 |
1 | 1959.0 | 2.0 | 2778.801 | 1733.7 | 310.859 | 481.301 | 1919.7 | 29.15 | 141.7 | 3.08 | 5.1 | 177.830 | 2.34 | 0.74 |
2 | 1959.0 | 3.0 | 2775.488 | 1751.8 | 289.226 | 491.260 | 1916.4 | 29.35 | 140.5 | 3.82 | 5.3 | 178.657 | 2.74 | 1.09 |
3 | 1959.0 | 4.0 | 2785.204 | 1753.7 | 299.356 | 484.052 | 1931.3 | 29.37 | 140.0 | 4.33 | 5.6 | 179.386 | 0.27 | 4.06 |
4 | 1960.0 | 1.0 | 2847.699 | 1770.5 | 331.722 | 462.199 | 1955.5 | 29.54 | 139.6 | 3.50 | 5.2 | 180.007 | 2.31 | 1.19 |
data.year
0 1959.0
1 1959.0
2 1959.0
3 1959.0
4 1960.0
...
198 2008.0
199 2008.0
200 2009.0
201 2009.0
202 2009.0
Name: year, Length: 203, dtype: float64
data.quarter
0 1.0
1 2.0
2 3.0
3 4.0
4 1.0
...
198 3.0
199 4.0
200 1.0
201 2.0
202 3.0
Name: quarter, Length: 203, dtype: float64
By passing these arrays to PeriodIndex
with a frequency,you can combine them to form an index for the DataFrame:
index=pd.PeriodIndex(year=data.year,quarter=data.quarter,freq='Q-DEC')
index
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
'1960Q3', '1960Q4', '1961Q1', '1961Q2',
...
'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
'2008Q4', '2009Q1', '2009Q2', '2009Q3'],
dtype='period[Q-DEC]', length=203, freq='Q-DEC')
data.index=index
data.infl
1959Q1 0.00
1959Q2 2.34
1959Q3 2.74
1959Q4 0.27
1960Q1 2.31
...
2008Q3 -3.16
2008Q4 -8.79
2009Q1 0.94
2009Q2 3.37
2009Q3 3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64
Resampling and frequency conversion
Resampling refers to the process of converting a time series from one frequency to another.Aggregating higher frequency data to lower frequency is called downsampling, while converting lower frequency to higher frequency is called upsampling. Not all resampling falls into either of these categories;
pandas objects are equipped with a resample method,which is the workhourse function for all frequency conversion.resample
has a similar API to groupby
;you can call resemple
to group the data,then call an aggregate functions:
rng=pd.date_range('2000-01-01',periods=5,freq='D')
ts=pd.Series(np.arange(5),index=rng)
ts.head()
2000-01-01 0
2000-01-02 1
2000-01-03 2
2000-01-04 3
2000-01-05 4
Freq: D, dtype: int32
ts.resample('M').mean()
2000-01-31 2
Freq: M, dtype: int32
ts.resample('M',kind='period').mean()
2000-01 2
Freq: M, dtype: int32
help(pd.Series.resample)
Help on function resample in module pandas.core.generic:
resample(self, rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)
Resample time-series data.
Convenience method for frequency conversion and resampling of time
series. Object must have a datetime-like index (`DatetimeIndex`,
`PeriodIndex`, or `TimedeltaIndex`), or pass datetime-like values
to the `on` or `level` keyword.
Parameters
----------
rule : DateOffset, Timedelta or str
The offset string or object representing target conversion.
how : str
Method for down/re-sampling, default to 'mean' for downsampling.
.. deprecated:: 0.18.0
The new syntax is ``.resample(...).mean()``, or
``.resample(...).apply(<func>)``
axis : {0 or 'index', 1 or 'columns'}, default 0
Which axis to use for up- or down-sampling. For `Series` this
will default to 0, i.e. along the rows. Must be
`DatetimeIndex`, `TimedeltaIndex` or `PeriodIndex`.
fill_method : str, default None
Filling method for upsampling.
.. deprecated:: 0.18.0
The new syntax is ``.resample(...).<func>()``,
e.g. ``.resample(...).pad()``
closed : {'right', 'left'}, default None
Which side of bin interval is closed. The default is 'left'
for all frequency offsets except for 'M', 'A', 'Q', 'BM',
'BA', 'BQ', and 'W' which all have a default of 'right'.
label : {'right', 'left'}, default None
Which bin edge label to label bucket with. The default is 'left'
for all frequency offsets except for 'M', 'A', 'Q', 'BM',
'BA', 'BQ', and 'W' which all have a default of 'right'.
convention : {'start', 'end', 's', 'e'}, default 'start'
For `PeriodIndex` only, controls whether to use the start or
end of `rule`.
kind : {'timestamp', 'period'}, optional, default None
Pass 'timestamp' to convert the resulting index to a
`DateTimeIndex` or 'period' to convert it to a `PeriodIndex`.
By default the input representation is retained.
loffset : timedelta, default None
Adjust the resampled time labels.
limit : int, default None
Maximum size gap when reindexing with `fill_method`.
.. deprecated:: 0.18.0
base : int, default 0
For frequencies that evenly subdivide 1 day, the "origin" of the
aggregated intervals. For example, for '5min' frequency, base could
range from 0 through 4. Defaults to 0.
on : str, optional
For a DataFrame, column to use instead of index for resampling.
Column must be datetime-like.
.. versionadded:: 0.19.0
level : str or int, optional
For a MultiIndex, level (name or number) to use for
resampling. `level` must be datetime-like.
.. versionadded:: 0.19.0
Returns
-------
Resampler object
See Also
--------
groupby : Group by mapping, function, label, or list of labels.
Series.resample : Resample a Series.
DataFrame.resample: Resample a DataFrame.
Notes
-----
See the `user guide
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling>`_
for more.
To learn more about the offset strings, please see `this link
<http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects>`__.
Examples
--------
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values
of the timestamps falling into a bin.
>>> series.resample('3T').sum()
2000-01-01 00:00:00 3
2000-01-01 00:03:00 12
2000-01-01 00:06:00 21
Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each
bin using the right edge instead of the left. Please note that the
value in the bucket used as the label is not included in the bucket,
which it labels. For example, in the original series the
bucket ``2000-01-01 00:03:00`` contains the value 3, but the summed
value in the resampled bucket with the label ``2000-01-01 00:03:00``
does not include 3 (if it did, the summed value would be 6, not 3).
To include this value close the right side of the bin interval as
illustrated in the example below this one.
>>> series.resample('3T', label='right').sum()
2000-01-01 00:03:00 3
2000-01-01 00:06:00 12
2000-01-01 00:09:00 21
Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right
side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00 0
2000-01-01 00:03:00 6
2000-01-01 00:06:00 15
2000-01-01 00:09:00 15
Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] # Select first 5 rows
2000-01-01 00:00:00 0.0
2000-01-01 00:00:30 NaN
2000-01-01 00:01:00 1.0
2000-01-01 00:01:30 NaN
2000-01-01 00:02:00 2.0
Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the ``NaN``
values using the ``pad`` method.
>>> series.resample('30S').pad()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 0
2000-01-01 00:01:00 1
2000-01-01 00:01:30 1
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
``NaN`` values using the ``bfill`` method.
>>> series.resample('30S').bfill()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 1
2000-01-01 00:01:00 1
2000-01-01 00:01:30 2
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
Pass a custom function via ``apply``
>>> def custom_resampler(array_like):
... return np.sum(array_like) + 5
...
>>> series.resample('3T').apply(custom_resampler)
2000-01-01 00:00:00 8
2000-01-01 00:03:00 17
2000-01-01 00:06:00 26
Freq: 3T, dtype: int64
For a Series with a PeriodIndex, the keyword `convention` can be
used to control whether to use the start or end of `rule`.
Resample a year by quarter using 'start' `convention`. Values are
assigned to the first quarter of the period.
>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
... freq='A',
... periods=2))
>>> s
2012 1
2013 2
Freq: A-DEC, dtype: int64
>>> s.resample('Q', convention='start').asfreq()
2012Q1 1.0
2012Q2 NaN
2012Q3 NaN
2012Q4 NaN
2013Q1 2.0
2013Q2 NaN
2013Q3 NaN
2013Q4 NaN
Freq: Q-DEC, dtype: float64
Resample quarters by month using 'end' `convention`. Values are
assigned to the last month of the period.
>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',
... freq='Q',
... periods=4))
>>> q
2018Q1 1
2018Q2 2
2018Q3 3
2018Q4 4
Freq: Q-DEC, dtype: int64
>>> q.resample('M', convention='end').asfreq()
2018-03 1.0
2018-04 NaN
2018-05 NaN
2018-06 2.0
2018-07 NaN
2018-08 NaN
2018-09 3.0
2018-10 NaN
2018-11 NaN
2018-12 4.0
Freq: M, dtype: float64
For DataFrame objects, the keyword `on` can be used to specify the
column instead of the index for resampling.
>>> d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df = pd.DataFrame(d)
>>> df['week_starting'] = pd.date_range('01/01/2018',
... periods=8,
... freq='W')
>>> df
price volume week_starting
0 10 50 2018-01-07
1 11 60 2018-01-14
2 9 40 2018-01-21
3 13 100 2018-01-28
4 14 50 2018-02-04
5 18 100 2018-02-11
6 17 40 2018-02-18
7 19 50 2018-02-25
>>> df.resample('M', on='week_starting').mean()
price volume
week_starting
2018-01-31 10.75 62.5
2018-02-28 17.00 60.0
For a DataFrame with MultiIndex, the keyword `level` can be used to
specify on which level the resampling needs to take place.
>>> days = pd.date_range('1/1/2000', periods=4, freq='D')
>>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df2 = pd.DataFrame(d2,
... index=pd.MultiIndex.from_product([days,
... ['morning',
... 'afternoon']]
... ))
>>> df2
price volume
2000-01-01 morning 10 50
afternoon 11 60
2000-01-02 morning 9 40
afternoon 13 100
2000-01-03 morning 14 50
afternoon 18 100
2000-01-04 morning 17 40
afternoon 19 50
>>> df2.resample('D', level=0).sum()
price volume
2000-01-01 21 110
2000-01-02 22 140
2000-01-03 32 150
2000-01-04 36 90
help(pd.MultiIndex.from_product)
Help on method from_product in module pandas.core.indexes.multi:
from_product(iterables, sortorder=None, names=None) method of builtins.type instance
Make a MultiIndex from the cartesian product of multiple iterables.
Parameters
----------
iterables : list / sequence of iterables
Each iterable has unique labels for each level of the index.
sortorder : int or None
Level of sortedness (must be lexicographically sorted by that
level).
names : list / sequence of str, optional
Names for the levels in the index.
Returns
-------
index : MultiIndex
See Also
--------
MultiIndex.from_arrays : Convert list of arrays to MultiIndex.
MultiIndex.from_tuples : Convert list of tuples to MultiIndex.
MultiIndex.from_frame : Make a MultiIndex from a DataFrame.
Examples
--------
>>> numbers = [0, 1, 2]
>>> colors = ['green', 'purple']
>>> pd.MultiIndex.from_product([numbers, colors],
... names=['number', 'color'])
MultiIndex([(0, 'green'),
(0, 'purple'),
(1, 'green'),
(1, 'purple'),
(2, 'green'),
(2, 'purple')],
names=['number', 'color'])
tngt=pd.date_range('1/1/2020',periods=9,freq='T')
tngt
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:01:00',
'2020-01-01 00:02:00', '2020-01-01 00:03:00',
'2020-01-01 00:04:00', '2020-01-01 00:05:00',
'2020-01-01 00:06:00', '2020-01-01 00:07:00',
'2020-01-01 00:08:00'],
dtype='datetime64[ns]', freq='T')
sert=pd.Series(np.arange(9),index=tngt)
sert
2020-01-01 00:00:00 0
2020-01-01 00:01:00 1
2020-01-01 00:02:00 2
2020-01-01 00:03:00 3
2020-01-01 00:04:00 4
2020-01-01 00:05:00 5
2020-01-01 00:06:00 6
2020-01-01 00:07:00 7
2020-01-01 00:08:00 8
Freq: T, dtype: int32
sert.resample('30S')
<pandas.core.resample.DatetimeIndexResampler object at 0x0000014F20423860>
help(pd.core.resample.DatetimeIndexResampler.asfreq)
Help on function asfreq in module pandas.core.resample:
asfreq(self, fill_value=None)
Return the values at the new freq, essentially a reindex.
Parameters
----------
fill_value : scalar, optional
Value to use for missing values, applied during upsampling (note
this does not fill NaNs that already were present).
.. versionadded:: 0.20.0
Returns
-------
DataFrame or Series
Values at the specified freq.
See Also
--------
Series.asfreq
DataFrame.asfreq
sert.resample('30S').asfreq()
2020-01-01 00:00:00 0.0
2020-01-01 00:00:30 NaN
2020-01-01 00:01:00 1.0
2020-01-01 00:01:30 NaN
2020-01-01 00:02:00 2.0
2020-01-01 00:02:30 NaN
2020-01-01 00:03:00 3.0
2020-01-01 00:03:30 NaN
2020-01-01 00:04:00 4.0
2020-01-01 00:04:30 NaN
2020-01-01 00:05:00 5.0
2020-01-01 00:05:30 NaN
2020-01-01 00:06:00 6.0
2020-01-01 00:06:30 NaN
2020-01-01 00:07:00 7.0
2020-01-01 00:07:30 NaN
2020-01-01 00:08:00 8.0
Freq: 30S, dtype: float64
ts.resample('M',kind='period').mean()
2000-01 2
Freq: M, dtype: int32
ts
2000-01-01 0
2000-01-02 1
2000-01-03 2
2000-01-04 3
2000-01-05 4
Freq: D, dtype: int32
ts.resample('M',kind='timestamp').mean()
2000-01-31 2
Freq: M, dtype: int32
ts.resample('M').mean()
2000-01-31 2
Freq: M, dtype: int32
resample
is a flexible and high-performance method that can be used to process very large time series.
The argument and its meaning:
- freq: string or DateOffset indicating desired resampled frequency(e.g 'M','5min' ,or Second(15)
- axis : axis to resample on;default axis=0
- fill_method: how to interpolate when upsampling,as in 'ffill' or 'bfill'; by default does not interpolation
- closed: in downsampling,which end of each interval is closed(inclusive),'right' or 'left'
- label: in downsampling,how to label the aggregated result,with the 'right' or 'left' bin edge(e.g, the 9:30 to 9:35 five-minute interval could be labeled 9:30 or 9:35)
- loffset:Time adjustment to the bin labels,such as '-1s'/Second(-1) to shift the aggregate labels one second erlier
- limit:when forward or backward filling,the maxinum number of periods to fill
- kind:Aggregate to periods('period') or timestamps('timestamp');default to the type of index the time series has
- convention: when resampling periods,the convention('start' or 'end') for converting the low-frequency period to high frequency;defaults to 'end'.
Downsampling
Aggregating data to a regular,lower frequency is a pretty normal time series task.The data you are aggregating does not need to be fixed frequently;the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate.For example, to convert to monthly,'M' or 'BM',you need to chop up the data into one-month interval,and the union of the intervals must make up the whole time frame.There are a couple things to think about when using resemple
to downsample data:
- Which side of each interval is closed
- How to label each aggregated bin,either with the start of the interval or the end
rng=pd.date_range('2000-01-01',periods=12,freq='T')
ts=pd.Series(np.arange(12),index=rng)
ts
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T, dtype: int32
ts.resample('5min',closed='right').sum() # when closed='right',bin is the form of (],the index of ts is the right side of bin
#(23:55:00],(00:00,00:05],(00:05,00:10],(00:10,00:15]
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int32
ts.resample('5min',closed='right',label='right').sum() #The same as above,but showing labels
#(00:00,00:05],(00:05,00:10],(00:10,00:15]
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
Freq: 5T, dtype: int32
ts.resample('5min',closed='left').sum()# [00:00,00:05),[00:05,00:10),[00:10,00:15), when closed='left',bin is the form of [),and
# the index of ts is the left side of bin.
2000-01-01 00:00:00 10
2000-01-01 00:05:00 35
2000-01-01 00:10:00 21
Freq: 5T, dtype: int32
ts.resample('5min',closed='left',label='right').sum()# The same as above,but showing labels.
2000-01-01 00:05:00 10
2000-01-01 00:10:00 35
2000-01-01 00:15:00 21
Freq: 5T, dtype: int32
Lastly,you might want to shift the result index by some amount,say subtracting one second from the right edge to make it more clear which interval the timestamp refers to.To do this,pass a string or date offset to loffset
:
ts.resample('5min',closed='right',label='right',loffset='-1s').sum()
1999-12-31 23:59:59 0
2000-01-01 00:04:59 15
2000-01-01 00:09:59 40
2000-01-01 00:14:59 11
Freq: 5T, dtype: int32
You also could have accomplished the effect of loffset
by calling the shift
method on the result without the loffset
.
ts.resample('5min',closed='right').sum().shift(1,'-1s')
1999-12-31 23:54:59 0
1999-12-31 23:59:59 15
2000-01-01 00:04:59 40
2000-01-01 00:09:59 11
Freq: 5T, dtype: int32
Open-High-Low-Close(OHLC) resampling
In finance,a popular way to aggregate a time series is to compute four values for each bucket:the first(open),last(close),maximum(high),minimal(low) values.By using the ohlc
aggregate function you will obtain a DataFrame having columns containing these four aggregates,which are effectively computed in a single sweep of the data:
ts.resample('5min').ohlc()
open | high | low | close | |
---|---|---|---|---|
2000-01-01 00:00:00 | 0 | 4 | 0 | 4 |
2000-01-01 00:05:00 | 5 | 9 | 5 | 9 |
2000-01-01 00:10:00 | 10 | 11 | 10 | 11 |
Upsampling and interpolation
When converting from a low frequency to a higher frequency,no aggregation is needed.
frame=pd.DataFrame(np.random.randn(2,4),index=pd.date_range('1/1/2000',periods=2,freq='W-WED'),columns=['Colorado','Texa','New York','Ohio'])
frame
Colorado | Texa | New York | Ohio | |
---|---|---|---|---|
2000-01-05 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-12 | -0.179902 | 2.531560 | -2.658008 | 0.946870 |
When you are using an aggregation function with this data,there is only one value per group,and missinig values result in the gaps.We use the
asfreq
mehtond to convert the higher frequency without any aggregation:
df_daily=frame.resample('D').asfreq()
df_daily
Colorado | Texa | New York | Ohio | |
---|---|---|---|---|
2000-01-05 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-06 | NaN | NaN | NaN | NaN |
2000-01-07 | NaN | NaN | NaN | NaN |
2000-01-08 | NaN | NaN | NaN | NaN |
2000-01-09 | NaN | NaN | NaN | NaN |
2000-01-10 | NaN | NaN | NaN | NaN |
2000-01-11 | NaN | NaN | NaN | NaN |
2000-01-12 | -0.179902 | 2.531560 | -2.658008 | 0.946870 |
Suppose you wanted to fill forward each weekly value on the non-Wednesdays.The same filling or interpolation methods available in the fillna
and reindex
methods are available for resampling:
frame.resample('D').ffill()
Colorado | Texa | New York | Ohio | |
---|---|---|---|---|
2000-01-05 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-06 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-07 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-08 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-09 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-10 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-11 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-12 | -0.179902 | 2.531560 | -2.658008 | 0.946870 |
You can similarly choose to only fill a certain number of periods forward to limit how far to continue using an observed value:
frame.resample('D').ffill(limit=2)
Colorado | Texa | New York | Ohio | |
---|---|---|---|---|
2000-01-05 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-06 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-07 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-08 | NaN | NaN | NaN | NaN |
2000-01-09 | NaN | NaN | NaN | NaN |
2000-01-10 | NaN | NaN | NaN | NaN |
2000-01-11 | NaN | NaN | NaN | NaN |
2000-01-12 | -0.179902 | 2.531560 | -2.658008 | 0.946870 |
Notably,the new date index need not overlap with the old one at all:
frame.resample('W-THU').ffill()
Colorado | Texa | New York | Ohio | |
---|---|---|---|---|
2000-01-06 | -0.739109 | 0.781223 | 0.570884 | -1.616556 |
2000-01-13 | -0.179902 | 2.531560 | -2.658008 | 0.946870 |
Resample with periods
Resampling data indexed by periods is similar to timestamps:
frame=pd.DataFrame(np.random.randn(24,4),index=pd.period_range('1-2000','12-2001',freq='M'),columns=['Colorado','Texas','New York','Ohio'])
frame[:5]
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2000-01 | -0.829993 | -1.129430 | 1.320036 | 0.275144 |
2000-02 | 2.511115 | -0.306541 | 0.472983 | 0.220395 |
2000-03 | -0.037656 | 0.776638 | 0.428096 | -0.274698 |
2000-04 | -1.116895 | -0.353303 | -0.642274 | 1.469136 |
2000-05 | 0.975105 | -1.160983 | 0.459956 | -0.834690 |
annual_frame=frame.resample('A-DEC').mean()
annual_frame
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2000 | 0.002307 | -0.277951 | -0.030337 | -0.273060 |
2001 | 0.126581 | 0.275503 | -0.209550 | 0.241073 |
Upsampling is more nuance ,as you must make a decision about which end of the timespan in the new frequency to place the values before resampling,just like the asfreq
method.The convention
argument defaults to start
but can also be end
.
annual_frame.resample('Q-DEC').ffill()
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2000Q1 | 0.002307 | -0.277951 | -0.030337 | -0.273060 |
2000Q2 | 0.002307 | -0.277951 | -0.030337 | -0.273060 |
2000Q3 | 0.002307 | -0.277951 | -0.030337 | -0.273060 |
2000Q4 | 0.002307 | -0.277951 | -0.030337 | -0.273060 |
2001Q1 | 0.126581 | 0.275503 | -0.209550 | 0.241073 |
2001Q2 | 0.126581 | 0.275503 | -0.209550 | 0.241073 |
2001Q3 | 0.126581 | 0.275503 | -0.209550 | 0.241073 |
2001Q4 | 0.126581 | 0.275503 | -0.209550 | 0.241073 |
annual_frame.resample('Q-DEC',convention='end').ffill()
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2000Q4 | 0.002307 | -0.277951 | -0.030337 | -0.273060 |
2001Q1 | 0.002307 | -0.277951 | -0.030337 | -0.273060 |
2001Q2 | 0.002307 | -0.277951 | -0.030337 | -0.273060 |
2001Q3 | 0.002307 | -0.277951 | -0.030337 | -0.273060 |
2001Q4 | 0.126581 | 0.275503 | -0.209550 | 0.241073 |
Moving window functions
An important class of array transformations used for time Series operations are statistics and other functinos evaluated over a sliding window or with exponentially decaying weights.This can be useful for smoothing noisy or gappy data.
close_px_all=pd.read_csv(r'./pydata-book-2nd-edition/examples/stock_px_2.csv',parse_dates=True,index_col=0)
close_px_all.head()
AAPL | MSFT | XOM | SPX | |
---|---|---|---|---|
2003-01-02 | 7.40 | 21.11 | 29.22 | 909.03 |
2003-01-03 | 7.45 | 21.14 | 29.24 | 908.59 |
2003-01-06 | 7.45 | 21.52 | 29.96 | 929.01 |
2003-01-07 | 7.43 | 21.93 | 28.95 | 922.93 |
2003-01-08 | 7.28 | 21.31 | 28.83 | 909.93 |
close_px=close_px_all[['AAPL','MSFT','XOM']]
close_px=close_px.resample('B').ffill()
rolling
operator behaves similarly to resample
and groupby
.It can be called on a Series or DataFrame along with a window
:
close_px.AAPL.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f2085ccf8>
close_px.AAPL.rolling(300).mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f20abec50>
close_px.AAPL.rolling(250).mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f20b7fb00>
help(close_px.AAPL.rolling)
Help on method rolling in module pandas.core.generic:
rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None) method of pandas.core.series.Series instance
Provide rolling window calculations.
.. versionadded:: 0.18.0
Parameters
----------
window : int, or offset
Size of the moving window. This is the number of observations used for
calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each
window will be a variable sized based on the observations included in
the time-period. This is only valid for datetimelike indexes. This is
new in 0.19.0
min_periods : int, default None
Minimum number of observations in window required to have a value
(otherwise result is NA). For a window that is specified by an offset,
`min_periods` will default to 1. Otherwise, `min_periods` will default
to the size of the window.
center : bool, default False
Set the labels at the center of the window.
win_type : str, default None
Provide a window type. If ``None``, all points are evenly weighted.
See the notes below for further information.
on : str, optional
For a DataFrame, a datetime-like column on which to calculate the rolling
window, rather than the DataFrame's index. Provided integer column is
ignored and excluded from result since an integer index is not used to
calculate the rolling window.
axis : int or str, default 0
closed : str, default None
Make the interval closed on the 'right', 'left', 'both' or
'neither' endpoints.
For offset-based windows, it defaults to 'right'.
For fixed windows, defaults to 'both'. Remaining cases not implemented
for fixed windows.
.. versionadded:: 0.20.0
Returns
-------
a Window or Rolling sub-classed for the particular operation
See Also
--------
expanding : Provides expanding transformations.
ewm : Provides exponential weighted functions.
Notes
-----
By default, the result is set to the right edge of the window. This can be
changed to the center of the window by setting ``center=True``.
To learn more about the offsets & frequency strings, please see `this link
<http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
The recognized win_types are:
* ``boxcar``
* ``triang``
* ``blackman``
* ``hamming``
* ``bartlett``
* ``parzen``
* ``bohman``
* ``blackmanharris``
* ``nuttall``
* ``barthann``
* ``kaiser`` (needs beta)
* ``gaussian`` (needs std)
* ``general_gaussian`` (needs power, width)
* ``slepian`` (needs width)
* ``exponential`` (needs tau), center is set to None.
If ``win_type=None`` all points are evenly weighted. To learn more about
different window types see `scipy.signal window functions
<https://docs.scipy.org/doc/scipy/reference/signal.html#window-functions>`__.
Examples
--------
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]})
>>> df
B
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
Rolling sum with a window length of 2, using the 'triang'
window type.
>>> df.rolling(2, win_type='triang').sum()
B
0 NaN
1 0.5
2 1.5
3 NaN
4 NaN
Rolling sum with a window length of 2, min_periods defaults
to the window length.
>>> df.rolling(2).sum()
B
0 NaN
1 1.0
2 3.0
3 NaN
4 NaN
Same as above, but explicitly set the min_periods
>>> df.rolling(2, min_periods=1).sum()
B
0 0.0
1 1.0
2 3.0
3 2.0
4 4.0
A ragged (meaning not-a-regular frequency), time-indexed DataFrame
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
... index = [pd.Timestamp('20130101 09:00:00'),
... pd.Timestamp('20130101 09:00:02'),
... pd.Timestamp('20130101 09:00:03'),
... pd.Timestamp('20130101 09:00:05'),
... pd.Timestamp('20130101 09:00:06')])
>>> df
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
Contrasting to an integer rolling window, this will roll a variable
length window corresponding to the time period.
The default for min_periods is 1.
>>> df.rolling('2s').sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
To illustrate the meaning of rolling
,look at the following example:
index=pd.date_range('2020-5-5',periods=20)
df=pd.DataFrame(np.arange(20),index=index,columns=['test']);df
test | |
---|---|
2020-05-05 | 0 |
2020-05-06 | 1 |
2020-05-07 | 2 |
2020-05-08 | 3 |
2020-05-09 | 4 |
2020-05-10 | 5 |
2020-05-11 | 6 |
2020-05-12 | 7 |
2020-05-13 | 8 |
2020-05-14 | 9 |
2020-05-15 | 10 |
2020-05-16 | 11 |
2020-05-17 | 12 |
2020-05-18 | 13 |
2020-05-19 | 14 |
2020-05-20 | 15 |
2020-05-21 | 16 |
2020-05-22 | 17 |
2020-05-23 | 18 |
2020-05-24 | 19 |
df['sum']=df.test.rolling(3).sum()
df['mean']=df.test.rolling(3).mean()
df['mean1'] = df.test.rolling(3,min_periods=2).mean()
df['expanding']=df.test.expanding().mean()
df
test | sum | mean | mean1 | expanding | |
---|---|---|---|---|---|
2020-05-05 | 0 | NaN | NaN | NaN | 0.0 |
2020-05-06 | 1 | NaN | NaN | 0.5 | 0.5 |
2020-05-07 | 2 | 3.0 | 1.0 | 1.0 | 1.0 |
2020-05-08 | 3 | 6.0 | 2.0 | 2.0 | 1.5 |
2020-05-09 | 4 | 9.0 | 3.0 | 3.0 | 2.0 |
2020-05-10 | 5 | 12.0 | 4.0 | 4.0 | 2.5 |
2020-05-11 | 6 | 15.0 | 5.0 | 5.0 | 3.0 |
2020-05-12 | 7 | 18.0 | 6.0 | 6.0 | 3.5 |
2020-05-13 | 8 | 21.0 | 7.0 | 7.0 | 4.0 |
2020-05-14 | 9 | 24.0 | 8.0 | 8.0 | 4.5 |
2020-05-15 | 10 | 27.0 | 9.0 | 9.0 | 5.0 |
2020-05-16 | 11 | 30.0 | 10.0 | 10.0 | 5.5 |
2020-05-17 | 12 | 33.0 | 11.0 | 11.0 | 6.0 |
2020-05-18 | 13 | 36.0 | 12.0 | 12.0 | 6.5 |
2020-05-19 | 14 | 39.0 | 13.0 | 13.0 | 7.0 |
2020-05-20 | 15 | 42.0 | 14.0 | 14.0 | 7.5 |
2020-05-21 | 16 | 45.0 | 15.0 | 15.0 | 8.0 |
2020-05-22 | 17 | 48.0 | 16.0 | 16.0 | 8.5 |
2020-05-23 | 18 | 51.0 | 17.0 | 17.0 | 9.0 |
2020-05-24 | 19 | 54.0 | 18.0 | 18.0 | 9.5 |
df['test'].rolling(2).agg([np.sum,np.mean]) #Using agg function to return several results.
sum | mean | |
---|---|---|
2020-05-05 | NaN | NaN |
2020-05-06 | 1.0 | 0.5 |
2020-05-07 | 3.0 | 1.5 |
2020-05-08 | 5.0 | 2.5 |
2020-05-09 | 7.0 | 3.5 |
2020-05-10 | 9.0 | 4.5 |
2020-05-11 | 11.0 | 5.5 |
2020-05-12 | 13.0 | 6.5 |
2020-05-13 | 15.0 | 7.5 |
2020-05-14 | 17.0 | 8.5 |
2020-05-15 | 19.0 | 9.5 |
2020-05-16 | 21.0 | 10.5 |
2020-05-17 | 23.0 | 11.5 |
2020-05-18 | 25.0 | 12.5 |
2020-05-19 | 27.0 | 13.5 |
2020-05-20 | 29.0 | 14.5 |
2020-05-21 | 31.0 | 15.5 |
2020-05-22 | 33.0 | 16.5 |
2020-05-23 | 35.0 | 17.5 |
2020-05-24 | 37.0 | 18.5 |
Look at the result above,since window
=3,so the first two datas are Nan.In terms of 'sum',3=0+1+2,6=1+2+3,9=2+3+4...
The expression rolling(250)
is similar in behaviour to groupby
,but instead of grouping it, it creates an object that grouping over a 250-day sliding window.By default rolling functions require all of the values in the window to be non-Na.This behaviour can be changed to account for missing and ,in particular ,the fact that you will have fewer than window
periods of data at the begining of the time series:
appl_std250=close_px.AAPL.rolling(250,min_periods=10).std()
appl_std250[5:12]
2003-01-09 NaN
2003-01-10 NaN
2003-01-13 NaN
2003-01-14 NaN
2003-01-15 0.077496
2003-01-16 0.074760
2003-01-17 0.112368
Freq: B, Name: AAPL, dtype: float64
appl_std250.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f23f70cf8>
In order to compute an expanding window mean,use the expanding
operator instead of rolling
.The expanding mean starts the time window from the begining of the time series and increases the size of the window until it encompasses the whole series.
To illustrate expanding
,look at the following example:
df.test
2020-05-05 0
2020-05-06 1
2020-05-07 2
2020-05-08 3
2020-05-09 4
2020-05-10 5
2020-05-11 6
2020-05-12 7
2020-05-13 8
2020-05-14 9
2020-05-15 10
2020-05-16 11
2020-05-17 12
2020-05-18 13
2020-05-19 14
2020-05-20 15
2020-05-21 16
2020-05-22 17
2020-05-23 18
2020-05-24 19
Freq: D, Name: test, dtype: int32
df.test.expanding().mean()
2020-05-05 0.0
2020-05-06 0.5
2020-05-07 1.0
2020-05-08 1.5
2020-05-09 2.0
2020-05-10 2.5
2020-05-11 3.0
2020-05-12 3.5
2020-05-13 4.0
2020-05-14 4.5
2020-05-15 5.0
2020-05-16 5.5
2020-05-17 6.0
2020-05-18 6.5
2020-05-19 7.0
2020-05-20 7.5
2020-05-21 8.0
2020-05-22 8.5
2020-05-23 9.0
2020-05-24 9.5
Freq: D, Name: test, dtype: float64
Compared with rolling
,expanding
's window is varialble like cumsum, and rolling
's window is fixed.
expanding_mean=appl_std250.expanding().mean()
expanding_mean
2003-01-02 NaN
2003-01-03 NaN
2003-01-06 NaN
2003-01-07 NaN
2003-01-08 NaN
...
2011-10-10 18.521201
2011-10-11 18.524272
2011-10-12 18.527385
2011-10-13 18.530554
2011-10-14 18.533823
Freq: B, Name: AAPL, Length: 2292, dtype: float64
close_px.rolling(60).mean().plot(logy=True)
<matplotlib.axes._subplots.AxesSubplot at 0x14f240f87f0>
The rolling
function also accepts a string indicating a fixed-size time offset rather than a set number of period.Using this notation can be useful for irregular time series.These are the same string that can be passed to resample.
ser=pd.Series(np.arange(6),pd.date_range('2020-05-05',periods=6));ser
2020-05-05 0
2020-05-06 1
2020-05-07 2
2020-05-08 3
2020-05-09 4
2020-05-10 5
Freq: D, dtype: int32
ser.rolling('2D').mean()
2020-05-05 0.0
2020-05-06 0.5
2020-05-07 1.5
2020-05-08 2.5
2020-05-09 3.5
2020-05-10 4.5
Freq: D, dtype: float64
ser.rolling('D').mean()
2020-05-05 0.0
2020-05-06 1.0
2020-05-07 2.0
2020-05-08 3.0
2020-05-09 4.0
2020-05-10 5.0
Freq: D, dtype: float64
close_px.rolling('20D').mean()
AAPL | MSFT | XOM | |
---|---|---|---|
2003-01-02 | 7.400000 | 21.110000 | 29.220000 |
2003-01-03 | 7.425000 | 21.125000 | 29.230000 |
2003-01-06 | 7.433333 | 21.256667 | 29.473333 |
2003-01-07 | 7.432500 | 21.425000 | 29.342500 |
2003-01-08 | 7.402000 | 21.402000 | 29.240000 |
... | ... | ... | ... |
2011-10-10 | 389.351429 | 25.602143 | 72.527857 |
2011-10-11 | 388.505000 | 25.674286 | 72.835000 |
2011-10-12 | 388.531429 | 25.810000 | 73.400714 |
2011-10-13 | 388.826429 | 25.961429 | 73.905000 |
2011-10-14 | 391.038000 | 26.048667 | 74.185333 |
2292 rows × 3 columns
Exponentially weighted functions
An alternative to using a static window size with equally weighted observations is to sepcify a constant decay factor to give more weight to more recent observations.There are a couple of ways to specify the decay factor. A popular one is using a span,which makes the result comparable to a simple moving window function with window size equal to the span.
Pandas has the ewm
operator to go along with rolling
and expanding
.Here is an example comparing a 60-day moving average of Apple's stock price with an EW moving average with span
=60:
appl_px=close_px.AAPL['2006':'2007']
ma60=appl_px.rolling(30,min_periods=20).mean()
ewma60=appl_px.ewm(span=30).mean()
ma60.plot(label='simple MA')
<matplotlib.axes._subplots.AxesSubplot at 0x14f24293390>
ewma60.plot(label='EW MA')
<matplotlib.axes._subplots.AxesSubplot at 0x14f24201128>
User-defined moving window functions
The apply
method on rolling
and related methods provides a means to apply an array function of your own devising over a moving window.The only requirement is that the funciton produce a single value(a reduction) from each piece of the array.For example,while we can compute sample quantiles using rolling(...)quantile(q)
,we might be interested in the percentile rank of a particular value over the sample.
from scipy.stats import percentileofscore
score_at_2percent=lambda x:percentileofscore(x,0.02)
results=close_px.pct_change()
result=results.AAPL.rolling(250).apply(score_at_2percent)
C:Users旺仔QQ糖AppDataRoamingPythonPython36site-packagesipykernel\__main__.py:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
if __name__ == '__main__':
result.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f2608ec88>