1. 初始化
(1)生成简单序列pd.Series
>>>s = pd.Series([1,3,5,np.nan,6,8]) >>>s 0 1.0 1 3.0 2 5.0 3 NaN #注意空 4 6.0 5 8.0 dtype: float64
(2)生成日期序列pd.date_range
>>>dates = pd.date_range('20130101', periods=6) >>> dates DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')
(3)结构
>>>df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) # index 表示序号,columns表示列名称 >>> df A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
>>>: df2 = pd.DataFrame({ 'A' : 1., ....: 'B' : pd.Timestamp('20130102'), ....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), ....: 'D' : np.array([3] * 4,dtype='int32'), ....: 'E' : pd.Categorical(["test","train","test","train"]), ....: 'F' : 'foo' }) ....: >>>: df2 A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
2. 观察数据
(1)前n个(head),后n个(tail)
>>> df.head(2) A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 >>> df.tail(3) A B C D 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
(2)展示序号(index)、列号(columns)、值(values)
>>>df.index DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') >>> df.columns Index(['A', 'B', 'C', 'D'], dtype='object') >>> df.values array([[ 0.4691, -0.2829, -1.5091, -1.1356], [ 1.2121, -0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949, 1.0718], [ 0.7216, -0.7068, -1.0396, 0.2719], [-0.425 , 0.567 , 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784, 0.525 ]])
(3)快速数据统计describe
>>>df.describe() A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.073711 -0.431125 -0.687758 -0.233103 std 0.843157 0.922818 0.779887 0.973118 min -0.861849 -2.104569 -1.509059 -1.135632 25% -0.611510 -0.600794 -1.368714 -1.076610 50% 0.022070 -0.228039 -0.767252 -0.386188 75% 0.658444 0.041933 -0.034326 0.461706 max 1.212112 0.567020 0.276232 1.071804
(4)转置df.T
(5)按轴排序
降序:ascending=False
升序:ascending=True
横轴: df.sort_index(axis=1, ascending=False)
纵轴: df.sort_index(axis=0, ascending=False)
>>>df.sort_index(axis=1, ascending=False) D C B A 2013-01-01 -1.135632 -1.509059 -0.282863 0.469112 2013-01-02 -1.044236 0.119209 -0.173215 1.212112 2013-01-03 1.071804 -0.494929 -2.104569 -0.861849 2013-01-04 0.271860 -1.039575 -0.706771 0.721555 2013-01-05 -1.087401 0.276232 0.567020 -0.424972 2013-01-06 0.524988 -1.478427 0.113648 -0.673690
(6)按值排序
>>> df.sort_values(by='B') A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401
3. 选择, 与matlab类似
选择某列( df.A ==
df['A'])
选择某个区间(df[0:3])
按标签选择(df.loc[dates[0]])
4. 数据缺失
用nan表示
舍去丢失数据的行 df.dropna(how='any')
补全丢失的数据 df.fillna(value=5)
判断是否缺失数据 pd.isna(df1)
5. 统计
求平均值 df.mean()
6. 使用函数
>>>df.apply(lambda x: x.max() - x.min()) A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000 dtype: float64