• hdf5文件、tqdm模块、nunique、read_csv、sort_values、astype、fillna


    pandas.DataFrame.to_hdf(self, path_or_buf, key, **kwargs):

    Hierarchical Data Format (HDF) ,to add another DataFrame or Series to an existing HDF file, please use append mode and a different a key.

    df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},  index=['a', 'b', 'c'])
    df.to_hdf('data.h5', key='df', mode='w', format='table')
    # format : {‘fixed’, ‘table’}, default ‘fixed’
    # ‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable
    # ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data
    s
    = pd.Series([1, 2, 3, 4]) s.to_hdf('data.h5', key='s') pd.read_hdf('data.h5', 'df') pd.read_hdf('data.h5', 's')

    tqdm模块显示进度条:

    tqdm(self, iterable=None, desc=None, total=None, leave=True, file=None, ncols=None, mininterval=0.1, maxinterval=10.0, miniters=None, ascii=None, disable=False, unit='it', unit_scale=False, dynamic_ncols=False, smoothing=0.3, bar_format=None, initial=0, position=None, postfix=None, unit_divisor=1000, write_bytes=None, gui=False, **kwargs)

    iterable : iterable, optional

    total : int, optional. The number of expected iterations. If unspecified, len(iterable) is used if possible. 

    for x in tqdm(train_df['request_timestamp'].values,total=len(train_df)):
        localtime=time.localtime(x)
        wday.append(localtime[6])
        hour.append(localtime[3])

     https://lorexxar.cn/2016/07/21/python-tqdm/

    https://tqdm.github.io/docs/tqdm/

    pandas.DataFrame.nuniquehttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

    DataFrame.nunique(selfaxis=0dropna=True)

    Count distinct observations over requested axis. Return Series with number of distinct observations. Can ignore NaN values.

    >>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
    >>> df
       A  B
    0  1  1
    1  2  1
    2  3  1
    >>> df.nunique()
    A    3
    B    1
    dtype: int64
    >>> df.nunique(axis=1)
    0    1
    1    2
    2    2
    dtype: int64

    pandas.read_csv:

    pandas.read_csv(...)常见参数:

    sep str, default ‘,’

    header int, list of int, default ‘infer’. Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None

    names array-like, optional. List of column names to use.  Duplicates in this list are not allowed.

    df=pd.read_csv('data/testA/totalExposureLog.out', sep='	',names=['id','request_timestamp','position','uid','aid','imp_ad_size','bid','pctr','quality_ecpm','totalEcpm'])

    pandas.DataFrame.sort_values:

    DataFrame.sort_values(self, by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
    # axis这个参数的默认值为0,匹配的是index,跨行进行排序,当axis=1时,匹配的是columns,跨列进行排序
    # by这个参数要求传入一个字符或者是一个字符列表,用来指定按照axis的中的哪个元素来进行排序
    # ascending这个参数的默认值是True,按照升序排序,当传入False时,按照降序进行排列
    # kind这个参数表示按照什么样算法来进行排序,默认值是quicksort(快速排序),也可以传入mergesort(归并排序)或者是heapsort(堆排序)
    
    df.sort_values(by='col1')
    df.sort_values(by=['col1', 'col2'])

    pandas.DataFrame.astype:

    DataFrame.astype(self, dtype, copy=True, errors='raise', **kwargs)
    # dtype : data type, or dict of column name
    # Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
    
    d = {'col1': [1, 2], 'col2': [3, 4]}
    df = pd.DataFrame(data=d)
    df.dtypes
    
    df.astype('int32').dtypes
    df.astype({'col1': 'int32'}).dtypes

    pandas.DataFrame.fillna: 

    DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
    # fillna()会填充nan数据,返回填充后的结果。如果希望在原DataFrame中修改,则把inplace设置为True
  • 相关阅读:
    UVa 568 Just the Facts
    UVa 253 Cube painting
    百钱买百鸡问题Java
    jsp-4 用cookie实现记住密码
    (动态规划)有 n 个学生站成一排,每个学生有一个能力值,从这 n 个学生中按照顺序选取kk 名学生,要求相邻两个学生的位置编号的差不超过 d,使得这 kk 个学生的能力值的乘积最大,返回最大的乘积
    c++ 根据生产日期,保质期求出过期时间
    汉诺塔的非递归解决办法
    链表实现多项式的加法和乘法
    暗时间--刘未鹏
    hdoj1013(数根,大数,九余数算法)
  • 原文地址:https://www.cnblogs.com/ljygoodgoodstudydaydayup/p/11420596.html
Copyright © 2020-2023  润新知