• 【DW·智慧海洋(捕鱼作业分析)打卡】task03_特征工程 (复现top的各种特征工程:分箱特征、网格特征、统计特征、Embedding特征)


    开源地址见Github:https://github.com/datawhalechina/team-learning

    学习目标

    1. 学习特征工程的基本概念

    2. 学习topline代码的特征工程构造方法,实现构建有意义的特征工程

    3. 完成相应学习打卡任务

    内容介绍

    1. 特征工程概述

    2. 赛题特征工程

      • 业务特征,根据先验知识进行专业性的特征构建
    3. 分箱特征

      • v、x、y的分箱特征
      • x、y分箱后并构造区域
    4. DataFramte特征

      • count计数值
      • shift偏移量
      • 统计特征
    5. Embedding特征

      • Word2vec构造词向量
      • NMF提取文本的主题分布
    6. 总结

    特征工程概述

    特征工程大体可分为3部分,特征构建、特征提取和特征选择。

    • 特征构建
      • 探索性数据分析
      • 数值特征
      • 类别特征
      • 时间特征
      • 文本特征
    • 特征提取和特征选择
      • 简化
      • 改善性能
      • 改善通用性/降低过拟合的风险
    import gc
    import multiprocessing as mp
    import os
    import pickle
    import time
    import warnings
    from collections import Counter
    from copy import deepcopy
    from datetime import datetime
    from functools import partial
    from glob import glob
    
    import geopandas as gpd
    import lightgbm as lgb
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    import seaborn as sns
    from gensim.models import FastText, Word2Vec
    from gensim.models.doc2vec import Doc2Vec, TaggedDocument
    from pyproj import Proj
    from scipy import sparse
    from scipy.sparse import csr_matrix
    from sklearn import metrics
    from sklearn.cluster import DBSCAN
    from sklearn.decomposition import NMF, TruncatedSVD
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from sklearn.metrics import f1_score, precision_recall_fscore_support
    from sklearn.model_selection import StratifiedKFold
    from sklearn.preprocessing import LabelEncoder
    from tqdm import tqdm
    
    os.environ['PYTHONHASHSEED'] = '0'
    warnings.filterwarnings('ignore')
    
    
    D:Anaconda3libsite-packagesgensimsimilarities\__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
      warnings.warn(msg)
    
    # 不直接对DataFrame做append操作,提升运行速度
    def get_data(file_path,max_lines = 2000):
        paths = os.listdir(file_path)
        tmp = []
        for t in tqdm(range(len(paths))):
            if len(tmp) > max_lines:break
    
            p = paths[t]
            with open('{}/{}'.format(file_path, p), encoding='utf-8') as f:
                next(f)
                for line in f.readlines():
                    tmp.append(line.strip().split(','))
                    if len(tmp) > max_lines:break
    
        tmp_df = pd.DataFrame(tmp)
        tmp_df.columns = ['渔船ID', 'x', 'y', '速度', '方向', 'time', 'type']
        return tmp_df
    
    TRAIN_PATH = "E:/competition-data/017_wisdomOcean/hy_round1_train_20200102/"
    
    # 采样数据行数
    max_lines = 2000
    df = get_data(TRAIN_PATH,max_lines=max_lines)
    
      0%|                                                                               | 6/7000 [00:00<00:02, 2999.86it/s]
    
    # 基本预处理
    label_dict1 = {'拖网': 0, '围网': 1, '刺网': 2}
    label_dict2 = {0: '拖网', 1: '围网', 2: '刺网'}
    name_dict = {'渔船ID': 'id', '速度': 'v', '方向': 'dir', 'type': 'label'}
    
    df.rename(columns = name_dict, inplace = True)
    df['label'] = df['label'].map(label_dict1)
    cols = ['x','y','v']
    for col in cols:
        df[col] = df[col].astype('float')
    df['dir'] = df['dir'].astype('int')
    df['time'] = pd.to_datetime(df['time'], format='%m%d %H:%M:%S')
    df['date'] = df['time'].dt.date
    df['hour'] = df['time'].dt.hour
    df['month'] = df['time'].dt.month
    df['weekday'] = df['time'].dt.weekday
    df.head()
    
    
    id x y v dir time label date hour month weekday
    0 0 6.152038e+06 5.124873e+06 2.59 102 1900-11-10 11:58:19 0 1900-11-10 11 11 5
    1 0 6.151230e+06 5.125218e+06 2.70 113 1900-11-10 11:48:19 0 1900-11-10 11 11 5
    2 0 6.150421e+06 5.125563e+06 2.70 116 1900-11-10 11:38:19 0 1900-11-10 11 11 5
    3 0 6.149612e+06 5.125907e+06 3.29 95 1900-11-10 11:28:19 0 1900-11-10 11 11 5
    4 0 6.148803e+06 5.126252e+06 3.18 108 1900-11-10 11:18:19 0 1900-11-10 11 11 5

    数据说明:

    - id:渔船ID,整数
    - x:记录位置横坐标,浮点数
    - y:记录位置纵坐标,浮点数
    - v:记录速度,浮点数
    - dir:记录航向,整数
    - time:时间,文本
    - label:需要预测的标签,整数
    

    赛题特征工程

    构造各点的(x、y)坐标与特定点(6165599,5202660)的距离

    df['x_dis_diff'] = (df['x'] - 6165599).abs()
    df['y_dis_diff'] = (df['y'] - 5202660).abs()
    df['base_dis_diff'] = ((df['x_dis_diff']**2)+(df['y_dis_diff']**2))**0.5    
    del df['x_dis_diff'],df['y_dis_diff'] 
    df['base_dis_diff'].head()
    
    0    78959.780945
    1    78763.845006
    2    78577.185266
    3    78399.867568
    4    78231.955018
    Name: base_dis_diff, dtype: float64
    

    对时间,小时进行白天、黑天进行划分,5-20为白天1,其余为黑天0

    df['day_nig'] = 0
    df.loc[(df['hour'] > 5) & (df['hour'] < 20),'day_nig'] = 1
    df['day_nig'].head()
    
    0    1
    1    1
    2    1
    3    1
    4    1
    Name: day_nig, dtype: int64
    

    根据月份划分季度

    # 季度
    df['quarter'] = 0
    df.loc[(df['month'].isin([1, 2, 3])), 'quarter'] = 1
    df.loc[(df['month'].isin([4, 5, 6, ])), 'quarter'] = 2
    df.loc[(df['month'].isin([7, 8, 9])), 'quarter'] = 3
    df.loc[(df['month'].isin([10, 11, 12])), 'quarter'] = 4
    

    动态速度,速度变化,角度变化,xy相似性等特征, 16个方位的特征划分

    temp = df.copy()
    temp.rename(columns={'id':'ship', 'dir':'d'},inplace=True)
    
    # 给速度一个等级
    def v_cut(v):
        if v < 0.1:
            return 0
        elif v < 0.5:
            return 1
        elif v < 1:
            return 2
        elif v < 2.5:
            return 3
        elif v < 5:
            return 4
        elif v < 10:
            return 5
        elif v < 20:
            return 5
        else:
            return 6
    # 统计每个ship的对应速度等级的个数
    def get_v_fea(df):
        df['v_cut'] = df['v'].apply(lambda x: v_cut(x))
        tmp = df.groupby(['ship', 'v_cut'], as_index=False)['v_cut'].agg({'v_cut_count': 'count'})
        # 通过pivot构建透视表
        tmp = tmp.pivot(index='ship', columns='v_cut', values='v_cut_count')
    
        new_col_nm = ['v_cut_' + str(col) for col in tmp.columns.tolist()]
        tmp.columns = new_col_nm
        tmp = tmp.reset_index()  # 把index恢复成data
    
        return tmp
    
    c1 = get_v_fea(temp)
    

    方位进行16均分

    def add_direction(df):
        df['d16'] = df['d'].apply(lambda x: int((x / 22.5) + 0.5) % 16 if not np.isnan(x) else np.nan)
        return df
    def get_d_cut_count_fea(df):
        df = add_direction(df)
        tmp = df.groupby(['ship', 'd16'], as_index=False)['d16'].agg({'d16_count': 'count'})
        tmp = tmp.pivot(index='ship', columns='d16', values='d16_count')
        new_col_nm = ['d16_' + str(col) for col in tmp.columns.tolist()]
        tmp.columns = new_col_nm
        tmp = tmp.reset_index()
        return tmp
    
    c2 = get_d_cut_count_fea(temp)
    

    统计速度为0的个数和统计量

    def get_v0_fea(df):
        # 统计速度为0的个数,以及速度不为0的统计量
        df_zero_count = df.query("v==0")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(
            {'num_zero_v': 'count'})
        df_not_zero_agg = df.query("v!=0")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(
            {'v_max_drop_0': 'max',
             'v_min_drop_0': 'min',
             'v_mean_drop_0': 'mean',
             'v_std_drop_0': 'std',
             'v_median_drop_0': 'median',
             'v_skew_drop_0': 'skew'})
        tmp = df_zero_count.merge(df_not_zero_agg, on='ship', how='left')
    
        return tmp
    
    c3 = get_v0_fea(temp)
    

    获取百分位数的feature

    def get_percentiles_fea(df_raw):
        key = ['x', 'y', 'v', 'd']
        temp = df_raw[['ship']].drop_duplicates('ship')
        for i in range(len(key)):
            # 加入x,v,d,y的中位数和各种位数
            tmp_dscb = df_raw.groupby('ship')[key[i]].describe(
                percentiles=[0.05] + [ii / 1000 for ii in range(125, 1000, 125)] + [0.95])
            raw_col_nm = tmp_dscb.columns.tolist()
            new_col_nm = [key[i] + '_' + col for col in raw_col_nm]
            tmp_dscb.columns = new_col_nm
            tmp_dscb = tmp_dscb.reset_index()
            # 删掉多余的统计特征
            tmp_dscb = tmp_dscb.drop([f'{key[i]}_count', f'{key[i]}_mean', f'{key[i]}_std',
                                      f'{key[i]}_min', f'{key[i]}_max'], axis=1)
    
            temp = temp.merge(tmp_dscb, on='ship', how='left')
        return temp
    
    c4 = get_percentiles_fea(temp)
    

    计算各点前后之间的delta

    def get_d_change_rate_fea(df):
        import math
        import time
        temp = df.copy()
        # 以ship、time为主键进行排序
        temp.sort_values(['ship', 'time'], ascending=True, inplace=True)
        # 通过shift求相邻差异值,注意学习.shift(-1,1)的含义
        temp['timenext'] = temp.groupby('ship')['time'].shift(-1)
        temp['ynext'] = temp.groupby('ship')['y'].shift(-1)
        temp['xnext'] = temp.groupby('ship')['x'].shift(-1)
        # 将shift得到的差异量进行填充,为什么会有空值NaN?
        # 因为shift的起始位置是没法比较的,故用空值来代替
        temp['ynext'] = temp['ynext'].fillna(method='ffill')
        temp['xnext'] = temp['xnext'].fillna(method='ffill')
        # 这里笔者的理解是ynext/xnext,而不需要减去y和x,因为ynext和xnext本身就是偏移量了
        temp['angle_next'] = (temp['ynext'] - temp['y']) / (temp['xnext'] - temp['x'])
        temp['angle_next'] = np.arctan(temp['angle_next']) / math.pi * 180
        temp['angle_next_next'] = temp['angle_next'].shift(-1)
        temp['timediff'] = np.abs(temp['timenext'] - temp['time'])
        temp['timediff'] = temp['timediff'].fillna(method='ffill')
        temp['hc_xy'] = abs(temp['angle_next_next'] - temp['angle_next'])
        # 对于hc_xy这列的值>180度的,进行修改成360度求差,仅考虑与水平线的角度
        temp.loc[temp['hc_xy'] > 180, 'hc_xy'] = (360 - temp.loc[temp['hc_xy'] > 180, 'hc_xy'])
        temp['hc_xy_s'] = temp.apply(lambda x: x['hc_xy'] / x['timediff'].total_seconds(), axis=1)
    
        temp['d_next'] = temp.groupby('ship')['d'].shift(-1)
        temp['hc_d'] = abs(temp['d_next'] - temp['d'])
        temp.loc[temp['hc_d'] > 180, 'hc_d'] = 360 - temp.loc[temp['hc_d'] > 180, 'hc_d']
        temp['hc_d_s'] = temp.apply(lambda x: x['hc_d'] / x['timediff'].total_seconds(), axis=1)
    
        temp1 = temp[['ship', 'hc_xy_s', 'hc_d_s']]
        xy_d_rate = temp1.groupby('ship')['hc_xy_s'].agg([('hc_xy_s_max', 'max')])
        xy_d_rate = xy_d_rate.reset_index()
        d_d_rate = temp1.groupby('ship')['hc_d_s'].agg([('hc_d_s_max', 'max')])
        d_d_rate = d_d_rate.reset_index()
    
        tmp = xy_d_rate.merge(d_d_rate, on='ship', how='left')
        return tmp
    
    c5 = get_d_change_rate_fea(temp)
    c5
    
    ship hc_xy_s_max hc_d_s_max
    0 0 0.183673 0.188020
    1 1 0.293241 0.340426
    2 10 0.223041 0.341176
    3 100 0.282311 0.300000
    4 1000 0.969970 0.717172
    5 1001 0.078424 0.270903
    f1 = temp.merge(c1, on='ship',how='left')
    f1 = f1.merge(c2, on='ship',how='left')
    f1 = f1.merge(c3, on='ship',how='left')
    f1 = f1.merge(c4, on='ship',how='left')
    f1 = f1.merge(c5, on='ship',how='left')
    f1
    
    ship x y v d time label date hour month ... d_12.5% d_25% d_37.5% d_50% d_62.5% d_75% d_87.5% d_95% hc_xy_s_max hc_d_s_max
    0 0 6.152038e+06 5.124873e+06 2.59 102 1900-11-10 11:58:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
    1 0 6.151230e+06 5.125218e+06 2.70 113 1900-11-10 11:48:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
    2 0 6.150421e+06 5.125563e+06 2.70 116 1900-11-10 11:38:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
    3 0 6.149612e+06 5.125907e+06 3.29 95 1900-11-10 11:28:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
    4 0 6.148803e+06 5.126252e+06 3.18 108 1900-11-10 11:18:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    1996 1001 6.246323e+06 5.241154e+06 0.11 0 1900-11-17 09:43:41 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903
    1997 1001 6.246323e+06 5.241154e+06 0.22 10 1900-11-17 09:34:10 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903
    1998 1001 6.246323e+06 5.241154e+06 0.11 0 1900-11-17 09:23:39 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903
    1999 1001 6.246323e+06 5.241154e+06 0.11 287 1900-11-17 09:13:40 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903
    2000 1001 6.246323e+06 5.241154e+06 0.32 271 1900-11-17 09:04:02 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903

    2001 rows × 83 columns

    分箱特征

    v、x、y的分箱特征+统计分箱特征

    pre_cols = df.columns
    
    df['v_bin'] = pd.qcut(df['v'], 200, duplicates='drop') # 速度进行 200分位数分箱
    df['v_bin'] = df['v_bin'].map(dict(zip(df['v_bin'].unique(), range(df['v_bin'].nunique())))) # 分箱后映射编码
    for f in ['x', 'y']:
        df[f + '_bin1'] = pd.qcut(df[f], 1000, duplicates='drop') # x,y位置分箱1000
        df[f + '_bin1'] = df[f + '_bin1'].map(dict(zip(df[f + '_bin1'].unique(), range(df[f + '_bin1'].nunique()))))#编码
        df[f + '_bin2'] = df[f] // 10000 # 取整操作
        df[f + '_bin1_count'] = df[f + '_bin1'].map(df[f + '_bin1'].value_counts()) #x,y不同分箱的数量映射
        df[f + '_bin2_count'] = df[f + '_bin2'].map(df[f + '_bin2'].value_counts()) #数量映射
        df[f + '_bin1_id_nunique'] = df.groupby(f + '_bin1')['id'].transform('nunique')#基于分箱1 id数量映射
        df[f + '_bin2_id_nunique'] = df.groupby(f + '_bin2')['id'].transform('nunique')#基于分箱2 id数量映射
    for i in [1, 2]:
        # 特征交叉x_bin1(2),y_bin1(2) 形成类别 统计每类数量映射到列  
        df['x_y_bin{}'.format(i)] = df['x_bin{}'.format(i)].astype('str') + '_' + df['y_bin{}'.format(i)].astype('str')
        df['x_y_bin{}'.format(i)] = df['x_y_bin{}'.format(i)].map(
            dict(zip(df['x_y_bin{}'.format(i)].unique(), range(df['x_y_bin{}'.format(i)].nunique())))
        )
        df['x_bin{}_y_bin{}_count'.format(i, i)] = df['x_y_bin{}'.format(i)].map(df['x_y_bin{}'.format(i)].value_counts())
    for stat in ['max', 'min']:
        # 统计x_bin1 y_bin1的最大最小值
        df['x_y_{}'.format(stat)] = df['y'] - df.groupby('x_bin1')['y'].transform(stat)
        df['y_x_{}'.format(stat)] = df['x'] - df.groupby('y_bin1')['x'].transform(stat)
    
    new_cols = [i for i in df.columns if i not in pre_cols]
    df[new_cols].head()
    
    v_bin x_bin1 x_bin2 x_bin1_count x_bin2_count x_bin1_id_nunique x_bin2_id_nunique y_bin1 y_bin2 y_bin1_count ... y_bin1_id_nunique y_bin2_id_nunique x_y_bin1 x_bin1_y_bin1_count x_y_bin2 x_bin2_y_bin2_count x_y_max y_x_max x_y_min y_x_min
    0 0.0 0 615.0 116 8 2 2 0 512.0 2 ... 2 1 0 1 0 3 -115954.675157 0.000000 0.000000 49790.106760
    1 0.0 1 615.0 2 8 2 2 1 512.0 2 ... 1 1 1 1 0 3 0.000000 0.000000 53070.048324 808.872353
    2 0.0 2 615.0 2 8 2 2 1 512.0 2 ... 1 1 2 1 0 3 0.000000 -808.872353 54707.512092 0.000000
    3 1.0 3 614.0 2 77 2 2 2 512.0 2 ... 1 1 3 1 1 8 0.000000 0.000000 52951.293120 808.787673
    4 2.0 4 614.0 2 77 2 2 2 512.0 2 ... 1 1 4 1 1 8 0.000000 -808.787673 55461.653028 0.000000

    5 rows × 21 columns

    将x、y进行分箱并构造(网格)区域

    def traj_to_bin(traj=None, x_min=12031967.16239096, x_max=14226964.881853,
                    y_min=1623579.449434373, y_max=4689471.1780792,
                    row_bins=4380, col_bins=3136):
    
        # Establish bins on x direction and y direction
        x_bins = np.linspace(x_min, x_max, endpoint=True, num=col_bins + 1)
        y_bins = np.linspace(y_min, y_max, endpoint=True, num=row_bins + 1)
    
        # Determine each x coordinate belong to which bin
        traj.sort_values(by='x', inplace=True)
        x_res = np.zeros((len(traj), ))
        j = 0
        for i in range(1, col_bins + 1):
            low, high = x_bins[i-1], x_bins[i]
            while( j < len(traj)):
                # low - 0.001 for numeric stable.
                if (traj["x"].iloc[j] <= high) & (traj["x"].iloc[j] > low - 0.001):
                    x_res[j] = i
                    j += 1
                else:
                    break
        traj["x_grid"] = x_res
        traj["x_grid"] = traj["x_grid"].astype(int)
        traj["x_grid"] = traj["x_grid"].apply(str)
    
        # Determine each y coordinate belong to which bin
        traj.sort_values(by='y', inplace=True)
        y_res = np.zeros((len(traj), ))
        j = 0
        for i in range(1, row_bins + 1):
            low, high = y_bins[i-1], y_bins[i]
            while( j < len(traj)):
                # low - 0.001 for numeric stable.
                if (traj["y"].iloc[j] <= high) & (traj["y"].iloc[j] > low - 0.001):
                    y_res[j] = i
                    j += 1
                else:
                    break
        traj["y_grid"] = y_res
        traj["y_grid"] = traj["y_grid"].astype(int)
        traj["y_grid"] = traj["y_grid"].apply(str)
    
        # Determine which bin each coordinate belongs to.
        traj["no_bin"] = [i + "_" + j for i, j in zip(
            traj["x_grid"].values.tolist(), traj["y_grid"].values.tolist())]
        traj.sort_values(by='time', inplace=True)
        return traj
    
    bin_size = 800
    col_bins = int((14226964.881853 - 12031967.16239096) / bin_size)
    row_bins = int((4689471.1780792 - 1623579.449434373) / bin_size)
    
    pre_cols = df.columns
    # 特征x_grid,y_grid,no_bin
    df = traj_to_bin(df)
    
    new_cols = [i for i in df.columns if i not in pre_cols]
    df[new_cols]
    
    x_grid y_grid no_bin
    1606 0 0 0_0
    1605 0 0 0_0
    1604 0 0 0_0
    1603 0 0 0_0
    1602 0 0 0_0
    ... ... ... ...
    1988 0 0 0_0
    1987 0 0 0_0
    1986 0 0 0_0
    1985 0 0 0_0
    1984 0 0 0_0

    2001 rows × 3 columns

    DataFrame特征

    count计数值

    def find_save_visit_count_table(traj_data_df=None, bin_to_coord_df=None):
        """Find and save the visit frequency of each bin."""
        visit_count_df = traj_data_df.groupby(["no_bin"]).count().reset_index()
        visit_count_df = visit_count_df[["no_bin", "x"]]
        visit_count_df.rename({"x":"visit_count"}, axis=1, inplace=True)
        return visit_count_df
    
    def find_save_unique_visit_count_table(traj_data_df=None, bin_to_coord_df=None):
        """Find and save the unique boat visit count of each bin."""
        unique_boat_count_df = traj_data_df.groupby(["no_bin"])["id"].nunique().reset_index()
        unique_boat_count_df.rename({"id":"visit_boat_count"}, axis=1, inplace=True)
    
        unique_boat_count_df_save = pd.merge(bin_to_coord_df, unique_boat_count_df,
                                             on="no_bin", how="left")
        return unique_boat_count_df
    
    traj_df = df[["id","x", "y",'time',"no_bin"]]
    bin_to_coord_df = traj_df.groupby(["no_bin"]).median().reset_index()
    bin_to_coord_df
    
    no_bin x y
    0 0_0 6.124951e+06 5.130672e+06
    pre_cols = df.columns
    
    # DataFrame tmp for finding POIs
    visit_count_df = find_save_visit_count_table(
        traj_df, bin_to_coord_df)
    unique_boat_count_df = find_save_unique_visit_count_table(
        traj_df, bin_to_coord_df)
    
    # # 特征'visit_count','visit_boat_count'
    df = df.merge(visit_count_df,on='no_bin',how='left')
    df = df.merge(unique_boat_count_df,on='no_bin',how='left')
    
    new_cols = [i for i in df.columns if i not in pre_cols]
    df[new_cols].head()
    
    visit_count visit_boat_count
    0 2001 6
    1 2001 6
    2 2001 6
    3 2001 6
    4 2001 6

    shift偏移量特征

    • 对x,y坐标进行时间平移 1 -1 2
      • 三角形求解上时刻1距离 下时刻-1距离 2距离
      • 2时刻距离等频分箱50
      • 上一时刻映射编码
    pre_cols = df.columns
    
    g = df.groupby('id')
    for f in ['x', 'y']:
        #对x,y坐标进行时间平移 1 -1 2
        df[f + '_prev_diff'] = df[f] - g[f].shift(1)
        df[f + '_next_diff'] = df[f] - g[f].shift(-1)
        df[f + '_prev_next_diff'] = g[f].shift(1) - g[f].shift(-1)
        ## 三角形求解上时刻1距离  下时刻-1距离 2距离 
    df['dist_move_prev'] = np.sqrt(np.square(df['x_prev_diff']) + np.square(df['y_prev_diff']))
    df['dist_move_next'] = np.sqrt(np.square(df['x_next_diff']) + np.square(df['y_next_diff']))
    df['dist_move_prev_next'] = np.sqrt(np.square(df['x_prev_next_diff']) + np.square(df['y_prev_next_diff']))
    df['dist_move_prev_bin'] = pd.qcut(df['dist_move_prev'], 50, duplicates='drop')# 2时刻距离等频分箱50
    df['dist_move_prev_bin'] = df['dist_move_prev_bin'].map(
        dict(zip(df['dist_move_prev_bin'].unique(), range(df['dist_move_prev_bin'].nunique())))
    ) #上一时刻映射编码
    
    new_cols = [i for i in df.columns if i not in pre_cols]
    df[new_cols].head()
    
    x_prev_diff x_next_diff x_prev_next_diff y_prev_diff y_next_diff y_prev_next_diff dist_move_prev dist_move_next dist_move_prev_next dist_move_prev_bin
    0 NaN -911.903731 NaN NaN 455.919062 NaN NaN 1019.524696 NaN NaN
    1 911.903731 -911.965576 -1823.869307 -455.919062 455.831205 911.750267 1019.524696 1019.540730 2039.065423 1.0
    2 911.965576 -918.791508 -1830.757085 -455.831205 20.360332 476.191538 1019.540730 919.017072 1891.673831 1.0
    3 918.791508 -597.354368 -1516.145877 -20.360332 993.131365 1013.491697 919.017072 1158.940097 1823.695078 2.0
    4 597.354368 -910.468269 -1507.822637 -993.131365 564.435006 1557.566370 1158.940097 1071.232628 2167.842730 3.0

    统计特征

    基本统计特征用法

    补充:

    分组统计特征agg的使用非常重要,在此进行代码示例,详细请参考:
    http://joyfulpandas.datawhale.club/Content/ch4.html

    • 请注意{}和[]的使用

    分组标准格式:

    df.groupby(分组依据)[数据来源].使用操作

    先分组,得到

    gb = df.groupby(['School', 'Grade'])

    • 【a】使用多个函数

    gb.agg(['具体方法(如内置函数)'])

    如gb.agg(['sum'])

    • 【b】对特定的列使用特定的聚合函数

    gb.agg({'指定列':'具体方法'})

    如gb.agg({'Height':['mean','max'], 'Weight':'count'})

    • 【c】使用自定义函数

    gb.agg(函数名或匿名函数)

    如gb.agg(lambda x: x.mean()-x.min())

    • 【d】聚合结果重命名

    gb.agg([
    ('重命名的名字',具体方法(如内置函数、自定义函数))
    ])

    如gb.agg([('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')])

    • 【e】 指定列使用特定的聚合函数 + 重命名
      gb.agg({‘col1’: [('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')]})

    另外需要注意,使用对一个或者多个列使用单个聚合的时候,重命名需要加方括号,否则就不知道是新的名字还是手误输错的内置函数字符串:

    • 下述代码主要使用了

    一种是df.groupby('id').agg{'列名':'方法'},另一种是df.groupby('id')['列名'].agg(字典)

    pre_cols = df.columns
    
    def start(x):
        try:
            return x[0]
        except:
            return None
    
    def end(x):
        try:
            return x[-1]
        except:
            return None
    
    
    def mode(x):
        try:
            return pd.Series(x).value_counts().index[0]
        except:
            return None
    
    for f in ['dist_move_prev_bin', 'v_bin']:
        # 上一时刻类别 速度类别映射处理
        df[f + '_sen'] = df['id'].map(df.groupby('id')[f].agg(lambda x: ','.join(x.astype(str))))
        
        # 一系列基本统计量特征 每列执行相应的操作
    g = df.groupby('id').agg({
        'id': ['count'], 'x_bin1': [mode], 'y_bin1': [mode], 'x_bin2': [mode], 'y_bin2': [mode], 'x_y_bin1': [mode],
        'x': ['mean', 'max', 'min', 'std', np.ptp, start, end],
        'y': ['mean', 'max', 'min', 'std', np.ptp, start, end],
        'v': ['mean', 'max', 'min', 'std', np.ptp], 'dir': ['mean'],
        'x_bin1_count': ['mean'], 'y_bin1_count': ['mean', 'max', 'min'],
        'x_bin2_count': ['mean', 'max', 'min'], 'y_bin2_count': ['mean', 'max', 'min'],
        'x_bin1_y_bin1_count': ['mean', 'max', 'min'],
        'dist_move_prev': ['mean', 'max', 'std', 'min', 'sum'],
        'x_y_min': ['mean', 'min'], 'y_x_min': ['mean', 'min'],
        'x_y_max': ['mean', 'min'], 'y_x_max': ['mean', 'min'],
    }).reset_index()
    g.columns = ['_'.join(col).strip() for col in g.columns] #提取列名
    g.rename(columns={'id_': 'id'}, inplace=True) #重命名id_
    cols = [f for f in g.keys() if f != 'id'] #特征列名提取
    cols
    
    ['id_count',
     'x_bin1_mode',
     'y_bin1_mode',
     'x_bin2_mode',
     'y_bin2_mode',
     'x_y_bin1_mode',
     'x_mean',
     'x_max',
     'x_min',
     'x_std',
     'x_ptp',
     'x_start',
     'x_end',
     'y_mean',
     'y_max',
     'y_min',
     'y_std',
     'y_ptp',
     'y_start',
     'y_end',
     'v_mean',
     'v_max',
     'v_min',
     'v_std',
     'v_ptp',
     'dir_mean',
     'x_bin1_count_mean',
     'y_bin1_count_mean',
     'y_bin1_count_max',
     'y_bin1_count_min',
     'x_bin2_count_mean',
     'x_bin2_count_max',
     'x_bin2_count_min',
     'y_bin2_count_mean',
     'y_bin2_count_max',
     'y_bin2_count_min',
     'x_bin1_y_bin1_count_mean',
     'x_bin1_y_bin1_count_max',
     'x_bin1_y_bin1_count_min',
     'dist_move_prev_mean',
     'dist_move_prev_max',
     'dist_move_prev_std',
     'dist_move_prev_min',
     'dist_move_prev_sum',
     'x_y_min_mean',
     'x_y_min_min',
     'y_x_min_mean',
     'y_x_min_min',
     'x_y_max_mean',
     'x_y_max_min',
     'y_x_max_mean',
     'y_x_max_min']
    
    df = df.merge(g,on='id',how='left')
    
    new_cols = [i for i in df.columns if i not in pre_cols]
    df[new_cols].head()
    
    dist_move_prev_bin_sen v_bin_sen id_count x_bin1_mode y_bin1_mode x_bin2_mode y_bin2_mode x_y_bin1_mode x_mean x_max ... dist_move_prev_min dist_move_prev_sum x_y_min_mean x_y_min_min y_x_min_mean y_x_min_min x_y_max_mean x_y_max_min y_x_max_mean y_x_max_min
    0 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374
    1 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374
    2 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374
    3 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374
    4 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374

    5 rows × 54 columns

    划分数据后进行统计

    TODO group_feature函数异常,无法实现!!!

    def group_feature(df, key, target, aggs, flag):
        print('group_feature:  ' , (key, target, aggs, flag))
        """通过字典的形式来构建方法和重命名"""
        agg_dict = {}
        agg_list = []
        for ag in aggs:
            # agg_dict['{}_{}_{}'.format(target,ag,flag)] = ag
            # agg_dict['{}_{}_{}'.format(target,ag,flag)] = [('{}_{}_{}'.format(target,ag,flag)+'_'+ag, ag)]
            agg_dict['{}_{}_{}'.format(target, ag, flag)] = [(ag, ag)]
            # agg_list.append(('{}_{}_{}'.format(target,ag,flag)+'__'+ag, ag))
        print(agg_dict)
        t = df.groupby(key)[target].agg(aggs).reset_index()
        return t # TODO 等待二次修复
        # return df
    
    def extract_feature(df, train, flag):
        '''
        统计feature
        注意理解group_feature的使用和效果
        '''
        if (flag == 'on_night') or (flag == 'on_day'): 
            t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
            train = pd.merge(train, t, on='ship', how='left')
            # return train
        
        
        if flag == "0":
            t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
            train = pd.merge(train, t, on='ship', how='left')  
        elif flag == "1":
            t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
            train = pd.merge(train, t, on='ship', how='left')
            t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
            train = pd.merge(train, t, on='ship', how='left') 
            # .nunique().to_dict() 将nunique得到的对应唯一值统计量做成字典
            # to_dict() 与 map的使用可以很方便地构建一些统计量映射特征,如CTR(分类)问题中的转化率
            # 提问: 如果根据训练集给定的label(0,1)来构建训练集+测试集的转化率特征,注:测试集与训练集存在部分id相同
            hour_nunique = df.groupby('ship')['speed'].nunique().to_dict()
            train['speed_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)   
            hour_nunique = df.groupby('ship')['direction'].nunique().to_dict()
            train['direction_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)  
    
        t = group_feature(df, 'ship','x',['max','min','mean','median','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')
        t = group_feature(df, 'ship','y',['max','min','mean','median','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')
        t = group_feature(df, 'ship','base_dis_diff',['max','min','mean','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')
    
           
        train['x_max_x_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
        train['y_max_y_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
        train['y_max_x_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
        train['x_max_y_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
        train['slope_{}'.format(flag)] = train['y_max_y_min_{}'.format(flag)] / np.where(train['x_max_x_min_{}'.format(flag)]==0, 0.001, train['x_max_x_min_{}'.format(flag)])
        train['area_{}'.format(flag)] = train['x_max_x_min_{}'.format(flag)] * train['y_max_y_min_{}'.format(flag)]
    
        mode_hour = df.groupby('ship')['hour'].agg(lambda x:x.value_counts().index[0]).to_dict()
        train['mode_hour_{}'.format(flag)] = train['ship'].map(mode_hour)
        train['slope_median_{}'.format(flag)] = train['y_median_{}'.format(flag)] / np.where(train['x_median_{}'.format(flag)]==0, 0.001, train['x_median_{}'.format(flag)])
    
        return train
    
    
    
    data  = df.copy()
    data.rename(columns={
        'id':'ship',
        'v':'speed',
        'dir':'direction'
    },inplace=True)
    # 去重
    data_label = data.drop_duplicates(['ship'],keep = 'first')
    
    data_1 = data[data['speed']==0]
    data_2 = data[data['speed']!=0]
    data_label = extract_feature(data_1, data_label,"0")
    data_label = extract_feature(data_2, data_label,"1")
    
    data_1 = data[data['day_nig'] == 0]
    data_2 = data[data['day_nig'] == 1]
    data_label = extract_feature(data_1, data_label,"on_night")
    data_label = extract_feature(data_2, data_label,"on_day")
    data_label.rename(columns={'ship':'id','speed':'v','direction':'dir'},inplace=True)
    data_label
    
    group_feature:   ('ship', 'direction', ['max', 'median', 'mean', 'std', 'skew'], '0')
    {'direction_max_0': [('max', 'max')], 'direction_median_0': [('median', 'median')], 'direction_mean_0': [('mean', 'mean')], 'direction_std_0': [('std', 'std')], 'direction_skew_0': [('skew', 'skew')]}
    group_feature:   ('ship', 'x', ['max', 'min', 'mean', 'median', 'std', 'skew'], '0')
    {'x_max_0': [('max', 'max')], 'x_min_0': [('min', 'min')], 'x_mean_0': [('mean', 'mean')], 'x_median_0': [('median', 'median')], 'x_std_0': [('std', 'std')], 'x_skew_0': [('skew', 'skew')]}
    group_feature:   ('ship', 'y', ['max', 'min', 'mean', 'median', 'std', 'skew'], '0')
    {'y_max_0': [('max', 'max')], 'y_min_0': [('min', 'min')], 'y_mean_0': [('mean', 'mean')], 'y_median_0': [('median', 'median')], 'y_std_0': [('std', 'std')], 'y_skew_0': [('skew', 'skew')]}
    group_feature:   ('ship', 'base_dis_diff', ['max', 'min', 'mean', 'std', 'skew'], '0')
    {'base_dis_diff_max_0': [('max', 'max')], 'base_dis_diff_min_0': [('min', 'min')], 'base_dis_diff_mean_0': [('mean', 'mean')], 'base_dis_diff_std_0': [('std', 'std')], 'base_dis_diff_skew_0': [('skew', 'skew')]}
    
    
    
    ---------------------------------------------------------------------------
    
    KeyError                                  Traceback (most recent call last)
    
    D:Anaconda3libsite-packagespandascoreindexesase.py in get_loc(self, key, method, tolerance)
       3079             try:
    -> 3080                 return self._engine.get_loc(casted_key)
       3081             except KeyError as err:
    
    
    pandas\_libsindex.pyx in pandas._libs.index.IndexEngine.get_loc()
    
    
    pandas\_libsindex.pyx in pandas._libs.index.IndexEngine.get_loc()
    
    
    pandas\_libsindex.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()
    
    
    pandas\_libsindex.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()
    
    
    pandas\_libsindex.pyx in pandas._libs.index.IndexEngine._unpack_bool_indexer()
    
    
    KeyError: 'y_median_0'
    
    
    The above exception was the direct cause of the following exception:
    
    
    KeyError                                  Traceback (most recent call last)
    
    <ipython-input-47-2b795bbafc35> in <module>
         75 data_1 = data[data['speed']==0]
         76 data_2 = data[data['speed']!=0]
    ---> 77 data_label = extract_feature(data_1, data_label,"0")
         78 data_label = extract_feature(data_2, data_label,"1")
         79 
    
    
    <ipython-input-47-2b795bbafc35> in extract_feature(df, train, flag)
         58     mode_hour = df.groupby('ship')['hour'].agg(lambda x:x.value_counts().index[0]).to_dict()
         59     train['mode_hour_{}'.format(flag)] = train['ship'].map(mode_hour)
    ---> 60     train['slope_median_{}'.format(flag)] = train['y_median_{}'.format(flag)] / np.where(train['x_median_{}'.format(flag)]==0, 0.001, train['x_median_{}'.format(flag)])
         61 
         62     return train
    
    
    D:Anaconda3libsite-packagespandascoreframe.py in __getitem__(self, key)
       3022             if self.columns.nlevels > 1:
       3023                 return self._getitem_multilevel(key)
    -> 3024             indexer = self.columns.get_loc(key)
       3025             if is_integer(indexer):
       3026                 indexer = [indexer]
    
    
    D:Anaconda3libsite-packagespandascoreindexesase.py in get_loc(self, key, method, tolerance)
       3080                 return self._engine.get_loc(casted_key)
       3081             except KeyError as err:
    -> 3082                 raise KeyError(key) from err
       3083 
       3084         if tolerance is not None:
    
    
    KeyError: 'y_median_0'
    
    new_cols = [i for i in data_label.columns if i not in df.columns]
    df = df.merge(data_label[new_cols+['id']],on='id',how='left')
    
    df[new_cols].head()
    

    统计特征的具体使用

    temp = df.copy()
    temp.rename(columns={'id':'ship','dir':'d'},inplace=True)
    
    def coefficient_of_variation(x):
        x = x.values
        if np.mean(x) == 0:
            return 0
        return np.std(x) / np.mean(x)
    
    def max_2(x):
        x = list(x.values)
        x.sort(reverse=True)
        return x[1]
    
    def max_3(x):
        x = list(x.values)
        x.sort(reverse=True)
        return x[2]
    
    def diff_abs_mean(x):  # 统计特征 deta绝对值均值
        return np.mean(np.abs(np.diff(x)))
    
    f1 = pd.DataFrame()
    for col in ['x', 'y', 'v', 'd']:
        features = temp.groupby('ship', as_index=False)[col].agg({
            '{}_min'.format(col): 'min',
            '{}_max'.format(col): 'max',
            '{}_mean'.format(col): 'mean',
            '{}_median'.format(col): 'median',
            '{}_std'.format(col): 'std',
            '{}_skew'.format(col): 'skew',
            '{}_sum'.format(col): 'sum',
            '{}_diff_abs_mean'.format(col): diff_abs_mean,
            '{}_mode'.format(col): lambda x: x.value_counts().index[0],
            '{}_coefficient_of_variation'.format(col): coefficient_of_variation,
            '{}_max2'.format(col): max_2,
            '{}_max3'.format(col): max_3
        })
        if f1.shape[0] == 0:
            f1 = features
        else:
            f1 = f1.merge(features, on='ship', how='left')
    
    f1['x_max_x_min'] = f1['x_max'] - f1['x_min']
    f1['y_max_y_min'] = f1['y_max'] - f1['y_min']
    f1['y_max_x_min'] = f1['y_max'] - f1['x_min']
    f1['x_max_y_min'] = f1['x_max'] - f1['y_min']
    # 计算斜率
    f1['slope'] = f1['y_max_y_min'] / np.where(f1['x_max_x_min'] == 0, 0.001, f1['x_max_x_min'])
    # 计算最远点之间的面积 距离的最大最小值
    f1['area'] = f1['x_max_x_min'] * f1['y_max_y_min']
    f1['dis_max_min'] = (f1['x_max_x_min'] ** 2 + f1['y_max_y_min'] ** 2) ** 0.5
    
    # 计算x和y平均点之前的距离的平均值
    f1['dis_mean'] = (f1['x_mean'] ** 2 + f1['y_mean'] ** 2) ** 0.5
    f1['area_d_dis_max_min'] = f1['area'] / f1['dis_max_min']
    
    # 加速度
    temp.sort_values(['ship', 'time'], ascending=True, inplace=True)
    temp['ynext'] = temp.groupby('ship')['y'].shift(-1)
    temp['xnext'] = temp.groupby('ship')['x'].shift(-1)
    temp['ynext'] = temp['ynext'].fillna(method='ffill')
    temp['xnext'] = temp['xnext'].fillna(method='ffill')
    temp['timenext'] = temp.groupby('ship')['time'].shift(-1)
    temp['timediff'] = np.abs(temp['timenext'] - temp['time'])
    temp['a_y'] = temp.apply(lambda x: (x['ynext'] - x['y']) / x['timediff'].total_seconds(), axis=1)
    temp['a_x'] = temp.apply(lambda x: (x['xnext'] - x['x']) / x['timediff'].total_seconds(), axis=1)
    for col in ['a_y', 'a_x']:
        f2 = temp.groupby('ship', as_index=False)[col].agg({
            '{}_max'.format(col): 'max',
            '{}_mean'.format(col): 'mean',
            '{}_min'.format(col): 'min',
            '{}_median'.format(col): 'median',
            '{}_std'.format(col): 'std'})
        f1 = f1.merge(f2, on='ship', how='left')
    
    # 曲率
    temp['y_pre'] = temp.groupby('ship')['y'].shift(1)
    temp['x_pre'] = temp.groupby('ship')['x'].shift(1)
    temp['y_pre'] = temp['y_pre'].fillna(method='bfill')
    temp['x_pre'] = temp['x_pre'].fillna(method='bfill')
    temp['d_pre'] = ((temp['x'] - temp['x_pre']) ** 2 + (temp['y'] - temp['y_pre']) ** 2) ** 0.5
    temp['d_next'] = ((temp['xnext'] - temp['x']) ** 2 + (temp['ynext'] - temp['y']) ** 2) ** 0.5
    temp['d_pre_next'] = ((temp['xnext'] - temp['x_pre']) ** 2 + (temp['ynext'] - temp['y_pre']) ** 2) ** 0.5
    temp['curvature'] = (temp['d_pre'] + temp['d_next']) / temp['d_pre_next']
    
    f2 = temp.groupby('ship', as_index=False)['curvature'].agg({
        'curvature_max': 'max',
        'curvature_mean': 'mean',
        'curvature_min': 'min',
        'curvature_median': 'median',
        'curvature_std': 'std'})
    f1 = f1.merge(f2, on='ship', how='left')
    f1
    

    embedding特征

    word2vec的简单说明

    • Question!
      为什么在数据挖掘类比赛中,我们需要word2vec或NMF(方法有很多,但这两种常用)来构造 “词嵌入特征”?

    答: 上分!

    确实,上分是现象,但背后却是对整体数据的考虑,上述的统计特征、业务特征等也都是考虑了数据的整体性,但是却难免忽略了数据间的关系。举个例子,对于所有人的年龄特征,如果仅做一些统计特征如平均值、最值,业务特征如标准体重=体重/年龄等,这些都是人为理解的。那将这些特征想象成一个个词,并将所有数据(或同一组数据)的这些词组合当成一篇文章来考虑,是不是就可以得到一些额外的规律,即特征。

    个人理解:将特征和特征群之间的关系作为新的特征 (Word2vec就是在重建词的语言上下文)

    word2vec的使用场景

    目前为止,Word Embedding可以用到特征生成,文件聚类,文本分类和自然语言处理等任务,例如:

    计算相似的词:Word Embedding可以被用来寻找与某个词相近的词。

    构建一群相关的词:对不同的词进行聚类,将相关的词聚集到一起;

    用于文本分类的特征:在文本分类问题中,因为词没法直接用于机器学习模型的训练,所以我们将词先投影到向量空间,这样之后便可以基于这些向量进行机器学习模型的训练;

    用于文件的聚类

    上面列举的是文本相关任务,当然目前词嵌入模型已经被扩展到方方面面。典型的,例如:

    在微博上面,每个人都用一个词来表示,对每个人构建Embedding,然后计算人之间的相关性,得到关系最为相近的人;

    在推荐问题里面,依据每个用户的购买的商品记录,对每个商品进行Embedding,就可以计算商品之间的相关性,并进行推荐;

    在此次天池的航海问题中,对相同经纬度上不同的船进行Embedding,就可以得到每个船只的向量,就可以得到经常在某些区域工作的船只;

    可以说,词嵌入为寻找物体之间相关性带来了巨大的帮助。现在基本每个数据竞赛都会见到Embedding技术。让我们来看看用的最多的Word2Vec模型。

    Word2Vec在做什么?

    Word2vec在向量空间中对词进行表示, 或者说词以向量的形式表示,在词向量空间中:相似含义的单词一起出现,而不同的单词则位于很远的地方。这也被称为语义关系。

    神经网络不理解文本,而只理解数字。词嵌入提供了一种将文本转换为数字向量的方法。

    Word2vec就是在重建词的语言上下文。那什么是语言上下文?在一般的生活情景中,当我们通过说话或写作来交流,其他人会试图找出句子的目的。例如,“印度的温度是多少”,这里的上下文是用户想知道“印度的温度”即上下文。

    简而言之,句子的主要目标是语境。围绕口头或书面语言的单词或句子(披露)有助于确定上下文的意义。Word2vec通过上下文学习单词的矢量表示。

    • 参考文献

    [NLP] 秒懂词向量Word2vec的本质:https://zhuanlan.zhihu.com/p/26306795

    Word2vec构造词向量

    Word2vec构造词向量

    def traj_cbow_embedding(traj_data_corpus=None, embedding_size=70,
                            iters=40, min_count=3, window_size=25,
                            seed=9012, num_runs=5, word_feat="no_bin"):
        """CBOW embedding for trajectory data."""
        boat_id = traj_data_corpus['id'].unique()
        sentences, embedding_df_list, embedding_model_list = [], [], []
        for i in boat_id:
            traj = traj_data_corpus[traj_data_corpus['id']==i]
            sentences.append(traj[word_feat].values.tolist())
    
        print("
    @Start CBOW word embedding at {}".format(datetime.now()))
        print("-------------------------------------------")
        for i in tqdm(range(num_runs)):
            model = Word2Vec(sentences, size=embedding_size,
                                      min_count=min_count,
                                      workers=mp.cpu_count(),
                                      window=window_size,
                                      seed=seed, iter=iters, sg=0)
    
            # Sentance vector
            embedding_vec = []
            for ind, seq in enumerate(sentences):
                seq_vec, word_count = 0, 0
                for word in seq:
                    if word not in model:
                        continue
                    else:
                        seq_vec += model[word]
                        word_count += 1
                if word_count == 0:
                    embedding_vec.append(embedding_size * [0])
                else:
                    embedding_vec.append(seq_vec / word_count)
            embedding_vec = np.array(embedding_vec)
            embedding_cbow_df = pd.DataFrame(embedding_vec, 
                columns=["embedding_cbow_{}_{}".format(word_feat, i) for i in range(embedding_size)])
            embedding_cbow_df["id"] = boat_id
            embedding_df_list.append(embedding_cbow_df)
            embedding_model_list.append(model)
        print("-------------------------------------------")
        print("@End CBOW word embedding at {}".format(datetime.now()))
        return embedding_df_list, embedding_model_list
    
    embedding_size=70
    iters=70
    min_count=3
    window_size=25
    num_runs=1
    
    df_list, model_list = traj_cbow_embedding(df,
                                              embedding_size=embedding_size,
                                              iters=iters, min_count=min_count,
                                              window_size=window_size,
                                              seed=9012,
                                              num_runs=num_runs,
                                              word_feat="no_bin")
    
    train_embedding_df_list = [d.reset_index(drop=True) for d in df_list]
    fea = train_embedding_df_list[0]
    fea = pd.DataFrame(fea)
    
    pre_cols = df.columns
    df = df.merge(fea,on='id',how='left')
    
    
    new_cols = [i for i in df.columns if i not in pre_cols]
    df[new_cols].head()
    
    boat_id = df['id'].unique()
    total_embedding = pd.DataFrame(boat_id, columns=["id"])
    traj_data = df[['v','dir','id']].rename(columns = {'v':'speed','dir':'direction'})
    
    # Step 1: Construct the words
    traj_data_corpus = []
    traj_data["speed_str"]     = traj_data["speed"].apply(lambda x: str(int(x*100)))
    traj_data["direction_str"] = traj_data["direction"].apply(str)
    traj_data["speed_dir_str"] = traj_data["speed_str"] + "_" + traj_data["direction_str"]
    traj_data_corpus = traj_data[["id", "speed_str",
                                      "direction_str", "speed_dir_str"]]
    print("
    @Round 2 speed embedding:")
    df_list, model_list = traj_cbow_embedding(traj_data_corpus,
                                              embedding_size=10,
                                              iters=40, min_count=3,
                                              window_size=25, seed=9102,
                                              num_runs=1, word_feat="speed_str")
    speed_embedding = df_list[0].reset_index(drop=True)
    total_embedding = pd.merge(total_embedding, speed_embedding,
                               on="id", how="left")
    
    
    print("
    @Round 2 direction embedding:")
    df_list, model_list = traj_cbow_embedding(traj_data_corpus,
                                              embedding_size=12,
                                              iters=70, min_count=3,
                                              window_size=25, seed=9102,
                                              num_runs=1, word_feat="speed_dir_str")
    speed_dir_embedding = df_list[0].reset_index(drop=True)
    total_embedding = pd.merge(total_embedding, speed_dir_embedding,
                               on="id", how="left")
    
    pre_cols = df.columns
    df = df.merge(total_embedding,on='id',how='left')
    
    new_cols = [i for i in df.columns if i not in pre_cols]
    df[new_cols].head()
    

    NMF提取文本的主题分布

    class nmf_list(object):
        def __init__(self,data,by_name,to_list,nmf_n,top_n):
            self.data = data
            self.by_name = by_name
            self.to_list = to_list
            self.nmf_n = nmf_n
            self.top_n = top_n
    
        def run(self,tf_n):
            df_all = self.data.groupby(self.by_name)[self.to_list].apply(lambda x :'|'.join(x)).reset_index()
            self.data =df_all.copy()
    
            print('bulid word_fre')
            # 词频的构建
            def word_fre(x):
                word_dict = []
                x = x.split('|')
                docs = []
                for doc in x:
                    doc = doc.split()
                    docs.append(doc)
                    word_dict.extend(doc)
                word_dict = Counter(word_dict)
                new_word_dict = {}
                for key,value in word_dict.items():
                    new_word_dict[key] = [value,0]
                del word_dict  
                del x
                for doc in docs:
                    doc = Counter(doc)
                    for word in doc.keys():
                        new_word_dict[word][1] += 1
                return new_word_dict 
            self.data['word_fre'] = self.data[self.to_list].apply(word_fre)
    
            print('bulid top_' + str(self.top_n))
            # 设定100个高频词
            def top_100(word_dict):
                return sorted(word_dict.items(),key = lambda x:(x[1][1],x[1][0]),reverse = True)[:self.top_n]
            self.data['top_'+str(self.top_n)] = self.data['word_fre'].apply(top_100)
            def top_100_word(word_list):
                words = []
                for i in word_list:
                    i = list(i)
                    words.append(i[0])
                return words 
            self.data['top_'+str(self.top_n)+'_word'] = self.data['top_' + str(self.top_n)].apply(top_100_word)
            # print('top_'+str(self.top_n)+'_word的shape')
            print(self.data.shape)
    
            word_list = []
            for i in self.data['top_'+str(self.top_n)+'_word'].values:
                word_list.extend(i)
            word_list = Counter(word_list)
            word_list = sorted(word_list.items(),key = lambda x:x[1],reverse = True)
            user_fre = []
            for i in word_list:
                i = list(i)
                user_fre.append(i[1]/self.data[self.by_name].nunique())
            stop_words = []
            for i,j in zip(word_list,user_fre):
                if j>0.5:
                    i = list(i)
                    stop_words.append(i[0])
    
            print('start title_feature')
            # 讲融合后的taglist当作一句话进行文本处理
            self.data['title_feature'] = self.data[self.to_list].apply(lambda x: x.split('|'))
            self.data['title_feature'] = self.data['title_feature'].apply(lambda line: [w for w in line if w not in stop_words])
            self.data['title_feature'] = self.data['title_feature'].apply(lambda x: ' '.join(x))
    
            print('start NMF')
            # 使用tfidf对元素进行处理
            tfidf_vectorizer = TfidfVectorizer(ngram_range=(tf_n,tf_n))
            tfidf = tfidf_vectorizer.fit_transform(self.data['title_feature'].values)
            #使用nmf算法,提取文本的主题分布
            text_nmf = NMF(n_components=self.nmf_n).fit_transform(tfidf)
    
    
            # 整理并输出文件
            name = [str(tf_n) + self.to_list + '_' +str(x) for x in range(1,self.nmf_n+1)]
            tag_list = pd.DataFrame(text_nmf)
            print(tag_list.shape)
            tag_list.columns = name
            tag_list[self.by_name] = self.data[self.by_name]
            column_name = [self.by_name] + name
            tag_list = tag_list[column_name]
            return tag_list
    
    data = df.copy()
    data.rename(columns={'v':'speed','id':'ship'},inplace=True)
    for j in range(1,4):
        print('********* {} *******'.format(j))
        for i in ['speed','x','y']:
            data[i + '_str'] = data[i].astype(str)
            nmf = nmf_list(data,'ship',i + '_str',8,2)
            nmf_a = nmf.run(j)
            nmf_a.rename(columns={'ship':'id'},inplace=True)
            data_label = data_label.merge(nmf_a,on = 'id',how = 'left')
    
    new_cols = [i for i in data_label.columns if i not in df.columns]
    df = df.merge(data_label[new_cols+['id']],on='id',how='left')
    
    df[new_cols].head()
    

    总结与思考

    • 赛题特征工程:该如何构建有效果的赛题特征工程

        参考:通过数据EDA、查阅对应赛题的参考文献,寻找并构建有实际意义的业务特征
      
    • 分箱特征:几乎所有topline代码中均有分箱特征的构造,为何分箱特征如此重要且有效。在什么情况下使用分箱特征的效果好?(为什么本赛题需要分箱特征)

        参考:分箱的原理
      
    • DataFrame特征:针对pandas DataFrame的内置方法的使用,可以构造出大量的统计特征。建议:自行整理一份针对表格数据的统计特征构造函数

        参考:DataWhale的joyful pandas
      
    • Embedding特征:上分秘籍,将序列转换成NLP文本中的一句话或一篇文章进行特征向量化为何效果如此之好。如何针对给定数据,调整参数构造较好的词向量?

        参考:Word2vec的学习
      

    附录

    学习来源

    1 团队名称:Pursuing the Past Youth
    链接:
    https://github.com/juzstu/TianChi_HaiYang

    2 团队名称:liu123的航空母舰队
    链接:
    https://github.com/MichaelYin1994/tianchi-trajectory-data-mining

    3 团队名称:天才海神号
    链接:
    https://github.com/fengdu78/tianchi_haiyang?spm=5176.12282029.0.0.5b97301792pLch

    4 团队名称:大白
    链接:
    https://github.com/Ai-Light/2020-zhihuihaiyang

    5 团队名称:抗毒救灾
    链接:
    https://github.com/wudejian789/2020DCIC_A_Rank7_B_Rank12

    6 团队名称:蜗牛坐车里团队
    链接:
    https://tianchi.aliyun.com/notebook-ai/detail?postId=114808

    7 团队名称:用欧气驱散疫情
    链接:
    https://github.com/tudoulei/2020-Digital-China-Innovation-Competition

    数据

    所用数据是 hy_round1_train_20200102(初赛数据)

    运行过程

    针对各团队的整理的详细运行代码见 ipynb/*.ipynb
    数字序号与上面相同

    运行结果

    文件输出见 result/*.csv

    部分解释

    推荐的学习资料

    实战类:知名比赛的topline代码,如kaggle、天池等平台的开源代码

    书籍类:

    +《阿里云天池大赛赛题解析》
       
       【笔者也有博客笔记学习(https://blog.csdn.net/qq_44574333/article/details/109611764)】
       
    +《美团机器学习实战》
    

    教程类:

    + Joyful Pandas 强烈推荐!基础且高效
    http://joyfulpandas.datawhale.club/
    你不逼自己一把,你永远都不知道自己有多优秀!只有经历了一些事,你才会懂得好好珍惜眼前的时光!
  • 相关阅读:
    常见问题汇总
    python的正则表达式
    Python 程序读取外部文件、网页的编码与JSON格式的转化
    第三方仓库PyPI
    文件名称一定不要设置为某些模块名称,会引起冲突!
    迟来的博客
    FENLIQI
    fenye
    Notif
    phpv6_css
  • 原文地址:https://www.cnblogs.com/zhazhaacmer/p/14674676.html
Copyright © 2020-2023  润新知