• kaggle竞赛入门整理


    1、Bike Sharing Demand

    kaggle: https://www.kaggle.com/c/bike-sharing-demand

    目的:根据日期、时间、天气、温度等特征,预测自行车的租借量

    处理:1、将日期(含年月日时分秒)提取出年,月, 星期几,以及小时

               2、season, weather都是类别标记的,利用哑变量编码

    算法模型选取:

    回归问题:1、RandomForestRegressor

                      2、GradientBoostingRegressor

    # -*- coding: utf-8 -*-
    import csv
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    
    train = pd.read_csv('data/train.csv')
    test = pd.read_csv('data/test.csv')
    
    # 选取特征值
    selected_features = ['datetime', 'season', 'holiday',
                    'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed']
    
    #X_train = train[selected_features]
    Y_train = train["count"]
    result = test["datetime"]
    
    # 特征值处理
    month = pd.DatetimeIndex(train.datetime).month
    day = pd.DatetimeIndex(train.datetime).dayofweek
    hour = pd.DatetimeIndex(train.datetime).hour
    season = pd.get_dummies(train.season)
    weather = pd.get_dummies(train.weather)
    
    X_train = pd.concat([season, weather], axis=1)
    X_test = pd.concat([pd.get_dummies(test.season), pd.get_dummies(test.weather)], axis=1)
    X_train['month'] = month
    X_test['month'] = pd.DatetimeIndex(test.datetime).month
    X_train['day'] = day
    X_test['day'] = pd.DatetimeIndex(test.datetime).dayofweek
    X_train['hour'] = hour
    X_test['hour'] = pd.DatetimeIndex(test.datetime).hour
    X_train['holiday'] = train['holiday']
    X_test['holiday'] = test['holiday']
    X_train['workingday'] = train['workingday']
    X_test['workingday'] = test['workingday']
    X_train['temp'] = train['temp']
    X_test['temp'] = test['temp']
    X_train['humidity'] = train['humidity']
    X_test['humidity'] = test['humidity']
    X_train['windspeed'] = train['windspeed']
    X_test['windspeed'] = test['windspeed']
    
    
    from sklearn.ensemble import *
    clf = GradientBoostingRegressor(n_estimators=200, max_depth=3)
    clf.fit(X_train, Y_train)
    result = clf.predict(X_test)
    result = np.expm1(result)
    
    df=pd.DataFrame({'datetime':test['datetime'], 'count':result})
    df.to_csv('results1.csv', index = False, columns=['datetime','count'])
    
    from sklearn.ensemble import RandomForestRegressor
    gbr = RandomForestRegressor()
    gbr.fit(X_train, Y_train)
    
    y_predict = gbr.predict(X_test).astype(int)
    
    df = pd.DataFrame({'datetime': test.datetime, 'count': y_predict})
    df.to_csv('result2.csv', index=False, columns=['datetime', 'count'])
    #predictions_file = open("RandomForestRegssor.csv", "wb")
    #open_file_object = csv.writer(predictions_file)
    #open_file_object.writerow(["datetime", "count"])
    #open_file_object.writerows(zip(res_time, y_predict))
    View Code

    2、Daily News for Stock Market Prediction

    通过历史数据:包含每日点击率最高的25条新闻,与当日股市涨跌,来预测未来股市涨跌

    方法一:

         1、将25条新闻合并成一篇新闻,然后对每个单词做预处理(去掉特殊字符,含数字的单词,删除停词,变成小写,取词干),然后用TF-IDF提取特征,用SVM训练

         2、用word2vec提取特征

    具体实现:

    https://github.com/yjfiejd/News_predict

    3、

  • 相关阅读:
    数据结构(四十七)归并排序(O(nlogn))
    数据结构(四十六)插入排序(1.直接插入排序(O(n²)) 2.希尔排序(O(n3/2)))
    数据结构(四十五)选择排序(1.直接选择排序(O(n²))2.堆排序(O(nlogn)))
    数据结构(四十四)交换排序(1.冒泡排序(O(n²))2.快速排序(O(nlogn))))
    数据结构(四十三)排序的基本概念与分类
    策略模式(strategy pattern)
    多线程同步之读者写者问题
    多线程同步之信号量
    多线程同步之条件变量
    多线程同步之互斥量
  • 原文地址:https://www.cnblogs.com/zhaopAC/p/9197608.html
Copyright © 2020-2023  润新知