文章主要介绍的是koren 08年发的论文[1], 2.2neighborhood models部分内容(其余部分会陆续补充上来)。
koren论文中用到netflix 数据集, 过于大, 在普通的pc机上运行时间很长很长。考虑到写文章目地主要是已介绍总结方法为主,所以采用Movielens 数据集。
变量介绍(涉及到的其他变量可以参看上面提到的相关文章):
利用pearson相关系数,求i,j之间的相关性。
文章中提到shrunk correlation coefficient(收缩的相关系数),收缩后pearson相关系数作为i,j相似性,后面会通过实践证明收缩的效果会更好。
预测值:
系统评判标准:RMSE, MAE
系统采用5-fold cross-validation(movielens数据集中已经默认划分好了)
注: 用SGD来训练出最优的用户和项的偏置值,后续会补充完整。
详细代码实现:
''''' Created on Dec 16, 2012 @Author: Dennis Wu @E-mail: hansel.zh@gmail.com @Homepage: http://blog.csdn.net/wuzh670 @Weibo: http://weibo.com/hansel Data set download from : http://www.grouplens.org/system/files/ml-100k.zip ''' from operator import itemgetter, attrgetter from math import sqrt,fabs,log import random def load_data(filename_train, filename_test): train = {} test = {} for line in open(filename_train): (userId, itemId, rating, timestamp) = line.strip().split(' ') train.setdefault(userId,{}) train[userId][itemId] = float(rating) for line in open(filename_test): (userId, itemId, rating, timestamp) = line.strip().split(' ') test.setdefault(userId,{}) test[userId][itemId] = float(rating) return train, test def initialBias(train, userNum, movieNum, mean): bu = {} bi = {} biNum = {} buNum = {} u = 1 while u < (userNum+1): su = str(u) for i in train[su].keys(): bi.setdefault(i,0) biNum.setdefault(i,0) bi[i] += (train[su][i] - mean) biNum[i] += 1 u += 1 i = 1 while i < (movieNum+1): si = str(i) biNum.setdefault(si,0) if biNum[si] >= 1: bi[si] = bi[si]*1.0/(biNum[si]+25) else: bi[si] = 0.0 i += 1 u = 1 while u < (userNum+1): su = str(u) for i in train[su].keys(): bu.setdefault(su,0) buNum.setdefault(su,0) bu[su] += (train[su][i] - mean - bi[i]) buNum[su] += 1 u += 1 u = 1 while u < (userNum+1): su = str(u) buNum.setdefault(su,0) if buNum[su] >= 1: bu[su] = bu[su]*1.0/(buNum[su]+10) else: bu[su] = 0.0 u += 1 return bu, bi def initial(train, userNum, movieNum): average = {} Sij = {} mean = 0 num = 0 N = {} for u in train.keys(): for i in train[u].keys(): mean += train[u][i] num += 1 average.setdefault(i,0) average[i] += train[u][i] N.setdefault(i,0) N[i] += 1 Sij.setdefault(i,{}) for j in train[u].keys(): if i == j: continue Sij[i].setdefault(j,[]) Sij[i][j].append(u) mean = mean / num for i in average.keys(): average[i] = average[i] / N[i] pearson = {} itemSim = {} for i in Sij.keys(): pearson.setdefault(i,{}) itemSim.setdefault(i,{}) for j in Sij[i].keys(): pearson[i][j] = 1 part1 = 0 part2 = 0 part3 = 0 for u in Sij[i][j]: part1 += (train[u][i] - average[i]) * (train[u][j] - average[j]) part2 += pow(train[u][i] - average[i], 2) part3 += pow(train[u][j] - average[j], 2) if part1 != 0: pearson[i][j] = part1 / sqrt(part2 * part3) itemSim[i][j] = fabs(pearson[i][j] * len(Sij[i][j]) / (len(Sij[i][j]) + 100)) # initial user and item Bias, respectly bu, bi = initialBias(train, userNum, movieNum, mean) return itemSim, mean, average, bu, bi def neighborhoodModels(train, test, itemSim, mean, average, bu, bi): pui = {} rmse = 0.0 mae = 0.0 num = 0 for u in test.keys(): pui.setdefault(u,{}) for i in test[u].keys(): pui[u][i] = mean + bu[u] + bi[i] stat = 0 stat2 = 0 for j in train[u].keys(): if itemSim.has_key(i) and itemSim[i].has_key(j): stat += (train[u][j] - mean - bu[u] - bi[j]) * itemSim[i][j] stat2 += itemSim[i][j] if stat > 0: pui[u][i] += stat * 1.0 / stat2 rmse += pow((pui[u][i] - test[u][i]), 2) mae += fabs(pui[u][i] - test[u][i]) num += 1 rmse = sqrt(rmse*1.0 / num) mae = mae * 1.0 / num return rmse, mae if __name__ == "__main__": i = 1 sumRmse = 0.0 sumMae = 0.0 while i <= 5: # load data filename_train = 'data/u' + str(i) + '.base' filename_test = 'data/u' + str(i) + '.test' train, test = load_data(filename_train, filename_test) # initial variables itemSim, mean, average, bu, bi = initial(train, 943, 1682) # neighborhoodModels rmse, mae = neighborhoodModels(train, test, itemSim, mean, average, bu, bi) print 'cross-validation %d: rmse: %s mae: %s' % (i, rmse, mae) sumRmse += rmse sumMae += mae i += 1 print 'neighborhood models final results: Rmse: %s Mae: %s' % (sumRmse/5, sumMae/5)
实验结果:
注:第一个结果是没有使用收缩的pearson相关系数跑出的结果;第二个结果则是使用收缩的相关系数跑出的结果。