(原创文章,转载请注明出处!)
一、问题
实现对电影的推荐,数据集中有约1600部电影,有约900个用户对这些电影进行了评价。设每个电影有10个特征,根据推荐系统(一)描述的算法,每个用户也相应的有10的参数,那么总的参数个数 ≈ 1600 * 10 + 900 * 10 ≈ 16000 + 9000 ≈25000 个。电影评价的数据集可以访问链接:http://grouplens.org/datasets/movielens/
二、代码实现
用R编程实现推荐系统(一)描述的算法。在实现中使用R提供的一个求解最优化的函数,optim。推荐算法的优化目标是无约束最优化问题,使用optim时,选择使用共轭梯度算法求解无约束最优化问题。optim的参数需要两个函数:目标函数和目标函数的一阶导函数。
1. 目标函数的代码如下:
1 JThetaXLamda <- function(thetaX) { 2 theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE) 3 X <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE) 4 5 return ( (1/2) * ( sum( (tcrossprod(X, theta) * R - Y)^2 ) 6 + lamda * ( sum(theta^2) + sum(X^2) ) ) ) 7 }
2. 目标函数的一阶导函数代码如下:
1 JThetaXGrLamda <- function(thetaX) { 2 theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE) # 943 is the number of user, each row is one user's parameter 3 X <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE) # 1682 is the number of movie 4 5 # gradient of theta 6 grTheta <- matrix(numeric(),ncol=nMovieFeature) 7 for (j in 1:nUser) { # jth user 8 idx <- which(R[,j]==1) # which movies are rated by jth user 9 tmpX <- X[idx,] 10 tmpY <- matrix(Y[idx,j], nrow=1) 11 grTheta <- rbind( grTheta, alpha * ( (tcrossprod(theta[j,], tmpX) - tmpY) %*% tmpX 12 + lamda * theta[j,] ) ) 13 } 14 15 # gradient of x 16 # calculating gradient of each movie only need the users who rate it. 17 grX <- matrix(numeric(),ncol=nMovieFeature) 18 for (i in 1:nMovie) { # ith movie 19 idx <- which(R[i,]==1) # exclude the user who haven't rated movie i 20 tmpTheta <- theta[idx,] 21 tmpY <- matrix(Y[i,idx], nrow=1) # tmpY is a vector including rating credit of movie i 22 grX <- rbind( grX, alpha * ( (tcrossprod(X[i,], tmpTheta) - tmpY) %*% tmpTheta 23 + lamda * X[i,]) ) 24 } 25 26 return ( c( t(grTheta), t(grX) ) ) 27 }
3. 训练:
上面两个函数使用的变量:nUser、nMovie、nMovieFeature、Y、R是自由变量,在训练代码中定义和赋值。
推荐系统(一)中还提到对数据需要做进行均值正规化(Mean Normalization), 代码如下:
1 ## normalizae the training data 2 ## Args : 3 ## Y - a matrix, rating of movies by users 4 ## R - a matrix, flags, which movies are rated by user 5 ## Returns : 6 ## a list contains two elements: Ynormalized - a matrix, normalized Y 7 ## Ymean - a vector contains mean of each movie 8 craEx8RS_meanNormalization <- function(Y, R) 9 { 10 numOfMovies <- dim(Y)[1] 11 l <- list( Ynormalized = Y, Ymean = numeric(length = numOfMovies) ) 12 for (i in 1:numOfMovies) { 13 idx <- which(R[i,] == 1) 14 meanOfOneMovie <- mean(Y[i,idx]) 15 l$Ynormalized[i,] <- Y[i,] - meanOfOneMovie 16 l$Ymean[i] <- meanOfOneMovie 17 } 18 return( l ) 19 }
训练代码如下:
1 ## do the training 2 Y <- read.table(file="Y.txt") 3 R <- read.table(file="R.txt") 4 Y <- as.matrix(Y) 5 R <- as.matrix(R) 6 l <- meanNormalization(Y, R) # normalize the data 7 Y <- l$Ynormalized 8 nMovie <- dim(Y)[1] 9 nUser <- dim(Y)[2] 10 nMovieFeature <- 10 11 lamda <- 0.01 12 alpha <- 0.01 13 14 15 ## random initialize theta, X 16 theta <- matrix(runif(nUser*nMovieFeature,min=0,max=0.1), nrow=nUser, ncol=nMovieFeature) # each row is parameter of one user 17 X <- matrix(runif(nMovie*nMovieFeature,min=0,max=0.1), nrow=nMovie, ncol=nMovieFeature) # each row is parameter of one movie 18 # convert the two initialize value matrix to vector 19 thetaX <- c( t(theta), t(X) ) 20 # using conjugate gradient algorithm to train the parameters 21 optimCG <- optim( par=thetaX, fn=JThetaX, gr = JThetaXGr, method = "CG", control = list(maxit=500000, abstol=1e-6) ) 22 optimCG$par # display the training reslut, i.e. best value has been found for parameters
4. 预测:
通过训练得到参数值后,就可以通过推荐系统(一)中提到的计算公式: θT*x + 平均值, 来计算每个用户对还没有评价电影的评分。通过比较计算得到评分,就能找到用户偏好的电影推荐给他。代码如下:
1 ## Find the top-n un-rating movies of the jth user 2 ## Args : 3 ## topn - a positive integer 4 ## theta_j - a vector, contains theta parameters of jth user 5 ## X - a matrix, contains X paramters 6 ## Y - a matrix, rating of movies by users 7 ## R - a matrix, flags, which movies are rated by user 8 ## j - a interger, jth user 9 ## Ymean - a vector, contains mean of all movies 10 ## isAddMean - flag, is the mean added to rating credit, or not? 11 ## TRUE - add the mean to rating credit 12 ## FALSE - the mean will not be added to the rating credit 13 ## Returns : 14 ## a list, contains the index of the top-n un-rating movies of jth user and the rating result 15 findTopN <- function(topn, theta_j, X, R, j, Ymean=NULL, isAddMean=FALSE) 16 { 17 idx <- which(R[,j]==0) 18 Yvec <- rep(-1,length(R[,j])) 19 20 if (isAddMean == TRUE) { 21 if (is.null(Ymean)) { 22 cat("Error: Ymean is NULL!!! ") 23 } 24 Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) ) + as.matrix( Ymean[idx], ncol=1) 25 } else { 26 Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) ) 27 } 28 29 sortTating <- sort(Yvec, method = "sh", decreasing = TRUE, index.return = TRUE) 30 objList <- list(ratingResult = Yvec, topnIndex = sortTating$ix[1:topn]) 31 return( objList ) 32 }
(寻找N个数中最大的K个数的方法,可参考链接:http://zhangzhibiao02005.blog.163.com/blog/static/3736782020112123353695/)