• 推荐系统(二)


     (原创文章,转载请注明出处!)

    一、问题

    实现对电影的推荐,数据集中有约1600部电影,有约900个用户对这些电影进行了评价。设每个电影有10个特征,根据推荐系统(一)描述的算法,每个用户也相应的有10的参数,那么总的参数个数 ≈ 1600 * 10 + 900 * 10 ≈ 16000 + 9000 ≈25000 个。电影评价的数据集可以访问链接:http://grouplens.org/datasets/movielens/

    二、代码实现

    用R编程实现推荐系统(一)描述的算法。在实现中使用R提供的一个求解最优化的函数,optim。推荐算法的优化目标是无约束最优化问题,使用optim时,选择使用共轭梯度算法求解无约束最优化问题。optim的参数需要两个函数:目标函数和目标函数的一阶导函数。

    1. 目标函数的代码如下:

    1 JThetaXLamda <- function(thetaX) {
    2     theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE)  
    3     X     <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE)
    4     
    5     return (  (1/2) * ( sum( (tcrossprod(X, theta) * R - Y)^2 )  
    6                         + lamda * ( sum(theta^2) + sum(X^2) )  ) )
    7 }

    2. 目标函数的一阶导函数代码如下:

     1 JThetaXGrLamda <- function(thetaX) {
     2     theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE)  # 943 is the number of user,  each row is one user's parameter
     3     X     <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE)  # 1682 is the number of movie
     4      
     5     # gradient of theta
     6     grTheta <- matrix(numeric(),ncol=nMovieFeature)
     7     for (j in 1:nUser) { # jth user
     8         idx <- which(R[,j]==1)  # which movies are rated by jth user
     9         tmpX <- X[idx,] 
    10         tmpY <- matrix(Y[idx,j], nrow=1)
    11         grTheta <- rbind( grTheta, alpha * ( (tcrossprod(theta[j,], tmpX) - tmpY) %*% tmpX 
    12                                                        + lamda * theta[j,] )  )
    13     }
    14 
    15     # gradient of x
    16     # calculating gradient of each movie only need the users who rate it.
    17     grX <- matrix(numeric(),ncol=nMovieFeature)
    18     for (i in 1:nMovie) { # ith movie
    19         idx <- which(R[i,]==1)  # exclude the user who haven't rated movie i
    20         tmpTheta <- theta[idx,] 
    21         tmpY <- matrix(Y[i,idx], nrow=1) # tmpY is a vector including rating credit of movie i
    22         grX <- rbind( grX, alpha * ( (tcrossprod(X[i,], tmpTheta) - tmpY) %*% tmpTheta 
    23                                                + lamda * X[i,]) )
    24     }
    25     
    26     return (   c( t(grTheta), t(grX) )   )
    27 }

    3. 训练:

    上面两个函数使用的变量:nUser、nMovie、nMovieFeature、Y、R是自由变量,在训练代码中定义和赋值。

    推荐系统(一)中还提到对数据需要做进行均值正规化(Mean Normalization), 代码如下:

     1 ## normalizae the training data
     2 ## Args :
     3 ##     Y  -  a matrix, rating of movies by users 
     4 ##     R  -  a matrix, flags, which movies are rated by user
     5 ## Returns :
     6 ##     a list contains two elements: Ynormalized - a matrix, normalized Y
     7 ##                                   Ymean - a vector contains mean of each movie
     8 craEx8RS_meanNormalization <- function(Y, R)
     9 {
    10     numOfMovies <- dim(Y)[1]
    11     l <- list( Ynormalized = Y, Ymean = numeric(length = numOfMovies) )
    12     for (i in 1:numOfMovies) {
    13         idx <- which(R[i,] == 1)
    14         meanOfOneMovie <- mean(Y[i,idx])
    15         l$Ynormalized[i,] <- Y[i,] - meanOfOneMovie
    16         l$Ymean[i] <- meanOfOneMovie
    17     }
    18     return( l )
    19 }

    训练代码如下:

     1 ## do the training
     2 Y <- read.table(file="Y.txt")
     3 R <- read.table(file="R.txt")
     4 Y <- as.matrix(Y)
     5 R <- as.matrix(R)
     6 l <- meanNormalization(Y, R)  # normalize the data
     7 Y <- l$Ynormalized
     8 nMovie <- dim(Y)[1]
     9 nUser  <- dim(Y)[2]
    10 nMovieFeature <- 10 
    11 lamda <- 0.01
    12 alpha <- 0.01
    13 
    14 
    15 ## random initialize theta, X
    16 theta <- matrix(runif(nUser*nMovieFeature,min=0,max=0.1), nrow=nUser, ncol=nMovieFeature)  # each row is parameter of one user
    17 X     <- matrix(runif(nMovie*nMovieFeature,min=0,max=0.1), nrow=nMovie, ncol=nMovieFeature)  # each row is parameter of one movie
    18 # convert the two initialize value matrix to vector
    19 thetaX <- c( t(theta), t(X) ) 
    20 # using conjugate gradient algorithm to train the parameters
    21 optimCG <- optim( par=thetaX, fn=JThetaX, gr = JThetaXGr, method = "CG", control = list(maxit=500000, abstol=1e-6) )
    22 optimCG$par # display the training reslut, i.e. best value has been found for parameters  

    4. 预测:
    通过训练得到参数值后,就可以通过推荐系统(一)中提到的计算公式:  θT*x + 平均值, 来计算每个用户对还没有评价电影的评分。通过比较计算得到评分,就能找到用户偏好的电影推荐给他。代码如下:

     1 ## Find the top-n un-rating movies of the jth user
     2 ## Args :
     3 ##     topn  -  a positive integer
     4 ##     theta_j  -  a vector, contains theta parameters of jth user
     5 ##     X  -  a matrix, contains X paramters
     6 ##     Y  -  a matrix, rating of movies by users 
     7 ##     R  -  a matrix, flags, which movies are rated by user
     8 ##     j  -  a interger, jth user
     9 ##     Ymean  -  a vector, contains mean of all movies
    10 ##     isAddMean  -  flag, is the mean added to rating credit, or not?
    11 ##                   TRUE  -  add the mean to rating credit
    12 ##                   FALSE  -  the mean will not be added to the rating credit
    13 ## Returns :
    14 ##     a list, contains the index of the top-n un-rating movies of jth user and the rating result
    15 findTopN <- function(topn, theta_j, X, R, j, Ymean=NULL, isAddMean=FALSE)
    16 {
    17     idx <- which(R[,j]==0)    
    18     Yvec <- rep(-1,length(R[,j]))
    19       
    20     if (isAddMean == TRUE) {
    21         if (is.null(Ymean)) {
    22             cat("Error: Ymean is NULL!!!
    ")
    23         }
    24         Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) )   +   as.matrix( Ymean[idx], ncol=1)
    25     } else {
    26         Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) )   
    27     }
    28     
    29     sortTating <- sort(Yvec, method = "sh", decreasing = TRUE, index.return = TRUE)
    30     objList <- list(ratingResult = Yvec, topnIndex = sortTating$ix[1:topn])
    31     return( objList )
    32 }

    (寻找N个数中最大的K个数的方法,可参考链接:http://zhangzhibiao02005.blog.163.com/blog/static/3736782020112123353695/)

  • 相关阅读:
    11 Apr 18 类与对象 对象的使用 一切皆对象
    10 Apr 18 项目构架 hashlib模块 subprocess模块 configparser模块 面向对象的编程 类与对象
    9 Apr 18 shelve模块 xml模块 re模块
    8 Apr 18 包的使用 常用模块
    定义一个Student类,包含名字一个数据成员,定义无参构造函数、有参构造函数、拷贝构造函数、析构函数及对于名字的封装函数,在main函数中实例化Student对象,并访问相关函数,观察运行结果。
    皮马印第安人糖尿病数据集免费下载分享
    机器学习之模型训练(二)皮马印第安人糖尿病数据集
    anaconda的虚拟环境sklearn中如何安装pandas
    人工智能之模型评估之参数选择
    人工智能之模型评估之数据分离
  • 原文地址:https://www.cnblogs.com/activeshj/p/3965438.html
Copyright © 2020-2023  润新知