• 协同滤波 Collaborative filtering 《推荐系统实践》 第二章


    利用用户行为数据

    简介:

    用户在网站上最简单存在形式就是日志。

    原始日志(raw log)------>会话日志(session log)-->展示日志或点击日志

    用户行一般分为两种:

    1显性反馈:包括用户明确表示对物品喜好的行为(数据量小)

    2隐形反馈:网页浏览等(数据量大)

    wps_clip_image-28823

    wps_clip_image-25503

    用户行为的统一标准如下:

    wps_clip_image-15497

    协同滤波与实验设计:

    本文参考《推荐系统实践》这本书,但细节和书中略有不同,因为个人把书中代码组合到一起有些小问题,所以自己小修改了一番,可以运行,与大家分享。

     
     
    实验数据集:

    采用GroupLens提供的MovieLens数据集。下载地址http://www.grouplens.org/node/73。为了提供实验速度,本文采用较小数据集,即m1-100k那个数据集中的u.data文件,其他文件没有用,如果有兴趣,读者可以自己参考readme.

     

    实验数据说明

    U.data数据包含4列,分别是  UserID::MovieID::Rating::Time  ,本实验关心的是topN推荐,所以只关心用户是否看了某个电影,而不关心用户对电影的评分和看电影的时间。所以取数据前两列。

    离线设计如下,将用户行为数据集随机分成M份,取M-1份为训练集,1份为测试集。本文M=8.代码

    def SplitData(data,M=8,k=3,seed=1):
        test = {}
        train = {}
        random.seed(seed)
      
    
        for user, item in data:
            if random.randint(0,M) ==k:
                if user not in test:
                    test[user]=set()
                test[user].add(item)
            else:
                if user not in train:
                    train[user]=set()
                train[user].add(item)
        return train,test

     

    评测指标:

    对用户u推荐N个物品(R(u)), 令用户u在测试集喜欢的物品集合为T(u),然后定义

    召回率:wps_clip_image-9143

    两种召回率代码如下:

    def Recall(train,test,N):
        hit=0
        alls=0
        W=UserSimilarity2(train)
    
          
        for user  in train.keys():
           try:#有可能test有没user看过的item
                te_user_item = test[user]
    
                recomRank = Recommend(user,train,W,N)
    
    
                for  recom_item,w in recomRank:
                    if recom_item in te_user_item:
                        hit+=1
                alls+=len(te_user_item)
           except:
               pass;
    
    
    
        return hit*1.0/alls
    
    #ItemFC_recall     
    def ItemRecall(train,test,N):
        hit=0
        alls=0
        W=ItemSimilarity(train)
    
           
        for user  in train.keys():
           try:#有可能test有没user看过的item
                te_user_item = test[user]
                recomRank = ItemRecommendation(user,train,W,N)
                #pdb.set_trace()
                for  recom_item,w in recomRank:
                    if recom_item in te_user_item:
                        hit+=1
                alls+=len(te_user_item)
           except:
              pass;
    
    
        return hit*1.0/alls

    准确率:wps_clip_image-16536

    def Precision(train,test,N):
        hit=0
        alls=0
        W=UserSimilarity2(train)
        for user  in train.keys():
            try:#有个能test有没user看过的item
                te_user_item = test[user]
                recomRank = Recommend(user,train,W,N)
                #pdb.set_trace()
                for  recom_item,w in recomRank:
                    if recom_item in te_user_item:
                        hit+=1
                alls+=N
            except:
                pass
    
        return hit*1.0/alls
    
    
    def ItemPrecision(train,test,N):
        hit=0
        alls=0
        W=ItemSimilarity(train)
    
           
        for user  in train.keys():
           try:#有可能test有没user看过的item
                te_user_item = test[user]
                recomRank = ItemRecommendation(user,train,W,N)
                #pdb.set_trace()
                for  recom_item,w in recomRank:
                    if recom_item in te_user_item:
                        hit+=1
                alls+=N
           except:
              pass;
    
    
        return hit*1.0/alls

    覆盖率:反应了推荐算法发觉长尾的能力,覆盖率越高,说明推荐算法越能够推荐长尾中的物品给用户。一个简单的定义如下:所有推荐的物品的并集/测试集的所有物品

    wps_clip_image-14884

    两种覆盖率代码如下:

    def Coverage(train,N):
        recommend_items = set()
        all_items = set()
        W=UserSimilarity2(train)
        for user in train.keys():
            for item in train[user]:
                all_items.add(item)
    
            rank =Recommend(user,train,W,N)
    
            for item in rank[0]:
                recommend_items.add(item)
    
        return len(recommend_items)/(len(all_items)*1.0)
    
    def ItemCoverage(train,N):
        recommend_items = set()
        all_items = set()
        W=ItemSimilarity(train)
        
        for user in train.keys():
            for item in train[user]:
                all_items.add(item)
    
            rank =ItemRecommendation(user,train,W,N)
    
            for item in rank[0]:
                recommend_items.add(item)
    
        return len(recommend_items)/(len(all_items)*1.0)

    基于用户的协同滤波User_CF(Collaborative filtering):

    算法思路:

    1)找到和目标永和兴趣相似的用户集合

    2)找到这个集合中的用户喜欢的,且目标用户没有听说过的物品推荐给用户

    相似度计算其中N(u)表示用户看过的电影集合。

    wps_clip_image-18085

    如下公式度量了UserCF算法中用户u对物品i的感兴趣程度:

    wps_clip_image-14502

    基于物品的协同滤波Item_CF

    思路步骤:

    1)计算物品之间的相似度

    2)根据物品的像吸毒和用户的历史行为给用户生成推荐列表

    N(i)若代表喜欢物品i的用户数目,则物品i和j相似度可以用下面的公式表示:wps_clip_image-21319

    USER_CF,ITEM_CF计算物品i,j相似度的代码如下:

    def UserSimilarity2(train,flag=1):
    #第二中计算W的函数
        item_users = dict() #bulid an new empty dicitionary
        for u ,item in train.items():
            for i in item:
                if i not in item_users:
                    item_users[i] = set() #生成一个集合
                item_users[i].add(u)
    
        C = dict()
        N = dict()#N[u]表示拥护u的项目(看电影)个数
    
    
    
        for item ,users in item_users.items():
    
            for u in users:
                if u not in N:
                    N[u]=1#如果用户u不在字典N里面,先创建
                else:
                    N[u]+=1
                for v in users :
                    if u!=v:
                        if flag==0:#正常情况
                            if (u,v) not in C:
                                C[(u,v)]=1
                            else:
                                C[(u,v)]+=1
                        elif flag==1:
                             if (u,v) not in C:
                                C[(u,v)]=1/log(1+len(users))
                             else:
                                C[(u,v)]+=1/log(1+len(users))
    
        W = dict()
    
        for uv in C.keys():
            #pdb.set_trace()
            u=uv[0]
            v=uv[1]
            if u not in W:
                W[u]=set()
            #添加与用户u相关的用户v,第二个意思是他们的权重Wuv
            W[u].add((v,C[uv]/sqrt(N[u] * N[v])))
    
    
        return W
    
    
    def ItemSimilarity(train):
        C = dict() #记录 N(i)并N(j)
        N = dict() #记录 N(i) i表示喜欢物品i的用户数
    
        for u , items in train.items():
            for i in items:
                if i not in N:
                    N[i]=1
                else:
                    N[i]+=1
    
                for j in items:
                    if i != j:
                        if (i,j) not in C:
    
                           C[(i,j)]=1
                        else:
                           C[(i,j)]+=1
    
        #calculate finial similarity:
        W= dict()
    
        for ij ,val in C.items():
    
            i=ij[0]#物品i
            j=ij[1]#物品j
    
            if i not in W:
                W[i]=set()
            W[i].add((j,val/sqrt(N[i]*N[j])))
        
        
        return W

    计算用户u对物品j的兴趣公式如下:

    wps_clip_image-19431

    userCF ,Item CF 推荐topN代码如下:

    def Recommend(user,train,W,N,K=20):
    
    
        rank = dict()
        interacted_items = train[user]
    
    
    
        
        for v,wuv in sorted(W[user], key=lambda x:x[1],reverse=True)[0:K]:
    
    
            for i  in train[v]:#v看过的电影
                if i not in interacted_items:#如果电影i不在user已看过的电影里
                    if i not in rank:
                        rank[i]=wuv * 1
                    else:
                        rank[i]+=wuv * 1
    
        rank=sorted(rank.items(), key = lambda x:x[1],reverse=True)
    
          #  rank=[(key,val) for key,val in rank.items()]#字典转换为list
        rank=rank[:N]
        return rank
        
        
        
        
    def ItemRecommendation(user,train,W,N,K=10):
        rank = dict()
    
        user_items =train[user]
        for i in user_items:
            for j , wij in sorted(W[i], key = lambda x:x[1],reverse =True)[0:K]:
                if j not in user_items:
                    if j not in rank:
                        rank[j] = wij*1
                    else:
                        rank[j]+=wij*1
        
        rank=sorted(rank.items(), key = lambda x:x[1],reverse=True)
    
        rank=rank[:N]    
        return rank

    参数M=8,N=10,k=10时候,输出结果如下:

    可以通过调节参数获得其他结果

    全部代码如下

    ItemCoverage:  0.601796407186
    ItemRecall: 0.172728085068
    ItemPrecision: 0.208972972973
    Recall  0.165132695916
    Precision  0.199783783784
    Coverage 0.698203592814

    # -*- coding: utf-8 -*-
    '''
    
    Created on 2014��4��16��
    
    @author: Administrator
    '''
    import random
    import pdb
    from math import *
    import traceback
    
    
    def SplitData(data,M=8,k=3,seed=1):
        test = {}
        train = {}
        random.seed(seed)
      
    
        for user, item in data:
            if random.randint(0,M) ==k:
                if user not in test:
                    test[user]=set()
                test[user].add(item)
            else:
                if user not in train:
                    train[user]=set()
                train[user].add(item)
        return train,test
    
    
    
    
    
    
    
    
    
    
    
    
    #USER_FC_recall
    
    def Recall(train,test,N):
        hit=0
        alls=0
        W=UserSimilarity2(train)
    
          
        for user  in train.keys():
           try:#有可能test有没user看过的item
                te_user_item = test[user]
    
                recomRank = Recommend(user,train,W,N)
    
    
                for  recom_item,w in recomRank:
                    if recom_item in te_user_item:
                        hit+=1
                alls+=len(te_user_item)
           except:
               pass;
    
    
    
        return hit*1.0/alls
    
    #ItemFC_recall     
    def ItemRecall(train,test,N):
        hit=0
        alls=0
        W=ItemSimilarity(train)
    
           
        for user  in train.keys():
           try:#有可能test有没user看过的item
                te_user_item = test[user]
                recomRank = ItemRecommendation(user,train,W,N)
                #pdb.set_trace()
                for  recom_item,w in recomRank:
                    if recom_item in te_user_item:
                        hit+=1
                alls+=len(te_user_item)
           except:
              pass;
    
    
        return hit*1.0/alls
    
         # pdb.set_trace()
         
         
    def Precision(train,test,N):
        hit=0
        alls=0
        W=UserSimilarity2(train)
        for user  in train.keys():
            try:#有个能test有没user看过的item
                te_user_item = test[user]
                recomRank = Recommend(user,train,W,N)
                #pdb.set_trace()
                for  recom_item,w in recomRank:
                    if recom_item in te_user_item:
                        hit+=1
                alls+=N
            except:
                pass
    
        return hit*1.0/alls
    
    
    def ItemPrecision(train,test,N):
        hit=0
        alls=0
        W=ItemSimilarity(train)
    
           
        for user  in train.keys():
           try:#有可能test有没user看过的item
                te_user_item = test[user]
                recomRank = ItemRecommendation(user,train,W,N)
                #pdb.set_trace()
                for  recom_item,w in recomRank:
                    if recom_item in te_user_item:
                        hit+=1
                alls+=N
           except:
              pass;
    
    
        return hit*1.0/alls
        
    #计算覆盖率
    #USER_CF
    def Coverage(train,N):
        recommend_items = set()
        all_items = set()
        W=UserSimilarity2(train)
        for user in train.keys():
            for item in train[user]:
                all_items.add(item)
    
            rank =Recommend(user,train,W,N)
    
            for item in rank[0]:
                recommend_items.add(item)
    
        return len(recommend_items)/(len(all_items)*1.0)
    
    def ItemCoverage(train,N):
        recommend_items = set()
        all_items = set()
        W=ItemSimilarity(train)
        
        for user in train.keys():
            for item in train[user]:
                all_items.add(item)
    
            rank =ItemRecommendation(user,train,W,N)
    
            for item in rank[0]:
                recommend_items.add(item)
    
        return len(recommend_items)/(len(all_items)*1.0)
    
    def UserSimilarity2(train,flag=1):
    #第二中计算W的函数
        item_users = dict() #bulid an new empty dicitionary
        for u ,item in train.items():
            for i in item:
                if i not in item_users:
                    item_users[i] = set() #生成一个集合
                item_users[i].add(u)
    
        C = dict()
        N = dict()#N[u]表示拥护u的项目(看电影)个数
    
    
    
        for item ,users in item_users.items():
    
            for u in users:
                if u not in N:
                    N[u]=1#如果用户u不在字典N里面,先创建
                else:
                    N[u]+=1
                for v in users :
                    if u!=v:
                        if flag==0:#正常情况
                            if (u,v) not in C:
                                C[(u,v)]=1
                            else:
                                C[(u,v)]+=1
                        elif flag==1:
                             if (u,v) not in C:
                                C[(u,v)]=1/log(1+len(users))
                             else:
                                C[(u,v)]+=1/log(1+len(users))
    
        W = dict()
    
        for uv in C.keys():
            #pdb.set_trace()
            u=uv[0]
            v=uv[1]
            if u not in W:
                W[u]=set()
            #添加与用户u相关的用户v,第二个意思是他们的权重Wuv
            W[u].add((v,C[uv]/sqrt(N[u] * N[v])))
    
    
        return W
    
    
    def ItemSimilarity(train):
        C = dict() #记录 N(i)并N(j)
        N = dict() #记录 N(i) i表示喜欢物品i的用户数
    
        for u , items in train.items():
            for i in items:
                if i not in N:
                    N[i]=1
                else:
                    N[i]+=1
    
                for j in items:
                    if i != j:
                        if (i,j) not in C:
    
                           C[(i,j)]=1
                        else:
                           C[(i,j)]+=1
    
        #calculate finial similarity:
        W= dict()
    
        for ij ,val in C.items():
    
            i=ij[0]#物品i
            j=ij[1]#物品j
    
            if i not in W:
                W[i]=set()
            W[i].add((j,val/sqrt(N[i]*N[j])))
        
        
        return W
    
    #给出要推荐的物品item,(并且存储于rank中)
    #rank是一个字典,rank[item]=推荐力度
    #返回前N个推荐
    def Recommend(user,train,W,N,K=10):
    
    
        rank = dict()
        interacted_items = train[user]
    
    
    
        
        for v,wuv in sorted(W[user], key=lambda x:x[1],reverse=True)[0:K]:
    
    
            for i  in train[v]:#v看过的电影
                if i not in interacted_items:#如果电影i不在user已看过的电影里
                    if i not in rank:
                        rank[i]=wuv * 1
                    else:
                        rank[i]+=wuv * 1
    
        rank=sorted(rank.items(), key = lambda x:x[1],reverse=True)
    
          #  rank=[(key,val) for key,val in rank.items()]#字典转换为list
        rank=rank[:N]
        return rank
        
        
        
        
    def ItemRecommendation(user,train,W,N,K=10):
        rank = dict()
    
        user_items =train[user]
        for i in user_items:
            for j , wij in sorted(W[i], key = lambda x:x[1],reverse =True)[0:K]:
                if j not in user_items:
                    if j not in rank:
                        rank[j] = wij*1
                    else:
                        rank[j]+=wij*1
        
        rank=sorted(rank.items(), key = lambda x:x[1],reverse=True)
    
        rank=rank[:N]    
        return rank
    
    
    
    f = open('u.data')
    
    data=[]#存储数据
    for line in f:
     
        data.append(line.split('	')[:2])
    
    train,test=SplitData(data)
    
    
    
    
    print 'ItemCoverage:  %s' % ItemCoverage(train,10)
    print 'ItemRecall: %s' % ItemRecall(train,test,10)
    print 'ItemPrecision: %s' %   ItemPrecision(train,test,10)
    
    print 'Recall  %s' %  Recall(train,test,10)
    print 'Precision  %s' %Precision(train,test,10)
    print 'Coverage %s' % Coverage(train,10)

    总结:

    本人愚笨,不太清楚怎么把《推荐系统实践》里面的代码整合在一起,自己改了改,希望可以跟他家分享做个参考吧。PS:数据量还不是很小,运行需要一段时间。

    参考书目:推荐系统实践

    转载请标注:http://www.cnblogs.com/Dzhouqi/p/3668919.html                       

  • 相关阅读:
    JSON.stringify 方法模拟
    leetcode算法:1648. 销售价值减少的颜色球
    Dynamics 365 Marketing Trail
    Power Automate expression中的 string 相关公式
    Adaptive Card 和 Power Automate
    Power Automate Approval 功能初始化失败
    linux编程-csh
    王道论坛_PC端播放和下载视频
    记树莓派的一个项目_RGB1602的实际使用
    提取植物大战僵尸静态资源
  • 原文地址:https://www.cnblogs.com/Dzhouqi/p/3668919.html
Copyright © 2020-2023  润新知