• 基于Python的三种Bandit算法的实现


    最近在看推荐系统方面的东西,看到Bandit算法的几种基本实现思路,看到网上没有很好的代码实现,将文中的三种经典的代码实现了一下。

    算法的具体介绍就不写啦,可以参考一下blog:

    https://blog.csdn.net/z1185196212/article/details/53374194

    https://blog.csdn.net/dengxing1234/article/details/73188731

    e_greedy算法:

    以epsilon的概率选择当前最大的,以1-epsilon概率随机选择。

    import numpy as np
    
    T = 1000
    N = 10
    true_award = np.random.uniform(0,1,N)
    estimated_award = np.zeros(N)
    item_count = np.zeros(N)
    epsilon = 0.1
    
    def e_greedy():
        choose = np.random.binomial(n=1,p=epsilon)
        if choose:
            item = np.argmax(estimated_award)
            award = np.random.binomial(n=1,p=true_award[item])
        else:
            item = np.random.choice(N, 1)
            award = np.random.binomial(n=1,p=true_award[item])
        return item, award
    
    total_award = 0
    for t in range(T):
        item, award = e_greedy()
        total_award+=award
        estimated_award[item] += award
        item_count[item]+=1
    
    for i in range(N):
        estimated_award[i] /= item_count[i]
    print(true_award)
    print(estimated_award)
    print(total_award)

    Thompson Sampling算法:

    对每个arm以beta(win[arm], lose[arm])产生随机数,选择最大的随机数作为本轮选择的arm。

    import numpy as np
    
    T = 1000
    N = 10
    true_award = np.random.uniform(0,1,N)
    win = np.zeros(N)
    lose = np.zeros(N)
    estimated_award = np.zeros(N)
    
    
    def Thompson_sampling():
        arm_prob = [np.random.beta(win[i]+1, lose[i]+1) for i in range(N)]
        item = np.argmax(arm_prob)
        reward = np.random.binomial(n=1,p=true_award[item])
        return item, reward
    
    total_reward = 0
    for t in range(T):
        item, reward = Thompson_sampling()
        if reward==1:
            win[item]+=1
        else:
            lose[item]+=1
        total_reward+=reward
    
    for i in range(N):
        estimated_award[i] = win[i]/(win[i]+lose[i])
    print(true_award)
    print(estimated_award)
    print(total_reward)

    UCB算法:

    不断的对概率进行调整,用观测概率 p'+ 误差delta 对真实概率 p进行估计。

    import numpy as np
    
    T = 1000
    N = 10
    
    ## 真实吐钱概率
    true_award = np.random.uniform(low=0, high=1,size=N)
    
    estimated_award = np.zeros(N)
    choose_count = np.zeros(N)
    
    total_award = 0
    
    def cal_delta(T, item):
        if choose_count[item] == 0:
            return 1
        else:
            return np.sqrt(2*np.log(T) / choose_count[item])
    
    def UCB(t, N):
        upper_bound_probs = [estimated_award[item] + cal_delta(t, item) for item in range(N)]
        item = np.argmax(upper_bound_probs)
        reward = np.random.binomial(n=1, p=true_award[item])
        return item, reward
    
    
    for t in range(1,T+1):
        item, reward = UCB(t, N)
        total_award += reward
    
        estimated_award[item] = (choose_count[item]*estimated_award[item] + reward) / (choose_count[item]+1)
        choose_count[item]+=1
    
    print(true_award)
    print(estimated_award)
    print(total_award)
  • 相关阅读:
    接口和抽象类
    JNI
    Serializable Parcelable
    android keystore 生成以及作用
    svn 服务器搭建
    java 8种基本数据类型
    Android NDK
    android adb命令行
    对称加密AES和DES加密、解密
    .net中的数据库连接字符串
  • 原文地址:https://www.cnblogs.com/liyinggang/p/14004216.html
Copyright © 2020-2023  润新知