• click through rate prediction


    包括内容如下图:

    使用直接估计法,置信区间置信率的估计:

    1.使用二项分布直接估计

    $p(0.04<hat{p}<0.06) = sum_{0.04nleq k leq 0.06n}{n choose k}0.05^{k}0.95^{n-k}$

    low=ceil(n*0.04);%上取整
    high=floor(n*0.06);%下取整
    prob = 0;
    for i=low:1:high
        prob = prob+nchoosek(n,i)*(0.05^i)*(0.95^(n-i));
    end
    

    2.使用正态分布近似

    $mu = p = 0.05,sigma^2 = frac{p(1-p)}{n} = frac{0.05*0.95}{n}$

    normcdf(0.06,0.05,sigma/x(i)^0.5) - normcdf(0.04,0.05,sigma/x(i)^0.5)
    
    warning off all;
    clear all;clc;close all;
    x=500:1:1500;
    y = zeros(1,size(x,2));
    y2 = zeros(1,size(x,2));
    sigma = sqrt(0.05*0.95);
    for i =1:size(x,2)
        y(i) = adPredict(x(i));
        y2(i) = normcdf(0.06,0.05,sigma/x(i)^0.5) - normcdf(0.04,0.05,sigma/x(i)^0.5);
    end
    
    plot(x,y,'b-'); hold on;
    plot(x,y2,'r-');
    hold on;
    x1=[500 1500];
    y1=[0.85 0.85];
    plot(x1,y1,'y-');
    

    打印曲线:观测到,n=1000,差不多置信度会到达0.85

    AUC概念及计算:

    sklearn代码:sklearn中有现成方法,计算一组TPR,FPR,然后plot就可以;AUC也可以直接调用方法。

    import numpy as np
    import matplotlib.pyplot as plt
    
    from sklearn.linear_model import LogisticRegression
    from sklearn import datasets
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import roc_auc_score
    from sklearn.metrics import roc_curve
    
    digits = datasets.load_digits()
    
    X, y = digits.data, digits.target
    X = StandardScaler().fit_transform(X)
    
    # classify small against large digits
    y = (y > 4).astype(np.int)
    X_train = X[:-400]
    y_train = y[:-400]
    
    X_test = X[-400:]
    y_test = y[-400:]
    
    lrg = LogisticRegression(penalty='l1')
    lrg.fit(X_train, y_train)
    
    y_test_prob=lrg.predict_proba(X_test)
    P = np.where(y_test==1)[0].shape[0];
    N  = np.where(y_test==0)[0].shape[0];
    
    dt = 10001
    TPR = np.zeros((dt,1))
    FPR = np.zeros((dt,1))
    for i in range(dt):
        y_test_p = y_test_prob[:,1]>=i*(1.0/(dt-1))
        TP = np.where((y_test==1)&(y_test_p==True))[0].shape[0];
        FN = P-TP;
        FP = np.where((y_test==0)&(y_test_p==True))[0].shape[0];
        TN = N - FP;
        TPR[i]=TP*1.0/P
        FPR[i]=FP*1.0/N
    
    
    
    plt.plot(FPR,TPR,color='black')
    plt.plot(np.array([[0],[1]]),np.array([[0],[1]]),color='red')
    plt.show()
    
    #use sklearn method
    # fpr, tpr, thresholds = roc_curve(y_test,y_test_prob[:,1],pos_label=1)
    # plt.plot(fpr,tpr,color='black')
    # plt.plot(np.array([[0],[1]]),np.array([[0],[1]]),color='red')
    # plt.show()
    
    rank = y_test_prob[:,1].argsort()
    rank = rank.argsort()+1
    auc = (sum(rank[np.where(y_test==1)[0]])-(P*1.0*(P+1)/2))/(P*N);
    print auc
    print roc_auc_score(y_test, y_test_prob[:,1])
  • 相关阅读:
    PAT顶级 1024 Currency Exchange Centers (35分)(最小生成树)
    Codeforces 1282B2 K for the Price of One (Hard Version)
    1023 Have Fun with Numbers (20)
    1005 Spell It Right (20)
    1092 To Buy or Not to Buy (20)
    1118 Birds in Forest (25)
    1130 Infix Expression (25)
    1085 Perfect Sequence (25)
    1109 Group Photo (25)
    1073 Scientific Notation (20)
  • 原文地址:https://www.cnblogs.com/porco/p/4533805.html
Copyright © 2020-2023  润新知