• Mahout推荐算法ItemBased


    Mahout推荐的ItemBased

    一、   算法原理

    (一)    基本的

    下面的例子,参见图评分矩阵:表现user,归类为item.

    图(1)


        该算法的原理:

    1.  计算Item之间的相似度。

    2.  对用户U做推荐


    公式(一)

    Map tmp ;

    Map tmp1 ;

    for(item a  in userRatedItems){

      rate  =userforItemRate(a)

      ListsimItem =getSimItem(a);

      For(Jin simItem){

        Item b =j;

        Simab=sim(a,b);

        Tmp.add(b,Tmp .get(b)+simab*rate)

    tmp1.add(b, tmp1.get(b)+simab)

    }

    }

    Maptmp2=temp/temp1

    Sortbyval(tmp2)

    return topK(tmp2,k)

     

    (二)    相似度计算

    1.  Cos相似度


    公式(二)

    2.  皮尔逊相似度


    公式(三)

    3.  调整的cos相似度


    公式(四)

    (三)    採样

    计算全量的itemPair之间的相似度耗费大量的时间。也是没有必要的,所以须要採样,减小计算量。

    二、   单机模式实现

    (一)    候选Item搜索

    计算全部Item Pair之间的相似度在单机模式下是不现实的,须要在海量的候选集中搜索出一部分最有可能的候选集用于计算。

    Mahout提供了4中候选Item选择策略。

    1.  AllSimilarItemsCandidateItemsStrategy

    @Override

      FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        FastIDSet candidateItemIDs = new FastIDSet();

        for (long itemID : preferredItemIDs) {

          candidateItemIDs.addAll(similarity.allSimilarItemIDs(itemID));

        }

        candidateItemIDs.removeAll(preferredItemIDs);

        return candidateItemIDs;

      }

    2.  AllUnknownItemsCandidateItemsStrategy

    @Override

      protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        FastIDSet possibleItemIDs = new FastIDSet(dataModel.getNumItems());

        LongPrimitiveIterator allItemIDs = dataModel.getItemIDs();

        while (allItemIDs.hasNext()) {

          possibleItemIDs.add(allItemIDs.nextLong());

        }

        possibleItemIDs.removeAll(preferredItemIDs);

        return possibleItemIDs;

      }

    3.  PreferredItemsNeighborhoodCandidateItemsStrategy

      @Override

      protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        FastIDSet possibleItemsIDs = new FastIDSet();

        for (long itemID : preferredItemIDs) {

          PreferenceArray itemPreferences = dataModel.getPreferencesForItem(itemID);

          int numUsersPreferringItem = itemPreferences.length();

          for (int index = 0; index < numUsersPreferringItem; index++) {

            possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(itemPreferences.getUserID(index)));

          }

        }

        possibleItemsIDs.removeAll(preferredItemIDs);

        return possibleItemsIDs;

      }

    4.  SamplingCandidateItemsStrategy

    private static int computeMaxFrom(int factor, int numThings) {

        if (factor == NO_LIMIT_FACTOR) {

          return MAX_LIMIT;

        }

        long max = (long) (factor * (1.0 + Math.log(numThings) / LOG2));

        return max > MAX_LIMIT ? MAX_LIMIT : (int) max;

      }

     

      @Override

      protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        LongPrimitiveIterator preferredItemIDsIterator = new LongPrimitiveArrayIterator(preferredItemIDs);

        if (preferredItemIDs.length > maxItems) {

          double samplingRate = (double) maxItems / preferredItemIDs.length;

    //      log.info("preferredItemIDs.length {}, samplingRate {}", preferredItemIDs.length, samplingRate);

          preferredItemIDsIterator =

              new SamplingLongPrimitiveIterator(preferredItemIDsIterator, samplingRate);

        }

        FastIDSet possibleItemsIDs = new FastIDSet();

        while (preferredItemIDsIterator.hasNext()) {

          long itemID = preferredItemIDsIterator.nextLong();

          PreferenceArray prefs = dataModel.getPreferencesForItem(itemID);

          int prefsLength = prefs.length();

          if (prefsLength > maxUsersPerItem) {

            Iterator<Preference> sampledPrefs =

                new FixedSizeSamplingIterator<Preference>(maxUsersPerItem, prefs.iterator());

            while (sampledPrefs.hasNext()) {

              addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(sampledPrefs.next().getUserID()));

            }

          } else {

            for (int i = 0; i < prefsLength; i++) {

              addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(prefs.getUserID(i)));

            }

          }

        }

        possibleItemsIDs.removeAll(preferredItemIDs);

        return possibleItemsIDs;

      }

     

      private void addSomeOf(FastIDSet possibleItemIDs, FastIDSet itemIDs) {

        if (itemIDs.size() > maxItemsPerUser) {

          LongPrimitiveIterator it =

              new SamplingLongPrimitiveIterator(itemIDs.iterator(), (double) maxItemsPerUser / itemIDs.size());

          while (it.hasNext()) {

            possibleItemIDs.add(it.nextLong());

          }

        } else {

          possibleItemIDs.addAll(itemIDs);

        }

      }

    (二)    估值

    protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID)

        throws TasteException {

        double preference = 0.0;

        double totalSimilarity = 0.0;

        int count = 0;

        double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs());

        for (int i = 0; i < similarities.length; i++) {

          double theSimilarity = similarities[i];

          if (!Double.isNaN(theSimilarity)) {

            // Weights can be negative!

            preference += theSimilarity * preferencesFromUser.getValue(i);

            totalSimilarity += theSimilarity;

            count++;

          }

        }

        // Throw out the estimate if it was based on no data points, of course, but also if based on

        // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment.

        // The reason is that in this case the estimate is, simply, the user's rating for one item

        // that happened to have a defined similarity. The similarity score doesn't matter, and that

        // seems like a bad situation.

        if (count <= 1) {

          return Float.NaN;

        }

        float estimate = (float) (preference / totalSimilarity);

        if (capper != null) {

          estimate = capper.capEstimate(estimate);

        }

        return estimate;

      }

    (三)    推荐

    1.  依据历史评分列表推荐

    这样的推荐方式依据用户之前产生过评分的item做推荐。推荐结果依照预计值的大小排序。

    @Override

      public List<RecommendedItem> recommend(long userID, int howMany, IDRescorer rescorer) throws TasteException {

        Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");

        log.debug("Recommending items for user ID '{}'", userID);

        PreferenceArray preferencesFromUser = getDataModel().getPreferencesFromUser(userID);

        if (preferencesFromUser.length() == 0) {

          return Collections.emptyList();

        }

        FastIDSet possibleItemIDs = getAllOtherItems(userID, preferencesFromUser);

        TopItems.Estimator<Long> estimator = new Estimator(userID, preferencesFromUser);

        List<RecommendedItem> topItems = TopItems.getTopItems(howMany, possibleItemIDs.iterator(), rescorer,

          estimator);

        log.debug("Recommendations are: {}", topItems);

        return topItems;

      }

    2.  Because推荐

    这样的推荐方式用于实时推荐。

    @Override

      public List<RecommendedItem> recommendedBecause(long userID, long itemID, int howMany) throws TasteException {

        Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");

        DataModel model = getDataModel();

        TopItems.Estimator<Long> estimator = new RecommendedBecauseEstimator(userID, itemID);

        PreferenceArray prefs = model.getPreferencesFromUser(userID);

        int size = prefs.length();

        FastIDSet allUserItems = new FastIDSet(size);

        for (int i = 0; i < size; i++) {

          allUserItems.add(prefs.getItemID(i));

        }

        allUserItems.remove(itemID);

        return TopItems.getTopItems(howMany, allUserItems.iterator(), null, estimator);

      }

     

    //估值方法

    @Override

    public double estimate(Long itemID) throws TasteException {

          Float pref = getDataModel().getPreferenceValue(userID, itemID);

          if (pref == null) {

            return Float.NaN;

          }

          double similarityValue = similarity.itemSimilarity(recommendedItemID, itemID);

          return (1.0 + similarityValue) * pref;

        }

    三、   MapReduce模式实现

    (一)    将偏好文件转换成评分矩阵(PreparePreferenceMatrixJob)

    (二)    计算共现矩阵相似度(RowSimilarityJob)

    (三)    挑选最相似的K个Item

    (四)    用户偏好向量和相似降维后的共现矩阵做乘法

    (五)    过滤制定的user item

    (六)    生成终于的推荐结果

    四、   实例演示

    1.  单机模式

    1)  批量推荐

    DataModel  dataModel = new FileDataModel(new File("p/pereference"));

     

    ItemSimilarity  similarity  = new PearsonCorrelationSimilarity(dataModel);

     

    ItemBasedRecommender  recommender = new GenericItemBasedRecommender(dataModel,similarity );

     

    System.out.println(recommender.recommend(10, 10));

    2)  Because推荐

    DataModel  dataModel = new FileDataModel(new File("p/pereference"));

     

    ItemSimilarity  similarity  = new PearsonCorrelationSimilarity(dataModel);

     

    ItemBasedRecommender  recommender = new GenericItemBasedRecommender(dataModel,similarity );

     

    System.out.println(recommender.recommendedBecause(10, 10328, 100));

    2.  MapReduce模式

    API

    org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args)

    --input

    偏好数据路径,文本文件。

    格式 userid itemid preference

    --output

    推荐结果路径

    -- numRecommendations

    推荐个数

    --usersFile

    须要做出推荐的user,默认所有做推荐

    --itemsFile

    须要做出推荐的item,默认所有做推荐

    --filterFile

    文件格式文本。useriditemid 。目的是给userid的用户不要推荐itemid的item

    --booleanData

    是否是布尔数据

    --maxPrefsPerUser

    最大偏好值

    --minPrefsPerUser

    最小偏好值

    --maxSimilaritiesPerItem

    给每个Item计算最多的相似item数目

    --maxPrefsPerUserInItemSimilarity

    ItemSimilarity预计item相似度时,对每个user最多偏好数目

    --similarityClassname

    SIMILARITY_PEARSON_CORRELATION、SIMILARITY_COOCCURRENCE、SIMILARITY_LOGLIKELIHOOD、SIMILARITY_TANIMOTO_COEFFICIENT、SIMILARITY_CITY_BLOCK、SIMILARITY_COSINE、SIMILARITY_EUCLIDEAN_DISTANCE

    --threshold

    删除低于该阈值的item对

    --outputPathForSimilarityMatrix

    指定生成的item相似矩阵路径,文本文件,格式为 itemA itemB 相似值

        实例

    String  [] args ={"--input","p",

    "--output","recommender",

    "--numRecommendations","10",

    "--outputPathForSimilarityMatrix","simMatrix",

    "--similarityClassname","SIMILARITY_PEARSON_CORRELATION"}

    org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args);

    五、   參考文献

    1.  M.Deshpandeand G. Karypis. Item-based top-n recommendation algorithms.

    2.  B.M.Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborativefiltering recommendation algorithms.

    3.  Item-based collaborative filtering

    4.  Accuratelycomputing running variance

    版权声明:本文博主原创文章,博客,未经同意不得转载。

  • 相关阅读:
    jQuery$命名冲突问题解决方法
    微信小程序开发工具 ubuntu linux版本
    阿里云Https通配符证书购买
    vs2017安装
    规范与标准
    Jvm远程监控
    Bash笔记
    Html5前端笔记
    Php7 开发笔记
    webpack笔记
  • 原文地址:https://www.cnblogs.com/blfshiye/p/4845971.html
Copyright © 2020-2023  润新知