【甘道夫】基于Mahout0.9+CDH5.2执行分布式ItemCF推荐算法

【甘道夫】基于Mahout0.9+CDH5.2执行分布式ItemCF推荐算法
环境：

hadoop-2.5.0-cdh5.2.0

mahout-0.9-cdh5.2.0

引言

尽管Mahout已经宣布不再继续基于Mapreduce开发，迁移到Spark。可是实际面临的情况是公司集群没有足够的内存支持Spark这仅仅把内存当饭吃的猛兽。再加上项目进度的压力以及开发者的技能现状，所以不得不继续使用Mahout一段时间。

今天记录的是命令行执行ItemCF on Hadoop的过程。

历史

之前读过一些前辈们关于的Mahout ItemCF on Hadoop编程的相关文章。描写叙述的都是怎样基于Mahout编程实现ItemCF on Hadoop。因为没空亲自研究。所以一直遵循前辈们编程实现的做法，比方下面这段在各大博客都频繁出现的代码：

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.mahout.cf.taste.hadoop.item.RecommenderJob;

public class ItemCFHadoop {

    public static void main(String[] args) throws Exception {

        JobConf conf = new JobConf(ItemCFHadoop.class);

        GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);

        String[] remainingArgs = optionParser.getRemainingArgs();

        if (remainingArgs.length != 5) {

            System.out.println("args length: "+remainingArgs.length);

            System.err.println("Usage: hadoop jar <jarname> <package>.ItemCFHadoop <inputpath> <outputpath> <tmppath> <booleanData> <similarityClassname>");

            System.exit(2);

        }

        System.out.println("input : "+remainingArgs[0]);

        System.out.println("output : "+remainingArgs[1]);

        System.out.println("tempdir : "+remainingArgs[2]);

        System.out.println("booleanData : "+remainingArgs[3]);

        System.out.println("similarityClassname : "+remainingArgs[4]);



        StringBuilder sb = new StringBuilder();

        sb.append("--input ").append(remainingArgs[0]);

        sb.append(" --output ").append(remainingArgs[1]);

        sb.append(" --tempDir ").append(remainingArgs[2]);

        sb.append(" --booleanData ").append(remainingArgs[3]);

        sb.append(" --similarityClassname ").append(remainingArgs[4]);

        conf.setJobName("ItemCFHadoop");

        RecommenderJob job = new RecommenderJob();

        job.setConf(conf);

        job.run(sb.toString().split(" "));

    }
}

以上代码是可运行的，仅仅要在命令行中传入正确的參数就能够顺利完毕ItemCF on Hadoop的任务。

可是，假设按这么个代码逻辑。实际上是在Java中做了命令行的工作。为何不直接通过命令行运行呢？

官网资料

前辈们为我指明了道路，ItemCF on Hadoop的任务是通过org.apache.mahout.cf.taste.hadoop.item.RecommenderJob类实现的。

官网（https://builds.apache.org/job/Mahout-Quality/javadoc/）中对于org.apache.mahout.cf.taste.hadoop.item.RecommenderJob类的说明例如以下：

Runs a completely distributed recommender job as a series of mapreduces.

Preferences in the input file should look like userID, itemID[, preferencevalue]

Preference value is optional to accommodate applications that have no notion of a preference value (that is, the user simply expresses a preference for an item, but no degree of preference).

The preference value is assumed to be parseable as a double. The user IDs and item IDs are parsed as longs.

Command line arguments specific to this class are:

--input(path): Directory containing one or more text files with the preference data

--output(path): output path where recommender output should go

--tempDir (path): Specifies a directory where the job may place temp files (default "temp")

--similarityClassname (classname): Name of vector similarity class to instantiate or a predefined similarity from VectorSimilarityMeasure

--usersFile (path): only compute recommendations for user IDs contained in this file (optional)

--itemsFile (path): only include item IDs from this file in the recommendations (optional)

--filterFile (path): file containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user (optional)

--numRecommendations (integer): Number of recommendations to compute per user (10)

--booleanData (boolean): Treat input data as having no pref values (false)

--maxPrefsPerUser (integer): Maximum number of preferences considered per user in final recommendation phase (10)

--maxSimilaritiesPerItem (integer): Maximum number of similarities considered per item (100)

--minPrefsPerUser (integer): ignore users with less preferences than this in the similarity computation (1)

--maxPrefsPerUserInItemSimilarity (integer): max number of preferences to consider per user in the item similarity computation phase, users with more preferences will be sampled down (1000)

--threshold (double): discard item pairs with a similarity value below this

为了方便具备英语阅读能力的同学。上面保留了原文，以下是翻译：

执行一个全然分布式的推荐任务，通过一系列mapreduce任务实现。

输入文件里的偏好数据格式为：userID, itemID[, preferencevalue]。
当中。preferencevalue并非必须的。

userID和itemID将被解析为long类型。preferencevalue将被解析为double类型。

该类能够接收的命令行參数例如以下：
--input(path): 存储用户偏好数据的文件夹。该文件夹下能够包括一个或多个存储用户偏好数据的文本文件；

--output(path): 结算结果的输出文件夹

--tempDir (path): 存储暂时文件的文件夹

--similarityClassname (classname): 向量相似度计算类。可选的相似度算法包含CityBlockSimilarity，CooccurrenceCountSimilarity，CosineSimilarity，CountbasedMeasure。EuclideanDistanceSimilarity，LoglikelihoodSimilarity。PearsonCorrelationSimilarity, TanimotoCoefficientSimilarity。注意參数中要带上包名。

--usersFile (path): 指定一个包括了一个或多个存储userID的文件路径，仅为该路径下全部文件包括的userID做推荐计算 (该选项可选)

--itemsFile (path): 指定一个包括了一个或多个存储itemID的文件路径，仅为该路径下全部文件包括的itemID做推荐计算 (该选项可选)

--filterFile (path): 指定一个路径，该路径下的文件包括了[userID,itemID]值对，userID和itemID用逗号分隔。计算结果将不会为user推荐[userID,itemID]值对中包括的item (该选项可选)

--numRecommendations (integer): 为每一个用户推荐的item数量，默觉得10

--booleanData (boolean): 假设输入数据不包括偏好数值，则将该參数设置为true，默觉得false

--maxPrefsPerUser (integer): 在最后计算推荐结果的阶段，针对每个user使用的偏好数据的最大数量，默觉得10

--maxSimilaritiesPerItem (integer): 针对每一个item的相似度最大值，默觉得100

--minPrefsPerUser (integer): 在相似度计算中，忽略全部偏好数据量少于该值的用户。默觉得1

--maxPrefsPerUserInItemSimilarity (integer): 在item相似度计算阶段。针对每一个用户考虑的偏好数据最大数量，默觉得1000

--threshold (double): 忽略相似度低于该阀值的item对
命令行运行

用于測试的用户偏好数据【userID, itemID, preferencevalue】：

1,101,2

1,102,5

1,103,1

2,101,1

2,102,3

2,103,2

2,104,6

3,101,1

3,104,1

3,105,1

3,107,2

4,101,2

4,103,2

4,104,5

4,106,3

5,101,3

5,102,5

5,103,6

5,104,8

5,105,1

5,106,1

相关基础环境配置完好后。在命令行运行例如以下命令就可以进行ItemCF on Hadoop推荐计算：

hadoop jar $MAHOUT_HOME/mahout-core-0.9-cdh5.2.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /UserPreference --output /CFOutput --tempDir /tmp --similarityClassname org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.LoglikelihoodSimilarity

注：这里仅仅使用了最重要的參数，很多其它的參数使用调优需结合实际项目进行測试。

计算结果【userID [itemID1:score1,itemID2:score2......]】：

1 [104:3.4706533,106:1.7326527,105:1.5989419]

2 [106:3.8991857,105:3.691359]

3 [106:1.0,103:1.0,102:1.0]

4 [105:3.2909648,102:3.2909648]

5 [107:3.2898135]

版权声明：本文博客原创文章。博客，未经同意，不得转载。
相关阅读:
一维数组的 K-Means 聚类算法理解
 c#计算2个字符串的相似度
 一个人开发的html整站源码分享网站就这么上线了
 html页面显示服务器时间
 禁用浏览器自动填充表单解决办法
 布隆过滤器
 （转）二进制与三进制趣题
 随机算法_模拟退火算法
 NAT穿越
 （转）为什么所有浏览器的userAgent都带Mozilla
原文地址：https://www.cnblogs.com/gcczhongduan/p/4747170.html