• (转)Mahout使用入门


    一、简介

    Mahout Apache Software FoundationASF)旗下的一个开源项目,提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Apache Mahout项目已经发展到了它的第三个年头,目前已经有了三个公共发行版本。Mahout包含许多实现,包括集群、分类、推荐过滤、频繁子项挖掘。此外,通过使用 Apache Hadoop 库,Mahout 可以有效地扩展到云中。

    二、下载与准备

    程序下载

    下载hadoop http://labs.renren.com/apache-mirror/hadoop/common/下载适合版本的包(本文采用稳定版 hadoop-0.20.203.0rc1.tar.gz

    下载mahout http://labs.renren.com/apache-mirror/mahout/

          (本文采用mahout-distribution-0.5.tar.gz 

    如需更多功能可能还需下载 maven mahout-collections

    数据下载

    数据源:http://kdd.ics.uci.edu/databases/里面有大量经典数据提供下载

    (本文使用synthetic_control 数据,synthetic_control.tar.gz

    三、安装与部署

    为了不污染Linux root环境,本文采用在个人Home目录安装程序,程序目录为$HOME/local

    程序已经下载到$HOME/Downloads,使用tar命令解压:

    tar zxvf hadoop-0.20.203.0rc1.tar.gz -C ~/local/

    cd ~/local

    mv hadoop-0.20.203.0 hadoop

     

    tar zxvf mahout-distribution-0.5.tar.gz -C ~/local/

    cd ~/local

    mv mahout-distribution-0.5 mahout

     

    修改.bash_profile / .bashrc

    export HADOOP_HOME=$HOME/local/hadoop

    export HADOOP_CONF_DIR=$HADOOP_HOME/conf

    为方便使用程序命令,可把程序bin目录添加到$PATH下,或者直接alias

    #Alias for apps

    alias mahout='$HOME/local/mahout/mahout'

    alias hdp='$HOME/local/hadoop/hdp'

     

    测试

    输入命令: mahout

    预期结果:

    Running on hadoop, using HADOOP_HOME=/home/username/local/hadoop

    HADOOP_CONF_DIR=/home/username/local/hadoop/conf

    An example program must be given as the first argument.

    Valid program names are:

      arff.vector: : Generate Vectors from an ARFF file or directory

      canopy: : Canopy clustering

      cat: : Print a file or resource as the logistic regression models would see it

      cleansvd: : Cleanup and verification of SVD output

      clusterdump: : Dump cluster output to text

      dirichlet: : Dirichlet Clustering

      eigencuts: : Eigencuts spectral clustering

      evaluateFactorization: : compute RMSE of a rating matrix factorization against probes in memory

      evaluateFactorizationParallel: : compute RMSE of a rating matrix factorization against probes

      fkmeans: : Fuzzy K-means clustering

      fpg: : Frequent Pattern Growth

      itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering

      kmeans: : K-means clustering

      lda: : Latent Dirchlet Allocation

      ldatopics: : LDA Print Topics

      lucene.vector: : Generate Vectors from a Lucene index

      matrixmult: : Take the product of two matrices

      meanshift: : Mean Shift clustering

      parallelALS: : ALS-WR factorization of a rating matrix

      predictFromFactorization: : predict preferences from a factorization of a rating matrix

      prepare20newsgroups: : Reformat 20 newsgroups data

      recommenditembased: : Compute recommendations using item-based collaborative filtering

      rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}

      rowsimilarity: : Compute the pairwise similarities of the rows of a matrix

      runlogistic: : Run a logistic regression model against CSV data

      seq2sparse: : Sparse Vector generation from Text sequence files

      seqdirectory: : Generate sequence files (of Text) from a directory

      seqdumper: : Generic Sequence File dumper

      seqwiki: : Wikipedia xml dump to sequence file

      spectralkmeans: : Spectral k-means clustering

      splitDataset: : split a rating dataset into training and probe parts

      ssvd: : Stochastic SVD

      svd: : Lanczos Singular Value Decomposition

      testclassifier: : Test Bayes Classifier

      trainclassifier: : Train Bayes Classifier

      trainlogistic: : Train a logistic regression using stochastic gradient descent

      transpose: : Take the transpose of a matrix

      vectordump: : Dump vectors from a sequence file to text

      wikipediaDataSetCreator: : Splits data set of wikipedia wrt feature like country

      wikipediaXMLSplitter: : Reads wikipedia data and creates ch

    输入命令:hdp

    预期结果:

    Usage: hadoop [--config confdir] COMMAND

    where COMMAND is one of:

      namenode -format     format the DFS filesystem

      secondarynamenode    run the DFS secondary namenode

      namenode             run the DFS namenode

      datanode             run a DFS datanode

      dfsadmin             run a DFS admin client

      mradmin              run a Map-Reduce admin client

      fsck                 run a DFS filesystem checking utility

      fs                   run a generic filesystem user client

      balancer             run a cluster balancing utility

      fetchdt              fetch a delegation token from the NameNode

      jobtracker           run the MapReduce job Tracker node

      pipes                run a Pipes job

      tasktracker          run a MapReduce task Tracker node

      historyserver        run job history servers as a standalone daemon

      job                  manipulate MapReduce jobs

      queue                get information regarding JobQueues

      version              print the version

      jar <jar>            run a jar file

      distcp <srcurl> <desturl> copy file or directories recursively

      archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive

      classpath            prints the class path needed to get the

                           Hadoop jar and the required libraries

      daemonlog            get/set the log level for each daemon

     or

      CLASSNAME            run the class named CLASSNAME

    Most commands print help when invoked w/o parameters.

     

    五、运行

    步骤一:

    通过这个命令可以查看mahout提供了哪些算法,以及如何使用

    mahout --help

     

    mahout kmeans --input /user/hive/warehouse/tmp_data/complex.seq   --clusters  5 --output  /home/hadoopuser/1.txt

     

    mahout下处理的文件必须是SequenceFile格式的,所以需要把txtfile转换成sequenceFileSequenceFilehadoop中的一个类,允许我们向文件中写入二进制的键值对,具体介绍请看

    eyjian写的http://www.hadoopor.com/viewthread.php?tid=144&highlight=sequencefile

     

    mahout中提供了一种将指定文件下的文件转换成sequenceFile的方式。

    You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text.

    使用方法如下:

     

    $MAHOUT_HOME/mahout seqdirectory \

    --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \

    <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \

    <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \

    <-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>

     

    举个例子:

    mahout seqdirectory --input /hive/hadoopuser/ --output /mahout/seq/ --charset UTF-8

     

    步骤二:

    运行kmeans的简单的例子:

     

    1:将样本数据集放到hdfs中指定文件下,应该在testdata文件夹下

    $HADOOP_HOME/hdp fs -put <PATH TO DATA> testdata

    例如:

    dap fs   -put ~/datasetsynthetic_controltest/synthetic_control.data  ~/local/mahout/testdata/

     

    2:使用kmeans算法

    hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

    例如:

    hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

     

    3:使用canopy算法

    hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job

    例如:

    hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job

     

    4:使用dirichlet 算法

    mahout jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

     

    5:使用meanshift算法

    meanshift :

    hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job

     

    6:查看一下结果吧

    mahout vectordump --seqFile /user/hadoopuser/output/data/part-00000

    这个直接把结果显示在控制台上。

     

    可以到hdfs中去看看数据是什么样子的

    上面跑的例子大多以testdata作为输入和输出文件夹名

    可以使用 hdp fs -lsr 来查看所有的输出结果

     

    KMeans 方法的输出结果在  output/points

    Canopy MeanShift 结果放在了 output/clustered-points

  • 相关阅读:
    聊聊Unity2018的LWRP和混合光照
    不能直接获取?聊聊如何在Shader Graph中获取深度图
    还原堆栈信息,分析地形系统使用ASTC格式的纹理导致Crash的问题
    巧妙设置Texture Type,将ShadowMask内存占用变成之前的1/4
    开发自定义ScriptableRenderPipeline,将DrawCall降低180倍
    Signed Distance Field Shadow in Unity
    Daily Pathtracer!安利下不错的Pathtracer学习资料
    聊聊LightProbe原理实现以及对LightProbe数据的修改
    Scala学习笔记(六):本地函数、头等函数、占位符和部分应用函数
    Scala学习笔记(五):内建控制循环
  • 原文地址:https://www.cnblogs.com/end/p/2872034.html
Copyright © 2020-2023  润新知