学习Mahout (四)

学习Mahout (四)
在Mahout 学习（三）中，我贴了example的代码，里面生成向量文件的代码：
```
InputDriver.runJob(input, directoryContainingConvertedInput, "org.apache.mahout.math.RandomAccessSparseVector");
```
InputDriver实际上就是启动一个MapReduce程序，文件名叫InputMapper.java，只有Map处理，输出就是向量文件，代码
```
protected void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {

    String[] numbers = SPACE.split(values.toString());
    // sometimes there are multiple separator spaces
    Collection<Double> doubles = Lists.newArrayList();
    for (String value : numbers) {
      if (!value.isEmpty()) {
        doubles.add(Double.valueOf(value));
      }
    }
    // ignore empty lines in data file
    if (!doubles.isEmpty()) {
      try {
        Vector result = (Vector) constructor.newInstance(doubles.size());
        int index = 0;
        for (Double d : doubles) {
          result.set(index++, d);
        }
        VectorWritable vectorWritable = new VectorWritable(result);
        context.write(new Text(String.valueOf(index)), vectorWritable);

      } catch (InstantiationException e) {
        throw new IllegalStateException(e);
      } catch (IllegalAccessException e) {
        throw new IllegalStateException(e);
      } catch (InvocationTargetException e) {
        throw new IllegalStateException(e);
      }
    }
  }
```
红色标注的代码，会有使得生成向量时，有一个强制要求，例如有数据
```
1 2 3
4 5 6 7
```
这样的数据作为输入，不能通过，必须数据为
```
1 2 3 0
4 5 6 7
```
这样才通过。

但是如果维度太多，缺少的维度需要自己手工填补，这样也不免太傻。

但是Mahout自带的seq2encoded 方法可以忽略缺少部分，同样缺少维度的数据也能成功生成向量文件。分析了一下，原来它在代码里写死了
```
Vector result = (Vector) constructor.newInstance(5000);
```
它指定时指定一个超大的值，保证它不会越界。这样即使输入数据长短不一，也能通过。

知道解决方法后

我们只要将InputMapper.java源代码找出来，仿造新建一个InputMapperLocal.java的文件，修改constructor.newInstance(5000);

同样的，找出InputDriver.java 源码，仿造新建 InputDriverLocal.java 文件，将Job set Mapper Class 的地方，使用InputMapperLocal.class即可。

当然，向量的代码里，也要使用InputDriverLocal方法。

附：

InputMapper.java 对应源码路径：${Mahout_Source_Home}/integration/src/main/java/org/apache/mahout/clustering/conversion/InputMapper.java

InputDriver.java 对应源码路径：${Mahout_Source_Home}/integration/src/main/java/org/apache/mahout/clustering/conversion/InputDriver.java

Mahout 版本：0.9
相关阅读:
spring学习笔记---数据库事务并发与锁详解
 VIM
Linux命令总结（转）
js实现配置菜品规格时，向后台传一个json格式字符串
 js 子窗口调用父框框方法
 springMVC 的拦截器理解
 java 使用poi 导入Excel 数据到数据库
 导入jeesite 项目
 JS动态添加删除html
在Linux CentOS 下安装JDK 1.8
原文地址：https://www.cnblogs.com/chenfool/p/3854778.html