各种类型的Writable（Text、ByteWritable、NullWritable、ObjectWritable、GenericWritable、ArrayWritable、MapWritable、SortedMapWritable）转

各种类型的Writable（Text、ByteWritable、NullWritable、ObjectWritable、GenericWritable、ArrayWritable、MapWritable、SortedMapWritable）转
java原生类型

除char类型以外，所有的原生类型都有对应的Writable类，并且通过get和set方法可以他们的值。

IntWritable和LongWritable还有对应的变长VIntWritable和VLongWritable类。

固定长度还是变长的选用类似与数据库中的char或者vchar。
Text类型

Text类型使用变长int型存储长度，所以Text类型的最大存储为2G.

Text类型采用标准的utf-8编码，所以与其他文本工具可以非常好的交互，但要注意的是，这样的话就和java的String类型差别就很多了。

检索的不同

Text的chatAt返回的是一个整型，及utf-8编码后的数字，而不是象String那样的unicode编码的char类型。
[java] view plain copy

@Test

public void testTextIndex(){

    Text text=new Text("hadoop");

    Assert.assertEquals(text.getLength(), 6);

    Assert.assertEquals(text.getBytes().length, 6);

    Assert.assertEquals(text.charAt(2),(int)'d');

    Assert.assertEquals("Out of bounds",text.charAt(100),-1);

}

Text还有个find方法，类似String里indexOf方法
[java] view plain copy

@Test

public void testTextFind() {

    Text text = new Text("hadoop");

    Assert.assertEquals("find a substring",text.find("do"),2);

    Assert.assertEquals("Find first 'o'",text.find("o"),3);

    Assert.assertEquals("Find 'o' from position 4 or later",text.find("o",4),4);

    Assert.assertEquals("No match",text.find("pig"),-1);

}

Unicode的不同

当uft-8编码后的字节大于两个时，Text和String的区别就会更清晰，因为String是按照unicode的char计算，而Text是按照字节计算。

我们来看下1到4个字节的不同的unicode字符

4个unicode分别占用1到4个字节，u+10400在java的unicode字符重占用两个char，前三个字符分别占用1个char

我们通过代码来看下String和Text的不同

[java] view plain copy

@Test

   public void string() throws UnsupportedEncodingException {

       String str = "u0041u00DFu6771uD801uDC00";

       Assert.assertEquals(str.length(), 5);

       Assert.assertEquals(str.getBytes("UTF-8").length, 10);



       Assert.assertEquals(str.indexOf("u0041"), 0);

       Assert.assertEquals(str.indexOf("u00DF"), 1);

       Assert.assertEquals(str.indexOf("u6771"), 2);

       Assert.assertEquals(str.indexOf("uD801uDC00"), 3);



       Assert.assertEquals(str.charAt(0), 'u0041');

       Assert.assertEquals(str.charAt(1), 'u00DF');

       Assert.assertEquals(str.charAt(2), 'u6771');

       Assert.assertEquals(str.charAt(3), 'uD801');

       Assert.assertEquals(str.charAt(4), 'uDC00');



       Assert.assertEquals(str.codePointAt(0), 0x0041);

       Assert.assertEquals(str.codePointAt(1), 0x00DF);

       Assert.assertEquals(str.codePointAt(2), 0x6771);

       Assert.assertEquals(str.codePointAt(3), 0x10400);

   }



   @Test

   public void text() {

       Text text = new Text("u0041u00DFu6771uD801uDC00");

       Assert.assertEquals(text.getLength(), 10);



       Assert.assertEquals(text.find("u0041"), 0);

       Assert.assertEquals(text.find("u00DF"), 1);

       Assert.assertEquals(text.find("u6771"), 3);

       Assert.assertEquals(text.find("uD801uDC00"), 6);



       Assert.assertEquals(text.charAt(0), 0x0041);

       Assert.assertEquals(text.charAt(1), 0x00DF);

       Assert.assertEquals(text.charAt(3), 0x6771);

       Assert.assertEquals(text.charAt(6), 0x10400);

   }

这样一比较就很明显了。

1.String的length()方法返回的是char的数量，Text的getLength()方法返回的是字节的数量。

2.String的indexOf()方法返回的是以char为单元的偏移量，Text的find()方法返回的是以字节为单位的偏移量。

3.String的charAt()方法不是返回的整个unicode字符，而是返回的是java中的char字符

4.String的codePointAt()和Text的charAt方法比较类似，不过要注意，前者是按char的偏移量，后者是字节的偏移量

Text的迭代

在Text中对unicode字符的迭代是相当复杂的，因为与unicode所占的字节数有关，不能简单的使用index的增长来确定。首先要把Text对象使用ByteBuffer进行封装，然后再调用Text的静态方法bytesToCodePoint对ByteBuffer进行轮询返回unicode字符的code point。看一下示例代码：

[java] view plain copy

package com.sweetop.styhadoop;



import org.apache.hadoop.io.Text;



import java.nio.ByteBuffer;



/**

* Created with IntelliJ IDEA.

* User: lastsweetop

* Date: 13-7-9

* Time: 下午5:00

* To change this template use File | Settings | File Templates.

*/

public class TextIterator {

    public static void main(String[] args) {

        Text text = new Text("u0041u00DFu6771uD801udc00");

        ByteBuffer buffer = ByteBuffer.wrap(text.getBytes(), 0, text.getLength());

        int cp;

        while (buffer.hasRemaining() && (cp = Text.bytesToCodePoint(buffer)) != -1) {

            System.out.println(Integer.toHexString(cp));

        }

    }

}

Text的修改

除了NullWritable是不可更改外，其他类型的Writable都是可以修改的。你可以通过Text的set方法去修改去修改重用这个实例。

[java] view plain copy

@Test

public void testTextMutability() {

    Text text = new Text("hadoop");

    text.set("pig");

    Assert.assertEquals(text.getLength(), 3);

    Assert.assertEquals(text.getBytes().length, 3);

}

但要注意的就是，在某些情况下Text的getBytes方法返回的字节数组的长度和Text的getLength方法返回的长度不一致。因此，在调用getBytes()方法的同时最好也调用一下getLength方法，这样你就知道在字节数组里有多少有效的字符。

[java] view plain copy

@Test

public void testTextMutability2() {

    Text text = new Text("hadoop");

    text.set(new Text("pig"));

    Assert.assertEquals(text.getLength(),3);

    Assert.assertEquals(text.getBytes().length,6);

}

BytesWritable类型

ByteWritable类型是一个二进制数组的封装类型，序列化格式是以一个4字节的整数(这点与Text不同，Text是以变长int开头)开始表明字节数组的长度，然后接下来就是数组本身。看下示例：

[java] view plain copy

@Test

public void testByteWritableSerilizedFromat() throws IOException {

    BytesWritable bytesWritable=new BytesWritable(new byte[]{3,5});

    byte[] bytes=SerializeUtils.serialize(bytesWritable);

    Assert.assertEquals(StringUtils.byteToHexString(bytes),"000000020305");

}

和Text一样，ByteWritable也可以通过set方法修改，getLength返回的大小是真实大小，而getBytes返回的大小确不是。

[java] view plain copy

<span style="white-space:pre">  </span>bytesWritable.setCapacity(11);

        bytesWritable.setSize(4);

        Assert.assertEquals(4,bytesWritable.getLength());

        Assert.assertEquals(11,bytesWritable.getBytes().length);

NullWritable类型

NullWritable是一个非常特殊的Writable类型，序列化不包含任何字符，仅仅相当于个占位符。你在使用mapreduce时，key或者value在无需使用时，可以定义为NullWritable。

[java] view plain copy

package com.sweetop.styhadoop;



import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.util.StringUtils;



import java.io.IOException;



/**

* Created with IntelliJ IDEA.

* User: lastsweetop

* Date: 13-7-16

* Time: 下午9:23

* To change this template use File | Settings | File Templates.

*/

public class TestNullWritable {

    public static void main(String[] args) throws IOException {

        NullWritable nullWritable=NullWritable.get();

        System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(nullWritable)));

    }

}

ObjectWritable类型
ObjectWritable是其他类型的封装类，包括java原生类型，String,enum,Writable,null等，或者这些类型构成的数组。当你的一个field有多种类型时，ObjectWritable类型的用处就发挥出来了，不过有个不好的地方就是占用的空间太大，即使你存一个字母，因为它需要保存封装前的类型，我们来看瞎示例：

[java] view plain copy

package com.sweetop.styhadoop;



import org.apache.hadoop.io.ObjectWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.util.StringUtils;



import java.io.IOException;



/**

* Created with IntelliJ IDEA.

* User: lastsweetop

* Date: 13-7-17

* Time: 上午9:14

* To change this template use File | Settings | File Templates.

*/

public class TestObjectWritable {

    public static void main(String[] args) throws IOException {

        Text text=new Text("u0041");

        ObjectWritable objectWritable=new ObjectWritable(text);

        System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(objectWritable)));



    }

}

仅仅是保存一个字母，那么看下它序列化后的结果是什么：

[java] view plain copy

00196f72672e6170616368652e6861646f6f702e696f2e5465787400196f72672e6170616368652e6861646f6f702e696f2e546578740141

太浪费空间了，而且类型一般是已知的，也就那么几个，那么它的代替方法出现，看下一小节

GenericWritable类型

使用GenericWritable时，只需继承于他，并通过重写getTypes方法指定哪些类型需要支持即可，我们看下用法：

[java] view plain copy

package com.sweetop.styhadoop;



import org.apache.hadoop.io.GenericWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.Writable;



class MyWritable extends GenericWritable {



    MyWritable(Writable writable) {

        set(writable);

    }



    public static Class<? extends Writable>[] CLASSES=null;



    static {

        CLASSES=  (Class<? extends Writable>[])new Class[]{

                Text.class

        };

    }



    @Override

    protected Class<? extends Writable>[] getTypes() {

        return CLASSES;  //To change body of implemented methods use File | Settings | File Templates.

    }

}

然后输出序列化后的结果

[java] view plain copy

package com.sweetop.styhadoop;



import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.VIntWritable;

import org.apache.hadoop.util.StringUtils;



import java.io.IOException;



/**

* Created with IntelliJ IDEA.

* User: lastsweetop

* Date: 13-7-17

* Time: 上午9:51

* To change this template use File | Settings | File Templates.

*/

public class TestGenericWritable {



    public static void main(String[] args) throws IOException {

        Text text=new Text("u0041u0071");

        MyWritable myWritable=new MyWritable(text);

        System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(text)));

        System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(myWritable)));



    }

}

结果是：

[java] view plain copy

024171

00024171

GenericWritable的序列化只是把类型在type数组里的索引放在了前面，这样就比ObjectWritable节省了很多空间，所以推荐大家使用GenericWritable

集合类型的Writable

ArrayWritable和TwoDArrayWritable

ArrayWritable和TwoDArrayWritable分别表示数组和二维数组的Writable类型，指定数组的类型有两种方法,构造方法里设置，或者继承于ArrayWritable,TwoDArrayWritable也是一样。

[java] view plain copy

package com.sweetop.styhadoop;



import org.apache.hadoop.io.ArrayWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.Writable;

import org.apache.hadoop.util.StringUtils;



import java.io.IOException;



/**

* Created with IntelliJ IDEA.

* User: lastsweetop

* Date: 13-7-17

* Time: 上午11:14

* To change this template use File | Settings | File Templates.

*/

public class TestArrayWritable {

    public static void main(String[] args) throws IOException {

        ArrayWritable arrayWritable=new ArrayWritable(Text.class);

        arrayWritable.set(new Writable[]{new Text("u0071"),new Text("u0041")});

        System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(arrayWritable)));

    }

}

看下输出：

[java] view plain copy

0000000201710141

可知，ArrayWritable以一个整型开始表示数组长度，然后数组里的元素一一排开。

ArrayPrimitiveWritable和上面类似，只是不需要用子类去继承ArrayWritable而已。

MapWritable和SortedMapWritable

MapWritable对应Map,SortedMapWritable对应SortedMap,以4个字节开头，存储集合大小，然后每个元素以一个字节开头存储类型的索引（类似GenericWritable,所以总共的类型总数只能倒127），接着是元素本身，先key后value，这样一对对排开。

这两个Writable以后会用很多，贯穿整个hadoop，这里就不写示例了。

我们注意到没看到set集合和list集合，这个可以代替实现。用MapWritable代替set，SortedMapWritable代替sortedmap，只需将他们的values设置成NullWritable即可，NullWritable不占空间。相同类型构成的list，可以用ArrayWritable代替，不同类型的list可以用GenericWritable实现类型，然后再使用ArrayWritable封装。当然MapWritable一样可以实现list，把key设置为索引，values做list里的元素。
相关阅读:
js 正则表达式
 JAVA jdk环境搭建
 VMWareStation10 密钥
 linux xshell jdk hadoop(环境搭建) 虚拟机安装(大数据搭建环境)
linux hadoop jdk虚拟机下配置
 Linux shell基础（四）
Linux shell基础（二）
Linux shell基础（三）
Linux shell基础（一）
html
原文地址：https://www.cnblogs.com/xuepei/p/3665463.html

各种类型的Writable（Text、ByteWritable、NullWritable、ObjectWritable、GenericWritable、ArrayWritable、MapWritable、SortedMapWritable）转

java原生类型

Text类型

检索的不同

Unicode的不同

Text的迭代

Text的修改

BytesWritable类型

NullWritable类型

ObjectWritable类型

GenericWritable类型

集合类型的Writable

ArrayWritable和TwoDArrayWritable

MapWritable和SortedMapWritable