node爬虫之gbk网页中文乱码解决方案

node爬虫之gbk网页中文乱码解决方案
之前在用 node 做爬虫时碰到的中文乱码问题一直没有解决，今天整理下备忘。（PS：网上一些解决方案都已经不行了）

中文乱码具体是指用 node 请求 gbk 编码的网页，无法正确获取网页中的中文（需要转码），"gbk" 和 "网页中的中文" 两个条件是缺一不可的。可以获取 utf-8 编码的网页中的中文，也可以获取 gbk 编码网页中的英文数字等。

举个简单的例子。获取 http://acm.hdu.edu.cn/statistic.php?pid=1000 排名第一的答案的 username，是为 "极光炫影"。刷刷刷写下如下代码：
```
var cheerio = require('cheerio')
  , superagent = require('superagent')
  , express = require('express');

var url = 'http://acm.hdu.edu.cn/statistic.php?pid=1000';
var app = express();

app.get('/', function (req, res, next) {

  superagent.get(url)
    .end(function (err, sres) {
      var html = sres.text;
      var $ = cheerio.load(html, {decodeEntities: false});
      var ans = $('.table_text td a').eq(0).html();
      res.send(ans);
    });
  });

app.listen(3000, function () {
  console.log('app is listening at port 3000');
});
```
得到了乱码，如下：
```
������Ӱ
```
如何获取正确的中文呢？这里提供几个解决方案应急（不关心原理，只是为了应急）。

方法一：

使用 superagent-charset 模块（2016-08-26：如出错，请使用 0.1.1 版本，安装命令 npm install superagent-charset@0.1.1 --save 或者直接在 package.json 中修改，因为之后的版本修改了 api，以下的代码是针对 0.1.1 的 api，有空我修改下）。
```
var cheerio = require('cheerio')
  , superagent = require('superagent-charset')
  , express = require('express');

var url = 'http://acm.hdu.edu.cn/statistic.php?pid=1000';
var app = express();

app.get('/', function (req, res, next) {

  superagent.get(url)
    .charset('gbk')
    .end(function (err, sres) {
      var html = sres.text;
      var $ = cheerio.load(html, {decodeEntities: false});
      var ans = $('.table_text td a').eq(0).html();
      res.send(ans);
    });

});

app.listen(3000, function () {
  console.log('app is listening at port 3000');
});
```
使用非常简单，只需要引入 superagent-charset 模块，且在链式调用时加入 charset 参数即可。superagent-charset 模块包括了 superAgent 模块以及 iconv-lite 模块。源码可以参考 Github。

方法二：

直接用 iconv-lite 模块进行转码。

iconv-lite 是一个进行编码转换的模块（node 默认编码 utf-8）。需要 decode 的编码必须是 Buffer 类型。
- 用 http 模块：
```
  http.get(url, function(sres) {
    var chunks = [];

    sres.on('data', function(chunk) {
      chunks.push(chunk);
    });

    sres.on('end', function() {
      // 将二进制数据解码成 gb2312 编码数据
      var html = iconv.decode(Buffer.concat(chunks), 'gb2312');
      var $ = cheerio.load(html, {decodeEntities: false});
      var ans = $('.table_text td a').eq(0).html();
      res.send(ans);
    });
  });
```
- 用 request 模块：
```
  request({
    url: url, 
    encoding: null  // 关键代码
  }, function (err, sres, body) {
    var html = iconv.decode(body, 'gb2312')
    var $ = cheerio.load(html, {decodeEntities: false});
    var ans = $('.table_text td a').eq(0).html();
    res.send(ans);
  });
```
  用 iconv 进行 decode 传入的参数必须是 Buffer。
  
  encoding - Encoding to be used on setEncoding of response data. If null, the body is returned as a Buffer. Anything else (including the default value of undefined) will be passed as the encoding parameter to toString() (meaning this is effectively utf8 by default). (Note: if you expect binary data, you should set encoding: null.)
iconv-lite 模块能配合 http 模块以及 request 模块使用，却不能直接和 superAgent 模块使用。因为 superAgent 是以 utf8 去取数据，然后再用 iconv 转也是不行的。页面是 gbk 编码的，sres.text 已经是 decode 过了的结果，也就是说它已经被转换成 utf8 了，再转换成 buffer 出来的结果必须是不正确的。

Read More：
相关阅读:
C++命名规则
 protobuf_1
以太网帧格式
 LinQ
asp.mvc 基本知识
 Lucene.Net 优化索引生成，即搜索显示优化
 HTML Meta中添加X-UA-Compatible和IE=Edge,chrome=1有什么作用
 DataSet
伪Excel导出新版代码
 WebUI 常用
原文地址：https://www.cnblogs.com/lessfish/p/5157887.html