scrapy框架

Scrapy框架：

spiders 发送请求 ==>引擎==> 调度器scheduler==>Downloader下载器,响应文件==>spiders==>处理数据，item,pipeline. 　　　　　　　　　　　　　　　　　　　　　　　　　　

新建项目（scrapy startproject xxx）：新建一个爬虫项目

明确目标（编写items.py）：明确抓取的目标

制作爬虫（spiders/xxspider.py）：制作爬虫开始爬取数据

存储内容（pipelines.py）：设计管道存储爬取内容

运行爬虫项目：

命令行运行：scrapy crawl myspider

pycharm运行：from scrapy import cmdline
　　　　　　cmdline.execute('scrapy crawl myspider'.split(" "))

管道：

先在settings.py里面：

ITEM_PIPELINES = {
#     'mySpider.pipelines.mySpiderPipelines':100,
   'mySpider.pipelines.MyspiderPipeline': 300,
}

然后在pipelines.py里面：

import json

class MyspiderPipeline(object):
    def __init__(self):
        self.filename = open('teacher.json','w',encoding='utf8')

    # 处理item数据
    def process_item(self, item, spider):
        jsontxt = json.dumps(dict(item),ensure_ascii=False)+ "
"
        self.filename.write(jsontxt)
        # return item

    # 结束调用
    def close_spider(self,spider):
        self.filename.close()

回调函数到下一页：myspider.py:写在for循环外

# 将请求重新发送给调度器入队列，交给下载器下载

yield scrapy.Request(self.url+str(self.offest),callback = self.parse)

设置报头：

DEFAULT_REQUEST_HEADERS = {
  'User-Agent':'Mozilla/5.0(compatible; MSIE 9.0;Windows NT 6.1;Trident/5.0;',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

设置延迟：

#DOWNLOAD_DELAY = 3

设置管道：

ITEM_PIPELINES = {
#     'mySpider.pipelines.mySpiderPipelines':100,
   'mySpider.pipelines.MyspiderPipeline': 300,
}

管道处理文字：

import json

class MyspiderPipeline(object):
    def __init__(self):
        self.filename = open('teacher.json','w',encoding='utf8')

    # 处理item数据
    def process_item(self, item, spider):
        jsontxt = json.dumps(dict(item),ensure_ascii=False)+ "
"
        self.filename.write(jsontxt)
        # return item

    # 结束调用
    def close_spider(self,spider):
        self.filename.close()

管道处理图片：

import scrapy
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.images import ImagesPipeline
import os

class ImagesPipeline(ImagesPipeline):
    #def process_item(self, item, spider):
    #    return item
    # 获取settings文件里设置的变量值
    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")

    def get_media_requests(self, item, info):
        image_url = item["imagelink"]
        yield scrapy.Request(image_url)

    def item_completed(self, result, item, info):
        image_path = [x["path"] for ok, x in result if ok]

        os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["nickname"] + ".jpg")

        item["imagePath"] = self.IMAGES_STORE + "/" + item["nickname"]

        return item

注意的点：

url = re.sub('d+',str(page),response.url)

re.sub(s1,s2,s3) 把s3里的s1替换成s2

content = json.dumps(dict(item),ensure_ascii=False)把中文转成Unicode

相关阅读:
asr相关技术总结
SLURM 使用基础教程
FSMN 及其变种 cFSMN DFSMN pyramidal-FSMN
均方根误差（RMSE），平均绝对误差 (MAE)，标准差 (Standard Deviation)
linux文本编码格式转化字幕处理
PyTorch-Kaldi 语音识别工具包
SRILM Ngram 折扣平滑算法
awk 调用 shell 命令，并传递参数
用 scikit-learn 和 pandas 学习线性回归
逻辑回归及实例

原文地址：https://www.cnblogs.com/xuezhihao/p/11636153.html