• scrapy框架


    Scrapy框架:

    spiders 发送请求 ==>引擎==> 调度器scheduler==>Downloader下载器,响应文件==>spiders==>处理数据,item,pipeline.                                                                                        

    新建项目(scrapy startproject xxx):新建一个爬虫项目

    明确目标(编写items.py):明确抓取的目标

    制作爬虫(spiders/xxspider.py):制作爬虫开始爬取数据

    存储内容(pipelines.py):设计管道存储爬取内容

    运行爬虫项目:

    命令行运行:scrapy  crawl  myspider

    pycharm运行:from scrapy import cmdline
          cmdline.execute('scrapy crawl myspider'.split(" "))

     管道:

    先在settings.py里面:

    ITEM_PIPELINES = {
    # 'mySpider.pipelines.mySpiderPipelines':100,
    'mySpider.pipelines.MyspiderPipeline': 300,
    }

     然后在pipelines.py里面:

    import json

    class MyspiderPipeline(object):
    def __init__(self):
    self.filename = open('teacher.json','w',encoding='utf8')

    # 处理item数据
    def process_item(self, item, spider):
    jsontxt = json.dumps(dict(item),ensure_ascii=False)+ " "
    self.filename.write(jsontxt)
    # return item

    # 结束调用
    def close_spider(self,spider):
    self.filename.close()

     回调函数到下一页:myspider.py:写在for循环外

    # 将请求重新发送给调度器入队列,交给下载器下载

    yield  scrapy.Request(self.url+str(self.offest),callback = self.parse)

     设置报头:

    DEFAULT_REQUEST_HEADERS = {
    'User-Agent':'Mozilla/5.0(compatible; MSIE 9.0;Windows NT 6.1;Trident/5.0;',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    # 'Accept-Language': 'en',
    }

     设置延迟:

    #DOWNLOAD_DELAY = 3

     设置管道:

    ITEM_PIPELINES = {
    # 'mySpider.pipelines.mySpiderPipelines':100,
    'mySpider.pipelines.MyspiderPipeline': 300,
    }

    管道处理文字:

    import json

    class MyspiderPipeline(object):
    def __init__(self):
    self.filename = open('teacher.json','w',encoding='utf8')

    # 处理item数据
    def process_item(self, item, spider):
    jsontxt = json.dumps(dict(item),ensure_ascii=False)+ " "
    self.filename.write(jsontxt)
    # return item

    # 结束调用
    def close_spider(self,spider):
    self.filename.close()

    管道处理图片:

    import scrapy
    from scrapy.utils.project import get_project_settings
    from scrapy.pipelines.images import ImagesPipeline
    import os

    class ImagesPipeline(ImagesPipeline):
    #def process_item(self, item, spider):
    # return item
    # 获取settings文件里设置的变量值
    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")

    def get_media_requests(self, item, info):
    image_url = item["imagelink"]
    yield scrapy.Request(image_url)

    def item_completed(self, result, item, info):
    image_path = [x["path"] for ok, x in result if ok]

    os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["nickname"] + ".jpg")

    item["imagePath"] = self.IMAGES_STORE + "/" + item["nickname"]

    return item

     注意的点:

    url = re.sub('d+',str(page),response.url)

    re.sub(s1,s2,s3) 把s3里的s1替换成s2

    content = json.dumps(dict(item),ensure_ascii=False)把中文转成Unicode

  • 相关阅读:
    asr相关技术总结
    SLURM 使用基础教程
    FSMN 及其变种 cFSMN DFSMN pyramidal-FSMN
    均方根误差(RMSE),平均绝对误差 (MAE),标准差 (Standard Deviation)
    linux文本编码格式转化 字幕处理
    PyTorch-Kaldi 语音识别工具包
    SRILM Ngram 折扣平滑算法
    awk 调用 shell 命令,并传递参数
    用 scikit-learn 和 pandas 学习线性回归
    逻辑回归 及 实例
  • 原文地址:https://www.cnblogs.com/xuezhihao/p/11636153.html
Copyright © 2020-2023  润新知