• scrapy工具创建爬虫工程


    1、scrapy创建爬虫工程:scrapy startproject scrape_project_name

    >scrapy startproject books_scrape
    New Scrapy project 'books_scrape', using template directory 's:\users\jiangshan\anaconda3\lib\site-packages\scrapy\templates\project', created in:
    D:WorkspaceScrapyTestooks_scrape

    You can start your first spider with:
    cd books_scrape
    scrapy genspider example example.com

    2、>cd books_scrape

    3、查看目录结构:>tree /F

    >tree /F
    卷 DATA1 的文件夹 PATH 列表
    卷序列号为 3A2E-EB05
    D:.
    │ scrapy.cfg

    └─books_scrape
    │ items.py
    │ middlewares.py
    │ pipelines.py
    │ settings.py
    │ __init__.py

    ├─spiders
    │ │ __init__.py
    │ │
    │ └─__pycache__
    └─__pycache__

    4、使用scrapy genspider<SPIDER_NAME> <DOMAIN> 命令生成(根据模板)和创建Spider文件以及Spider类,该命令的两个参数分别是Spider的名字和所要爬取的域(网站)

    > scrapy genspider books  books.toscrape.com

    5、查看目录结构:(标蓝色先不管,因为本人使用远程服务器调试)

    >tree /F

    D:.
    │ scrapy.cfg

    └─books_scrape
    │ items.py
    │ middlewares.py
    │ pipelines.py
    │ run.py
    │ settings.py
    │ __init__.py

    ├─.idea
    │ books_scrape.iml
    │ deployment.xml
    │ misc.xml
    │ modules.xml
    │ remote-mappings.xml
    │ workspace.xml

    ├─spiders
    │ │ books.py
    │ │ __init__.py
    │ │
    │ └─__pycache__
    │ __init__.cpython-37.pyc

    └─__pycache__
    settings.cpython-37.pyc
    __init__.cpython-37.pyc

    6、打开pycharm软件,打开创建的books_scrape工程,以配置文件scrapy.cfg为基准

    7、在和├─spiders同级目录新建,run.py文件,写入:

    from scrapy import cmdline
    cmdline.execute('scrapy crawl books'.split())



    cmdline.execute('scrapy crawl books -o %(name)s%(time)s.csv'.split())


    cmdline.execute('scrapy crawl books -o books.csv'.split())

    cmdline.execute('scrapy crawl books -o books.xml'.split())


  • 相关阅读:
    必须了解的经典排序算法整理
    浅谈Code Review
    NOIP2018提高组省一冲奖班模测训练(六)
    NOIP2018提高组省一冲奖班模测训练(五)
    NOIP2018提高组金牌训练营——动态规划专题
    poj 3074
    搜索中的剪枝
    bitset骚操作
    NOIP 2017 宝藏
    prim求最小生成树
  • 原文地址:https://www.cnblogs.com/jeshy/p/11105766.html
Copyright © 2020-2023  润新知