• python爬虫系列:Scrapy安装与使用


    这篇博文主要写Scrapy框架的安装与使用

    Scrapy框架安装

    命令行进入C:Anaconda2Scripts目录,运行:conda install Scrapy

    创建Scrapy项目

    1)进入打算存储的目录下,执行scrapy startproject 文件名 命令即可创建

    新文件目录及内容

    demo/
        scrapy.cfg
        tutorial/
            __init__.py
            items.py
            pipelines.py
            settings.py
            spiders/
                __init__.py
                ...

    这些文件分别是:

    • scrapy.cfg: 项目的配置文件
    • demo/: 该项目的python模块.
    • demo/items.py: 项目中的item文件,即写将要抓取的内容.
    • demo/pipelines.py: 项目中的pipelines文件,即写数据如何存储.
    • demo/settings.py: 项目的设置文件,即写如何定制Scrapy组件,这个比较复杂可以忽略.
    • demo/spiders/: 放置spider代码的目录,即写如何实现爬取.

    定义爬虫文件

    1)定义Item

    #item.py
     import scrapy
     from scrapy.item import Item, Field
     
     class DoubanItem(scrapy.Item):
          # define the fields for your item here like:
          # name = scrapy.Field(
          title = scrapy.Field()
          url = scrapy.Field()
          rate= scrapy.Field()
          tag = scrapy.Field()

    2)定义spider

    # coding:utf8
    import scrapy
    from douban.items import DoubanItem
    from scrapy.selector import Selector
    import re
    from douban.pipelines import DoubanPipeline
    import json
    import urllib
    
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    class DmozSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = ["douban.com"]
        start_urls = [
            "https://movie.douban.com/j/search_subjects?type=movie&tag=热门&sort=recommend&page_limit=1000&page_start=0",
        ]
        
        def start_requests(self):       
            reqs = []
            tags = [u'热门', u'最新', u'经典', u'豆瓣高分', u'冷门佳片', u'华语', u'欧美',
                    u'韩国', u'日本', u'动作', u'喜剧', u'爱情', u'科幻', u'悬疑', u'恐怖', u'文艺']
            
            for i in tags:
                url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=' + str(i) + '&sort=recommend&page_limit=1000&page_start=0'
                req = scrapy.Request(url)
                reqs.append(req)
    
            return reqs
    
        def parse(self, response):
            html = response.body
            url = response.url
            # print u'地址',url
            tag = re.findall(u'tag=(.*?)&',url)[0]
            tag=urllib.unquote(tag)
            # print type(tag)
    
            dictt = json.loads(html)
            dd = dictt['subjects']
            items = []
            for a in dd:
                # self.get_tag(tag)
                pre_item = TutorialItem()
                pre_item['url'] = a['url']
                pre_item['title'] = a['title']
                pre_item['rate'] = a['rate']
                pre_item['tag'] = tag           
                items.append(pre_item)
    
            return items

    3)定义pipeline

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    import pymongo
    
    class DoubanPipeline(object):
    
        def process_item(self, item, spider):
            db = spider.settings.get('db')
            dbb = pymongo.MongoClient(db)
            db = dbb['douban']
            # lis = (item['url'],item['rate'])
            db.info.insert(dict(item))
            # lis = (item['title'], item['PORT'], item['POSITION'], item[
            #        'TYPE'], item['SPEED'], item['last_check_time'])
            return item

    4)定义setting

    1 MONGO_HOST = "127.0.0.1"  # 主机IP
    2 MONGO_PORT = 27017  # 端口号
    3 MONGO_DB = "Spider"  # 库名 
    4 MONGO_COLL = "douban"  # collection名
    5 # MONGO_USER = "Ryana"
    6 # MONGO_PSW = "123456"

    运行Spider

    进入spider所在的文件夹,执行scrapy crawl  spiderName 命令即可,这里再推荐一个MongoDB可视化工具Robomongo,运行结果如下图

     

      

  • 相关阅读:
    dutacm.club_1094_等差区间_(线段树)(RMQ算法)
    dutacm.club_1087_Common Substrings_(KMP)_(结合此题通俗理解kmp的next数组)
    dutacm.club_1089_A Water Problem_(dp)
    14年第五届蓝桥杯第八题_地宫取宝_(记忆化搜索)
    14年第五届蓝桥杯第七题_蚂蚁感冒_(思维)
    dutacm.club_1085_Water Problem_(矩阵快速幂)
    HDU_2476_String painter_(区间dp)
    第五届蓝桥杯校内选拔第七题_(树型dp)
    第五届蓝桥杯校内选拔第六题_(dfs)
    15年第六届蓝桥杯第九题_(矩阵快速幂优化的动态规划)
  • 原文地址:https://www.cnblogs.com/Ryana/p/6147782.html
Copyright © 2020-2023  润新知