• Scrapy实战篇(四)爬取京东商城文胸信息


    创建scrapy项目

    scrapy startproject jingdong

    填充 item.py文件

    在这里定义想要存储的字段信息

    import scrapy
    
    class JingdongItem(scrapy.Item):
        content = scrapy.Field()
        creationTime = scrapy.Field()
        productColor = scrapy.Field()
        productSize = scrapy.Field()
        userClientShow = scrapy.Field()
        userLevelName = scrapy.Field()
    class IdItem(scrapy.Item):
        id = scrapy.Field()

    填充middlewares.py文件

    中间件主要实现添加随机user-agent的作用。

    import random
    from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
    
    
    class RandomUserAgent(UserAgentMiddleware):
        def __init__(self, agents):
            self.agents = agents
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler.settings.getlist("USER_AGENTS"))
    
        def process_request(self, request, spider):
            request.headers.setdefault('User-Agent', random.choice(self.agents))

    填充pipelines.py文件

    将我们爬取到的结果存储在mongo数据库中

    from pymongo import MongoClient
    
    class JingdongPipeline(object):
    
        collection = 'jingdong_cup'
    
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                mongo_uri=crawler.settings.get('MONGO_RUI'),
                mongo_db=crawler.settings.get('MONGO_DB')
            )
    
        # 爬虫启动将会自动执行下面的方法
        def open_spider(self,spider):
            self.client = MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
        
        # 爬虫项目关闭调用的方法
        def close_spider(self, spider):
            self.client.close()
    
        def process_item(self, item, spider):
            table = self.db[self.collection]
            data = dict(item)
            table.insert_one(data)
            return "OK!"

    设置settings.py文件

    下面的这些信息需要简单的修改,其他的信息不动即可

    BOT_NAME = 'jingdong'
    SPIDER_MODULES = ['jingdong.spiders']
    NEWSPIDER_MODULE = 'jingdong.spiders'
    ROBOTSTXT_OBEY = False
    DOWNLOAD_DELAY = 2
    COOKIES_ENABLED = False
    USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
    
    ]
    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
        'jingdong.middlewares.RandomUserAgent': 400
    }
    ITEM_PIPELINES = {
       'jingdong.pipelines.JingdongPipeline': 300,
    }
    
    MONGO_URI = 'mongodb://localhost:27017'
    MONGO_DB = 'JD'

    最后在创建jingdong_spider.py文件,来实现我们的逻辑

    主要的逻辑是这样的,在京东首页输入商品信息之后,第一步需要做的就是将每一页的商品id爬取下来,商品的id是一串数字,我们只要将这一串数字加入到url中,就可以拿到每件商品的评论页,评论信息是josn形式返回,当然这里还需要实现翻页的功能,代码如下。

    from scrapy import Spider,Request
    from jingdong.items import JingdongItem,IdItem
    import json
    import re
    
    
    class JingdongSpider(Spider):
        name = 'jingdong'
        allowed_domains = []
    
        
        def start_requests(self):
            start_urls = ['https://search.jd.com/Search?keyword=%E6%96%87%E%83%B8&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&suggest=1.his.0.0&page={}&s=1&click=0'.format(str(i)) for i in range(1,150,2)]
            for url in start_urls:
                yield Request(url=url, callback=self.parse)
        
        # 获取商品的id
        def parse(self, response):  
            selector = response.xpath('//ul[@class="gl-warp clearfix"]/li')
            id_list = []
            for info in selector:
                try:
                    id = info.xpath('@data-sku').extract_first()
                    if id not in id_list:
                        id_list.append(id)
                        item = IdItem()
                        item['id'] = id
                        comment_url = 'https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv6&productId={}&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'.format(str(id))
                        yield Request(url=comment_url, meta={'item':item}, headers=self.headers, callback=self.parseurl)
                except IndexError:
                    continue
        # 拿到评论页信息,解析出页面总数,针对每一个页面再次请求
        def parseurl(self,response):
            t = re.findall('^fetchJSON_comment98vvd*((.*));', response.text)  
            json_data = json.loads(t[0])  # 字符串格式格式化成json格式
            page = json_data['maxPage']
            item = response.meta['item']
            id = item['id']
            urls = ['https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv6&productId={}&score=0&sortType=5&page={}&pageSize=10&isShadowSku=0&fold=1'.format(str(id), str(i)) for i in range(0, int(page))]
        
            for path in urls:
                yield Request(url=path, headers=self.headers, callback=self.parsebody)
        
        # 解析评论信息
        def parsebody(self,response):
            t = re.findall('^fetchJSON_comment98vvd*((.*));', response.text)  # 去掉json的头信息,变成一个单一的列表
            json_data = json.loads(t[0])
        
            for comment in json_data['comments']:  # 列表套字典格式
                item = JingdongItem()
                try:
                    item['content'] = comment['content']
                    item['creationTime'] = comment['creationTime']
                    item['productColor'] = comment['productColor']
                    item['productSize'] = comment['productSize']
                    item['userClientShow'] = comment['userClientShow']
                    item['userLevelName'] = comment['userLevelName']
                    yield item
                except:
                    continue
    

    整体的代码可以去github下载:https://github.com/cnkai/jingdong-cup

  • 相关阅读:
    gridview列前加复选框需要注意的一点
    为什么日历控件放在panel无法显示出来
    表格翻页
    The SDK platform-tools version ((21)) is too old to check APIs compiled with API 23
    win7怎么安装和启动 jboss
    (转)如何制作nodejs,npm “绿色”安装包
    Can't use Subversion command line client: svn
    (转)Android自定义属性时format选项( <attr format="reference" name="background" /> )
    本地拒绝服务漏洞修复建议
    (转)Android 读取联系人(详细)
  • 原文地址:https://www.cnblogs.com/lxbmaomao/p/10363434.html
Copyright © 2020-2023  润新知