• Python爬虫【实战篇】scrapy 框架爬取某招聘网存入mongodb


    创建项目

    scrapy startproject zhaoping

    创建爬虫

    cd zhaoping
    scrapy genspider hr zhaopingwang.com

    目录结构

    items.py

        title = scrapy.Field()
        position = scrapy.Field()
        publish_date = scrapy.Field()

    pipelines.py

    from pymongo import MongoClient
    
    mongoclient = MongoClient(host='192.168.226.150',port=27017)
    collection = mongoclient['zhaoping']['hr']
    
    class TencentPipeline(object):
        def process_item(self, item, spider):
            print(item)
            # 需要转换为 dict
            collection.insert(dict(item))
            return item

    spiders/hr.py

        def parse(self, response):
            # 不要第一个 和最后一个
            tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
            for tr in tr_list:
                item = TencentItem()
                # xpath 从1 开始数起
                item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
                item["position"] = tr.xpath("./td[2]/text()").extract_first()
                item["publish_date"] = tr.xpath("./td[5]/text()").extract_first()
                yield item
    
            next_url = response.xpath("//a[@id='next']/@href").extract_first()
            # 构造url
            if next_url != "javascript:;":
                print(next_url)
                next_url = "https://hr.tencent.com/" + next_url
                yield scrapy.Request(url=next_url,callback=self.parse,)

    就是这么简单,就获取到数据

  • 相关阅读:
    16-1-6 kafka的操作
    16-1-5:MapReduce
    MapReduce概述
    MapReduce的代码实现过程分析
    MapReduce
    HDFS2—SequenceFile(小文件的解决方案)
    HDFS2—federation
    hdfs
    缓解爬虫ip被封的概率
    zookeeper集群搭建
  • 原文地址:https://www.cnblogs.com/tangkaishou/p/10264628.html
Copyright © 2020-2023  润新知