Python爬虫【实战篇】scrapy 框架爬取某招聘网存入mongodb

创建项目

scrapy startproject zhaoping

创建爬虫

cd zhaoping
scrapy genspider hr zhaopingwang.com

目录结构

items.py

    title = scrapy.Field()
    position = scrapy.Field()
    publish_date = scrapy.Field()

pipelines.py

from pymongo import MongoClient

mongoclient = MongoClient(host='192.168.226.150',port=27017)
collection = mongoclient['zhaoping']['hr']

class TencentPipeline(object):
    def process_item(self, item, spider):
        print(item)
        # 需要转换为 dict
        collection.insert(dict(item))
        return item

spiders/hr.py

    def parse(self, response):
        # 不要第一个 和最后一个
        tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
        for tr in tr_list:
            item = TencentItem()
            # xpath 从1 开始数起
            item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
            item["position"] = tr.xpath("./td[2]/text()").extract_first()
            item["publish_date"] = tr.xpath("./td[5]/text()").extract_first()
            yield item

        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        # 构造url
        if next_url != "javascript:;":
            print(next_url)
            next_url = "https://hr.tencent.com/" + next_url
            yield scrapy.Request(url=next_url,callback=self.parse,)

就是这么简单，就获取到数据

相关阅读:
16-1-6 kafka的操作
16-1-5：MapReduce
MapReduce概述
MapReduce的代码实现过程分析
MapReduce
HDFS2—SequenceFile(小文件的解决方案)
HDFS2—federation
hdfs
缓解爬虫ip被封的概率
zookeeper集群搭建

原文地址：https://www.cnblogs.com/tangkaishou/p/10264628.html