使用scrapy-crawlSpider 爬取tencent 招聘

Tencent 招聘信息网站

创建项目

scrapy startproject Tencent

创建爬虫

scrapy  genspider -t crawl tencent

1. 起始url start_url = 'https://hr.tencent.com/position.php'

在起始页面，需要获取该也页面上的每个职位的详情页的url,同时需要提取下一页的url地址，做同样的操作。

因此起始页url地址的提取，分为两类：

　　1. 每个职位详情页的url地址的提取

　　2. 下一页url地址的提取，并且得到的页面做的操作和起始页的操作一样。

url地址的提取

1. 提取详情页url，详情页的url地址如下：

提取规则详情页的规则：

rules = (
        # 提取详情页的url地址  ，详情页url地址对应的响应，需要进行数据提取，所有需要有回调函数，用来解析数据
        
        Rule(LinkExtractor(restrict_xpaths=("//table[@class='tablelist']//td[@class='l square']")), callback='parse_item')

    )

提取下一页的htmlj所在的位置：

2 获取下一页的url 规则：

rules = (
        # 提取详情页的url地址
        # Rule(LinkExtractor(allow=r'position_detail.php?id=d+&keywords=&tid=0&lid=0'), callback='parse_item'), # 这个表达式有错，这里不用正则
        Rule(LinkExtractor(restrict_xpaths=("//table[@class='tablelist']//td[@class='l square']")), callback='parse_item'),
        # 翻页
        Rule(LinkExtractor(restrict_xpaths=("//a[@id='next']")), follow=True),
    )

获取详情页数据

1.详情数据提取(爬虫逻辑)

1.获取标题

xpath:

item['title'] = response.xpath('//td[@id="sharetitle"]/text()').extract_first()

2. 获取工作地点，职位，招聘人数

xpath:

 item['addr'] = response.xpath('//tr[@class="c bottomline"]/td[1]//text()').extract()[1]

 item['position'] = response.xpath('//tr[@class="c bottomline"]/td[2]//text()').extract()[1]

 item['num'] = response.xpath('//tr[@class="c bottomline"]/td[3]//text()').extract()[1]

3.工作要求抓取

xpath:

item['skill'] =response.xpath('//ul[@class="squareli"]/li/text()').extract()

爬虫的代码：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from ..items import TencentItem

class TencentSpider(CrawlSpider):
    name = 'tencent'
    allowed_domains = ['hr.tencent.com']
    start_urls = ['https://hr.tencent.com/position.php']

    rules = (
        # 提取详情页的url地址
        # Rule(LinkExtractor(allow=r'position_detail.php?id=d+&keywords=&tid=0&lid=0'), callback='parse_item'), # 这个表达式有错
        Rule(LinkExtractor(restrict_xpaths=("//table[@class='tablelist']//td[@class='l square']")), callback='parse_item'),
        # 翻页
        Rule(LinkExtractor(restrict_xpaths=("//a[@id='next']")), follow=True),
    )

    def parse_item(self, response):

        item = TencentItem()


        item['title'] = response.xpath('//td[@id="sharetitle"]/text()').extract_first()

        item['addr'] = response.xpath('//tr[@class="c bottomline"]/td[1]//text()').extract()[0]

        item['position'] = response.xpath('//tr[@class="c bottomline"]/td[2]//text()').extract()[0]

        item['num'] = response.xpath('//tr[@class="c bottomline"]/td[3]//text()').extract()[0]

        item['skill'] =response.xpath('//ul[@class="squareli"]/li/text()').extract()

        print(dict(item))

        return item

tencent.py

2. 数据存储

1.settings.py 配置文件，配置如下信息

ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
ITEM_PIPELINES = {
   'jd.pipelines.TencentPipeline': 300,

}

2. items.py 中：

import scrapy

class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    addr = scrapy.Field()
    position = scrapy.Field()
    num = scrapy.Field()
    skill = scrapy.Field()

3. pipeline.py中：

import  pymongo

class TencentPipeline(object):


    def open_spider(self,spider):
        # 爬虫开启是连接数据库
        client = pymongo.MongoClient()

        collention = client.tencent.ten

        self.client =client

        self.collention = collention


        pass
    def process_item(self, item, spider):

        # 数据保存在mongodb 中


        self.collention.insert(dict(item))


        return item


    def colse_spdier(self,spider):

        # 爬虫结束，关闭数据库

        self.client.close()

启动项目

1.先将MongoDB数据库跑起来。

2.执行爬虫命令：

scrapy  crawl  tencent

3. 执行程序后的效果：

相关阅读:
启动django报错
 celery简单使用
 git简单使用
 selinux干扰mysql启动
 python操作xml文件时，带有^M符号
 获取服务器内网地址
 WebStorm激活
 linux nohup python 后台运行无输出问题
 安装FTP
sql server还原数据库代码
原文地址：https://www.cnblogs.com/knighterrant/p/10714234.html

最新文章
组件的生命周期
 state-表单
 state
事件处理
 props
组件/组件属性/复合组件
 创建React工程
 UISearchBar 自定义处理
 CoreML的学习
 ARKit的使用