scrapy基础知识之 CrawlSpiders(爬取腾讯校内招聘):

import scrapy
from scrapy.spider import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from tencent.items import TencentItem

class TencentSpider(CrawlSpider):
    name = "Tencent"
    allowed_domains = ["tencent.com"]
    # url="http://hr.tencent.com/position.php?&start="
    # offset=0
    start_urls = [ "http://hr.tencent.com/position.php?&start=0#a"]

    page_link=LinkExtractor(allow=("start=d+"))

    rules=[
            Rule(page_link,callback = "parseContent",follow=True)
    ]

    def parseContent(self, response):
        list=response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
        for infos in list:
            item=TencentItem()
            item['positionname']=infos.xpath("./td[1]/a/text()").extract()[0]
            item['positionlink']=infos.xpath("./td[1]/a/@href").extract()[0]
            item['positionType']=infos.xpath("./td[2]/text()").extract()
            item['positionNum']=infos.xpath("./td[3]/text()").extract()[0]
            item['positionLocation']=infos.xpath("./td[4]/text()").extract()[0]
            item['publishTime']=infos.xpath("./td[5]/text()").extract()[0]

            yield item


运行： scrapy crawl Tencent
#注意：千万记住callback不能写 parse，由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败

相关阅读:
【原创】解决向工程中添加Megacore 文件在文件列表中没有出现目标文件的问题
(笔记)找工作，该怎么进补
(原创)结构体位域操作
(原创)TCP/IP学习笔记之IP(网际协议)
(原创)确认大端模式或小端模式(最直接有效的方法)
(原创)HDL中的unsigned与signed
(原创)TCP/IP学习笔记之概述
(笔记)往一个指定的地址读写一个值
(笔记)我的EDN博客被评为专家博客啦
(原创)同步复位与异步复位

原文地址：https://www.cnblogs.com/huwei934/p/6971251.html