dygod.net

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class DgSpider(CrawlSpider):
    name = 'dg'
    # allowed_domains = ['https://www.dygod.net']
    start_urls = ['https://www.dygod.net/html/gndy/dyzz/index.html']

    rules = (
        Rule(LinkExtractor(allow=r'https://www.dygod.net/html/gndy/dyzz/index_d+.html')),
        Rule(LinkExtractor(allow=r'https://www.dygod.net/html/gndy/dyzz/d+/d+.html'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        item['name'] = response.css('div[id*=Zoom] p:nth-child(3)::text').get()
        # item['time'] = response.xpath('//div[@id="description"]').get()
        return item

刚开始报错，因为 start_urls的https://www.dygod.net/html/gndy/dyzz/index.html最后多了一个/

后来继续报错，filter offline ....dygod.net，没搞清楚就直接把allowed_domains注释掉了就好了。。。，

但是扒下来的汉字都是u25ceu7247u3000u3000u540du3000 Unicode模式

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- focus on what you want to be

相关阅读:
hdu 1423 LICS
poj 1135
poj 1112
poj 1087
poj 1094
谷歌浏览器字体小于12px不能正常显示bug
gulpfile.js配置实现ctrl+s自动编译和刷新浏览器
<hr>标签横线的颜色
jQuery轮播图鼠标移入停止，移出播放，点击小横条切换图片
最简单的jq轮播图

原文地址：https://www.cnblogs.com/bamboozone/p/10464146.html