• 利用Scrapy爬取拉勾网某职位信息


          学习了一段时间的scrapy爬虫框架,也模仿别人的写了几个。最近,在编写爬取拉勾网某职位相关信息的过程中,遇到一些小的问题,和之前一般的爬取静态网页略有不同,这次需要提取的部分信息是js生成的。记录一下,后续备查。

          整个project的文件结构如下所示:

    ├── lagou_python
    
    │   ├── __init__.py
    
    │   ├── items.py
    
    │   ├── pipelines.py
    
    │   ├── settings.py
    
    │   └── spiders
    
    │       ├── __init__.py
    
    │       ├── lagou_spider.py
    
    │       ├── middleware.py
    
    │       └── rotate_useragent.py
    
    └── scrapy.cfg
    

          接下来,逐个分析一下。

          首先,分析一下需要抓取的页面信息,招聘信息(城市为上海,职业为python)URL为:

    http://www.lagou.com/jobs/list_Python?kd=Python&spc=1&pl=&gj=&xl=&yx=&gx=&st=&labelWords=&lc=&workAddress=&city=%E4%B8%8A%E6%B5%B7&requestId=&pn=1

          通过比较,后续的URL只是pn这个参数的值有变化。显然,只需要修改pn这个参数的值,就可以连续抓取不同网页的信息了。那么,总共有多少页呢,在Firebug中查看网页源码可以得到如下图所示的信息:

    51

          页面的分页是js实现的,应该是开发者修改了一个版本,不太像JQuery中的分页相关代码。js我只懂一点点,所以没有具体分析了。正好有看到scrapy如何处理带js的动态网页相关blog,我就尝试使用scrapy+python-webkit来进行处理。我虚拟机中采用的是Ubuntu14.04.2 Server版本,需要先安装几个包:

    apt-get install python-webkit
    
    apt-get install python-jswebkit 
    
    apt-get install Xvfb 

          完成安装之后,就可以在后面的project中让scrapy结合python-webkit来处理带js的动态网页了。我的初步理解是利用webkit处理带js的网页,然后scrapy抓取处理好的网页内容。拿到我的实例中来看,就是让webkit处理带js的网页,这样分页显示等处理都已经触发了,后续只要用xpath定为到最终呈现的页码处获取尾页的数字,然后自己构造后续的职位信息页码,让scrapy不断抓取就OK了。

          在scrapy中来说,需要自定义下载中间件的处理类。我在project下建立了一个middleware.py的文件,后续在settings.py中把middleware的处理类添加进去。middleware.py中的内容如下:

    from scrapy.http import Request, FormRequest, HtmlResponse
    
    import gtk
    
    import webkit
    
    import jswebkit
    
    from lagou_python import settings
    
    class WebkitDownloader( object ):
    
        def process_request( self, request, spider ):
    
            if spider.name in settings.WEBKIT_DOWNLOADER:
    
                if( type(request) is not FormRequest ):
    
                    webview = webkit.WebView()
    
                    webview.connect( 'load-finished', lambda v,f: gtk.main_quit() )
    
                    webview.load_uri( request.url )
    
                    gtk.main()
    
                    js = jswebkit.JSContext( webview.get_main_frame().get_global_context() )
    
                    renderedBody = str( js.EvaluateScript( 'document.body.innerHTML' ) )
    
                    return HtmlResponse( request.url, body=renderedBody )

          settings.py中的内容如下:

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for lagou_python project
    
    #
    
    # For simplicity, this file contains only the most important settings by
    
    # default. All the other settings are documented here:
    
    #
    
    #     http://doc.scrapy.org/en/latest/topics/settings.html
    
    #
    
    BOT_NAME = 'lagou_python'
    
    SPIDER_MODULES = ['lagou_python.spiders']
    
    NEWSPIDER_MODULE = 'lagou_python.spiders'
    
    WEBKIT_DOWNLOADER=['lagou']
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    
    #USER_AGENT = 'lagou_python (+http://www.yourdomain.com)'
    
    ITEM_PIPELINES = {
    
        'lagou_python.pipelines.LagouPythonPipeline':300
    
    }
    
    COOKIES_ENABLED = False
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    
    #USER_AGENT = 'TBBKAnalysis (+http://www.yourdomain.com)'
    
    DOWNLOADER_MIDDLEWARES = {
    
         'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
    
         'lagou_python.spiders.rotate_useragent.RotateUserAgentMiddleware' :400,
    
         'lagou_python.spiders.middleware.WebkitDownloader':543
    
    }
    
    LOG_LEVEL = 'DEBUG'

          已经把自定义的下载类添加进去了。

          为了把爬取的职位信息存在本地为lagou_python.json的文件中,pipelines.py的源码内容如下所示:

    # -*- coding: utf-8 -*-
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import json
    import codecs
    class LagouPythonPipeline(object):
        def __init__(self):
            self.file = codecs.open("lagou_python.json", encoding="utf-8", mode="wb")
        def process_item(self, item, spider):
            line = json.dumps(dict(item)) + '
    '
            self.file.write(line.decode("unicode_escape"))
            return item

          为了防止被ban,建立了一个自动切换User-Agent的文件rotate_useragent.py,源码如下所示:

    from scrapy import log
    
    import random
    
    from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
    
    class RotateUserAgentMiddleware(UserAgentMiddleware):
    
        def __init__(self, user_agent=''):
    
            self.user_agent = user_agent
    
        def process_request(self, request, spider):
    
            ua = random.choice(self.user_agent_list)
    
            if ua:
    
                log.msg('Current UserAgent: '+ua, level=log.INFO)
    
                request.headers.setdefault('User-Agent', ua)
    
        #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    
        #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    
        user_agent_list = [
    
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
    
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
    
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
    
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
    
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
    
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
    
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
    
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
    
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
    
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
    
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    
            "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
    
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
    
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    
           ]

          最重要的部分,爬虫的编写,在网页中具体内容的定位我就不说明了。职位信息的后续网页是根据尾页的范围,利用xrange产生从第2页到尾页的数字,然后拼接上去的。源码如下:

    __author__ = 'sniper.geek'
    
    import re
    
    import json
    
    from scrapy.selector import Selector
    
    from scrapy.spider import Spider
    
    from scrapy.contrib.spiders import CrawlSpider,Rule
    
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
    
    from lagou_python.items import LagouPythonItem
    
    from scrapy import log
    
    from scrapy.http import Request
    
    class LagouSpider(CrawlSpider):
    
        name = "lagou"
    
        download_delay = 2
    
        allowed_domains = ["lagou.com"]
    
        start_urls = [
    
            "http://www.lagou.com/jobs/list_Python?kd=Python&spc=1&pl=&gj=&xl=&yx=&gx=&st=&labelWords=&lc=&workAddress=&city=%E4%B8%8A%E6%B5%B7&requestId=&pn=1"
    
        ]
    
        #rules = [
    
        #    Rule(sle(allow=("l/jobs/list_Python?kd=Python&spc=1&pl=&gj=&xl=&yx=&gx=&st=&labelWords=&lc=&workAddress=&city=%E4%B8%8A%E6%B5%B7&requestId=&pn=d{1}")),follow=True,callback='parse_item')
    
        #    Rule(sle(),follow=True,callback='parse_item')
    
        #
    
        #]
    
        def parse(self,response):
    
            log.msg("Fetch page: %s"%response.url)
    
            #items = []
    
            sel = Selector(response)
    
            sites = sel.xpath('//div[@class="content"]/ul[@class="hot_pos reset"]/li')
    
            total_page_value = sel.xpath('//div[@class="Pagination myself"]/a[@href="#"][last()]/@title').extract()
    
            log.msg("total_page_value: %s"%total_page_value)
    
            total_page = int(total_page_value[0])
    
            log.msg("page_number: %s"%total_page)
    
            for site in sites:
    
                item = LagouPythonItem()
    
                item['salary'] =site.xpath('div[@class="hot_pos_l"]/span[1]/text()').extract()
    
                item['experience'] =site.xpath('div[@class="hot_pos_l"]/span[2]/text()').extract()
    
                item['education'] =site.xpath('div[@class="hot_pos_l"]/span[3]/text()').extract()
    
                item['occupation_temptation'] =site.xpath('div[@class="hot_pos_l"]/span[4]/text()').extract()
    
                if len(site.xpath('div[@class="hot_pos_r"]/span')) ==3:
    
                    item['job_fields'] = site.xpath('div[@class="hot_pos_r"]/span[1]/text()').extract()
    
                    item['stage'] = site.xpath('div[@class="hot_pos_r"]/span[2]/text()').extract()
    
                    item['scale'] = site.xpath('div[@class="hot_pos_r"]/span[3]/text()').extract()
    
                    item['company'] =site.xpath('div[@class="hot_pos_r"]/div[@class="mb10"]/a/text()').extract()
    
                    item['url'] = site.xpath('div[@class="hot_pos_r"]/div[@class="mb10"]/a/@href').extract()
    
                    item['founder']=[]
    
                else:
    
                    item['job_fields'] = site.xpath('div[@class="hot_pos_r"]/span[1]/text()').extract()
    
                    item['founder'] = site.xpath('div[@class="hot_pos_r"]/span[2]/text()').extract()
    
                    item['stage'] = site.xpath('div[@class="hot_pos_r"]/span[3]/text()').extract()
    
                    item['scale'] = site.xpath('div[@class="hot_pos_r"]/span[4]/text()').extract()
    
                    item['company'] =site.xpath('div[@class="hot_pos_r"]/div[@class="mb10"]/a/text()').extract()
    
                    item['url'] = site.xpath('div[@class="hot_pos_r"]/div[@class="mb10"]/a/@href').extract()
    
                yield item
    
            next_urls=[]
    
            for k in xrange(2,total_page+1):
    
                base_url = "http://www.lagou.com/jobs/list_Python?kd=Python&spc=1&pl=&gj=&xl=&yx=&gx=&st=&labelWords=&lc=&workAddress=&city=%E4%B8%8A%E6%B5%B7&requestId=&pn="+str(k)
    
                next_urls.append(base_url)
    
            for next_url in next_urls:
    
                #log.msg("Next page:%s"%next_url, level=log.INFO)
    
                yield Request(next_url,callback=self.parse)

          尝试抓取,并且在本地生成为log的日志文件,命令如下:

    scrapy crawl lagou  --logfile=log

          部分抓取结果:

    52

          整个project的源文件我放到百度云了,感兴趣的可以看看。scrapy还用的不熟练,发现自己python写的也一般,任重道远。加油。

          链接:http://pan.baidu.com/s/1gdjbKUR 密码:hnrk

     

  • 相关阅读:
    2020-2021-1 20201217《信息安全专业导论》第十一周学习总结
    python模拟进程状态
    博客文章汇总
    20201225 张晓平《信息安全专业导论》第十三周学习总结
    网站设计
    gpg
    20201225 张晓平《信息安全专业导论》第十二周学习总结
    wire shark
    ssh
    成绩调节
  • 原文地址:https://www.cnblogs.com/hiccup/p/4543726.html
Copyright © 2020-2023  润新知