• 2:url有规律的多页面爬取


    举例网站:http://www.luoo.net/music/期刊号

    e.g:http://www.luoo.net/music/760

    打算爬取其title:Hello World;pic;desc:本期音乐为......《8-bit Love》。

    步骤:

    1):建立项目

      在shell中你对应的目录下:scrapy startproject luoo

      在pycharm中打开luoo文件夹

    2):编写items.py

    1 import scrapy
    2 class LuooItem(scrapy.Item):
    3     url = scrapy.Field()
    4     title = scrapy.Field()
    5     pic = scrapy.Field()
    6     desc = scrapy.Field()


    3):编写spider
      在spiders文件夹下建立luoospider.py

      
     1 import scrapy
     2 from luoo.items import LuooItem
     3 
     4 class LuooSpider(scrapy.Spider):
     5     name = "luoo"
     6     allowed_domains = ["luoo.net"]
     7     start_urls = []
     8     for i in range(750,763):
     9         url = 'http://www.luoo.net/music/%s'%(str(i))
    10         start_urls.append(url)
    11 
    12     def parse(self, response):
    13         item = LuooItem()
    14         item['url'] = response.url
    15         item['title'] = response.xpath('//span[@class="vol-title"]/text()').extract()
    16         item['pic'] = response.xpath('//img[@class="vol-cover"]/@src').extract()
    17         item['desc'] = response.xpath('//div[@class="vol-desc"]/text()').extract()
    18         return item
    4)pipelines.py不动
    5)在command中进入luoo目录
      scrapy list 列出可用的爬虫(luoo)
      scrapy crawl luoo -o result.csv(执行爬虫并且以result.csv保存到当前目录下)
    6)用notepad++打开result.py并且更改格式为ANSI后保存,再用excel打开就不会有乱码了

    *遗留to do:
    1)数据考虑后期迁移到mysql数据库
    2)单独把图片保存到图片格式的文件夹中


    memory:顺便附上两个月前用urllib库实现的此功能代码(python3.4)
         现在看看用scrapy真的是方便太多了,更别提其牛逼呼呼的可扩展性:
     1 import urllib.request
     2 import re
     3 import time
     4 
     5 def openurl(urls):   
     6     htmls=[]
     7     for url in urls:
     8         req=urllib.request.Request(url)
     9         req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36')    
    10   #      Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0
    11         response = urllib.request.urlopen(url)
    12         htmls.append(response.read())
    13         time.sleep(5)
    14     return htmls
    15 
    16 def jiexi(htmls):
    17     pics=[]
    18     titles=[]
    19     contents=[]
    20     for html in htmls:
    21         html = html.decode('utf-8')
    22         pics.append(re.findall('<div class="player-wrapper".*?>.*?<img.*?src="(.*?).jp.*?".*?alt=".*"',html,re.S))
    23         titles.append(re.findall('class="vol-title">(.*?)</span>',html,re.S))
    24         contents.append(re.findall('<div.*?class="vol-desc">.*?(.*?)</div>',html,re.S))
    25         
    26     i = len(titles)
    27     with open('C:\Users\Administrator\Desktop\test.txt', 'w') as f:
    28         for x in range(i):
    29             print("正在下载期刊:%d" %(746-x))
    30             f.write("期刊名:"+str(titles[x])[2:-2]+"
    ")
    31             f.write("图片链接:"+str(pics[x])[2:-2]+".jpg
    ")
    32             content = str(contents[x])[4:-2]
    33             content.strip()
    34             print(content.count("""<br>
    """)) 
    35             content.replace("""<br>
    ""","#")
    36             f.write("配诗:"+content+"
    
    
    ")
    37 
    38 
    39 yur='http://www.luoo.net/music/'
    40 urls = []
    41 for i in range(657,659):
    42     urls.append(yur + str(i))
    43 
    44 htmls = openurl(urls)
    45 pics = jiexi(htmls)
  • 相关阅读:
    Android通知栏介绍与适配总结
    Java emoji持久化mysql
    css自适应
    常用网址总结
    前端开发常用技巧
    JAVA问题集锦Ⅰ
    Android之常见问题集锦Ⅱ
    Java集合之ConcurrentHashMap.addCount解析
    Java集合之ConcurrentHashMap解析
    Java数据结构之Map
  • 原文地址:https://www.cnblogs.com/pengsixiong/p/4908595.html
Copyright © 2020-2023  润新知