Scrapy中使用xpath时,根据xpath的语法不一定能得到想要的。
如下面的html源码:
1 <div class="db_contout"> <div class="db_cont"> <div class="details_nav"> <a href="http://movie.mtime.com/79055/addimage.html" class="db_addpic" target="_blank"> <strong class="px16">+</strong> 添加图片</a> <ul id="imageNavUl"> <li><i> </i><a href="http://movie.mtime.com/79055/posters_and_images/">全部图片</a></li> <li><i> </i><a href="#">剧照</a></li> <li><i> </i><a href="#">海报</a></li> <li><i> </i><a href="#">工作照</a></li> <li><i> </i><a href="#">新闻图片</a></li> <li><i> </i><a href="#">桌面</a></li> <li><i> </i><a href="#">封套</a></li> </ul> </div> <div class="db_pictypeout"> <div class="pictypenav clearfix"> <ul id="imageSubNavUl" class="fl mt3"> </ul> <div id="filters" class="db_selbox fr"> </div> </div> <dl id="imagesDiv" class="db_pictypelist clearfix"> </dl> <div id="pageDiv"> </div> </div> </div></div><div id="M13_B_DB_Movie_FooterTopTG"></div><script type="text/javascript"> 2 var imageList = [{"stagepicture":[{"officialstageimage":[{"id":1059362,"title":"官方剧照 #16","type":6,"subType":6001,"status":1,"img_220":"http://img31.mtime.cn/pi/2014/02/28/042610.59909056_220X220.jpg","img_1000":"http://img31.mtime.cn/pi/2014/02/28/042610.59909056_1000X1000.jpg","width":3233,"height":2000,"fileSize":5472,"enterTime":"2009-07-09","enterNickName":"jackali","description":"","commentCount":0,"imgDetailUrl":"http://movie.mtime.com/79055/posters_and_images/1059362/","topNum":4,"newIndex":37,"typeHotIndex":0,"typeNewIndex":37,"img_235":"http://img31.mtime.cn/pi/2014/02/28/042610.59909056_235X235.jpg"},{"id":829271,"title":"官方剧照 #06","type":6,"subType":6001,"status":1,"img_220":"http://img31.mtime.cn/pi/2014/02/28/042556.29233713_220X220.jpg","img_1000":"http://img31.mtime.cn/pi/2014/02/28/042556.29233713_1000X1000.jpg","width":842,"height":477,"fileSize":74,"enterTime":"2008-12-17","enterNickName":"边界","description":"","commentCount":0,"imgDetailUrl":"http://movie.mtime.com/79055/posters_and_images/829271/","topNum":0,"newIndex":51,"typeHotIndex":1,"typeNewIndex":51,"img_235":"http://img31.mtime.cn/pi/2014/02/28/042556.29233713_235X235.jpg"},{"id":625583,"title":"官方剧照
要得到img_1000后面picture的source路径,通过xpath的语法我没有得到直接取到的方法,折中办法参考:http://www.cnblogs.com/Garvey/p/6697162.html,使用re来获得需要的内容。
1 class MtimeSpider(scrapy.Spider): 2 name = "mtime" 3 allowed_domains = ["http://www.mtime.com"] 4 start_urls = ( 5 'http://movie.mtime.com/79055/posters_and_images/posters/hot.html', 6 ) 7 8 def parse(self, response): 9 allpics = response.xpath("//script[@type='text/javascript']").re('"img_1000":"(.+?jpg)"') 10 print len(allpics) 11 nameList = [] 12 i = 0 13 for pic in allpics: 14 i = i+1 15 item = S0819MtimeTiantangItem() 16 while True: 17 itemName = random.randint(0, 1000)*3 18 itemName = str(itemName) 19 if itemName in nameList: 20 pass 21 else: 22 name = str(i) 23 nameList.append(itemName) 24 #print "-----"+itemName 25 print "-----" 26 #print nameList 27 break 28 addr = pic 29 item['name'] = name 30 item['addr'] = addr 31 print "+++++"+addr 32 print "+++++"+name 33 yield item