一 总结
页面是异步加载,页面滑动的过程中,每张图片的url才显示。所以推荐selenium。同时需要selenium执行js代买,实现页面滚动的效果。就是window.scrollTo()方法。
在用scrapy框架中,并不是所有的request都是需要经过用selenium。经过selenium拿到数据,返回response,具体某一话漫画的首页才这个需求。将这个需求写入到下载中间件中,并加条件判断。
具体参考:https://jiayi.space/post/scrapy-phantomjs-seleniumdong-tai-pa-chong
import requests from bs4 import BeautifulSoup from selenium import webdriver import time START_URL = 'http://ac.qq.com/ComicView/index/id/505430/cid/920' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', } js = 'window.scrollTo(0,{})' driver = webdriver.Chrome() driver.get(START_URL) for i in range(1,25): driver.execute_script(js.format(i*1000)) time.sleep(1) content = driver.page_source soup = BeautifulSoup(content,'lxml') lis = soup.select('ul#comicContain > li') i = 1 for li in lis: img = li.select('img')[0] url = img.get('src') if url.startswith('http://ac.tc.qq.com/store_file_download?'): r = requests.get(url) con = r.content with open('page{}.jpg'.format(i),'wb') as f: f.write(con) i += 1