------------恢复内容开始------------
------------恢复内容开始------------
https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-21&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null
上面这一串地址复制到chrome浏览器和别的浏览器比如qq浏览器中,都能访问。唯独从selenium中启动的chrome中,会变成如下界面。无法显示2021-10-21日的机票信息。
初步怀疑是selenium不支持我当前的chrome版本。(更新:2021年10月17日10:47:22 后来降低为selenium支持的92.x.x.107版本,包括chrome driver也降了。还是不行)
现象就是:直接启动chrome能搜索去哪儿,并且显示上海到北京的次日机票信息。而从python程序中selenium启动的chrome中,死活打不开。然后把地址复制出来到任何浏览器,包括没有selenium控制的chrome,都能打开网页。
具体原因待查。
为什么要用selenium
首先,机票页面是XHR中用js交互产生的动态渲染数据,因此直接抓取页面的源代码中,没有任何机票信息。只有用selenium,所见即所可得。
有人会说,那么就模拟XHR请求。
我通过在postman中模拟XHR请求,
返回的是请求成功不错。但是没有任何数据。
而浏览器中访问中返回的数据是有机票信息的。
更新:
后续通过补全headers信息,发现就可以获取返回数据了。下面是补全的headers信息。
下面是传递的get参数信息,就是直接在url后面的。
获取的返回数据是:
并且这里面包括页面中后面页面的数据。也就是点击第2页和下一页才会显示的数据。
并且,在headers中隔了一天以后,要重新打开浏览器,修改pre参数。这个pre应该是一种验证机制。如果pre不修改,之前的请求,隔了一天,无法请求到机票数据。
那么selenium,所见即所可得,能否获取js动态渲染中下一页才显示的数据呢?
不仅chrome,Firefox也遇到同样问题
import time from selenium.webdriver import Firefox from selenium.webdriver.firefox.options import Options from selenium.webdriver.firefox.firefox_binary import FirefoxBinary from selenium.webdriver.common.desired_capabilities import DesiredCapabilities import requests,json options = Options() # options.binary_location = "C:\Users\xiaojie\AppData\Local\Google\Chrome\Application\chrome.exe" options.binary_location = "C:\Program Files\Mozilla Firefox\firefox.exe" binary= FirefoxBinary("C:\Program Files\Mozilla Firefox\firefox.exe") caps = DesiredCapabilities.FIREFOX.copy() caps['marionette'] = True # options.add_experimental_option('excludeSwitches', ['enable-automation']) # options.add_argument('--incognito') # options.add_argument('disable-infobars') # options.add_argument('log-level=3') driver =Firefox(firefox_binary=binary,capabilities=caps, executable_path="geckodriver.exe") url="https://www.qunar.com/" # url = "https://diannao.jd.com/" driver.get(url) time.sleep(1) url="https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-22&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null" #搜索机票 #两种方式 driver.get(url) with open('items.jl','w',encoding='UTF-8') as file: file.write(driver.page_source) url = "https://diannao.jd.com/" #设置header属性 header={ "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400" ,"Referer":"http://m.611.com/Match/Index" ,"Connection":"keep-alive" ,"Cookie":"__jdu=1961246440; shshshfpa=9d816ce7-9076-04a7-ea7d-8018bc6eadf2-1623898299; shshshfpb=sdus9TfEminz9dWA7KKGYgw%3D%3D; pinId=pjKhXUE59i7LjgxtUlkd_A; pin=zyj183247166; unick=jdzyj183; _tp=LdLet8T0koyg96E1dqQafA%3D%3D; _pst=zyj183247166; areaId=2; TrackID=1CmX15GEs1MOTf99XeZl5eebqfftNFiLEiMsS8vvKBTMwCAfRlKqXu7YCjQn__C2-mqlg-FJxPlEwiA79snSf04SU1xTtsZOoj5aQk_Cb5mu1XN52nGptNsMI-kJjYqCV; user-key=03e76034-3bcd-4674-a834-cf34a4b960a6; ipLocation=%u4e0a%u6d77; cn=76; ipLoc-djd=2-2824-61056-0.3405425761; unpl=V2_ZzNtbUVQQhV1DhJXLxhfV2IFFV0RAkoSJltEAHsRWwc1AUdbclRCFnUURlVnGVsUZgsZXkJcQxFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsRWwVkBhVYR1ZzJXI4dmR%2fHV4BbwciXHJWc1chVEBTeRBdByoDGlpCVEYScA1HZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_660117e2e02c4761bd86bb3e1963c3d7|1634347276899; PCSYCityID=CN_310000_310100_310101; shshshfp=26c4ea20d915727cb0bd98f196346b4d; __jdc=122270672; wlfstk_smdl=7bqz560g1rixn5kg0i84v5y6xvvrqkhx; __jda=122270672.1961246440.1623769658.1634347277.1634382193.23; __jdb=122270672.1.1961246440|23.1634382193; o2-webp=true; 3AB9D23F7A4B3C9B=TJ6FIT6PTK3N32QTINQHHUBRA4J4MDPCZWEHCIIXVS6J5H3LSD75C3RMTC2RIBLHQDLJOCOMXWMJ2LSD6IEMJMV66M" } #response = requests.get("http://m.611.com/Match/Index",headers=header) response = requests.get(url,headers=header) if response.status_code == 200: print(response.text) with open('items2.jl','w') as file: file.write(response.text) # data = json.loads(response.text) # token = data["Data"] time.sleep(4) #商品信息 item__info=driver.find_elements_by_class_name('goods-item') for item in item__info: name=item.find_element_by_class_name('goods-item__info').text price=item.find_element_by_class_name('goods-item__price').text print(name) print(price) print("-----------") driver.delete_all_cookies() # driver.quit()
使用selenium启动firefox访问机票信息。同样页面卡机
可能原因就是去哪儿机票网,设置了对自动化测试工具webdriver的反爬。导致无法爬取。同样的链接,直接启动firefox或者chrome浏览器,都能爬取。唯独通过selenium启动后,不能访问网页。
最后实验出来是因为设置了针对selenium的反爬
对于网上说的mitmproxy设置代理的方法,较为复杂。
最后添加了两个参数,隐藏selenium的特征后,就能正常使用selenium进行爬取了。
完整代码如下:
import time # from selenium.webdriver import Firefox # from selenium.webdriver.firefox.options import Options from selenium.webdriver import Chrome from selenium.webdriver.chrome.options import Options # from selenium.webdriver.firefox.firefox_binary import FirefoxBinary # from selenium.webdriver.common.desired_capabilities import DesiredCapabilities import requests,json options = Options() options.binary_location = "C:\Users\xiaojie\AppData\Local\Google\Chrome\Application\chrome.exe" options.add_experimental_option('excludeSwitches', ['enable-automation']) options.add_argument('--incognito') options.add_argument('disable-infobars') options.add_argument('log-level=3') options.add_argument("--disable-blink-features") options.add_argument("--disable-blink-features=AutomationControlled") # options.binary_location = "C:\Program Files\Mozilla Firefox\firefox.exe" # binary= FirefoxBinary("C:\Program Files\Mozilla Firefox\firefox.exe") # caps = DesiredCapabilities.FIREFOX.copy() # caps['marionette'] = True # driver =Firefox(firefox_binary=binary,capabilities=caps, executable_path="geckodriver.exe") driver =Chrome(options=options,executable_path="D:\webdriver\chromedriver_win32\chromedriver.exe") # script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});' # driver.execute_script(script) url="https://www.qunar.com/" # url = "https://diannao.jd.com/" driver.get(url) time.sleep(1) url="https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-22&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null" #搜索机票 #两种方式 # script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});' # driver.execute_script(script) driver.get(url) # script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});' # driver.execute_script(script) with open('items.jl','w',encoding='UTF-8') as file: file.write(driver.page_source) url = "https://diannao.jd.com/" #设置header属性 header={ "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400" ,"Referer":"http://m.611.com/Match/Index" ,"Connection":"keep-alive" ,"Cookie":"__jdu=1961246440; shshshfpa=9d816ce7-9076-04a7-ea7d-8018bc6eadf2-1623898299; shshshfpb=sdus9TfEminz9dWA7KKGYgw%3D%3D; pinId=pjKhXUE59i7LjgxtUlkd_A; pin=zyj183247166; unick=jdzyj183; _tp=LdLet8T0koyg96E1dqQafA%3D%3D; _pst=zyj183247166; areaId=2; TrackID=1CmX15GEs1MOTf99XeZl5eebqfftNFiLEiMsS8vvKBTMwCAfRlKqXu7YCjQn__C2-mqlg-FJxPlEwiA79snSf04SU1xTtsZOoj5aQk_Cb5mu1XN52nGptNsMI-kJjYqCV; user-key=03e76034-3bcd-4674-a834-cf34a4b960a6; ipLocation=%u4e0a%u6d77; cn=76; ipLoc-djd=2-2824-61056-0.3405425761; unpl=V2_ZzNtbUVQQhV1DhJXLxhfV2IFFV0RAkoSJltEAHsRWwc1AUdbclRCFnUURlVnGVsUZgsZXkJcQxFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsRWwVkBhVYR1ZzJXI4dmR%2fHV4BbwciXHJWc1chVEBTeRBdByoDGlpCVEYScA1HZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_660117e2e02c4761bd86bb3e1963c3d7|1634347276899; PCSYCityID=CN_310000_310100_310101; shshshfp=26c4ea20d915727cb0bd98f196346b4d; __jdc=122270672; wlfstk_smdl=7bqz560g1rixn5kg0i84v5y6xvvrqkhx; __jda=122270672.1961246440.1623769658.1634347277.1634382193.23; __jdb=122270672.1.1961246440|23.1634382193; o2-webp=true; 3AB9D23F7A4B3C9B=TJ6FIT6PTK3N32QTINQHHUBRA4J4MDPCZWEHCIIXVS6J5H3LSD75C3RMTC2RIBLHQDLJOCOMXWMJ2LSD6IEMJMV66M" } #response = requests.get("http://m.611.com/Match/Index",headers=header) response = requests.get(url,headers=header) if response.status_code == 200: print(response.text) with open('items2.jl','w') as file: file.write(response.text) # data = json.loads(response.text) # token = data["Data"] time.sleep(4) #商品信息 item__info=driver.find_elements_by_class_name('goods-item') for item in item__info: name=item.find_element_by_class_name('goods-item__info').text price=item.find_element_by_class_name('goods-item__price').text print(name) print(price) print("-----------") driver.delete_all_cookies() # driver.quit()
里面添加了两句话:
options.add_argument("--disable-blink-features") options.add_argument("--disable-blink-features=AutomationControlled")
同时,发现selenium只能抓取当前页显示的数据。无法像前述模拟XHR请求,返回所有机票数据。
至于设置的文本混淆。则需要用其他方法解决。
但是模拟XHR请求,然后尚未渲染到页面的,直接从服务器返回的数据,是应对文本混淆的最佳方法。
------------恢复内容结束------------
------------恢复内容结束------------