• 9 chrome能打开去哪儿的机票页面而python selenium启动的chrome不行 2


    ------------恢复内容开始------------

    ------------恢复内容开始------------

    https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-21&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null

    上面这一串地址复制到chrome浏览器和别的浏览器比如qq浏览器中,都能访问。唯独从selenium中启动的chrome中,会变成如下界面。无法显示2021-10-21日的机票信息。

    初步怀疑是selenium不支持我当前的chrome版本。(更新:2021年10月17日10:47:22 后来降低为selenium支持的92.x.x.107版本,包括chrome driver也降了。还是不行)

    现象就是:直接启动chrome能搜索去哪儿,并且显示上海到北京的次日机票信息。而从python程序中selenium启动的chrome中,死活打不开。然后把地址复制出来到任何浏览器,包括没有selenium控制的chrome,都能打开网页。

    具体原因待查。

    为什么要用selenium

    首先,机票页面是XHR中用js交互产生的动态渲染数据,因此直接抓取页面的源代码中,没有任何机票信息。只有用selenium,所见即所可得

    有人会说,那么就模拟XHR请求

    我通过在postman中模拟XHR请求,

     返回的是请求成功不错。但是没有任何数据。

    而浏览器中访问中返回的数据是有机票信息的。

     

    更新:

    后续通过补全headers信息,发现就可以获取返回数据了。下面是补全的headers信息。

    下面是传递的get参数信息,就是直接在url后面的。

    获取的返回数据是:

     并且这里面包括页面中后面页面的数据。也就是点击第2页和下一页才会显示的数据。

    并且,在headers中隔了一天以后,要重新打开浏览器,修改pre参数。这个pre应该是一种验证机制。如果pre不修改,之前的请求,隔了一天,无法请求到机票数据

    那么selenium,所见即所可得,能否获取js动态渲染中下一页才显示的数据呢?

     不仅chrome,Firefox也遇到同样问题

    import time
    from selenium.webdriver import Firefox
    from selenium.webdriver.firefox.options import Options
    from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    import requests,json
    options = Options()
    # options.binary_location = "C:\Users\xiaojie\AppData\Local\Google\Chrome\Application\chrome.exe"
    options.binary_location = "C:\Program Files\Mozilla Firefox\firefox.exe"
    binary= FirefoxBinary("C:\Program Files\Mozilla Firefox\firefox.exe")
    caps = DesiredCapabilities.FIREFOX.copy()
    caps['marionette'] = True
    # options.add_experimental_option('excludeSwitches', ['enable-automation'])
    # options.add_argument('--incognito')
    # options.add_argument('disable-infobars')
    # options.add_argument('log-level=3')
    driver =Firefox(firefox_binary=binary,capabilities=caps, executable_path="geckodriver.exe")
    url="https://www.qunar.com/"
    # url = "https://diannao.jd.com/"
    
    driver.get(url)
    time.sleep(1)
    
    url="https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-22&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null"
    #搜索机票
    #两种方式
    driver.get(url)
    with open('items.jl','w',encoding='UTF-8') as file:
        file.write(driver.page_source)
    
    
    url = "https://diannao.jd.com/"
    #设置header属性
    header={
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400"
        ,"Referer":"http://m.611.com/Match/Index"
        ,"Connection":"keep-alive"
        ,"Cookie":"__jdu=1961246440; shshshfpa=9d816ce7-9076-04a7-ea7d-8018bc6eadf2-1623898299; shshshfpb=sdus9TfEminz9dWA7KKGYgw%3D%3D; pinId=pjKhXUE59i7LjgxtUlkd_A; pin=zyj183247166; unick=jdzyj183; _tp=LdLet8T0koyg96E1dqQafA%3D%3D; _pst=zyj183247166; areaId=2; TrackID=1CmX15GEs1MOTf99XeZl5eebqfftNFiLEiMsS8vvKBTMwCAfRlKqXu7YCjQn__C2-mqlg-FJxPlEwiA79snSf04SU1xTtsZOoj5aQk_Cb5mu1XN52nGptNsMI-kJjYqCV; user-key=03e76034-3bcd-4674-a834-cf34a4b960a6; ipLocation=%u4e0a%u6d77; cn=76; ipLoc-djd=2-2824-61056-0.3405425761; unpl=V2_ZzNtbUVQQhV1DhJXLxhfV2IFFV0RAkoSJltEAHsRWwc1AUdbclRCFnUURlVnGVsUZgsZXkJcQxFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsRWwVkBhVYR1ZzJXI4dmR%2fHV4BbwciXHJWc1chVEBTeRBdByoDGlpCVEYScA1HZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_660117e2e02c4761bd86bb3e1963c3d7|1634347276899; PCSYCityID=CN_310000_310100_310101; shshshfp=26c4ea20d915727cb0bd98f196346b4d; __jdc=122270672; wlfstk_smdl=7bqz560g1rixn5kg0i84v5y6xvvrqkhx; __jda=122270672.1961246440.1623769658.1634347277.1634382193.23; __jdb=122270672.1.1961246440|23.1634382193; o2-webp=true; 3AB9D23F7A4B3C9B=TJ6FIT6PTK3N32QTINQHHUBRA4J4MDPCZWEHCIIXVS6J5H3LSD75C3RMTC2RIBLHQDLJOCOMXWMJ2LSD6IEMJMV66M"
    
        }
    #response = requests.get("http://m.611.com/Match/Index",headers=header)
    response = requests.get(url,headers=header)
    if response.status_code == 200:
        print(response.text)
        with open('items2.jl','w') as file:
            file.write(response.text)
        # data = json.loads(response.text)
        # token = data["Data"]
    time.sleep(4)
    #商品信息
    item__info=driver.find_elements_by_class_name('goods-item')
    for item in item__info:
        name=item.find_element_by_class_name('goods-item__info').text
        price=item.find_element_by_class_name('goods-item__price').text
        print(name)
        print(price)
        print("-----------")
    driver.delete_all_cookies()
    # driver.quit()

    使用selenium启动firefox访问机票信息。同样页面卡机

     可能原因就是去哪儿机票网,设置了对自动化测试工具webdriver的反爬。导致无法爬取。同样的链接,直接启动firefox或者chrome浏览器,都能爬取。唯独通过selenium启动后,不能访问网页。

     最后实验出来是因为设置了针对selenium的反爬

    对于网上说的mitmproxy设置代理的方法,较为复杂。

    最后添加了两个参数,隐藏selenium的特征后,就能正常使用selenium进行爬取了。

    完整代码如下:

    import time
    # from selenium.webdriver import Firefox
    # from selenium.webdriver.firefox.options import Options
    
    from selenium.webdriver import Chrome
    from selenium.webdriver.chrome.options import Options
    
    # from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
    # from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    import requests,json
    options = Options()
    options.binary_location = "C:\Users\xiaojie\AppData\Local\Google\Chrome\Application\chrome.exe"
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    options.add_argument('--incognito')
    options.add_argument('disable-infobars')
    options.add_argument('log-level=3')
    options.add_argument("--disable-blink-features")
    options.add_argument("--disable-blink-features=AutomationControlled")
    
    # options.binary_location = "C:\Program Files\Mozilla Firefox\firefox.exe"
    # binary= FirefoxBinary("C:\Program Files\Mozilla Firefox\firefox.exe")
    # caps = DesiredCapabilities.FIREFOX.copy()
    # caps['marionette'] = True
    
    # driver =Firefox(firefox_binary=binary,capabilities=caps, executable_path="geckodriver.exe")
    driver =Chrome(options=options,executable_path="D:\webdriver\chromedriver_win32\chromedriver.exe")
    # script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});'
    # driver.execute_script(script)
    url="https://www.qunar.com/"
    # url = "https://diannao.jd.com/"
    
    driver.get(url)
    time.sleep(1)
    
    url="https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-22&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null"
    #搜索机票
    #两种方式
    # script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});'
    # driver.execute_script(script)
    driver.get(url) 
    # script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});'
    # driver.execute_script(script)
    with open('items.jl','w',encoding='UTF-8') as file:
        file.write(driver.page_source)
    
    
    url = "https://diannao.jd.com/"
    #设置header属性
    header={
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400"
        ,"Referer":"http://m.611.com/Match/Index"
        ,"Connection":"keep-alive"
        ,"Cookie":"__jdu=1961246440; shshshfpa=9d816ce7-9076-04a7-ea7d-8018bc6eadf2-1623898299; shshshfpb=sdus9TfEminz9dWA7KKGYgw%3D%3D; pinId=pjKhXUE59i7LjgxtUlkd_A; pin=zyj183247166; unick=jdzyj183; _tp=LdLet8T0koyg96E1dqQafA%3D%3D; _pst=zyj183247166; areaId=2; TrackID=1CmX15GEs1MOTf99XeZl5eebqfftNFiLEiMsS8vvKBTMwCAfRlKqXu7YCjQn__C2-mqlg-FJxPlEwiA79snSf04SU1xTtsZOoj5aQk_Cb5mu1XN52nGptNsMI-kJjYqCV; user-key=03e76034-3bcd-4674-a834-cf34a4b960a6; ipLocation=%u4e0a%u6d77; cn=76; ipLoc-djd=2-2824-61056-0.3405425761; unpl=V2_ZzNtbUVQQhV1DhJXLxhfV2IFFV0RAkoSJltEAHsRWwc1AUdbclRCFnUURlVnGVsUZgsZXkJcQxFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsRWwVkBhVYR1ZzJXI4dmR%2fHV4BbwciXHJWc1chVEBTeRBdByoDGlpCVEYScA1HZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_660117e2e02c4761bd86bb3e1963c3d7|1634347276899; PCSYCityID=CN_310000_310100_310101; shshshfp=26c4ea20d915727cb0bd98f196346b4d; __jdc=122270672; wlfstk_smdl=7bqz560g1rixn5kg0i84v5y6xvvrqkhx; __jda=122270672.1961246440.1623769658.1634347277.1634382193.23; __jdb=122270672.1.1961246440|23.1634382193; o2-webp=true; 3AB9D23F7A4B3C9B=TJ6FIT6PTK3N32QTINQHHUBRA4J4MDPCZWEHCIIXVS6J5H3LSD75C3RMTC2RIBLHQDLJOCOMXWMJ2LSD6IEMJMV66M"
    
        }
    #response = requests.get("http://m.611.com/Match/Index",headers=header)
    response = requests.get(url,headers=header)
    if response.status_code == 200:
        print(response.text)
        with open('items2.jl','w') as file:
            file.write(response.text)
        # data = json.loads(response.text)
        # token = data["Data"]
    time.sleep(4)
    #商品信息
    item__info=driver.find_elements_by_class_name('goods-item')
    for item in item__info:
        name=item.find_element_by_class_name('goods-item__info').text
        price=item.find_element_by_class_name('goods-item__price').text
        print(name)
        print(price)
        print("-----------")
    driver.delete_all_cookies()
    # driver.quit()

    里面添加了两句话:

    options.add_argument("--disable-blink-features")
    options.add_argument("--disable-blink-features=AutomationControlled")

    同时,发现selenium只能抓取当前页显示的数据。无法像前述模拟XHR请求,返回所有机票数据。

    至于设置的文本混淆。则需要用其他方法解决。

    但是模拟XHR请求,然后尚未渲染到页面的,直接从服务器返回的数据,是应对文本混淆的最佳方法。

     

    ------------恢复内容结束------------

    ------------恢复内容结束------------

    你永远不知道未来会有什么,做好当下。技术改变世界,欢迎交流。
  • 相关阅读:
    Remove Element
    wso2esb安装及helloworld
    动态布局中RadioGroup的RadioButton有时候不相互排斥的原因
    有关机房收费系统学生下机的思考
    ITOO之底层关系
    POJ 3252 Round Numbers(数位dp&记忆化搜索)
    怎样将「插件化」接入到项目之中?
    授人玫瑰 手留余香 --纪念python3.2.3官方文档翻译结束
    poj 2965 The Pilots Brothers' refrigerator
    怎样使用本文档
  • 原文地址:https://www.cnblogs.com/xiaojieshisilang/p/15415136.html
Copyright © 2020-2023  润新知