• 9 chrome能打开去哪儿的机票页面而python selenium启动的chrome不行 2


    ------------恢复内容开始------------

    ------------恢复内容开始------------

    https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-21&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null

    上面这一串地址复制到chrome浏览器和别的浏览器比如qq浏览器中,都能访问。唯独从selenium中启动的chrome中,会变成如下界面。无法显示2021-10-21日的机票信息。

    初步怀疑是selenium不支持我当前的chrome版本。(更新:2021年10月17日10:47:22 后来降低为selenium支持的92.x.x.107版本,包括chrome driver也降了。还是不行)

    现象就是:直接启动chrome能搜索去哪儿,并且显示上海到北京的次日机票信息。而从python程序中selenium启动的chrome中,死活打不开。然后把地址复制出来到任何浏览器,包括没有selenium控制的chrome,都能打开网页。

    具体原因待查。

    为什么要用selenium

    首先,机票页面是XHR中用js交互产生的动态渲染数据,因此直接抓取页面的源代码中,没有任何机票信息。只有用selenium,所见即所可得

    有人会说,那么就模拟XHR请求

    我通过在postman中模拟XHR请求,

     返回的是请求成功不错。但是没有任何数据。

    而浏览器中访问中返回的数据是有机票信息的。

     

    更新:

    后续通过补全headers信息,发现就可以获取返回数据了。下面是补全的headers信息。

    下面是传递的get参数信息,就是直接在url后面的。

    获取的返回数据是:

     并且这里面包括页面中后面页面的数据。也就是点击第2页和下一页才会显示的数据。

    并且,在headers中隔了一天以后,要重新打开浏览器,修改pre参数。这个pre应该是一种验证机制。如果pre不修改,之前的请求,隔了一天,无法请求到机票数据

    那么selenium,所见即所可得,能否获取js动态渲染中下一页才显示的数据呢?

     不仅chrome,Firefox也遇到同样问题

    import time
    from selenium.webdriver import Firefox
    from selenium.webdriver.firefox.options import Options
    from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    import requests,json
    options = Options()
    # options.binary_location = "C:\Users\xiaojie\AppData\Local\Google\Chrome\Application\chrome.exe"
    options.binary_location = "C:\Program Files\Mozilla Firefox\firefox.exe"
    binary= FirefoxBinary("C:\Program Files\Mozilla Firefox\firefox.exe")
    caps = DesiredCapabilities.FIREFOX.copy()
    caps['marionette'] = True
    # options.add_experimental_option('excludeSwitches', ['enable-automation'])
    # options.add_argument('--incognito')
    # options.add_argument('disable-infobars')
    # options.add_argument('log-level=3')
    driver =Firefox(firefox_binary=binary,capabilities=caps, executable_path="geckodriver.exe")
    url="https://www.qunar.com/"
    # url = "https://diannao.jd.com/"
    
    driver.get(url)
    time.sleep(1)
    
    url="https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-22&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null"
    #搜索机票
    #两种方式
    driver.get(url)
    with open('items.jl','w',encoding='UTF-8') as file:
        file.write(driver.page_source)
    
    
    url = "https://diannao.jd.com/"
    #设置header属性
    header={
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400"
        ,"Referer":"http://m.611.com/Match/Index"
        ,"Connection":"keep-alive"
        ,"Cookie":"__jdu=1961246440; shshshfpa=9d816ce7-9076-04a7-ea7d-8018bc6eadf2-1623898299; shshshfpb=sdus9TfEminz9dWA7KKGYgw%3D%3D; pinId=pjKhXUE59i7LjgxtUlkd_A; pin=zyj183247166; unick=jdzyj183; _tp=LdLet8T0koyg96E1dqQafA%3D%3D; _pst=zyj183247166; areaId=2; TrackID=1CmX15GEs1MOTf99XeZl5eebqfftNFiLEiMsS8vvKBTMwCAfRlKqXu7YCjQn__C2-mqlg-FJxPlEwiA79snSf04SU1xTtsZOoj5aQk_Cb5mu1XN52nGptNsMI-kJjYqCV; user-key=03e76034-3bcd-4674-a834-cf34a4b960a6; ipLocation=%u4e0a%u6d77; cn=76; ipLoc-djd=2-2824-61056-0.3405425761; unpl=V2_ZzNtbUVQQhV1DhJXLxhfV2IFFV0RAkoSJltEAHsRWwc1AUdbclRCFnUURlVnGVsUZgsZXkJcQxFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsRWwVkBhVYR1ZzJXI4dmR%2fHV4BbwciXHJWc1chVEBTeRBdByoDGlpCVEYScA1HZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_660117e2e02c4761bd86bb3e1963c3d7|1634347276899; PCSYCityID=CN_310000_310100_310101; shshshfp=26c4ea20d915727cb0bd98f196346b4d; __jdc=122270672; wlfstk_smdl=7bqz560g1rixn5kg0i84v5y6xvvrqkhx; __jda=122270672.1961246440.1623769658.1634347277.1634382193.23; __jdb=122270672.1.1961246440|23.1634382193; o2-webp=true; 3AB9D23F7A4B3C9B=TJ6FIT6PTK3N32QTINQHHUBRA4J4MDPCZWEHCIIXVS6J5H3LSD75C3RMTC2RIBLHQDLJOCOMXWMJ2LSD6IEMJMV66M"
    
        }
    #response = requests.get("http://m.611.com/Match/Index",headers=header)
    response = requests.get(url,headers=header)
    if response.status_code == 200:
        print(response.text)
        with open('items2.jl','w') as file:
            file.write(response.text)
        # data = json.loads(response.text)
        # token = data["Data"]
    time.sleep(4)
    #商品信息
    item__info=driver.find_elements_by_class_name('goods-item')
    for item in item__info:
        name=item.find_element_by_class_name('goods-item__info').text
        price=item.find_element_by_class_name('goods-item__price').text
        print(name)
        print(price)
        print("-----------")
    driver.delete_all_cookies()
    # driver.quit()

    使用selenium启动firefox访问机票信息。同样页面卡机

     可能原因就是去哪儿机票网,设置了对自动化测试工具webdriver的反爬。导致无法爬取。同样的链接,直接启动firefox或者chrome浏览器,都能爬取。唯独通过selenium启动后,不能访问网页。

     最后实验出来是因为设置了针对selenium的反爬

    对于网上说的mitmproxy设置代理的方法,较为复杂。

    最后添加了两个参数,隐藏selenium的特征后,就能正常使用selenium进行爬取了。

    完整代码如下:

    import time
    # from selenium.webdriver import Firefox
    # from selenium.webdriver.firefox.options import Options
    
    from selenium.webdriver import Chrome
    from selenium.webdriver.chrome.options import Options
    
    # from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
    # from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    import requests,json
    options = Options()
    options.binary_location = "C:\Users\xiaojie\AppData\Local\Google\Chrome\Application\chrome.exe"
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    options.add_argument('--incognito')
    options.add_argument('disable-infobars')
    options.add_argument('log-level=3')
    options.add_argument("--disable-blink-features")
    options.add_argument("--disable-blink-features=AutomationControlled")
    
    # options.binary_location = "C:\Program Files\Mozilla Firefox\firefox.exe"
    # binary= FirefoxBinary("C:\Program Files\Mozilla Firefox\firefox.exe")
    # caps = DesiredCapabilities.FIREFOX.copy()
    # caps['marionette'] = True
    
    # driver =Firefox(firefox_binary=binary,capabilities=caps, executable_path="geckodriver.exe")
    driver =Chrome(options=options,executable_path="D:\webdriver\chromedriver_win32\chromedriver.exe")
    # script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});'
    # driver.execute_script(script)
    url="https://www.qunar.com/"
    # url = "https://diannao.jd.com/"
    
    driver.get(url)
    time.sleep(1)
    
    url="https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-22&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null"
    #搜索机票
    #两种方式
    # script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});'
    # driver.execute_script(script)
    driver.get(url) 
    # script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});'
    # driver.execute_script(script)
    with open('items.jl','w',encoding='UTF-8') as file:
        file.write(driver.page_source)
    
    
    url = "https://diannao.jd.com/"
    #设置header属性
    header={
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400"
        ,"Referer":"http://m.611.com/Match/Index"
        ,"Connection":"keep-alive"
        ,"Cookie":"__jdu=1961246440; shshshfpa=9d816ce7-9076-04a7-ea7d-8018bc6eadf2-1623898299; shshshfpb=sdus9TfEminz9dWA7KKGYgw%3D%3D; pinId=pjKhXUE59i7LjgxtUlkd_A; pin=zyj183247166; unick=jdzyj183; _tp=LdLet8T0koyg96E1dqQafA%3D%3D; _pst=zyj183247166; areaId=2; TrackID=1CmX15GEs1MOTf99XeZl5eebqfftNFiLEiMsS8vvKBTMwCAfRlKqXu7YCjQn__C2-mqlg-FJxPlEwiA79snSf04SU1xTtsZOoj5aQk_Cb5mu1XN52nGptNsMI-kJjYqCV; user-key=03e76034-3bcd-4674-a834-cf34a4b960a6; ipLocation=%u4e0a%u6d77; cn=76; ipLoc-djd=2-2824-61056-0.3405425761; unpl=V2_ZzNtbUVQQhV1DhJXLxhfV2IFFV0RAkoSJltEAHsRWwc1AUdbclRCFnUURlVnGVsUZgsZXkJcQxFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsRWwVkBhVYR1ZzJXI4dmR%2fHV4BbwciXHJWc1chVEBTeRBdByoDGlpCVEYScA1HZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_660117e2e02c4761bd86bb3e1963c3d7|1634347276899; PCSYCityID=CN_310000_310100_310101; shshshfp=26c4ea20d915727cb0bd98f196346b4d; __jdc=122270672; wlfstk_smdl=7bqz560g1rixn5kg0i84v5y6xvvrqkhx; __jda=122270672.1961246440.1623769658.1634347277.1634382193.23; __jdb=122270672.1.1961246440|23.1634382193; o2-webp=true; 3AB9D23F7A4B3C9B=TJ6FIT6PTK3N32QTINQHHUBRA4J4MDPCZWEHCIIXVS6J5H3LSD75C3RMTC2RIBLHQDLJOCOMXWMJ2LSD6IEMJMV66M"
    
        }
    #response = requests.get("http://m.611.com/Match/Index",headers=header)
    response = requests.get(url,headers=header)
    if response.status_code == 200:
        print(response.text)
        with open('items2.jl','w') as file:
            file.write(response.text)
        # data = json.loads(response.text)
        # token = data["Data"]
    time.sleep(4)
    #商品信息
    item__info=driver.find_elements_by_class_name('goods-item')
    for item in item__info:
        name=item.find_element_by_class_name('goods-item__info').text
        price=item.find_element_by_class_name('goods-item__price').text
        print(name)
        print(price)
        print("-----------")
    driver.delete_all_cookies()
    # driver.quit()

    里面添加了两句话:

    options.add_argument("--disable-blink-features")
    options.add_argument("--disable-blink-features=AutomationControlled")

    同时,发现selenium只能抓取当前页显示的数据。无法像前述模拟XHR请求,返回所有机票数据。

    至于设置的文本混淆。则需要用其他方法解决。

    但是模拟XHR请求,然后尚未渲染到页面的,直接从服务器返回的数据,是应对文本混淆的最佳方法。

     

    ------------恢复内容结束------------

    ------------恢复内容结束------------

    你永远不知道未来会有什么,做好当下。技术改变世界,欢迎交流。
  • 相关阅读:
    tensorflow2.0 GPU和CPU 时间对比
    第一次使用FileZilla Server
    PremiumSoft Navicat 15 for Oracle中文破解版安装教程
    Unmapped Spring configuration files found. Please configure Spring facet or use 'Create Default Context' to add one including all unmapped files.
    ng : 无法加载文件 D: odejs ode_global g.ps1,因为在此系统上禁止运行脚本。有关详细信息,请参阅 https:/go.microsoft.com/fwlink/?LinkID=135170 中的 about_Execution_Policies。
    angular
    Github上优秀的go项目
    win10---file explore 中remove quick access folder
    react--useEffect使用
    linux---cat 和 grep 的妙用
  • 原文地址:https://www.cnblogs.com/xiaojieshisilang/p/15415136.html
Copyright © 2020-2023  润新知