在用Python爬取动态页面时,普通的requests、urllib2无法实现,此时就需要Seleniums了。
Seleniums是一个用于Web应用程序测试的工具。Seleniums测试直接在浏览器中运行,就像真正的用户在操作一样。使用它爬取页面十分方便,只需要按照访问步骤模拟人的操作就可以了,不用担心Cookies、Session的处理。
准备工作:
Python
PyCharm
Chrome浏览器
chromedriver浏览器驱动
代码解析:
使用Seleniums的webdriver库,初始化一个Chrom浏览器窗口。
from selenium import webdriver chrome_driver = r"C:Users13439Desktopchromedriver.exe" driver = webdriver.Chrome(executable_path=chrome_driver)
接下来打开网页
driver.get("https://fh.dujia.qunar.com/?tf=package")
接下来实现等待需要以下三个库:By库用于指定HTML文件中DOM标签元素;WebDriverWait库用于等待网页加载完毕;expected_conditions库用于指定等待网页加载结束的条件
from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC
使driver保持等待,直到读取id=“depCity”,最多等待时间为10s
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "depCity")))
用webdriver.Chroms()的find_element_by_xpath方法找到出发地输入框,然后清除输入框的内容,将自定义的出发地填入出发地输入框,同理也将目的地填入框,单击页面上的“开始定制”按钮和搜索结果页面的页码按钮。
driver.find_element_by_xpath("//*[@id='depCity']").clear() driver.find_element_by_xpath("//*[@id='depCity']").send_keys(dep) driver.find_element_by_xpath("//*[@id='arrCity']").send_keys(query["query"]) driver.find_element_by_xpath("/html/body/div[2]/div[1]/div[2]/div[3]/div/div[2]/div/a").click() pageBtns = driver.find_elements_by_xpath("html/body/div[2]/div[2]/div[8]") if pageBtns == []: break
找到对应的数据,然后分块取出数据
routes = driver.find_elements_by_xpath("html/body/div[2]/div[2]/div[7]/div[2]/div") for route in routes: result = { 'date': time.strftime('%Y-%m-%d', time.localtime(time.time())), 'dep': dep, 'arrive': query['query'], 'result': route.text }
指定页码和翻页,检测不到下一页元素就跳出循环
if i < 9: btns = driver.find_elements_by_xpath("html/body/div[2]/div[2]/div[8]/div/div/a") for a in btns: if a.text == u"下一页": a.click() break
全部代码如下:
import requests import urllib.request import time import random import selenium from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC def get_url(url): time.sleep(1) return(requests.get(url)) if __name__ == "__main__": chrome_driver = r"C:Users13439Desktopchromedriver.exe" driver = webdriver.Chrome(executable_path=chrome_driver) #driver = selenium.webdriver.Chrome() # 初始化一个浏览器对象 # dcap = dict(DesiredCapabilities.PHANTOMJS) # dcap["phantomjs.page.settings.loadImages"] = False # driver = webdriver.PhantomJS( desired_capabilities=dcap) dep_cities = ["北京","上海","广州","深圳","天津","杭州","南京","济南","重庆","青岛","大连","宁波","厦门","成都","武汉", "哈尔滨","沈阳","西安","长春","长沙","福州","郑州","石家庄","苏州","佛山","烟台","合肥","昆明","唐山", "乌鲁木齐","兰州","呼和浩特","南通","潍坊","绍兴","邯郸","东营","嘉兴","泰州","江阴","金华","鞍山","襄阳", "南阳","岳阳","漳州","淮安","湛江","柳州","绵阳"] for dep in dep_cities: strhtml = get_url('https://m.dujia.qunar.com/golfz/sight/arriveRecommend?dep=' + urllib.request.quote(dep) + '&exclude=&extensionImg=255,175') arrive_dict = strhtml.json() for arr_item in arrive_dict['data']: for arr_item_1 in arr_item['subModules']: for query in arr_item_1['items']: driver.get("https://fh.dujia.qunar.com/?tf=package") # 等待出发地输入框加载完毕,最多等待10s WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "depCity"))) # 清空出发地文本框,输入出发地和目的地,点击“开始定制”按钮 driver.find_element_by_xpath("//*[@id='depCity']").clear() driver.find_element_by_xpath("//*[@id='depCity']").send_keys(dep) driver.find_element_by_xpath("//*[@id='arrCity']").send_keys(query["query"]) driver.find_element_by_xpath("/html/body/div[2]/div[1]/div[2]/div[3]/div/div[2]/div/a").click() print("dep:%s arr:%s" % (dep, query["query"])) for i in range(10): time.sleep(random.uniform(5, 6)) # 如果定位不到页码按钮,说明搜索结果为空 pageBtns = driver.find_elements_by_xpath("html/body/div[2]/div[2]/div[8]") if pageBtns == []: break # 找出所有的路线信息DOM元素 routes = driver.find_elements_by_xpath("html/body/div[2]/div[2]/div[7]/div[2]/div") for route in routes: result = { 'date': time.strftime('%Y-%m-%d', time.localtime(time.time())), 'dep': dep, 'arrive': query['query'], 'result': route.text } print(result) if i < 9: # 找到“下一页"按钮并点击翻页 btns = driver.find_elements_by_xpath("html/body/div[2]/div[2]/div[8]/div/div/a") for a in btns: if a.text == u"下一页": a.click() break driver.close()