前言
有时候通过元素的属性查找页面上的某个元素,可能不大好找,这时候可以从源码中爬出想要的信息。selenium的page_source方法可以获取页面源码。
爬页面源码的作用:如,爬出页面上所有的url地址,可以批量请求页面url地址,看是否存在404等异常等
一、page_source
1.selenium的page_source方法可以直接返回页面源码
二、re非贪婪模式
1.这里需导入re模块
2.用re的正则匹配:非贪婪模式
3.findall方法返回的是一个list集合
4.匹配出来之后发现有一些不是url链接,可以筛选下
findall 在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,如果没有找到匹配的,则返回空列表。
语法格式为:re.findall(pattern, string, flags=0)
参考代码:
driver=webdriver.Chrome() driver.get("https://www.cnblogs.com/canglongdao") #print(type(driver.page_source)) rs=driver.page_source.encode("utf-8") print(type(rs),type(str(rs))) aurl=re.findall('href="(.+?)"',str(rs)) print(aurl)
运行结果:
<class 'bytes'> <class 'str'> ['//common.cnblogs.com/favicon.ico?v=20200522', '/css/blog-common.min.css?v=7Pwqzj5EBy4dBv4DJNI181rFKP8_OF0hT7jO3o8jAa0', '/skins/book/bundle-book-2.min.css', '/skins/book/bundle-book-mobile.min.css?v=XFoR99E4sMNWcYA_LxWBPY7uXp4-8NCPb1RnsUN1Mwo', 'https://www.cnblogs.com/canglongdao/rss', 'https://www.cnblogs.com/canglongdao/rsd.xml', 'https://www.cnblogs.com/canglongdao/wlwmanifest.xml', 'https://www.cnblogs.com/canglongdao/', 'https://www.cnblogs.com/canglongdao/archive/2020/09/01.html', 'https://www.cnblogs.com/canglongdao/p/13595372.html', 'https://www.cnblogs.com/canglongdao/p/13595372.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13595372', 'https://www.cnblogs.com/canglongdao/p/13594914.html', 'https://www.cnblogs.com/canglongdao/p/13594914.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13594914', 'https://www.cnblogs.com/canglongdao/p/13594459.html', 'https://www.cnblogs.com/canglongdao/p/13594459.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13594459', 'https://www.cnblogs.com/canglongdao/p/13590722.html', 'https://www.cnblogs.com/canglongdao/p/13590722.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13590722', 'https://www.cnblogs.com/canglongdao/archive/2020/08/31.html', 'https://www.cnblogs.com/canglongdao/p/13590348.html', 'https://www.cnblogs.com/canglongdao/p/13590348.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13590348', 'https://www.cnblogs.com/canglongdao/p/13589720.html', 'https://www.cnblogs.com/canglongdao/p/13589720.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13589720', 'https://www.cnblogs.com/canglongdao/p/13587969.html', 'https://www.cnblogs.com/canglongdao/p/13587969.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13587969', 'https://www.cnblogs.com/canglongdao/archive/2020/08/30.html', 'https://www.cnblogs.com/canglongdao/p/13587061.html', 'https://www.cnblogs.com/canglongdao/p/13587061.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13587061', 'https://www.cnblogs.com/canglongdao/p/13586938.html', 'https://www.cnblogs.com/canglongdao/p/13586938.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13586938', 'https://www.cnblogs.com/canglongdao/p/13585477.html', 'https://www.cnblogs.com/canglongdao/p/13585477.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13585477', 'https://www.cnblogs.com/canglongdao/default.html?page=2', 'https://www.cnblogs.com/', 'javascript:void(0);', 'javascript:void(0);', 'https://www.cnblogs.com/canglongdao/archive/2020/09/01.html', 'https://www.cnblogs.com/', 'https://www.cnblogs.com/canglongdao/', 'https://i.cnblogs.com/EditPosts.aspx?opt=1', 'https://msg.cnblogs.com/send/%E6%98%9F%E7%A9%BA6', 'javascript:void(0)', 'https://www.cnblogs.com/canglongdao/rss/', 'https://i.cnblogs.com/', 'https://home.cnblogs.com/u/canglongdao/', 'https://home.cnblogs.com/u/canglongdao/', 'https://home.cnblogs.com/u/canglongdao/followers/', 'https://home.cnblogs.com/u/canglongdao/followees/', 'javascript:void(0)', 'https://www.cnblogs.com/canglongdao/p/', 'https://www.cnblogs.com/canglongdao/MyComments.html', 'https://www.cnblogs.com/canglongdao/OtherPosts.html', 'https://www.cnblogs.com/canglongdao/RecentComments.html', 'https://www.cnblogs.com/canglongdao/tag/', 'https://www.cnblogs.com/canglongdao/category/1593317.html', 'https://www.cnblogs.com/canglongdao/category/1694849.html', 'https://www.cnblogs.com/canglongdao/category/1633461.html', 'https://www.cnblogs.com/canglongdao/category/1616592.html', 'https://www.cnblogs.com/canglongdao/category/1609028.html', 'https://www.cnblogs.com/canglongdao/category/1633189.html', 'https://www.cnblogs.com/canglongdao/category/1750002.html', 'https://www.cnblogs.com/canglongdao/category/1566249.html', 'https://www.cnblogs.com/canglongdao/category/1606140.html', 'https://www.cnblogs.com/canglongdao/category/1629226.html', 'https://www.cnblogs.com/canglongdao/category/1588735.html', 'https://www.cnblogs.com/canglongdao/category/1815562.html', 'https://www.cnblogs.com/canglongdao/category/1588084.html', 'https://www.cnblogs.com/canglongdao/category/1589277.html', 'https://www.cnblogs.com/canglongdao/category/1834572.html', 'https://www.cnblogs.com/canglongdao/category/1611757.html', 'https://www.cnblogs.com/canglongdao/category/1589392.html', 'https://www.cnblogs.com/canglongdao/category/1627263.html', 'https://www.cnblogs.com/canglongdao/category/1619655.html', 'https://www.cnblogs.com/canglongdao/category/1657195.html', 'https://www.cnblogs.com/canglongdao/category/1612257.html', 'https://www.cnblogs.com/canglongdao/category/1769926.html', 'https://www.cnblogs.com/canglongdao/category/1635972.html', 'https://www.cnblogs.com/canglongdao/category/1630667.html', 'https://www.cnblogs.com/canglongdao/archive/2020/09.html', 'https://www.cnblogs.com/canglongdao/archive/2020/08.html', 'https://www.cnblogs.com/canglongdao/archive/2020/07.html', 'https://www.cnblogs.com/canglongdao/archive/2020/06.html', 'https://www.cnblogs.com/canglongdao/archive/2020/05.html', 'https://www.cnblogs.com/canglongdao/archive/2020/04.html', 'https://www.cnblogs.com/canglongdao/archive/2020/03.html', 'https://www.cnblogs.com/canglongdao/archive/2020/02.html', 'https://www.cnblogs.com/canglongdao/archive/2020/01.html', 'https://www.cnblogs.com/canglongdao/archive/2019/12.html', 'https://www.cnblogs.com/canglongdao/archive/2019/11.html', 'https://www.cnblogs.com/canglongdao/archive/2019/10.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/11973931.html', 'https://www.cnblogs.com/canglongdao/p/12013291.html', 'https://www.cnblogs.com/canglongdao/p/12722846.html', 'https://www.cnblogs.com/canglongdao/p/12606952.html', 'https://www.cnblogs.com/canglongdao/p/12019714.html', 'https://www.cnblogs.com/canglongdao/p/12436272.html', 'https://www.cnblogs.com/canglongdao/p/12726642.html', 'https://www.cnblogs.com/canglongdao/p/11973931.html', 'https://www.cnblogs.com/canglongdao/p/12013291.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/12067902.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/12601894.html', 'https://www.cnblogs.com/canglongdao/p/13414829.html']
三、筛选url地址出来
1.加个if语句判断,'http'在url里面说明是正常的url地址了
2.把所有的url地址放到一个集合,就是我们想要的结果
参考代码:
# coding:utf-8 from selenium import webdriver import re driver=webdriver.Chrome() driver.get("https://www.cnblogs.com/canglongdao") #print(type(driver.page_source)) rs=driver.page_source.encode("utf-8") # print(type(rs),type(str(rs))) aurl=re.findall('href="(.+?)"',str(rs)) print(aurl) url=[] for i in aurl: if 'http' in i: url.append(i) #最终的url集合 print(len(url),url)
运行结果: