【2022.05.20】对无验证码的整个网页公告的内容进行自适应爬取(1)

学习内容

xpath，以及python字符串替换，

url自适应拼接，因为很多网站的href不完整

使用Selenium 抓取动态页面内容

前言

这次要实现的是根据网址和xpath，去抓取同一页面中的所有公告内容

代码

源码

import requests
from lxml import etree
import lxml
import json
import time
import string
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# 全局变量
with open('config.json', 'r', encoding='utf-8') as f:
    JsonFile = json.load(f)

base_url = JsonFile['url']
notice_title_href_xpath = JsonFile['notice_title_href_xpath']
notice_title_xpath = JsonFile['notice_title_xpath']
notice_content_xpath = JsonFile['notice_content_xpath']
search = JsonFile['search']
notice_content_xpath = notice_content_xpath.replace("替换", search)


# 返回下一页的链接，这部分还没写完
def get_next_page_url(current_url):
    current_html = requests.get(current_url)
    current_html.encoding = "utf-8"
    selecter = etree.HTML(current_html.text)
    next_page_url = selecter.xpath("""//*[contains(text(),'下一页') or contains(text(),'下页') or contains(text(),'next') or contains(text(),'Next')]/@href""")
    print("下一页链接", next_page_url)
    return next_page_url

if __name__ == '__main__':
    current_url = base_url

    current_html = requests.get(current_url)
    current_html.encoding = "utf-8"
    selecter = etree.HTML(current_html.text)
    notice = selecter.xpath(notice_title_href_xpath)
    print(notice)

    # print(current_html)
    # next_page_url_xpath = """//*[@href = 'tj-sgsj_2.shtml']/@href"""
    # next_page_url = selecter.xpath(next_page_url_xpath)
    # print("下一页链接", next_page_url)

    # 获取当前页面的所有公告
    for result in notice:
        # 获得网址
        result_url = urljoin(current_url, result)
        print("网址： ", result_url)
        result_html = requests.get(result_url)
        result_html.encoding = "utf-8"
        result_detail = etree.HTML(result_html.text)
        result_title = result_detail.xpath(notice_title_xpath)
        print("标题: ", result_title)

        result_content = result_detail.xpath(notice_content_xpath)
        print("内容： ")
        for result_print in result_content:
            print(result_print)
        print("\n")
        time.sleep(1)

配置文件

{
    "？url": "【可修改】用于查找的网页",
    "url": "http://www.lswz.gov.cn/html/ywpd/lstk/tj-sgsj.shtml",
    "？search": "【可修改】要查找的公告内容",
    "search": "玉米",
    "？notice_title_href_xpath": "【不修改】获取每个公告href的Xpath位置",
    "notice_title_href_xpath": "//*[@class='lists diylist']/li/a/@href",
    "？notice_title_xpath": "【不修改】获取每个公告title的Xpath位置",
    "notice_title_xpath": "//div[@class='pub-det-title']/text()",
    "？notice_content_xpath": "【不修改】获取每个公告content的Xpath位置",
    "notice_content_xpath": "//*[contains(text(),'替换') and @style]/text()"
}

参考链接

(22条消息) XPath中text方法和string方法的用法总结_Jock2018的博客-CSDN博客_xpath中text

python爬虫使用requests请求无法获取网页元素时终极解决方案 - 咖啡少女不加糖。 - 博客园 (cnblogs.com)

requests get不到完整页面源码 - SegmentFault 思否

python JS 反爬用 request 获取到的 HTML 不全，请各位帮我看看 | Python | Python 技术论坛 (learnku.com)

相关阅读:
MySQL性能优化的最佳经验
 18个网站SEO建议
 sql之left join、right join、inner join的区别
 PHP与MYSQL事务处理
 Firefox上Web开发工具库一览
 SphinxSE的安装
 python XML
python yaml
C语言文本处理
 Linux strace命令
原文地址：https://www.cnblogs.com/mokou/p/16293558.html