• 【2022.05.20】对无验证码的整个网页公告的内容进行自适应爬取(1)


    学习内容

    xpath,以及python字符串替换,

    url自适应拼接,因为很多网站的href不完整

    使用Selenium 抓取动态页面内容

    前言

    这次要实现的是根据网址和xpath,去抓取同一页面中的所有公告内容

    代码

    源码

    import requests
    from lxml import etree
    import lxml
    import json
    import time
    import string
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    
    # 全局变量
    with open('config.json', 'r', encoding='utf-8') as f:
        JsonFile = json.load(f)
    
    base_url = JsonFile['url']
    notice_title_href_xpath = JsonFile['notice_title_href_xpath']
    notice_title_xpath = JsonFile['notice_title_xpath']
    notice_content_xpath = JsonFile['notice_content_xpath']
    search = JsonFile['search']
    notice_content_xpath = notice_content_xpath.replace("替换", search)
    
    
    # 返回下一页的链接,这部分还没写完
    def get_next_page_url(current_url):
        current_html = requests.get(current_url)
        current_html.encoding = "utf-8"
        selecter = etree.HTML(current_html.text)
        next_page_url = selecter.xpath("""//*[contains(text(),'下一页') or contains(text(),'下页') or contains(text(),'next') or contains(text(),'Next')]/@href""")
        print("下一页链接", next_page_url)
        return next_page_url
    
    if __name__ == '__main__':
        current_url = base_url
    
        current_html = requests.get(current_url)
        current_html.encoding = "utf-8"
        selecter = etree.HTML(current_html.text)
        notice = selecter.xpath(notice_title_href_xpath)
        print(notice)
    
        # print(current_html)
        # next_page_url_xpath = """//*[@href = 'tj-sgsj_2.shtml']/@href"""
        # next_page_url = selecter.xpath(next_page_url_xpath)
        # print("下一页链接", next_page_url)
    
        # 获取当前页面的所有公告
        for result in notice:
            # 获得网址
            result_url = urljoin(current_url, result)
            print("网址: ", result_url)
            result_html = requests.get(result_url)
            result_html.encoding = "utf-8"
            result_detail = etree.HTML(result_html.text)
            result_title = result_detail.xpath(notice_title_xpath)
            print("标题: ", result_title)
    
            result_content = result_detail.xpath(notice_content_xpath)
            print("内容: ")
            for result_print in result_content:
                print(result_print)
            print("\n")
            time.sleep(1)
    

    配置文件

    {
        "?url": "【可修改】用于查找的网页",
        "url": "http://www.lswz.gov.cn/html/ywpd/lstk/tj-sgsj.shtml",
        "?search": "【可修改】要查找的公告内容",
        "search": "玉米",
        "?notice_title_href_xpath": "【不修改】获取每个公告href的Xpath位置",
        "notice_title_href_xpath": "//*[@class='lists diylist']/li/a/@href",
        "?notice_title_xpath": "【不修改】获取每个公告title的Xpath位置",
        "notice_title_xpath": "//div[@class='pub-det-title']/text()",
        "?notice_content_xpath": "【不修改】获取每个公告content的Xpath位置",
        "notice_content_xpath": "//*[contains(text(),'替换') and @style]/text()"
    }
    

    参考链接

    (22条消息) XPath中text方法和string方法的用法总结_Jock2018的博客-CSDN博客_xpath中text

    python爬虫使用requests请求无法获取网页元素时终极解决方案 - 咖啡少女不加糖。 - 博客园 (cnblogs.com)

    requests get不到完整页面源码 - SegmentFault 思否

    python JS 反爬用 request 获取到的 HTML 不全,请各位帮我看看 | Python | Python 技术论坛 (learnku.com)

  • 相关阅读:
    MySQL性能优化的最佳经验
    18个网站SEO建议
    sql之left join、right join、inner join的区别
    PHP与MYSQL事务处理
    Firefox上Web开发工具库一览
    SphinxSE的安装
    python XML
    python yaml
    C语言文本处理
    Linux strace命令
  • 原文地址:https://www.cnblogs.com/mokou/p/16293558.html
Copyright © 2020-2023  润新知