• 爬取中公网新闻时政


    一、实现功能

    获取中公网时政每日新闻,通过构造翻页网址实现量化

    发送每日新闻的请求获取xptah匹配到的位置

    import requests
    import time
    from lxml import etree
    import re
    
    def write_info(info):
        with open('时政2.txt','a',encoding='utf-8') as f:
            f.write(info)
            f.close()
    
    
    url_temp = 'http://gd.zgsydw.com/ziliao/shizheng/{}.html'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
    url_f = []
    for k in range(11,34):
        url_f.append(url_temp.format(str(k)))   #构造url地址
        
    try:
        for url in url_f:    #遍历主页的网址
            response = requests.get(url,headers=headers)
            html = etree.HTML(response.content.decode('gbk'))    
        ##result = etree.tostring(html).decode()
            item = html.xpath('//*[@class="whole clearfix zg_list_one"]/div[@class="zg_list_lf"]/div[@class="zpxx_nr_change1"]/div/ul/li')
            page = re.findall(r'http://gd.zgsydw.com/ziliao/shizheng/(.*?).html',url)
            for i in item:
                title = i.xpath('.//a/span/b/text()')[0]    #获取标题
                href = i.xpath('.//@href')[0]   #获取每日网址
            ##    url_list.append(href)
            ##    title_list.append(title)
                print('现在已经进行到第{}页'.format(page[0])+'
    ',title,href)
                write_info(title)
                res = requests.get(href,headers=headers)
                html2 = etree.HTML(res.content.decode('gbk'))
                item1 = html2.xpath('//*[@class="whole clearfix zg_show_one"]/div[@class="zg_show_lf"]/div[@class="show_con"]')
                for i in item1:
                    one = i.xpath('.//div[@class="show_con_box"]/text()')[0]
        ##            print(one)
                    write_info(one)
                    two = i.xpath('.//div[@class="show_con_box"]/p/text()')
                    for j in two[5:]:
                        write_info(j)
                        write_info('
    ')
                    time.sleep(3)
    ##                print(j)
    ##            print('
    ')
    except Exception as e:
        print(e)
    爬取代码
    import re
    
    
    with open('时政2.txt','r',encoding='utf-8') as f:
        s = f.read()
        research = re.sub(r'和各位考生探讨考试中的疑惑,以下为正文内容详情:','',s)
        
    with open('时政3.txt','w',encoding='utf-8') as f:
        f.write(research)
        f.close()
    初步清洗标签

    二、存在问题

    编码格式问题,网页编码属于charset="GB2312",但到一些页面会解析失败,统一用gbk可解决

    xpath位置问题,第一条新闻没有单独的p标签,xpath('.//div[@class="show_con_box"]/text()')[0] 通过标签下文本匹配组合

    2019年时期的网页标签不一致,因此无法完全匹配

    三、参照

    编码

    https://blog.csdn.net/lxdcyh/article/details/4018054

    lxml

    https://blog.csdn.net/mouday/article/details/105376949

    列表推导

    https://blog.csdn.net/lexi3555/article/details/80633441

  • 相关阅读:
    Node.js缓冲模块Buffer
    hashMap底层put和get方法逻辑
    减少GC开销的措施
    mybatis中#{}和${}的区别
    AngularJS 用 Interceptors 来统一处理 HTTP 请求和响应
    点击页面的悬浮窗口实现随意拖动
    RabbitMQ 基本概念介绍-----转载
    Spring+Quartz实现定时执行任务的配置
    springmvc导出excel并弹出下载框
    AtomicInteger类保证线程安全的用法
  • 原文地址:https://www.cnblogs.com/ybxw/p/13600745.html
Copyright © 2020-2023  润新知