爬取新闻网站的财经信息

要爬取的新闻信息
在这里插入图片描述可以通过Ctrl+U快捷键查看页面的html源码，便于数据结构分析
数据结构：类news_li下的h2标签下的a链接指定每个新闻的详情
完整代码如下:

#encoding:utf-8
import requests
from pyquery import PyQuery as pq
import os
import datetime

header = {
        "Referer": "https://www.thepaper.cn/channel_25951",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}
base_url = "https://www.thepaper.cn/"
def index():
    url = base_url + "channel_25951"
    response = requests.get(url, header).text
    # 数据初始化
    doc = pq(response)
    # 使用css选择器的方式进行数据提取(类（news_li）———》h2标签--》a标签内容)
    a = doc(".news_li h2 a").items()
    for x in a:
        # 获取新闻详情链接
        href = base_url + x.attr("href")
        # 提取文本数据
        title = x.text()
        content(href, title)

def content(href, title):
    response = requests.get(href, header).text
    doc = pq(response)
    news = doc(".news_txt").items()
    for x in news:
        new = x.text()
        date = datetime.datetime.now().strftime("%Y-%m-%d") + "//"
        if not os.path.exists(date):
            os.mkdir(date)
        with open(date + "{}.txt".format(title),"a", encoding="utf-8") as f:
            f.write(new)

index()

在这里插入图片描述

相关阅读:
在Oracle的FORM中高亮显示鼠标点击或光标所在的行
Attempt to refer to a unregistered pool by its alias 'MySQL'
C#中手机号验证，邮箱验证
Caused by: java.lang.ClassNotFoundException:org.apache.commons.logging.LogFactory
org.apache.jasper.JasperException: /existingstudent.jsp(4,4) Invalid directive
C#中使用cookies
【old】简单易用的鹰眼类源代码下载
MapGuide Tips如何限制MapGuide Ajax Viewer的缩放范围
MapGuide应用开发系列（八）MapGuide Studio准备地图之地图（Map)
基于MapGuide的在线WebGIS站点介绍

原文地址：https://www.cnblogs.com/zhouzetian/p/13380546.html