爬虫案例：博客文章列表

博客实例：

爬取博客园文章列表，假设页面的URL是https://www.cnblogs.com/loaderman

要求：

使用requests获取页面信息，用XPath / re 做数据提取
获取每个博客里的标题，描述，链接地址，日期等
保存到 json 文件内

代码

# -*- coding:utf-8 -*-

import urllib2
import json
from lxml import etree

url = "https://www.cnblogs.com/loaderman/"
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}

request = urllib2.Request(url, headers=headers)

html = urllib2.urlopen(request).read()
# 响应返回的是字符串，解析为HTML DOM模式 text = etree.HTML(html)

text = etree.HTML(html)
# 返回所有结点位置，contains()模糊查询方法，第一个参数是要匹配的标签，第二个参数是标签名部分内容
node_list = text.xpath('//div[contains(@class, "post")]')
print (node_list)
items = {}
for each in node_list:
    print (each)
    title = each.xpath(".//h2/a[@class='postTitle2']/text()")[0]
    detailUrl = each.xpath(".//a[@class='postTitle2']/@href")[0]
    content = each.xpath(".//div[@class='c_b_p_desc']/text()")[0]
    date = each.xpath(".//p[@class='postfoot']/text()")[0]

    items = {
        "title": title,
        "image": detailUrl,
        "content": content,
        "date": date,

    }

    with open("loaderman.json", "a") as f:
        f.write(json.dumps(items, ensure_ascii=False).encode("utf-8") + "
")

效果：

相关阅读:
oracle聚合函数及行专列，pivot rollup cube
oracle expdp导入时提示“ORA-39002: 操作无效 ORA-39070: 无法打开日志文件 ”
PL/SQL 美化器不能解析文本
PL/SQL TOAD 不安装Oracle客户端连接数据库的方法
oracle 某一字段取反
jqgrid 加按钮列
扩展方法 DataTable的ToList<T>
jquery ajax调用WCF，采用System.ServiceModel.WebHttpBinding
jquery ajax调用WCF，采用System.ServiceModel.WSHttpBinding协议
学习WCF笔记之二

原文地址：https://www.cnblogs.com/loaderman/p/11759854.html