• 爬虫案例:博客文章列表


    博客实例:

    爬取博客园文章列表,假设页面的URL是https://www.cnblogs.com/loaderman

    要求:

    1. 使用requests获取页面信息,用XPath / re 做数据提取

    2. 获取每个博客里的标题,描述,链接地址,日期等

    3. 保存到 json 文件内

    代码

    # -*- coding:utf-8 -*-
    
    import urllib2
    import json
    from lxml import etree
    
    url = "https://www.cnblogs.com/loaderman/"
    headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
    
    request = urllib2.Request(url, headers=headers)
    
    html = urllib2.urlopen(request).read()
    # 响应返回的是字符串,解析为HTML DOM模式 text = etree.HTML(html)
    
    text = etree.HTML(html)
    # 返回所有结点位置,contains()模糊查询方法,第一个参数是要匹配的标签,第二个参数是标签名部分内容
    node_list = text.xpath('//div[contains(@class, "post")]')
    print (node_list)
    items = {}
    for each in node_list:
        print (each)
        title = each.xpath(".//h2/a[@class='postTitle2']/text()")[0]
        detailUrl = each.xpath(".//a[@class='postTitle2']/@href")[0]
        content = each.xpath(".//div[@class='c_b_p_desc']/text()")[0]
        date = each.xpath(".//p[@class='postfoot']/text()")[0]
    
        items = {
            "title": title,
            "image": detailUrl,
            "content": content,
            "date": date,
    
        }
    
        with open("loaderman.json", "a") as f:
            f.write(json.dumps(items, ensure_ascii=False).encode("utf-8") + "
    ")

    效果:

  • 相关阅读:
    shell-bash学习01基础、打印、环境变量
    css/js(工作中遇到的问题)-2
    git学习 git-flow
    js正则表达式练习
    12 链
    11数据访问
    10访问者,解释器
    08中介者,装饰者
    09 状态,适配器
    调试 scrapy 文件报错:line 48, in _load_handler、line 44, in load_object、 line 37, in import_module
  • 原文地址:https://www.cnblogs.com/loaderman/p/11759854.html
Copyright © 2020-2023  润新知