CSDN文章爬取

文章首发地址：喜欢沧月的二福君的个人博客

title: CSDN文章爬取
date: 2019-06-09 13:17:26
tags:

CSDN
python
category: 技术

计划

由于前些时间新建了个人博客，于是想把csdn的博客迁移到此处，一键迁移功能没有使用成功，所以想到了，直接爬取，然后重新发送
时间：3小时
预期结果：博客文章保存到本地

实施过程

找到文章列表，进行文章爬取，提取到文章的url信息。

进行文章内容的解析，提取文章内容。

保存到本地。

尝试对文章样式进行保存

使用的技术

采用python语言来完成，使用pyquery库进行爬取。

编码

分析文章页面，内容的爬取代码如下：

   article = doc('.blog-content-box')
   #文章标题
   title = article('.title-article').text()
   #文章内容
   content = article('.article_content')

进行文章的保存

 dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt'
        with open(dir, 'a', encoding='utf-8') as file:
            file.write(title+'
'+content.text())

对文章的url的提取

urls = doc('.article-list .content a')
    return urls

分页爬取

    for i in range(3):
        print(i)
        main(offset = i+1)

代码整合

完整代码

#!/usr/bin/env python
# _*_coding:utf-8 _*_
#@Time    :2019/6/8 0008 下午 11:00
#@Author  :喜欢二福的沧月君（necydcy@gmail.com）
#@FileName: CSDN.py

#@Software: PyCharm

import requests
from pyquery import PyQuery as pq

def find_html_content(url):
    headers = {
                'User-Agent': 'Mozilla/5.0(Macintosh;Inter Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gerko) Chrome/52.0.2743.116 Safari/537.36'
            }
    html = requests.get(url,headers=headers).text
    return html
def read_and_wiriteblog(html):
    doc = pq(html)

    article = doc('.blog-content-box')
    #文章标题
    title = article('.title-article').text()

    content = article('.article_content')

    try:
        dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt'
        with open(dir, 'a', encoding='utf-8') as file:
            file.write(title+'
'+content.text())
    except Exception:
        print("保存失败")


def geturls(url):
    content = find_html_content(url)
    doc = pq(content)
    urls = doc('.article-list .content a')
    return urls

def main(offset):
    url = '此处为博客地址' + str(offset)
    urls = geturls(url)
    for a in urls.items():
        a_url = a.attr('href')
        print(a_url)
        html = find_html_content(a_url)
        read_and_wiriteblog(html)
if __name__ == '__main__':
    for i in range(3):
        print(i)
        main(offset = i+1)

相关阅读:
Weblogic 12c 集群部署和session复制
Weblogic 12c 集群环境搭建
Programming In Scala笔记-第十九章、类型参数，协变逆变，上界下界
这是最好的时光，这是最坏的时光 SNAPSHOT
这是最好的时光这是最坏的时光 v0.1.1.1
鹅厂欧阳大神给年轻人的一些分享
谈到电影，我们收获了什么
那些被电影搞的日子
Programming In Scala笔记-第十五章、Case Classes和模式匹配
[CSharp]传一个包含多个属性的对象，只改变其中个别属性值的方法

原文地址：https://www.cnblogs.com/miria-486/p/10993272.html