• 小说文本爬取


    爬取的网页:http://www.shicimingju.com/book/sanguoyanyi.html

    import requests
    import bs4
    import lxml
    import os

    headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
    }
    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    page_text = requests.get(url=url,headers=headers).text
    soup = bs4.BeautifulSoup(page_text,'lxml')
    book = soup.select('.card > h1 ')[0].string # 获取小说名
    os.mkdir(book)

    a_list = soup.select('.book-mulu > ul > li > a') # 获取所有的a标签
    for a in a_list:
    title = a.string # 获取章节名
    f = open(book+'\'+title,'w',encoding='utf-8')
    url_detail = 'http://www.shicimingju.com'+a['href']
    page_text_detail = requests.get(url=url_detail,headers=headers).text
    soup = bs4.BeautifulSoup(page_text_detail, 'lxml')
    content = soup.find('div',attrs={'class':'chapter_content'}).text
    f.write(content)
    print(title,'下载成功')









  • 相关阅读:
    短URL
    Linux安装MySQL
    Ubuntu中安装MySQL
    安装交叉工具链arm-linux-gcc
    Linux安装—IP设置
    Linux内核概述
    Bash变量
    Shell登陆
    Linux—查看远程Linux系统运行时间
    Linux—查看路由
  • 原文地址:https://www.cnblogs.com/KingOfCattle/p/12907968.html
Copyright © 2020-2023  润新知