• bs4实战之三国演义数据爬取



    # 需求:爬取三国演义小说中的章节标题与章节内容http://www.shicimingju.com/book/sanguoyanyi.html
    import requests
    from bs4 import BeautifulSoup
    if __name__ == "__main__":
    # 对首页数据进行爬取
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    } # UA伪装
    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    page_text = requests.get(url=url,headers=headers).text

    # 在首页解析出章节的标题和详情页的url
    # 1实例化beautifulsoup对象,需要将页面源码数据加载到该对象中
    soup = BeautifulSoup(page_text,'lxml')
    # 在首页解析出章节的标题和详情页的url
    li_list=soup.select('.book-mulu > ul > li ')

    fp = open("./sanguo.txt",'w',encoding='utf-8')
    for li in li_list:
    title = li.a.string #todo
    detail_url = 'http://www.shicimingju.com'+li.a['href']
    # 对详情页发起请求,解析出章节内容
    detail_page_text = requests.get(url=detail_url,headers = headers).text
    # 解析出详情页中的相关内容
    detail_soup = BeautifulSoup(detail_page_text,'lxml')
    div_tag = detail_soup.find('div',class_= 'chapter_content')
    # 解析到了章节内容
    content = div_tag.text()
    fp.write(title +':'+ content+' ')
    print(title,"爬取成功")


  • 相关阅读:
    微软面试智力题
    苹果下好用的中文输入法
    Programming on Mac OS X Learn ObjectC Serials (1)
    Mac OS X Glut build instructions
    window.opener 的用法
    jboss5.1.0.GA :org.apache.tomcat.jni.Error
    分账模式助力电子商务
    SON JAVA 使用方法
    网站功能用语
    如何在spring框架中解决多数据源的问题
  • 原文地址:https://www.cnblogs.com/huahuawang/p/12692354.html
Copyright © 2020-2023  润新知