• python Beautiful Soup 采集it books pdf,免费下载


    http://www.allitebooks.org/
    是我见过最良心的网站,所有书籍免费下载
    周末无聊,尝试采集此站所有Pdf书籍。

    采用技术

    • python3.5
    • Beautiful soup

    分享代码

    最简单的爬虫,没有考虑太多的容错,建议大家尝试的时候,温柔点,别把这个良心网站搞挂掉了

    # www.qingmiaokeji.cn 30
    from bs4 import BeautifulSoup
    import requests
    import json
    
    siteUrl = 'http://www.allitebooks.org/'
    
    
    def category():
        response = requests.get(siteUrl)
        # print(response.text)
        categoryurl = []
        soup = BeautifulSoup(response.text,"html.parser")
        for a in soup.select('.sub-menu li a'):
            categoryurl.append({'name':a.get_text(),'href':a.get("href")})
        return categoryurl
    
    def  bookUrlList(url):
        # urls = []
        response = requests.get(url['href'])
        soup = BeautifulSoup(response.text,"html.parser")
        a = soup.select(".pagination a[title='Last Page →']")
        nums = 0
        for e in a:
            nums = int(e.get_text())
            # print(e.get_text())
        for i in range(1,nums+1):
            # print(url+"page/"+str(i))
            # urls.append(url+"page/"+str(i))
            bookList(url['href']+"page/"+str(i))
    
    def bookList(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text,"html.parser")
        article = soup.select(".main-content-inner article .entry-title a")
        for i in article:
            url = i.get("href")
            getBookDetail(url)
    
    def  getBookDetail(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text,"html.parser")
        title = soup.select(".single-title")[0].text
        imgurl = soup.select(".entry-body-thumbnail .attachment-post-thumbnail")[0].get("src")
        downLoadPdfUrl = soup.select(".download-links a")[0].get("href")
        with open('d:/booklist.txt', 'a+',encoding='utf-8') as f:
            f.write(title+" | ![]("+imgurl+") | "+ downLoadPdfUrl+"
    ")
    
    
    if __name__ == '__main__':
        
        list = category()
        for url in list:
            bookUrlList(url)
    
  • 相关阅读:
    ZooKeeper系列
    CST和GMT时间的区别
    ZooKeeper系列之二:Zookeeper常用命令
    分布式服务框架 Zookeeper -- 管理分布式环境中的数据
    ZooKeeper资料
    分布式选举算法
    初识ZooKeeper与集群搭建实例
    原子广播
    Apache ZooKeeper
    工作流引擎
  • 原文地址:https://www.cnblogs.com/qingmiaokeji/p/10988906.html
Copyright © 2020-2023  润新知