• python爬虫---->github上python的项目


      这里面通过爬虫github上的一些start比较高的python项目来学习一下BeautifulSoup和pymysql的使用。我一直以为山是水的故事,云是风的故事,你是我的故事,可是却不知道,我是不是你的故事。

    github的python爬虫

    爬虫的需求:爬取github上有关python的优质项目,以下是测试用例,并没有爬取很多数据。

    一、实现基础功能的爬虫版本

    这个案例可以学习到关于pymysql的批量插入、使用BeautifulSoup解析html数据以及requests库的get请求数据的知识。至于pymysql的一些使用,可以参考博客:python框架---->pymysql的使用

    import requests
    import pymysql.cursors
    from bs4 import BeautifulSoup
    
    def get_effect_data(data):
        results = list()
        soup = BeautifulSoup(data, 'html.parser')
        projects = soup.find_all('div', class_='repo-list-item')
        for project in projects:
            writer_project = project.find('a', attrs={'class': 'v-align-middle'})['href'].strip()
            project_language = project.find('div', attrs={'class': 'd-table-cell col-2 text-gray pt-2'}).get_text().strip()
            project_starts = project.find('a', attrs={'class': 'muted-link'}).get_text().strip()
            update_desc = project.find('p', attrs={'class': 'f6 text-gray mb-0 mt-2'}).get_text().strip()
    
            result = (writer_project.split('/')[1], writer_project.split('/')[2], project_language, project_starts, update_desc)
            results.append(result)
        return results
    
    
    def get_response_data(page):
        request_url = 'https://github.com/search'
        params = {'o': 'desc', 'q': 'python', 's': 'stars', 'type': 'Repositories', 'p': page}
        resp = requests.get(request_url, params)
        return resp.text
    
    
    def insert_datas(data):
        connection = pymysql.connect(host='localhost',
                                     user='root',
                                     password='root',
                                     db='test',
                                     charset='utf8mb4',
                                     cursorclass=pymysql.cursors.DictCursor)
        try:
            with connection.cursor() as cursor:
                sql = 'insert into project_info(project_writer, project_name, project_language, project_starts, update_desc) VALUES (%s, %s, %s, %s, %s)'
                cursor.executemany(sql, data)
                connection.commit()
        except:
            connection.close()
    
    
    if __name__ == '__main__':
        total_page = 2 # 爬虫数据的总页数
        datas = list()
        for page in range(total_page):
            res_data = get_response_data(page + 1)
            data = get_effect_data(res_data)
            datas += data
        insert_datas(datas)

    运行完之后,可以在数据库中看到如下的数据:

    11 tensorflow tensorflow C++ 78.7k Updated Nov 22, 2017
    12 robbyrussell oh-my-zsh Shell 62.2k Updated Nov 21, 2017
    13 vinta awesome-python Python 41.4k Updated Nov 20, 2017
    14 jakubroztocil httpie Python 32.7k Updated Nov 18, 2017
    15 nvbn thefuck Python 32.2k Updated Nov 17, 2017
    16 pallets flask Python 31.1k Updated Nov 15, 2017
    17 django django Python 29.8k Updated Nov 22, 2017
    18 requests requests Python 28.7k Updated Nov 21, 2017
    19 blueimp jQuery-File-Upload JavaScript 27.9k Updated Nov 20, 2017
    20 ansible ansible Python 26.8k Updated Nov 22, 2017
    21 justjavac free-programming-books-zh_CN JavaScript 24.7k Updated Nov 16, 2017
    22 scrapy scrapy Python 24k Updated Nov 22, 2017
    23 scikit-learn scikit-learn Python 23.1k Updated Nov 22, 2017
    24 fchollet keras Python 22k Updated Nov 21, 2017
    25 donnemartin system-design-primer Python 21k Updated Nov 20, 2017
    26 certbot certbot Python 20.1k Updated Nov 20, 2017
    27 aymericdamien TensorFlow-Examples Jupyter Notebook 18.1k Updated Nov 8, 2017
    28 tornadoweb tornado Python 14.6k Updated Nov 17, 2017
    29 python cpython Python 14.4k Updated Nov 22, 2017
    30 reddit reddit Python 14.2k Updated Oct 17, 2017

    友情链接

  • 相关阅读:
    Redis数据结构之字典
    多路复用
    Redis数据结构之SDS
    记一个图片转换神器vectorizer
    Java基础之面向对象上
    科学
    Linux内核源码分析之setup_arch (二)
    Linux内核源码分析之setup_arch (一)
    printk 流程分析
    多个线程顺序打印问题,一网打尽
  • 原文地址:https://www.cnblogs.com/huhx/p/usepythongithubspider.html
Copyright © 2020-2023  润新知