• 爬虫综合大作业


    作业要求来自https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3159

    可以用pandas读出之前保存的数据:见上次博客爬取全部的校园新闻并保存csv

    一.把爬取的内容保存到数据库sqlite3

    import sqlite3
    with sqlite3.connect('gzccnewsdb.sqlite') as db:
    newsdf.to_sql('gzccnews',con = db)

    with sqlite3.connect('gzccnewsdb.sqlite') as db:
    df2 = pd.read_sql_query('SELECT * FROM gzccnews',con=db)

    保存到MySQL数据库

      • import pandas as pd
      • import pymysql
      • from sqlalchemy import create_engine
      • conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
      • engine = create_engine(conInfo,encoding='utf-8')
      • df = pd.DataFrame(allnews)
      • df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False)
      • 二.爬虫综合大作业
        1. 选择一个热点或者你感兴趣的主题。
        2. 选择爬取的对象与范围。
        3. 了解爬取对象的限制与约束。
        4. 爬取相应内容。
        5. 做数据分析与文本分析。
        6. 形成一篇文章,有说明、技术要点、有数据、有数据分析图形化展示与说明、文本分析图形化展示与说明。
        7. 文章公开发布。
          1. 我爬取的主题是虎扑英雄联盟专区,作为一名jr,平时比较喜欢在虎扑这个软件。
          2. 主要代码:

          3. 生成爬虫的函数
            
            def creat_bs(url):
                result = requests.get(url)
                e=chardet.detect(result.content)['encoding']
                #set the code of request object to the webpage's code
                result.encoding=e
                c = result.content
                soup =BeautifulSoup(c,'lxml')
                return soup
            
            接着构建所要获取网页的集合函数:
            
            def build_urls(prefix,suffix):
                urls=[]
                for item in suffix:
                    url=prefix+item
                    urls.append(url)
                return urls
            爬取函数:
            
            
            def find_title_link(soup):
                titles=[]
                links=[]
                try:
                    contanier=soup.find('div',{'class':'container_padd'})
                    ajaxtable=contanier.find('form',{'id':'ajaxtable'})
                    page_list=ajaxtable.find_all('li')
                    for page in page_list:
                        titlelink=page.find('a',{'class':'truetit'})
                        if titlelink.text==None:
                            title=titlelink.find('b').text
                        else:
                            title=titlelink.text
                        if np.random.uniform(0,1)>0.90:
                            link=titlelink.get('href')
                            titles.append(title)
                            links.append(link)
                except:
                    print('have no value')
                return titles,links
            数据并保存:
            
            
            wordlist=str()
            for title in title_group:
                wordlist+=title
            
            for reply in reply_group:
                wordlist+=reply
            
            def savetxt(wordlist):
                f=open('wordlist.txt','wb')
                f.write(wordlist.encode('utf8'))
                f.close()
            savetxt(wordlist)
            词云图的制作:
            
            
            import jieba
            jieba.load_userdict('user_dict.txt')
            wordlist_af_jieba=jieba.cut_for_search(wordlist)
            wl_space_split=' '.join(wordlist_af_jieba)
            
            from wordcloud import WordCloud,STOPWORDS
            import matplotlib.pyplot as plt
            stopwords=set(STOPWORDS)
            fstop=open('stopwords.txt','r')
            for eachWord in fstop:
                stopwords.add(eachWord.decode('utf-8'))
            
            wc=WordCloud(font_path=r'C:WindowsFontsSTHUPO.ttf', background_color='black',max_words=200,width=700,height=1000,stopwords=stopwords,max_font_size=100,random_state=30)
            wc.generate(wl_space_split)
            wc.to_file('hupu_pubg2.png')
            plt.imshow(wc,interpolation='bilinear')
            plt.axis('off')

            爬取的帖子截图:

            词云图:

  • 相关阅读:
    二级联动选择框的实现
    vimperator
    Ipan笔记-2
    git的一些补充点
    联想云部署的笔记心得
    关于vim的折叠
    ipan笔记
    php中浮点数计算问题
    Chrome 控制台报错Unchecked runtime.lastError: The message port closed before a response was received
    PHP-redis中文文档
  • 原文地址:https://www.cnblogs.com/lenkay/p/10836278.html
Copyright © 2020-2023  润新知