• 爬取豆瓣电影影评--移动迷宫3


    准备工作:

    1.登录豆瓣网,找到移动迷宫三,获取评论的地址:https://movie.douban.com/subject/26004132/comments?status=P

    2.登录账号,打开开发者工具,获取User-Agent

    agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
    headers = {
        "Host": "www.douban.com",
        "Referer": "https://www.douban.com/",
        'User-Agent': agent,
    }

    3.填写个人账号信息,用post把信息发到服务器完成登陆

        post_data = {
            "source": "index_nav",
            'form_email': acount,
            'form_password': secret
        }

    代码编写:

    1.爬取评论

     爬取当前页面的所有评论
            result = soup.find_all('div', {'class': 'comment'})  # 爬取得所有的短评
            pattern4 = r'<p class=""> (.*?)' 
                       r'</p>'
            for item in result:
                s = str(item)
                count2 = s.find('<p class="">')
                count3 = s.find('</p>')
                s2 = s[count2 + 12:count3]  # 抽取字符串中的评论
                if 'class' not in s2:
                    f.write(s2)
    
            # 获取下一页的链接
            next_url = soup.find_all('div', {'id': 'paginator'})
            pattern3 = r'href="(.*?)">后页'
            if (len(next_url) == 0):
                break
            next_url = re.findall(pattern3, str(next_url[0]))  # 得到后页的链接
            if (len(next_url) == 0):  # 如果没有后页的链接跳出循环
                break
            next_url = next_url[0]

    2.把获取到的评论保存到txt(部分评论)

    3.使用jieba进行词汇分割:

     with codecs.open(dirs,encoding='utf-8') as f:
            comment_text=f.read()
        cut_text=" ".join(jieba.cut(comment_text))
        with codecs.open('pjl_jieba.txt','w',encoding='utf-8') as f:
            f.write(cut_text)

    4.对关键词出现次数进行统计:

        word_lists=[] 
        with codecs.open(file_name,'r',encoding='utf-8') as f:
            Lists=f.readlines()
            for li in Lists:
                cut_list=list(jieba.cut(li))
                for word in cut_list:
                    word_lists.append(word)
    
        word_lists_set=set(word_lists)
        sort_count=[]
        word_lists_set=list(word_lists_set)
    
        length=len(word_lists_set)
    
        k = 1
        for w in word_lists_set:
            sort_count.append(w + u':' + str(word_lists.count(w)) + u"")
            k += 1
        with codecs.open('count_word.txt', 'w', encoding='utf-8') as f:
            f.writelines(sort_count)

    部分词汇出现次数如下:

    成功使爬虫蠕动的艰难过程:

    安装python3.7运行环境,不清楚是macos的问题还是python的问题,在安装scipy时,无法正常安装且报错:

    Command "python setup.py egg_info" failed with error code 1 in /...

    (重装python3.6版后解决问题)

    成果展示:

     

  • 相关阅读:
    linux初学者-ftp篇(一)
    随机密码生成
    倒计时问题java
    百钱买鸡
    去7JAVA
    贪吃蛇Controller Java实现(二)
    贪吃蛇Global Java实现(二)
    贪吃蛇GamePanel Java实现(二)
    贪吃蛇Listener Java实现(二)
    贪吃蛇snake Java实现(二)
  • 原文地址:https://www.cnblogs.com/wollen-zwt/p/8877908.html
Copyright © 2020-2023  润新知