• 百度贴吧的数据抓取和分析(一):指定条目帖子信息抓取


    这个教程使用BeautifulSoup库爬取指定贴吧的帖子信息。

    本教程的代码托管于github: https://github.com/w392807287/spider_baidu_bar

    数据分析部分请移步:

    python版本:3.5.2

    使用BeautifulSoup库获取网页信息

    引入相关库:

    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    from urllib.error import HTTPError
    

    这里使用python吧为例子,python吧的主页为:http://tieba.baidu.com/f?ie=utf-8&kw=python&fr=search,精简一点http://tieba.baidu.com/f?kw=python

    获取BeautifulSoup对象:

    url = "http://tieba.baidu.com/f?kw=python"
    html = urlopen(url).read()
    bsObj = BeautifulSoup(html, "lxml")
    

     这里将这一段进行封装,封装成一个传入url返回bs对象的函数:

    def get_bsObj(url):
        '''
        返回给定url的beautifulsoup对象
        :param url:目标网址
        :return:beautifulsoup对象
        '''
        try:
            html = urlopen(url).read()
            bsObj = BeautifulSoup(html, "lxml")
            return bsObj
        except HTTPError as e:
            print(e)
            return None
    

     这个函数传入一个url,返回beautifulsoup对象,如果发生错误则打印出错误并返回空值。

     贴吧主页的处理

    在贴吧主页中,包含了这个贴吧的大概信息,比如:关注量,主题数量,帖子数量。我们将这些信息汇总到一个文件。

    获取主页bs对象:

    bsObj_mainpage = get_bsObj(url)
    

     获取总页数:

    last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])
    

     获取最后一页的页数是为了后面爬去帖子时越界。其中使用了beautifulsoup的find方法。

    获取需要的信息并写入文件:

            red_text = bsObj_mainpage.findAll("span", {"class": "red_text"})
            subject_sum = red_text[0].get_text()  # 主题数
            post_sum = red_text[1].get_text()  # 帖子数
            follow_sum = red_text[2].get_text()  # 关注量
    
        with open('main_info.txt','w+') as f:
            f.writelines("统计时间: "+str(datetime.datetime.now())+"
    ")
            f.writelines("主题数:  "+subject_sum+"
    ")
            f.writelines("帖子数:  "+post_sum+"
    ")
            f.writelines("关注量:  "+follow_sum+"
    ")
    

     最后将这些步骤封装成一个函数,传入主页面的url,写入信息,返回最后一页的页码:

    def del_mainPage(url):
        '''
        对主页面进行处理,返回最后页码
        :param url:目标主页地址
        :return:返回最后页码,int
        '''
        bsObj_mainpage = get_bsObj(url)
        last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])
        try:
            red_text = bsObj_mainpage.findAll("span", {"class": "red_text"})
            subject_sum = red_text[0].get_text()  # 主题数
            post_sum = red_text[1].get_text()  # 帖子数
            follow_sum = red_text[2].get_text()  # 关注量
        except AttributeError as e:
            print("发生错误:" + e + "时间:"+str(datetime.datetime.now()))
            return None
        with open('main_info.txt','w+') as f:
            f.writelines("统计时间: "+str(datetime.datetime.now())+"
    ")
            f.writelines("主题数:  "+subject_sum+"
    ")
            f.writelines("帖子数:  "+post_sum+"
    ")
            f.writelines("关注量:  "+follow_sum+"
    ")
        return last_page
    

     得到的结果:

    统计时间: 2016-10-07 15:14:19.642933
    主题数:  25083
    帖子数:  414831
    关注量:  76511
    

    从页面中获取帖子地址

    我们想要获取每个帖子的详细信息,就需要进入这个帖子,所以需要这个帖子的地址。比如:http://tieba.baidu.com/p/4700788764

    这个url中,http://tieba.baidu.com是服务器地址,/p应该是路由中帖子(post)的对应,/4700788764即帖子的id。

    我们观察贴吧首页会发现,每个帖子都是在一个块中。在浏览器中按F12观察其代码会发现,一个帖子所对应一个<li>标签,每个页面中有50个帖子(首页有广告贴可能不一样)。

    其中<li>标签大概长这样:

    <li class=" j_thread_list clearfix" data-field="{"id":4700788764,"author_name":"u6768u5175507","first_post_id":95008842757,"reply_num":2671,"is_bakan":null,"vid":"","is_good":null,"is_top":null,"is_protal":null,"is_membertop":null,"frs_tpoint":null}">
                <div class="t_con cleafix">
            
                        <div class="col2_left j_threadlist_li_left">
                    <span class="threadlist_rep_num center_text" title="回复">2671</span>
                </div>
                    <div class="col2_right j_threadlist_li_right ">
                <div class="threadlist_lz clearfix">
                    <div class="threadlist_title pull_left j_th_tit ">
        
        
        <a href="/p/4700788764" title="恭喜《从零开始学Python》进入百度阅读平台【首页】新书推荐榜单" target="_blank" class="j_th_tit ">恭喜《从零开始学Python》进入百度阅读平台【首页】新书推荐榜单</a>
    </div><div class="threadlist_author pull_right">
        <span class="tb_icon_author " title="主题作者: 杨兵507" data-field="{"user_id":625543823}"><i class="icon_author"></i><span class="frs-author-name-wrap"><a data-field="{"un":"u6768u5175507"}" class="frs-author-name j_user_card " href="/home/main/?un=%E6%9D%A8%E5%85%B5507&ie=utf-8&fr=frs" target="_blank">杨兵507</a></span><span class="icon_wrap  icon_wrap_theme1 frs_bright_icons "></span>    </span>
        <span class="pull-right is_show_create_time" title="创建时间">7-29</span>
    </div>
                </div>
                                <div class="threadlist_detail clearfix">
                        <div class="threadlist_text pull_left">
                                    <div class="threadlist_abs threadlist_abs_onlyline ">
                http://yuedu.baidu.com/ebook/ec1aa9f7b90d6c85ec3ac6d7?fr=index
            </div>
    
                        </div>
    
                        
    <div class="threadlist_author pull_right">
            <span class="tb_icon_author_rely j_replyer" title="最后回复人: 杨兵小童鞋">
                <i class="icon_replyer"></i>
                <a data-field="{"un":"u6768u5175u5c0fu7ae5u978b"}" class="frs-author-name j_user_card " href="/home/main/?un=%E6%9D%A8%E5%85%B5%E5%B0%8F%E7%AB%A5%E9%9E%8B&ie=utf-8&fr=frs" target="_blank">杨兵小童鞋</a>        </span>
            <span class="threadlist_reply_date pull_right j_reply_data" title="最后回复时间">
                14:17        </span>
    </div>
                    </div>
                        </div>
        </div>
    </li>
    

    我们可以观察到<li>标签中的class=" j_thread_list clearfix" data-field="{"id":4700788764,"author_name":"u6768u5175507","first_post_id":95008842757,"reply_num":2671,"is_bakan":null,"vid":"","is_good":null,"is_top":null,"is_protal":null,"is_membertop":null,"frs_tpoint":null}"

    根据class属性我们可以找到单个页面中所有的帖子:

    posts = bsObj_page.findAll("li", {"class": "j_thread_list"})
    

     在data-field属性中我们可以得到:帖子ID,作者名称,回复数量,是否精品等信息。根据帖子ID我们可以得到帖子对应的url,不过下面<a>标签中直接给了。

    我们获取链接并将其放入数组中:

    post_info = post.find("a", {"class": "j_th_tit "})
    urls.append("http://tieba.baidu.com" + post_info.attrs["href"])
    

     将上述代码打包,给出单页链接,返回此链接中所有帖子的url:

    def get_url_from_page(page_url):
        '''
        对给定页面进行处理返回页面中帖子的url
        :param page_url: 页面链接
        :return: 页面中所有帖子的url
        '''
        bsObj_page = get_bsObj(page_url)
        urls = []
        try:
            posts = bsObj_page.findAll("li", {"class": "j_thread_list"})
        except AttributeError as e:
            print("发生错误:" + e + "时间:" + str(datetime.datetime.now()))
        for post in posts:
            post_info = post.find("a", {"class": "j_th_tit "})
            urls.append("http://tieba.baidu.com" + post_info.attrs["href"])
        return urls
    

    处理每页的信息

    上面我们得到了每页的地址,接下来我们处理每个帖子中的信息。我们需要在这页面中找到一些对我们有用的信息并将其存入csv文件中。

    同样用这个地址举例:http://tieba.baidu.com/p/4700788764

    首先,当打开这个链接是我们观察到的信息就是:帖子的标题,楼主名称,发帖时间,回复量等。

    让我们观察一下这个页面的代码:

    <div class="l_post j_l_post l_post_bright noborder " data-field="{"author":{"user_id":625543823,"user_name":"u6768u5175507","name_u":"%E6%9D%A8%E5%85%B5507&ie=utf-8","user_sex":2,"portrait":"8f0ae69da8e585b53530374925","is_like":1,"level_id":7,"level_name":"u8d21u58eb","cur_score":445,"bawu":0,"props":null},"content":{"post_id":95008842757,"is_anonym":false,"open_id":"tieba","open_type":"","date":"2016-07-29 19:10","vote_crypt":"","post_no":1,"type":"0","comment_num":0,"ptype":"0","is_saveface":false,"props":null,"post_index":0,"pb_tpoint":null}}">
    

     同样两个属性 class 和data-field,在data-field中包含了这个帖子的大部分信息:帖人id,发帖人昵称,性别,等级id,等级昵称,open_id,open_type,发帖日期等。

    首先我们创建一个帖子对象,其中属性为帖子的信息,方法为将信息写入对应的csv文件:

    class PostInfo:
        def __init__(self,post_id,post_title,post_url,reply_num,post_date,open_id,open_type,
                     user_name,user_sex,level_id,level_name):
            self.post_id = post_id
            self.post_title = post_title
            self.post_url = post_url
            self.reply_num = reply_num
            self.post_date = post_date
            self.open_id = open_id
            self.open_type = open_type
            self.user_name = user_name
            self.user_sex = user_sex
            self.level_id = level_id
            self.level_name = level_name
    
        def dump_to_csv(self,filename):
            csvFile = open(filename, "a+")
            try:
                writer = csv.writer(csvFile)
                writer.writerow((self.post_id,self.post_title,self.post_url,self.reply_num,self.post_date,self.open_id,
                                 self.open_type,self.user_name,self.user_sex,self.level_id,self.level_name))
            finally:
                csvFile.close()
    

     然后我们通过find方法找到对应信息:

    obj1 = json.loads(
    bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])
    reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()
    post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()
    
    post_id = obj1.get('content').get('post_id')
    post_url = url
    post_date = obj1.get('content').get('date')
    open_id = obj1.get('content').get('open_id')
    open_type = obj1.get('content').get('open_type')
    user_name = obj1.get('author').get('user_name')
    user_sex = obj1.get('author').get('user_sex')
    level_id = obj1.get('author').get('level_id')
    level_name = obj1.get('author').get('level_name')
    

     创建对象,将其保存:

    postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,user_sex, level_id, level_name)
    postinfo.dump_to_csv('post_info2.csv')
    

     其实不用通过对象保存,这只是个人想法。

    将上面代码封装成处理每个帖子的函数:

    def del_post(urls):
        '''
        处理传入url的帖子
        :param url:
        :return:
        '''
        for url in urls:
            bsObj = get_bsObj(url)
            try:
                obj1 = json.loads(
                    bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])
                reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()
                post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()
            except:
                print("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url)
                with open('error.txt', 'a+') as f:
                    f.writelines("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url)
                return None
            post_id = obj1.get('content').get('post_id')
            post_url = url
            post_date = obj1.get('content').get('date')
            open_id = obj1.get('content').get('open_id')
            open_type = obj1.get('content').get('open_type')
            user_name = obj1.get('author').get('user_name')
            user_sex = obj1.get('author').get('user_sex')
            level_id = obj1.get('author').get('level_id')
            level_name = obj1.get('author').get('level_name')
            postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,
                                user_sex, level_id, level_name)
            postinfo.dump_to_csv('post_info2.csv')
            del postinfo
    

    得到的结果类似于:

    98773024983,【轰动Python界】的学习速成高效大法,http://tieba.baidu.com/p/4811129571,2,2016-10-06 20:32,tieba,,openlabczx,0,7,贡士
    

    组合使用上面的函数

    首先,我们让用户输入需要爬去的贴吧的主页:

    home_page_url = input("请输入要处理贴吧的主页链接")
    

     处理url:

    bar_name = home_page_url.split("=")[1].split("&")[0]
    pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn="      #page_url 不包含页数的前缀
    

     处理主页:

    all_post_num = del_mainPage(home_page_url)      #贴吧一共有多少条帖子
    

     让用户输入需要爬去的帖子数量:

    del_post_num = int(input("请输入需要处理前多少条帖子:"))     #指定需要处理的帖子数目
    

     最后:

    if del_post_num > all_post_num:
        print("需要处理的帖子数大于贴吧帖子总数!")
    else:
        for page in range(0,del_post_num,50):
            print("It's processing page : " + str(page))
            page_url = pre_page_url+str(page)
            urls = get_url_from_page(page_url)
            t = threading.Thread(target=del_post,args=(urls,))
            t.start()
    

    主函数代码:

    if __name__ == '__main__':
        #home_page_url = input("请输入要处理贴吧的主页链接")
        home_page_url = test_url
        bar_name = home_page_url.split("=")[1].split("&")[0]
        pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn="      #page_url 不包含页数的前缀
        all_post_num = del_mainPage(home_page_url)      #贴吧一共有多少条帖子
        del_post_num = int(input("请输入需要处理前多少条帖子:"))     #指定需要处理的帖子数目
        if del_post_num > all_post_num:
            print("需要处理的帖子数大于贴吧帖子总数!")
        else:
            for page in range(0,del_post_num,50):
                print("It's processing page : " + str(page))
                page_url = pre_page_url+str(page)
                urls = get_url_from_page(page_url)
                t = threading.Thread(target=del_post,args=(urls,))
                t.start()
    

     全部代码:

    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    from urllib.error import HTTPError
    import json
    import datetime
    import csv
    import threading
    
    class PostInfo:
        def __init__(self,post_id,post_title,post_url,reply_num,post_date,open_id,open_type,
                     user_name,user_sex,level_id,level_name):
            self.post_id = post_id
            self.post_title = post_title
            self.post_url = post_url
            self.reply_num = reply_num
            self.post_date = post_date
            self.open_id = open_id
            self.open_type = open_type
            self.user_name = user_name
            self.user_sex = user_sex
            self.level_id = level_id
            self.level_name = level_name
    
        def dump_to_csv(self,filename):
            csvFile = open(filename, "a+")
            try:
                writer = csv.writer(csvFile)
                writer.writerow((self.post_id,self.post_title,self.post_url,self.reply_num,self.post_date,self.open_id,
                                 self.open_type,self.user_name,self.user_sex,self.level_id,self.level_name))
            finally:
                csvFile.close()
    
    def get_bsObj(url):
        '''
        返回给定url的beautifulsoup对象
        :param url:目标网址
        :return:beautifulsoup对象
        '''
        try:
            html = urlopen(url).read()
            bsObj = BeautifulSoup(html, "lxml")
            return bsObj
        except HTTPError as e:
            print(e)
            return None
    
    
    def del_mainPage(url):
        '''
        对主页面进行处理,返回最后页码
        :param url:目标主页地址
        :return:返回最后页码,int
        '''
        bsObj_mainpage = get_bsObj(url)
        last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])
        try:
            red_text = bsObj_mainpage.findAll("span", {"class": "red_text"})
            subject_sum = red_text[0].get_text()  # 主题数
            post_sum = red_text[1].get_text()  # 帖子数
            follow_sum = red_text[2].get_text()  # 关注量
        except AttributeError as e:
            print("发生错误:" + e + "时间:"+str(datetime.datetime.now()))
            return None
        with open('main_info.txt','w+') as f:
            f.writelines("统计时间: "+str(datetime.datetime.now())+"
    ")
            f.writelines("主题数:  "+subject_sum+"
    ")
            f.writelines("帖子数:  "+post_sum+"
    ")
            f.writelines("关注量:  "+follow_sum+"
    ")
        return last_page
    
    def get_url_from_page(page_url):
        '''
        对给定页面进行处理返回页面中帖子的url
        :param page_url: 页面链接
        :return: 页面中所有帖子的url
        '''
        bsObj_page = get_bsObj(page_url)
        urls = []
        try:
            posts = bsObj_page.findAll("li", {"class": "j_thread_list"})
        except AttributeError as e:
            print("发生错误:" + e + "时间:" + str(datetime.datetime.now()))
        for post in posts:
            post_info = post.find("a", {"class": "j_th_tit "})
            urls.append("http://tieba.baidu.com" + post_info.attrs["href"])
        return urls
    
    def del_post(urls):
        '''
        处理传入url的帖子
        :param url:
        :return:
        '''
        for url in urls:
            bsObj = get_bsObj(url)
            try:
                obj1 = json.loads(
                    bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])
                reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()
                post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()
            except:
                print("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url)
                with open('error.txt', 'a+') as f:
                    f.writelines("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url)
                return None
            post_id = obj1.get('content').get('post_id')
            post_url = url
            post_date = obj1.get('content').get('date')
            open_id = obj1.get('content').get('open_id')
            open_type = obj1.get('content').get('open_type')
            user_name = obj1.get('author').get('user_name')
            user_sex = obj1.get('author').get('user_sex')
            level_id = obj1.get('author').get('level_id')
            level_name = obj1.get('author').get('level_name')
            postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,
                                user_sex, level_id, level_name)
            postinfo.dump_to_csv('post_info2.csv')
            # t = threading.Thread(target=postinfo.dump_to_csv,args=('post_info2.csv',))
            # t.start()
            del postinfo
    
    test_url = "http://tieba.baidu.com/f?kw=python&ie=utf-8"
    
    if __name__ == '__main__':
        #home_page_url = input("请输入要处理贴吧的主页链接")
        home_page_url = test_url
        bar_name = home_page_url.split("=")[1].split("&")[0]
        pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn="      #page_url 不包含页数的前缀
        all_post_num = del_mainPage(home_page_url)      #贴吧一共有多少条帖子
        del_post_num = int(input("请输入需要处理前多少条帖子:"))     #指定需要处理的帖子数目
        if del_post_num > all_post_num:
            print("需要处理的帖子数大于贴吧帖子总数!")
        else:
            for page in range(0,del_post_num,50):
                print("It's processing page : " + str(page))
                page_url = pre_page_url+str(page)
                urls = get_url_from_page(page_url)
                t = threading.Thread(target=del_post,args=(urls,))
                t.start()
        t.join()
                #del_post(urls)
    

    以上  

    欢迎多来访问博客:http://liqiongyu.com/blog

    微信公众号:

  • 相关阅读:
    js 数组去重的几种方式及原理
    js replace
    gulp的使用方法
    gulp 安装部署
    gulp 的5个方法
    fiddler 监听手机的http请求
    vsCood
    browser-sync使用方法
    browser-sync 安装
    npm 移除第三方包
  • 原文地址:https://www.cnblogs.com/Liqiongyu/p/5936019.html
Copyright © 2020-2023  润新知