• 【python】理想论坛帖子爬虫1.06


    昨天认识到在本期同时起一百个回调/线程后程序会崩溃,造成结果不可信。

    于是决定用Python单线程操作,因为它理论上就用主线程跑不会有问题,只是时间长点。

    写好程序后,测试了一中午,210个主贴,11974个帖子,11974个文件都生成了。

    当然,程序没有一蹴而就,原有的对requests属性不加限制时最多跑个四千条就崩了,参考了 “https://blog.csdn.net/shi_weihappy/article/details/51009602”的做法后才知道要将requests.session.keep_alive = False,以免资源不释放。另外加了一点异常时让其重复处理的代码,以免丢失数据。这两处代码都在saveTopicDetail函数中。

    代码如下:

    #------------------------------------------------------------------------------------
    # 理想论坛爬虫1.06,用于爬取主贴再爬子贴,数据存到文件里,再由insertDB.py读取插DB
    # 2018年4月26日
    #------------------------------------------------------------------------------------
    from bs4 import BeautifulSoup
    import requests
    import threading
    import re
    import time
    import datetime
    import os
    import json
    import colorama
    from colorama import Fore, Back, Style
    colorama.init()
    
    user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    headers={'User-Agent':user_agent}
    
    # 存储数据文件的目录
    folder=""
    
    # 主帖数组
    topics=[]
    
    #------------------------------------
    # 在论坛页中寻找主贴
    # pageUrl:论坛页url
    #------------------------------------
    def findTopics(pageUrl):
        print("
    开始读取页面"+pageUrl+"的帖子");
    
        try:
            rsp=requests.get(pageUrl,headers=headers)
            soup= BeautifulSoup(rsp.text,'html.parser',from_encoding='gb2312')
    
            for tbodys in soup.find_all('tbody'):
                pageCount=1
                url='none'
                title='none'
    
                for spans in tbodys.find_all('span',class_="forumdisplay"):
                    for link in spans.find_all('a'):
                        if link and link.get("href"): 
                           url="http://www.55188.com/"+link.get("href")
                           title=link.text
                            
                for spans in tbodys.find_all('span',class_="threadpages"):
                    for link in spans.find_all('a'):
                        pageCount=link.text
                    
                if url!='none' and title!='none':
                    topic={'pageCount':pageCount,'url':url,'title':title}
                    topics.append(topic)
    
            #print("读取页面"+pageUrl+"的帖子完毕");
        except Exception as e:
            log("findTopics出现异常:"+str(e),'red')
    
    #------------------------------------
    # 以不同颜色在控制台输出文字
    # pageUrl:论坛页url
    #------------------------------------
    def log(text,color):
        if color=='red':
            print(Fore.RED + text+ Style.RESET_ALL)
        elif color=='green':
            print(Fore.GREEN + text+ Style.RESET_ALL)
        else:
            print(text)
    
    #------------------------------------
    # 找到并保存帖子的细节
    # index:序号,url:地址,title:标题
    #------------------------------------
    def saveTopicDetail(index,url,title):
        infos=[]    # 找到的子贴信息
    
        while(len(infos)==0):
            try:
                rsp=requests.get(url,headers=headers)                       
                soup= BeautifulSoup(rsp.text,'html.parser',from_encoding='gb2312')
                session = requests.session()
                session.keep_alive = False
                
                for divs in soup.find_all('div',class_="postinfo"):
                    # 用正则表达式将多个空白字符替换成一个空格
                    RE = re.compile(r'(s+)')
                    line=RE.sub(" ",divs.text)
                    arr=line.split(' ')
                    arrLength=len(arr)
    
                    if arrLength==7:
                        info={'楼层':arr[1],
                              '作者':arr[2].replace('只看:',''),
                              '日期':arr[4],
                              '时间':arr[5],'title':title,'url':url}
                        infos.append(info);
                    elif arrLength==8:
                        info={'楼层':arr[1],
                              '作者':arr[2].replace('只看:',''),
                              '日期':arr[5],
                              '时间':arr[6],'title':title,'url':url}
                        infos.append(info);
    
                #存文件
                filename=folder+"/"+str(index)+'.json'
                with open(filename,'w',encoding='utf-8') as fObj:
                    json.dump(infos,fObj)
            except Exception as e:
                log("findTopicDetail访问"+url+"时出现异常:"+str(e),'red')
                time.sleep(5); # 如果出现异常,休息五秒后再试
                continue;
    
    #------------------------------------
    # 入口函数
    # start:起始页,end:终止页
    #------------------------------------
    def main(start,end):
        # 创建目录
        currTime=time.strftime('%H_%M_%S',time.localtime(time.time()))
        global folder
        folder="./"+currTime
        os.makedirs(folder)
        print("目录"+folder+"创建完成")
    
        # 获取主贴
        print('
    将从以下页面获取主贴:');
        for i in range(start,end+1):        
            pageUrl='http://www.55188.com/forum-8-'+str(i)+'.html' # 这个页是论坛页,即第1页,第2页等
            findTopics(pageUrl);
    
        n=len(topics)
        log("共读取到:"+str(n)+"个主贴",'green')
    
        # 获取主贴及其子贴
        finalTopics=[]
        index=0
        for topic in topics:
            end=int(topic['pageCount'])+1
            title=topic['title']
    
            for i in range(1,end):
                pattern='-(d+)-(d+)-(d+)'
                newUrl=re.sub(pattern,lambda m:'-'+m.group(1)+'-'+str(i)+'-'+m.group(3),topic['url'])
                #print(newUrl)
    
                newTopic={'index':index,'url':newUrl,'title':title}
                finalTopics.append(newTopic)
    
                index=index+1
    
        n=len(finalTopics)
        log("共读取到:"+str(n)+"个帖子",'green')
        
        # 遍历finalTopics
        for newTopic in finalTopics:
            saveTopicDetail(newTopic['index'],newTopic['url'],newTopic['title']);
    
    # 开始
    main(1,3)

    控制台输入如下:

    C:Usershorn1Desktoppython26>python lixiang.py
    目录./11_12_31创建完成
    
    将从以下页面获取主贴:
    
    开始读取页面http://www.55188.com/forum-8-1.html的帖子
    C:Usershorn1AppDataLocalProgramsPythonPython36libsite-packagess4__init__.py:146: UserWarning: You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.
      warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.")
    
    开始读取页面http://www.55188.com/forum-8-2.html的帖子
    
    开始读取页面http://www.55188.com/forum-8-3.html的帖子
    共读取到:210个主贴
    共读取到:11974个帖子
    findTopicDetail访问http://www.55188.com/thread-6009271-105-2.html时出现异常:HTTPConnectionPool(host='www.55188.com', port=80): Max retries exceeded with url: /thread-6009271-105-2.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000016DE6EBA278>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))
    findTopicDetail访问http://www.55188.com/thread-5938774-339-2.html时出现异常:('Connection aborted.', TimeoutError(10060, '由于连接方在一段时间后没有正确答复或连接的主机没有反应 ,连接尝试失败。', None, 10060, None))
    findTopicDetail访问http://www.55188.com/thread-5938774-583-2.html时出现异常:HTTPConnectionPool(host='www.55188.com', port=80): Max retries exceeded with url: /thread-5938774-583-2.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000016DE6E448D0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))
    findTopicDetail访问http://www.55188.com/thread-7065242-77-2.html时出现异常:HTTPConnectionPool(host='www.55188.com', port=80): Max retries exceeded with url: /thread-7065242-77-2.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000016DE75B49B0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))
    
    C:Usershorn1Desktoppython26>

    生成文件截图:

    这一次达到了下载指定页数据的目的,最终数据有237496条。

     而负责插入数据的insertDB.py代码如下:

    # 读取理想论坛爬虫1.06生成的数据,然后写入DB
    import pymysql
    import time
    import datetime
    import os
    import json
    
    # 数据库插值
    def insertDB(sqls):
        conn=pymysql.connect(host='127.0.0.1',user='root',passwd='12345678',db='test',charset='utf8')
        
        sum=0;# 插入成功总数
    
        
        for sql in sqls:        
            try:
                count=conn.query(sql)    #单条是否成功    
            except Exception as e:
                print("sql'"+sql+"'出现异常:"+str(e))
                continue;
    
            if count==0:
                print(sql+'插入记录失败');    
                
            sum+=count
        
        conn.commit()
        conn.close()
    
        return sum
    
    # 入口函数
    def main(folder):
        allinfos=[]
        for filename in os.listdir(folder):
            filePathname=folder+"/"+filename
    
            with open(filePathname,'r',encoding='utf-8') as fObj:
                infos=json.load(fObj)
                allinfos.extend(infos)
    
        sqls=[]
        for info in allinfos:
            sql="insert into test.topic14(floor,author,tdate,ttime,addtime,url,title) values ('"+info['楼层']+"','"+info['作者']+"','"+info['日期']+"','"+info['时间']+"',"+"now(),'"+info['url']+"','"+info['title']+"' "+" )"
            sqls.append(sql)
    
        print("将向数据库插入"+str(len(sqls))+"条记录")
        retval=insertDB(sqls)
        print("已向数据库插入"+str(retval)+"条记录")
    
    # 开始
    main("./11_12_31")

    第一次执行时出现异常,我于是加了异常处理,原来有一条数据因乱码在文本里出现了单引号,跳过这一句就好了,输出如下:

    C:Usershorn1Desktoppython26>python insertDB.py
    将向数据库插入237496条记录
    sql'insert into test.topic14(floor,author,tdate,ttime,addtime,url,title) values ('7155Â¥','Ö»¿´£ºÒ»¾ÅÎÞÏÞ'','2011-4-1','21:53',now(),'http://www.55188.com/thread-3980834-358-3.html',' Ïþ¸çµÄͶ×ÊÉúÑÄ:³¬³¤ÏßʵÅÌÌù '  )'出现异常:(1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '2011-4-1','21:53',now(),'http://www.55188.com/thread-3980834-358-3.html',' Ïþ' at line 1")
    已向数据库插入237495条记录
    
    C:Usershorn1Desktoppython26>

    数据库里也得到了数据:

    到现在,程序崩溃的问题找到了,那就是控制并发,控制session释放,这样能让程序稳定的执行下去。

    但是乱码问题还在,理想论坛爬虫还需要磨练。

    2018年4月26日14点00分

  • 相关阅读:
    了解PCI Express的Posted传输与Non-Posted传输
    最强加密算法?AES加解密算法Matlab和Verilog实现
    校招必看硬核干货:IC前端这样学,秒变offer收割机!
    一次压力测试Bug排查-epoll使用避坑指南
    硬核干货 | C++后台开发学习路线
    Web服务器项目详解
    O准备如何苟进复赛圈?华为软挑开挂指南(附赛题预测)
    Linux最大文件句柄(文件描述符)限制和修改
    linux中对EINTR错误的处理
    C/C++实现单向循环链表(尾指针,带头尾节点)
  • 原文地址:https://www.cnblogs.com/heyang78/p/8950015.html
Copyright © 2020-2023  润新知