python爬取中遇到的一些错误以及解决方案:
must be str, not ReadTimeout
must be str, not ConnectionError
429 Too Many Requests
乱码(gb2312)
1 错误信息: 2 AS1084航班爬取错误 3 must be str, not ProxyError 错误信息未处理 4 解决方案: 5 使用try except:print(记录错误航班) pass跳出错误继续爬取 6 7 错误信息: 8 CA3767航班爬取错误 9 local variable 'ok' referenced before assignment 未赋值前被引用 10 解决方案: 11 赋值改为全局变量 global ok 12 13 错误信息: 14 MF1930航班爬取完成! 15 must be str, not ReadTimeout 获取网页超时 16 content = requests.get( 17 'http://happiness.variflight.com/info/detail?fnum', 18 proxies=proxies,timeout=30).text 19 解决方案: 20 超时即 except:pass重新连接页面 21 22 错误信息: 23 NS8185航班爬取完成! 24 must be str, not ConnectionError 数据库连接错误 25 解决方案: 26 重连数据库,记录并 pass跳过此条航班信息 27 28 错误信息: 29 429 Too Many Requests 错误页面 30 403 31 502 32 解决方案: 33 频繁访问页面,判断为正常页面 爬取即可 34 35 解决方案: 36 unc = stringa.decode("gb2312") #先decode 37 print unc.encode("utf-8") #后转utf-8 38 HTML乱码 此编码方式为gb2312 39 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 40 <HTML><HEAD> 41 <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=gb2312"> 42 <TITLE>′í?ó£o?ú?ù???óμ?í??·£¨URL£??T·¨??è?</TITLE> 43 <STYLE type="text/css"><!--BODY{background-color:#ffffff;font-family:verdana,sans-serif}PRE{font-family:sans-serif}--></STYLE> 44 </HEAD><BODY> 45 <H1>′í?ó</H1> 46 <H2>?ú?ù???óμ?í??·£¨URL£??T·¨??è?</H2> 47 <HR noshade size="1px"> 48 <P> 49 μ±3¢ê??áè?ò???í??·£¨URL£?ê±£o 50 <A HREF="http://happiness.variflight.com/info/detail?fnum=CZ3134&dep=TSN&arr=CAN&date=2017-12-28&type=1">http://happiness.variflight.com/info/detail?fnum=CZ3134&dep=TSN&arr=CAN&date=2017-12-28&type=1</A> 51 <P> 52 ·¢éúá???áDμ?′í?ó£o 53 <UL> 54 <LI> 55 <STRONG> 56 Read Error 57 <BR> 58 ?áè?′í?ó 59 </STRONG> 60 </UL> 61 62 <P> 63 ?μí3??ó|£o 64 <PRE><I> (104) Connection reset by peer</I></PRE> 65 66 <P> 67 An error condition occurred while reading data from the network. Please 68 retry your request. 69 <BR> 70 ?y?úí¨1yí????áè?êy?Yê±·¢éúá?′í?ó£?????D?3¢ê??£ 71 </P> 72 <P>±??o′?·t???÷1üàí?±£o<A HREF="mailto:support@chinacache.com">support@chinacache.com</A>