• Python SGMLParser 的1个BUG??


    首先说一下,我用的是python 2.7,刚好在学Python,今天想去爬点图片当壁纸,但是当我用 SGMLParser 做 <img> 标签解析的时候,发现我想要的那部分根本没获取到,我尝试用 lxml 修复网页,还是解析不出..但是当我把此部分字段单独提出来时,我却可以将此部分标签解析出来,实在无法解决这个问题...先将问题放在这里,用正则表达式去匹配好了..如果有遇到过此问题的前辈请务必告诉我..我的邮箱是 781512880@qq.com

    这是源网站:http://mcyacg.com/m60948/

    <div class="quote"><blockquote><font size="5"><font color="Pink">P站引言:</font></font>正好可以放在手心里的红色果实——“苹果”。红色果实切开后看起来像心形的苹果,在白雪公主、亚当与夏娃等故事中都是作为诱惑的象征登场。艳丽的红色外皮包裹着香甜多汁的清脆果实,或许有着一种不可思议的魅力吧。<br/>
    今天,就为大家送上描绘了“苹果”的插画作品特辑。敬请欣赏这些仿佛能听见咬下新鲜苹果时的清脆声音的插画作品。</blockquote></div><br/>
    <a href="http://pan.baidu.com/s/1mhJ4ti4" target="_blank"><font size="5">下载</font></a><br/>
    <img id="aimg_IPKD8" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/08859a09d120090cfff30152010130c7.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_Sv2k2" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/db3a060b649a422a701dd47982f9cbe5.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_Ob8zD" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/8fd5b9b5f4706b17e71c00939c75f648.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_MWOuz" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/7b5f4a94fff33ea1a7cac45131f2ba41.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_nG9jr" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/4c0a17365342ef700c68c4e4caada0e0.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_J790D" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/a1f2a3486ce679f007abea46782a33b7.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_v6mTz" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/8d37e5d40f34c03180080135e8757bc8.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_wzFQq" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/7feed5d205b6811d5d1366dd495a0760.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_xWlS5" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/413f8d116e31174451032abfa72c1246.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_O8T03" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/f654b77544edfeee9cda4d069e704c90.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_PfGhH" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/e6fb8b6eafd5ef5dea2a13a284ba8309.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_tZEBu" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/46d344876774e7dcb059ef84e2fc70f7.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_PnP6y" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i1.qlshw.org/9c6ab03cffb678a0945dcb0da127ea63.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_H01fi" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i1.qlshw.org/d0bf9d03f427a730b40a29bfebc9697a.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_j1pqX" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/36074139da9039d1d4c0d1042f6b1b8c.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_xaHP0" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/a40f503868cdd657531cbea34adf55e6.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_wE44O" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i1.qlshw.org/0a5d24a51f5c4ad0041d8feec5b5fe9a.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_T50cd" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/b9f666b571221894bd4d922d369fce5d.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_C4o7y" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/114b3eff9da458cfbfb52d08160fd30a.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <img id="aimg_m2c3s" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/ba3265f867650ef35891f6cb09a7a196.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/>
    <br/>

    这部分是我想要提取的元素.

    from sgmllib import SGMLParser
    class LableParse(SGMLParser):
        def reset(self):
            SGMLParser.reset(self)        
            self.level = 0
            self.flag = False
            self.picturesrc=[]
        def start_div(self,attrs):
            if self.flag == True: #遇到子层 level+1
                self.level+=1
            for k,v in attrs:
                if k=='class' and v=='pct':
                    self.flag = True
                    self.level+=1 #自己加一层
                    
        def end_div(self):
            if self.level == 0:
                self.flag = False
            if self.flag == True: #退出DIV子层的时候level-1
                self.level-=1
        def start_img(self,attrs):
            #if self.flag == True:
                for k,v in attrs:
                    print '{%s : %s}'%(k,v)
                    
    if __name__ == '__main__':
        lp = LableParse()
        lp.feed(open('source.txt').read())

    这部分是我继承自 SGMLParser 的一个类..

  • 相关阅读:
    CentOS 配置防火墙+允许指定ip访问端口
    防火墙总结
    WordPress Rank Math SEO插件任意元数据修改漏洞分析
    thinkphp6 session 任意文件创建漏洞POC
    云锁最新版SQL注入WAF绕过
    加密Webshell“冰蝎” 流量 100%识别
    MKCMS代码审计小结
    远控免杀从入门到实践之白名单(113个)总结篇
    记一次从源代码泄漏到后台获取webshell的过程
    一次实战sql注入绕狗
  • 原文地址:https://www.cnblogs.com/liyinggang/p/6106951.html
Copyright © 2020-2023  润新知