• 如何解析复杂页面IP地址?


    import re
    
    import lxml.html
    
    test_data = """
    <head><style>
    .HRAR{display:none}
    .QMMO{display:none}
    .DYWL{display:inline}
    .KZGR{display:inline}
    </style></head>
    <body>抓取下面10个ip地址<br>
    <span style="display:none">128</span>
    <div style="display:none">54</div>
    <span style="display:none">38</span>
    <span class="QMMO">220</span>
    <span style="display:none">.</span>
    <span class="QMMO">107</span>
    12
    <span style="display:none">.</span>
    <div style="display:none">99</div>
    <span style="display:none">75</span>
    <div style="display:none">.</div>
    <span style="display:none">79</span>
    <span class="QMMO">.</span>
    <span style="display:none">.</span>
    .
    <span style="display:inline">82</span>
    <div style="display:none">196</div>
    <span style="display:inline">.</span>
    <span style="display:none">74</span>
    <span class="QMMO">179</span>
    141
    <span style="display:none">.</span>
    .
    <span style="display:none">.</span>
    <span style="display:none">180</span>
    <div style="display:none">162</div>
    <span class="NUMH">45</span>
    <br>
    <div style="display:none">196</div>
    <span class="HRAR">.</span>
    <div style="display:none">119</div>
    <div style="display:none">157</div>
    <span class="QMMO">188</span>
    <span class="HRAR">222</span>
    <span class="HRAR">.</span>
    <span class="QMMO">37</span>
    <div style="display:none">.</div>
    <span class="NUMH">165</span>
    <span style="display:none">25</span>
    <span class="HRAR">79</span>
    <div style="display:none">154</div>
    <span style="display:none">.</span>
    <div style="display:none">11</div>
    <span class="HRAR">61</span>
    .
    <span class="EIQR">239</span>
    <div style="display:none">102</div>
    <span style="display:none">.</span>
    <span style="display:none">.</span>
    <div style="display:none">41</div>
    <div style="display:none">193</div>
    <span style="display:inline">.</span>
    233
    .
    110
    <br>
    <span class="QMMO">.</span>
    <span style="display:none">3</span>
    <span style="display:none">.</span>
    140
    <span class="QMMO">127</span>
    .
    <span style="display:none">.</span>
    <div style="display:none">202</div>
    <span class="DYWL">7</span>
    <span style="display:none">148</span>
    <span class="HRAR">219</span>
    <div style="display:none">.</div>
    <span class="QMMO">.</span>
    <div style="display:none">.</div>
    <div style="display:none">.</div>
    <div style="display:none">136</div>
    <span class="QMMO">230</span>
    <div style="display:none">183</div>
    .
    <span style="display:none">242</span>
    <span style="display:none">.</span>
    <span class="QMMO">57</span>
    <span style="display:none">.</span>
    <span style="display:none">.</span>
    <span style="display:inline">190</span>
    <span class="EIQR">.</span>
    5
    <br>
    <div style="display:none">.</div>
    <span class="HRAR">250</span>
    <div style="display:none">179</div>
    <div style="display:none">106</div>
    <span style="display:none">18</span>
    <span class="YMXL">151</span>
    <span style="display:none">.</span>
    <div style="display:none">73</div>
    <span class="HRAR">91</span>
    <span class="DYWL">.</span>
    <span class="HRAR">201</span>
    <span style="display:none">.</span>
    <span class="QMMO">.</span>
    <span style="display:none">.</span>
    <div style="display:none">86</div>
    <span style="display:inline">39</span>
    <span style="display:none">.</span>
    <span style="display:none">.</span>
    <span class="HRAR">85</span>
    <span class="QMMO">215</span>
    <span class="QMMO">.</span>
    <span class="HRAR">232</span>
    <span class="YMXL">.</span>
    <div style="display:none">234</div>
    <span style="display:inline">243</span>
    <span style="display:inline">.</span>
    <span style="display:inline">210</span>
    <br>
    <span style="display:none">.</span>
    <span class="HRAR">.</span>
    <span class="QMMO">185</span>
    <div style="display:none">119</div>
    <span class="HRAR">51</span>
    <span class="HRAR">90</span>
    <span class="QMMO">229</span>
    <span class="BOZY">64</span>
    <span style="display:none">256</span>
    <span class="HRAR">.</span>
    <span class="HRAR">207</span>
    <span class="HRAR">99</span>
    <span style="display:none">177</span>
    <span class="HRAR">161</span>
    <div style="display:none">55</div>
    <span style="display:none">.</span>
    <span class="HRAR">252</span>
    <div style="display:none">.</div>
    <div style="display:none">106</div>
    <span class="HRAR">189</span>
    <span class="HRAR">12</span>
    .
    96
    .
    <span class="GNTR">36</span>
    <span style="display:inline">.</span>
    <span class="GNTR">50</span>
    <br>
    <span style="display:none">211</span>
    52
    <span style="display:none">158</span>
    <span class="HRAR">.</span>
    <span class="HRAR">167</span>
    <span style="display:none">209</span>
    <span style="display:none">57</span>
    <span class="HRAR">24</span>
    <span style="display:none">.</span>
    <span class="QMMO">143</span>
    .
    <span style="display:none">57</span>
    <div style="display:none">.</div>
    <span class="HRAR">23</span>
    <div style="display:none">.</div>
    156
    <span style="display:none">29</span>
    <span class="GNTR">.</span>
    <div style="display:none">80</div>
    <span class="QMMO">.</span>
    <span style="display:inline">142</span>
    <span class="HRAR">.</span>
    <div style="display:none">.</div>
    <span class="HRAR">248</span>
    <span style="display:none">.</span>
    .
    <span class="DYWL">254</span>
    <br>
    <span style="display:none">.</span>
    <span style="display:none">26</span>
    <div style="display:none">164</div>
    <div style="display:none">.</div>
    <span style="display:none">.</span>
    <div style="display:none">102</div>
    <span style="display:none">.</span>
    <span style="display:none">96</span>
    <span class="QGZL">153</span>
    <span class="HRAR">229</span>
    <span class="QMMO">85</span>
    <span style="display:none">130</span>
    <div style="display:none">114</div>
    .
    <span style="display:inline">4</span>
    <span style="display:inline">.</span>
    <span class="YMXL">162</span>
    <span style="display:none">.</span>
    <span class="HRAR">232</span>
    <div style="display:none">226</div>
    <span class="QMMO">.</span>
    <span class="HRAR">.</span>
    <span style="display:none">142</span>
    <div style="display:none">46</div>
    .
    52
    <span class="HRAR">203</span>
    <br>
    <span style="display:none">.</span>
    <span class="QMMO">33</span>
    <div style="display:none">29</div>
    232
    <span class="QMMO">.</span>
    <div style="display:none">85</div>
    <span class="QMMO">69</span>
    <span style="display:none">245</span>
    <span class="HRAR">.</span>
    <div style="display:none">169</div>
    <span style="display:none">199</span>
    <span class="HRAR">23</span>
    <span style="display:none">.</span>
    <span class="QMMO">.</span>
    <div style="display:none">88</div>
    <span style="display:none">10</span>
    <span class="QGZL">.</span>
    <span class="QMMO">.</span>
    <div style="display:none">.</div>
    <div style="display:none">.</div>
    <div style="display:none">240</div>
    <span style="display:none">245</span>
    <span class="YMXL">10</span>
    .
    <span style="display:inline">72</span>
    <span class="BOZY">.</span>
    <span class="KZGR">169</span>
    <br>
    <div style="display:none">206</div>
    <div style="display:none">239</div>
    <span class="HRAR">218</span>
    <div style="display:none">97</div>
    <span class="HRAR">106</span>
    <span class="QMMO">.</span>
    <span class="QMMO">140</span>
    <span style="display:none">144</span>
    <span class="HRAR">126</span>
    <div style="display:none">.</div>
    127
    <div style="display:none">.</div>
    <span style="display:none">120</span>
    <span style="display:none">209</span>
    <span class="BOZY">.</span>
    <span style="display:none">179</span>
    <span class="HRAR">.</span>
    <span style="display:inline">3</span>
    <div style="display:none">.</div>
    <span class="QMMO">198</span>
    <div style="display:none">169</div>
    <span style="display:none">.</span>
    <span style="display:none">37</span>
    <span class="EIQR">.</span>
    <span style="display:inline">31</span>
    <span style="display:inline">.</span>
    61
    <br>
    <div style="display:none">37</div>
    76
    <span style="display:none">94</span>
    <span style="display:none">.</span>
    .
    <span class="HRAR">109</span>
    <span style="display:inline">17</span>
    <div style="display:none">.</div>
    <span class="HRAR">232</span>
    <span class="HRAR">247</span>
    <span class="HRAR">136</span>
    <span class="HRAR">67</span>
    <span class="HRAR">49</span>
    <div style="display:none">194</div>
    <span class="QGZL">.</span>
    <span class="QMMO">159</span>
    <span class="QMMO">.</span>
    <div style="display:none">81</div>
    <span style="display:inline">39</span>
    <span style="display:none">29</span>
    <span style="display:inline">.</span>
    <span style="display:none">202</span>
    30
    <div style="display:none">89</div>
    <span class="HRAR">242</span>
    <span style="display:none">138</span>
    <span class="HRAR">62</span><body>
    
            """
    '''
    / 从根标签开始
    // 从当前标签开始 后续节点含有即可选出
    *通配符选择所有
    //div/book[1]/title 选择div 下第一个book标签的title元素
    //div/book/title[@lang="zh"]选择title属性含有lang且内容zh的title元素
    //div/book/title  //book/title //title具有相同的結果,因为使用相対路径最終都指向title
    //book/title/@* 将title所有属性值选择出来
    //book/title/text() 将title的内容选择出来,使用内置text函数
    //a[@href="link1.html" and @id="places_neighbours_row"
    //div/book[last()]/title/text()#将最后一个book元素选出来
    //div/book[price>39]/title/text()
    //li[starts-with(@class,'item')]/a/text()
    '''
    
    #解析
    def analysis_content(test_data):
        """
        解析文件,得到ip
        :param result:
        :return:
        re 模块的一般使用步骤如下:
        使用 compile() 函数将正则表达式的字符串形式编译为一个 Pattern 对象
        通过 Pattern 对象提供的一系列方法对文本进行匹配查找,获得匹配结果,一个 Match 对象。
        最后使用 Match 对象提供的属性和方法获得信息,根据需要进行其他的操作
    
    Pattern 对象的一些常用方法主要有:
        match 方法:从起始位置开始查找,一次匹配
        search 方法:从任何位置开始查找,一次匹配
        findall 方法:全部匹配,返回列表
        finditer 方法:全部匹配,返回迭代器
        split 方法:分割字符串,返回列表
        sub 方法:替换
        """
        pattern = re.compile(r'.([A-Z]+){display:none}')#匹配大写字母,([A-Z]+)组
        class_none_list = pattern.findall(test_data)#全部匹配,返回列表
        # print(class_none_list)#['HRAR', 'QMMO']{display:none}
        pattern_div = re.compile('<divs.*')
        t = pattern_div.sub("", test_data)
        pattern_span_none = re.compile('<spansstyle="display:none">.*?</span>')
        t1 = pattern_span_none.sub("", t)
        pattern_class_none1 = re.compile('<spansclass="' + class_none_list[0] + '">.*</span>')
        t2 = pattern_class_none1.sub("", t1)
        pattern_class_none2 = re.compile('<spansclass="' + class_none_list[1] + '">.*</span>')
        t3 = pattern_class_none2.sub("", t2)
        html = lxml.html.fromstring(t3.replace("
    ", ""))
        html_data = html.xpath('//body/descendant-or-self::text()')
        tt = ""
        lt = []
        for i in html_data[1:]:
            if tt.count('.') == 3 and tt[-1] != '.':
                lt.append(tt)
                tt = ""
            tt = tt + i
        lt.append(tt)
        print(lt)#打印IP
        print(len(lt))#打印列表长度
    
    
    analysis_content(test_data)
    输出结果:['12.82.141.45', '165.239.233.110', '140.7.190.5', '151.39.243.210', '64.96.36.50', '52.156.142.254', '153.4.162.52', '232.10.72.169', '127.3.31.61', '76.17.39.30        ']
    10
  • 相关阅读:
    设定cookie 获取cookie数据的转换
    cookie cookie的获取
    常见的请求方式 json字符串
    请求报文和响应报文
    http协议
    php分页查询 子查询
    MAC 地址为什么不需要全球唯一
    ceph分布式存储简介
    一文看懂为什么边缘计算是大势所趋
    真香!Windows 可直接运行 Linux 了
  • 原文地址:https://www.cnblogs.com/liangliangzz/p/10172241.html
Copyright © 2020-2023  润新知