• 反反爬虫--破解字体加密


    今天在爬取58同城租房信息的时候发现58同城中的价格在html文档中是以乱码的,但是在页面中是正常显示

    字体加密是爬取网页的过程中比较麻烦的问题。

    字体加密一般是网页修改了默认的字符编码集,在网页上加载的他们自己定义的字体文件作为字体的样式,可以正确地显示数字,但是在源码上同样的二进制数由于未加载自定义的字体文件就由计算机默认编码成了乱码。

    一般来说,通用的解决办法是找到字体文件,分析文件中的映射关系。一般来说,字体文件都是作为样式加在加密字体的部位。

    在样式中基本上可以确定fangchan-secret是加密字体文件

    在58同城的源码中,字体文件是通过base64加密后放在了head中的style标签中,并且每次在页面刷新的时候会刷新这个随机字符串,其中的映射关系会变。

    页面进行爬取的时候可以使用正则表达式提取出来

    在页面不刷新的情况下取出style中的随机字体文件,与4300对应的乱码

    price_code = '餼龒麣麣'
    base64_srt = 'AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8Z/YQAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQUNAcHAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOpxi4RfDzz1AAsIAAAAAADYVl37AAAAANhWXfsAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAACQAGAAUAAwAKAAEACAAEAAIABwAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAJAACVjwAAlY8AAAAGAACZPAAAmTwAAAAFAACaSwAAmksAAAADAACeOgAAnjoAAAAKAACeowAAnqMAAAABAACfZAAAn2QAAAAIAACfkgAAn5IAAAAEAACfpAAAn6QAAAACAACfpQAAn6UAAAAHAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA'
    

    首先进行base64解码,转化成为二进制形式

    import base64
    def make_font_file(base64_string: str):
        bin_data = base64.decodebytes(base64_string.encode())
        # 保存字体文件对象
        with open('text.otf','wb') as f:
            f.write(bin_data)
        return bin_data
    

    将字节文件转换为xml文件

    from fontTools.ttLib import TTFont
    def convert_font_to_xml(bin_data):
        #TTFont接受的必须是一个文件对象
        font = TTFont('text.otf')
        font.saveXML("text.xml")
    # 获取对应关系
    font = TTFont(BytesIO(make_font_file(base64_srt)))
    unilist = font['cmap'].tables[0].ttFont.getGlyphOrder()
    c = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap
    

    字体的xml文件中,下图部分开始可知(没截取完),glyph00000没有意义,glyph00001对应0,glyph00001对应1以此类推

    根据网页抓取的乱码的unicode编码,获取其对对应的字源,即可获取所对应的数字

    def get_num(string):
        ret_list = []
        for char in string:
            decode_num = ord(char)
            num = c[decode_num]
            num = int(num[-2:])-1
            ret_list.append(num)
    
        return ret_list
    

    整理后的代码

    price_code = '餼龒麣麣'
    base64_srt = 'AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TL...'
    import base64
    from io import BytesIO
    from fontTools.ttLib import TTFont
    
    
    class ParseNum:
        def __init__(self,base64_srt):
            font = TTFont(BytesIO(self.make_font_file(base64_srt)))
            # unilist = font['cmap'].tables[0].ttFont.getGlyphOrder()
            self.c = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap
    
        def make_font_file(self,base64_string: str):
            bin_data = base64.decodebytes(base64_string.encode())
    
            # 返回二进制的字体文件数据
            return bin_data
    
        def get_num(self,string):
            ret_str = ''
            for char in string:
                decode_num = ord(char)
                num = self.c[decode_num]
                num = int(num[-2:]) - 1
                ret_str += str(num)
    
            return ret_str
    
    parse = ParseNum(base64_srt)
    money = parse.get_num(price_code)
    print(money)
    
  • 相关阅读:
    常用纹理数据库
    开源许可协议
    3TB-GPT-MBR
    ubuntu配置cudnn
    神经网络模型种类
    What is R语言
    DataWindow.NET 控件 实现点击列头排序
    PB调用C# Windows窗体
    工商银行卡网上查询开户行
    【DevExpress】1、SearchLookUpEdit详解
  • 原文地址:https://www.cnblogs.com/wualin/p/10226549.html
Copyright © 2020-2023  润新知