• python--magic module 文件类型识别


    前言:接触magic module是由于工作中对的文件类型判断需求,网上查了下,python内置的有mimetypes module,filetype module,与使用mimetypes库相比,更可靠的方法是使用magic软件包。

    magic

    magic是libmagic文件标识库的封装,libmagic是一个根据文件头识别文件类型的开发库,因此可以实现对文件类型的判断,在Django上,还可以确保MIME类型与UploadedFile.content_type相匹配。

    libmagic

    • Usage:
      • import magic
        
        detected = magic.detect_from_filename('magic.py')
        print 'Detected MIME type: {}'.format(detected.mime_type)
        print 'Detected encoding: {}'.format(detected.encoding)
        print 'Detected file type name: {}'.format(detected.name)

    确定文件mime-type时,选择的工具简称为file,其后端称为libmagic如果您想将任何libmagic绑定与python一起使用,则需要使用此工具,该工具已经附带了自己的python绑定,称为file-magic。file-magic绑定包含到文件上。如果同时安装文件和python-magic,则python模块将magic引用前者。

    python-magic

    • module name: magic
    • pypi: python-magic
    • source: https://github.com/ahupp/python-magic
    • install:
      • pip install python-magic #window下依赖python-magic-bin, pip install python-magic-bin
    • usage:
      • >>> import magic
        >>> magic.from_file("testdata/test.pdf")
        'PDF document, version 1.2'
        # recommend using at least the first 2048 bytes, as less can produce incorrect identification
        >>> magic.from_buffer(open("testdata/test.pdf").read(2048)) 
        'PDF document, version 1.2'
        >>> magic.from_file("testdata/test.pdf", mime=True)
        'application/pdf'
        >>> f = magic.Magic(uncompress=True)
        >>> f.from_file('testdata/test.gz')
        'ASCII text (gzip compressed data, was "test", last modified: Sat Jun 28
        21:32:52 2008, from Unix)'
        >>> f = magic.Magic(mime=True, uncompress=True)
        >>> f.from_file('testdata/test.gz')
        'text/plain'

    filemagic

          这个库与file-magic有一些相似之处,包含在libmagic。

     python-magic用例

    import magic
    
    file_type = magic.from_buffer(open("file_types/Bs.tar.gz",'rb').read(2048)) #1
    #or
    file_type = magic.from_file("file_types/Bs.tar.gz", mime=True) #2
    
    f = magic.Magic(uncompress=True)
    ff=f.from_file('file_types/Bs.tar.gz') #3
    
    print(file_type,ff)   #gzip compressed data, last modified: Tue Dec 10 08:46:57 2019, from Unix
    

       我更喜欢的是Magic方法,Magic是libmagic C库的包装。更强大更直接,包含magic的数据库方法,并且可以进行mime_encoding检测。

            但有网友友情提示说不建议用于一般用途,特别是跨多个线程共享并不安全,如果尝试这样做会失败。这个还没深究,但是我们可以先了解magic对Magic方法的调用:

    def _get_magic_type(mime):
        i = _instances.get(mime)
        if i is None:
            i = _instances[mime] = Magic(mime=mime)
        return i
    

      可以看到,如果magic方法没有获取到mime,还调用了Magic,所以对于安全性和可行性,我们打个问号,等菜鸟我修炼一段时间,攒点经验值,再回来研究研究吧。  

    try:
        ms = magic.open(magic.MAGIC_NONE)
        ms.load()
    except:
        ms = None
    

      magic的一些常量:

    MAGIC_NONE = 0x000000 # No flags
    MAGIC_DEBUG = 0x000001 # Turn on debugging
    MAGIC_SYMLINK = 0x000002 # Follow symlinks
    MAGIC_COMPRESS = 0x000004 # Check inside compressed files
    MAGIC_DEVICES = 0x000008 # Look at the contents of devices
    MAGIC_MIME = 0x000010 # Return a mime string
    MAGIC_MIME_ENCODING = 0x000400 # Return the MIME encoding
    MAGIC_CONTINUE = 0x000020 # Return all matches
    MAGIC_CHECK = 0x000040 # Print warnings to stderr
    MAGIC_PRESERVE_ATIME = 0x000080 # Restore access time on exit
    MAGIC_RAW = 0x000100 # Don't translate unprintable chars
    MAGIC_ERROR = 0x000200 # Handle ENOENT etc as real errors
    
    MAGIC_NO_CHECK_COMPRESS = 0x001000 # Don't check for compressed files
    MAGIC_NO_CHECK_TAR = 0x002000 # Don't check for tar files
    MAGIC_NO_CHECK_SOFT = 0x004000 # Don't check magic entries
    MAGIC_NO_CHECK_APPTYPE = 0x008000 # Don't check application type
    MAGIC_NO_CHECK_ELF = 0x010000 # Don't check for elf details
    MAGIC_NO_CHECK_ASCII = 0x020000 # Don't check for ascii files
    MAGIC_NO_CHECK_TROFF = 0x040000 # Don't check ascii/troff
    MAGIC_NO_CHECK_FORTRAN = 0x080000 # Don't check ascii/fortran
    MAGIC_NO_CHECK_TOKENS = 0x100000 # Don't check ascii/tokens

    分享文章:

    cuckoo里的文件识别功能:https://www.cnblogs.com/viwilla/p/5051896.html

    对文件格式的判断代码

    def _get_filetype(self, data):
            """Gets filetype, uses libmagic if available.
            @param data: data to be analyzed.
            @return: file type or None.
            """
            if not HAVE_MAGIC:
                return None
    
            try:
                ms = magic.open(magic.MAGIC_NONE)
                ms.load()
                file_type = ms.buffer(data)
            except:
                try:
                    file_type = magic.from_buffer(data)
                except Exception:
                    return None
            finally:
                try:
                    ms.close()
                except:
                    pass
    
            return file_type
    

     分享找到的一个挺好的用例:https://www.cnblogs.com/17bdw/p/10042549.html

    参考链接:https://stackoverflow.com/questions/43580/how-to-find-the-mime-type-of-a-file-in-python

  • 相关阅读:
    源码阅读-logback的StaticLoggerBinder如何提供ILoggerFactory的实现类
    源码阅读-logback解析之对接日志门面slf4j
    不可变对象 -final-unmodifiableX
    安全发布对象
    线程安全性-原子性-可见性-有序性
    并发相关基础知识
    并发与高并发介绍
    Spring源码解析-ioc容器的设计
    微服务架构概述
    获取当前时间到毫秒
  • 原文地址:https://www.cnblogs.com/chenyuting/p/12085337.html
Copyright © 2020-2023  润新知