• Structured data representation of python


    Structured data

    https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

    结构化数据 -- 在数据上定义了一层模式, 例如关系型数据库

    非结构数据 -- 自由形式数据, 没有任何约束, 例如报纸新闻

    半结构化数据 -- 没有全局的数据模式, 但是对于每一条数据都有自身的模式定义, 例如文档数据库。

    在python应用中往往需要定义结构化数据,来管理业务数据。本文总结几种结构化数据存储方法。

    Structured data

    Structured data sources define a schema on the data. With this extra bit of information about the underlying data, structured data sources provide efficient storage and performance. For example, columnar formats such as Parquet and ORC make it much easier to extract values from a subset of columns. Reading each record row by row first, then extracting the values from the specific columns of interest can read much more data than what is necessary when a query is only interested in a small fraction of the columns. A row-based storage format such as Avro efficiently serializes and stores data providing storage benefits. However, these advantages often come at the cost of flexibility. For example, because of rigidity in structure, evolving a schema can be challenging.

    Unstructured data

    By contrast, unstructured data sources are generally free-form text or binary objects that contain no markup, or metadata (e.g., commas in CSV files), to define the organization of data. Newspaper articles, medical records, image blobs, application logs are often treated as unstructured data. These sorts of sources generally require context around the data to be parseable. That is, you need to know that the file is an image or is a newspaper article. Most sources of data are unstructured. The cost of having unstructured formats is that it becomes cumbersome to extract value out of these data sources as many transformations and feature extraction techniques are required to interpret these datasets.

    Semi-structured data

    Semi-structured data sources are structured per record but don’t necessarily have a well-defined global schema spanning all records. As a result, each data record is augmented with its schema information. JSON and XML are popular examples. The benefits of semi-structured data formats are that they provide the most flexibility in expressing your data as each record is self-describing. These formats are very common across many applications as many lightweight parsers exist for dealing with these records, and they also have the benefit of being human readable. However, the main drawback for these formats is that they incur extra parsing overheads, and are not particularly built for ad-hoc querying.

    Dict

    https://docs.python.org/3/tutorial/datastructures.html#dictionaries

    实际上没有模式定义, 需要开发者使用的时候按照需求列举出各个fields。

    >>> tel = {'jack': 4098, 'sape': 4139}
    >>> tel['guido'] = 4127
    >>> tel
    {'jack': 4098, 'sape': 4139, 'guido': 4127}
    >>> tel['jack']
    4098

    namedtuple

    https://medium.com/swlh/structures-in-python-ed199411b3e1

    命名元组, 定义的元组各个位置的应用名字, 并可以使用名字来索引元素。

    from collections import namedtuple 
    Point = namedtuple('Point', ['x', 'y'])
    
    
    Point = namedtuple('Point', ['x', 'y'], defaults=[0, 0])
    
    
    
    ntpt = Point(3, y=6)
    
    
    
    ntpt.x + ntpt.y
    
    
    
    ntpt[0] + ntpt[1]

    class

    https://docs.python.org/3/tutorial/classes.html#class-objects

    使用class管理复合数据属性。

    >>> class Complex:
    ...     def __init__(self, realpart, imagpart):
    ...         self.r = realpart
    ...         self.i = imagpart
    ...
    >>> x = Complex(3.0, -4.5)
    >>> x.r, x.i
    (3.0, -4.5)

    dataclass

    https://www.geeksforgeeks.org/understanding-python-dataclasses/

    dataclass在class的基础上做了增强,专门面向数据存储, 包括初始化, 打印, 和比较。

    DataClasses has been added in a recent addition in python 3.7 as a utility tool for storing data. DataClasses provides a decorator and functions for automatically adding generated special methods such as __init__() , __repr__() and __eq__() to user-defined classes.

    # default field example
    from dataclasses import dataclass, field
    
    
    # A class for holding an employees content
    @dataclass
    class employee:
    
        # Attributes Declaration
        # using Type Hints
        name: str
        emp_id: str
        age: int
        
        # default field set
        # city : str = "patna"
        city: str = field(default="patna")
    
    
    emp = employee("Satyam", "ksatyam858", 21)
    print(emp)

    pydantic

    https://pydantic-docs.helpmanual.io/

    在定义数据模式基础上, 增强了一些功能:

    数据验证

    运行时类型错误提示

    Data validation and settings management using python type annotations.

    pydantic enforces type hints at runtime, and provides user friendly errors when data is invalid.

    Define how data should be in pure, canonical python; validate it with pydantic.

    from datetime import datetime
    from typing import List, Optional
    from pydantic import BaseModel
    
    
    class User(BaseModel):
        id: int
        name = 'John Doe'
        signup_ts: Optional[datetime] = None
        friends: List[int] = []
    
    
    external_data = {
        'id': '123',
        'signup_ts': '2019-06-01 12:22',
        'friends': [1, 2, '3'],
    }
    user = User(**external_data)
    print(user.id)
    #> 123
    print(repr(user.signup_ts))
    #> datetime.datetime(2019, 6, 1, 12, 22)
    print(user.friends)
    #> [1, 2, 3]
    print(user.dict())
    """
    {
        'id': 123,
        'signup_ts': datetime.datetime(2019, 6, 1, 12, 22),
        'friends': [1, 2, 3],
        'name': 'John Doe',
    }
    """
  • 相关阅读:
    并发编程 进程
    计算机的发展史和操作系统简介
    subprocess和struct模块
    socket编程
    面向对象进阶 反射
    类的内置方法
    常用模块(hashlib,configparser,logging)

    面向对象封装 classmethod和staticmethod方法
    面向对象 继承补充和多态
  • 原文地址:https://www.cnblogs.com/lightsong/p/15950181.html
Copyright © 2020-2023  润新知