• Adventures with Testing BI/DW Application


    数据产品测试与其他产品的不同之处:

    根据精确、及时的数据分析,用户做出决策;

    整合、频繁的检索数据大于存储;

    数据需要及时、准确;

    需要维护大量的历史数据;

    检索的性能;

    数据的安全性。

    数据产品测试的阻碍:

    image

    性能问题、过期数据、功能问题、可扩展问题;

    业务主要关注end reports;很多整合实现工作、缺少专业的文档--)重要的业务逻辑隐藏在复杂的架构中;忽略白盒测试的重要性;在设计阶段缺少测试的参;与缺少知识共享、过程的不成熟-------客户流失

    数据量、复杂度不断变化;

    上游数据改变直接影响整合的过程,需要修改现有模型、转换逻辑;

    上游数据的质量问题;

    实时分析需要及时的数据;

    测试过程的建议:

    image 

    requirement & analysis(分析数据来源)、design & coding(获取数据、实现业务逻辑和纬度建模、建立 填充多纬度数据集)、QA & Deployment(得出报告)

    设计测试的过程

    image

    a) Requirements Review & Inspection:
    1. Validating the data required and the availability of the data sources they can be acquired from.
    2. Data profiling:
      • Understanding the Data: This exercise helps test team understand the nature of the data, which is critical to assess the choice of design.
      • Finding Issues early: Discovering data issues / anomalies early, so that late project surprises are avoided. Finding data problems early in the project, considerably reduces the cost of fixing it late in the cycle.
      • Identifying realistic Boundary Value Conditions: Current data trend can be used to determine minimum, maximum values for the important business fields to come up with realistic and good test scenarios.
      • Redundancy identifies overlapping values between tables.Example: Redundancy analysis could provide the analyst with the fact that the ZIP field in table A contained the same values as the ZIP_CODE field in table B, 80% of the time.
    3. Data  Quality and Performance Acceptance Criteria:
      • Data Quality attributes (Completeness, Accuracy, Validity, Consistency etc) e.g. A customer expects at least 90% of data accuracy and 85% of data consistency.
      • Performance Benchmarking & SLA (Service Level Agreements) e.g. Report should be rendered in max 30 seconds.
    4. Validation of Business Transformation Rules:

      A realistic example for this can be to acquire last 5 years product sales data from United States for a Company (here this rule should be taken while designing the system as it doesn’t make sense to acquire all the data if the customer wants to see reports based on only last 5 year data from United States)

    5. Test Planning
      Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself. 
      Key important Areas to be focussed upon:
      • Scope of testing: Functional & Non Functional requirements like Performance Testing, Security Testing etc
      • Testing techniques and Testing Types to be used.
      • Test Data Preparation: Sampling of data from data sources or data generation
    b)    Design & Code Review / Inspection
    1. Reviewing Data dictionary

      Verifying metadata which includes constraints like Nulls, Default Values, PKs, Check Constraints, Referential Integrity (PK-FK relationship), Surrogate keys/ Natural keys, Cardinality (1:1, m: n) etc

    2. Validating Source to Target Mapping (STM)

      Ensuring the traceability from: Data Sources -> Staging -> Data Warehouse -> Data Marts -> Cube -> Reports

    3. Validation & Selection of Data Model (Dimensional vs. Normalized)
      1. Dimensional Modelling:

        Dimensional approach enables a relational database to emulate analytical functionality of a multidimensional database and makes the data warehouse easier for the user to understand & use. Also, the retrieval of data from the data warehouse tends to operate very quickly. In the dimensional approach, transaction data are partitioned into either "facts” or "dimensions".
        For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.

        • Star Schema:
          • Dimension tables have a simple primary key, while fact tables have a compoundprimary key consisting of the aggregate of relevant dimension keys.
          • Another reason for using a star schema is its simplicity from the users' point of view: queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables, without the indirect dependencies to other tables that are possible in a better normalized snowflake schema.
        • Snowflake schema
          • The snowflake schema is a variation of the star schema, featuring normalization of dimension tables.
          • Closely related to the star schema, the snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.
      2. Normalized Approach:

        In the normalized approach, the data in the data warehouse are stored as per the database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is straightforward to add information into the database.

    4. Validation of  BI/ DW Architecture:
      Ensuring that design is scalable, robust and as per the requirements. Choosing the best approach for designing the system:
      • Bottom-up:
        Data marts are first created to provide reporting and analytical capabilities for specific business processes. Data marts contain atomic data and, if necessary, summarized data. These data marts can eventually be used together to create a comprehensive data warehouse.
      • Top-down:
        Data warehouse is defined as a centralized repository for the entire enterprise and suggests an approach in which the data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse.
    5. Archival / Purge Strategy
      Deciding on the appropriate archival and purge policy based on the business needs e.g. maintaining data history of last 5 yrs etc.
    6. Error Logging / Exception Handling / Recoverability
      Ensuring appropriate data failure tracking & prevention (schema changes, source unavailability etc), as well as the ability to resume from the point of failure.
    7. Parallel Execution & Precedence

      Data warehousing procedures can subdivide an ETL process into smaller pieces running sequentially or in parallel in a specific order.  The opted path can have a direct impact on the performance and scalability of the system

    8. ETL Pull Logic – Full / Incremental (a.k.a. Delta pull)

      Entire data can be pulled from the source every time or only the delta since the last run can be considered to reduce the network movement of huge amount of data for each run.

    c) BI/DW Testing

    1. Test Data Preparation
      • Test Data Selection
        Identifying a subset of production data to be used as test data (Ensure that customer’s confidential data is not used for such purposes). The selection can be made on the following parameters:
        • On percentage, fixed number, time basis etc.
      • Generate new test data from scratch
        • Identify the source tables, the constraints and dependencies
        • Understand the range of possible values for various fields (Include boundary values)
        • Use data generation tools to generate data keeping above rules in mind
    2. Test Case Design & Execution
      1. ETL Testing:
        • Validate Data Extraction Logic
        • Validate Data Transformation Logic (including testing of Dimensional Model – Facts, Dimensions, Views etc)
        • Validate Data Loading
        • Some data warehouses may overwrite existing information with cumulative, updated data every week, while other DW (or even other parts of the same DW) may add new data in an incremental form, for example, hourly.
        • Data Validation
        • Test end to end data flow from source to mart to reports (including calculation logics and business rules)
        • Data Quality Validation
        • Check for accuracy, completeness (missing data, invalid data) and inconsistencies.
      2. OLAP & Cube Testing:
        • Check whether the data from the data warehouse/data mart is mapped & designed correctly in the OLAP Cube or reports.
        • Validate all the measures and measure groups (including derived measures, aggregations)
        • Validate all the dimensions (including Slowly Changing Dimension), attribute hierarchy etc
        • Many OLAP tools provide on the fly computations features and provisions of customized SQLs, which can be prone to error.
      3. Reports Testing (Drill Down/Drill Through)
        • Verification of the layout format per the design mock-up, style sheets, prompts and filters attributes and metrics on the report.
        • Verification of drilling, sorting and export functions of the reports in the Web environment.
        • Verification of reports containing derived metrics (Special focus should be paid to any subtotals or aggregates)
        • Reports with "non-aggregate-able" metrics (e.g., inventory at hand) also need special attention to the subtotal row. It should not, for example, add up the inventory for each week and show the inventory of the month.
        • The test team should target the lowest granularity that is present in the data warehouse if it is higher than the transaction grain at the OLTP.
        • Understand each report & the linkages of every field displayed in the report with the star schema and trace its origin back to the source System.
  • 相关阅读:
    Github优秀java项目集合(中文版)
    gradle 将依赖打入Jar包的方法
    早期malloc分配时,如果内存耗尽分配不出来,会直接返回NULL。现在分配不出来,直接抛出异常(可使用nothrow关键字)
    最想挖的各家互联网公司最牛职位人才(哪方面值得去、值得学)
    C++使用libcurl做HttpClient(业务观摩,用C++封装过程式代码,post和get的数据,最好url编码,否则+会变成空格)good
    【C/S通信交互之Http篇】Cocos2dx(Client)使用Curl与Jetty(Server)实现手机网游Http通信框架(内含解决curl.h头文件找不到问题)
    PHP模拟POST提交数据并获得返回值之CURL方法(使用PHP extension,然后使用php_curl.dll,很不错)
    SignalR实现B/S系统对windows服务运行状态的监测
    滴滴出行秋招编程题
    Tag Helpers 介绍
  • 原文地址:https://www.cnblogs.com/stay-sober/p/4387322.html
Copyright © 2020-2023  润新知