• postgresql中work_mem的实现及临时文件的实现


      在postgresql中,work_mem不足(包括物化、hashjoin、某些table函数如generate_series)时写临时文件的实现在src/backend/utils/sort/tuplestore.c(它本质上是 tuplesort.c (logtape.c协助其实现multiple merge支持)的简化版本,src/backend/utils/sort/sharedtuplestore.c其实parallel-aware版本)中。

    创建临时文件并写入元组到临时文件的过程

    static void
    tuplestore_puttuple_common(Tuplestorestate *state, void *tuple)
    {
        TSReadPointer *readptr;
        int            i;
        ResourceOwner oldowner;
    
        state->tuples++;
    
        switch (state->status)
        {
            case TSS_INMEM:   // 第一次写的状态,随后状态会改为TSS_WRITEFILE
    
                /*
                 * Update read pointers as needed; see API spec above.
                 */
                readptr = state->readptrs;
                for (i = 0; i < state->readptrcount; readptr++, i++)
                {
                    if (readptr->eof_reached && i != state->activeptr)
                    {
                        readptr->eof_reached = false;
                        readptr->current = state->memtupcount;
                    }
                }
    
                /*
                 * Grow the array as needed.  Note that we try to grow the array
                 * when there is still one free slot remaining --- if we fail,
                 * there'll still be room to store the incoming tuple, and then
                 * we'll switch to tape-based operation.
                 */
                if (state->memtupcount >= state->memtupsize - 1)
                {
                    (void) grow_memtuples(state);   // 如果元祖数量超过了分配的元组可容纳大小,(默认是1024,在tuplesort_begin_common中初始化)则进行扩展,
    // 但是大小不会超过work_mem/sizeof(MinTuple)的大小 Assert(state
    ->memtupcount < state->memtupsize); } /* Stash the tuple in the in-memory array */ state->memtuples[state->memtupcount++] = tuple; /* * Done if we still fit in available memory and have array slots. */ if (state->memtupcount < state->memtupsize && !LACKMEM(state)) return; /* * Nope; time to switch to tape-based operation. Make sure that * the temp file(s) are created in suitable temp tablespaces. */ PrepareTempTablespaces(); /* associate the file with the store's resource owner */ oldowner = CurrentResourceOwner; CurrentResourceOwner = state->resowner; state->myfile = lt_BufFileCreateTempForWorkMemory(state->interXact); // 创建临时文件lightdb扩展支持更多的debug机制 CurrentResourceOwner = oldowner; /* * Freeze the decision about whether trailing length words will be * used. We can't change this choice once data is on tape, even * though callers might drop the requirement. */ state->backward = (state->eflags & EXEC_FLAG_BACKWARD) != 0; state->status = TSS_WRITEFILE; dumptuples(state); // 写元组到文件,在内部会将元组的指针偏移量转换为文件偏移量 break; case TSS_WRITEFILE: // 随后写元祖

    写元组到临时文件

    >    writetup_heap    C++ (gdb)
         dumptuples    C++ (gdb)
         tuplestore_puttuple_common    C++ (gdb)
         tuplestore_putvalues    C++ (gdb)
         ExecMakeTableFunctionResult    C++ (gdb)  // 循环写
         FunctionNext    C++ (gdb)  // 第一次访问表函数时会触发物化
         ExecScanFetch    C++ (gdb)
         ExecScan    C++ (gdb)  // 执行过滤、投影
         ExecScanFetch    C++ (gdb)
         ExecScan    C++ (gdb)
         ExecProcNode    C++ (gdb)
         ExecModifyTable    C++ (gdb)
         ExecProcNode    C++ (gdb)
         ExecutePlan    C++ (gdb)
         standard_ExecutorRun    C++ (gdb)
         pgss_ExecutorRun    C++ (gdb)
         ltsa_ExecutorRun    C++ (gdb)
         pgsp_ExecutorRun    C++ (gdb)
         ProcessQuery    C++ (gdb)
         PortalRunMulti    C++ (gdb)
         PortalRun    C++ (gdb)
         exec_simple_query    C++ (gdb)
         PostgresMain    C++ (gdb)
         BackendRun    C++ (gdb)
         BackendStartup    C++ (gdb)
         ServerLoop    C++ (gdb)
         PostmasterMain    C++ (gdb)
         main    C++ (gdb)

    但是实际的记录和临时文件中的存储并没有对上137609_ltsql_tmp137609.rar。如下:

    zjh@postgres=# insert into t select id,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' from generate_series(1,200000) id;
    INSERT 0 200000

    其中id字段可以对上(4个字节),36个a没有对上,只用了10字节存储,看起来并非一一对应存储?因为内部还是和标准数据类型体系一样,有数据类型的概念,所以对于'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',采用的是varattrib_4b存储字符串,所以是压缩格式。

    实际记录中,存储的是generate_series的输出也就是一个字段,而不是select,因此只存储了整型,除了4字节的数据本身外,还有10字节的额外负载(而MinimalTuple是16字节,(unsigned long)(&(((MinimalTuple)0)->t_bits))=15+1)。

    在minimaltuple中,t_hoff这个字段值不正确。

        uint8        t_hoff;            /* sizeof header incl. bitmap, padding */
    t_hoff uint8 1 byte offset to user data

    MINIMAL_TUPLE_DATA_OFFSET == (unsigned long)&((MinimalTuple) 0)->t_infomask2   ==  10
    
    char       *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
    unsigned int tupbodylen = tuple->t_len - MINIMAL_TUPLE_DATA_OFFSET;  == 10
    unsigned int tuplen = tupbodylen + sizeof(int);  == 14
    
    BufFileWrite(state->myfile, (void *) &tuplen, sizeof(tuplen));   === 写入4字节总长度  值14
    BufFileWrite(state->myfile, (void *) tupbody, tupbodylen);     === 写入10字节:包含MinimalTuple后面部分6字节+值(本例中为整型)

    所以加起来14刚好。

     其次,运行期间临时文件的大小和运行完成后磁盘中显示的大小并不一致,如下:

    [zjh@hs-10-20-30-193 keep]$ ll
    total 2728
    -rw------- 2 zjh zjh 2793472 Sep 11 15:51 137609_ltsql_tmp137609.10   --期间,该值为file.curPos的值,也就是实际大小。

    [zjh@hs-10-20-30-193 keep]$ ll
    total 2736
    -rw------- 1 zjh zjh 2800000 Sep 11 15:34 137609_ltsql_tmp137609.6  --完成后

    即使ExecMakeTableFunctionResult已经执行完成,到了ExecScanFetch fetch函数时,大小仍然是2793472,此时物化已经完成。因为在pg实现中,flush操作是以BLOCKSZ为单位的,达到BLOCKSZ才会做bufdump。所以最后一部分记录因为没有达到BLOCKSZ(2800000 - 2793472 = 6528),所以中间过程中并不是精确的2800000。

    ==== 按照2800000字节来算,每行14字节(也就是物理存储上并不包含MinimalTuple的header部分),大约200000行。而从文件内容来看,20万行确实需要2800000字节。

    work_mem=4MB,应该内存中足以放下,为什么需要磁盘文件呢?因为在内存中,是包含MinimalTuple header部分的,而且并不是压缩格式,大约每行需要56个字节(其中数据本身41字节=4+1(short var header)+36,pack形式存储,如下:),也就是只能大约存储74897行,所以需要临时文件。

            else if (VARLENA_ATT_IS_PACKABLE(att) &&
                     VARATT_CAN_MAKE_SHORT(val))
            {
                /* convert to short varlena -- no alignment */
                data_length = VARATT_CONVERTED_SHORT_SIZE(val); -- VARATT_CONVERTED_SHORT_SIZE == ((((varattrib_4b *) (val))->va_4byte.va_header >> 2) & 0x3FFFFFFF)
                SET_VARSIZE_SHORT(data, data_length);
                memcpy(data + 1, VARDATA(val), data_length - 1);  -- VARDATA == ((((varattrib_4b *) (val))))->va_4byte
            }

    所以,我比较好奇的是,压缩在哪里体现?在tuplestore_gettupleslot从临时文件读回元组的时候大小也不是41。

    https://15721.courses.cs.cmu.edu/spring2020/slides/18-sortmergejoins.pdf

    https://www.cs.cornell.edu/courses/cs432/2003fa/slides/ExternalSorting.pdf

    https://wiki.postgresql.org/wiki/Parallel_External_Sort

  • 相关阅读:
    MySQL支持的数据类型(2)( 日期)
    MySQL支持的数据类型(1)( 整数,小数,位)
    MySQL查询语句(select)详解(2)
    MySQL查询语句(select)详解(1)
    MySQL sql_safe_updates 分析
    MySQL中INFORMATION_SCHEMA是什么?(2)
    js event事件对象概括
    js String字符串对象常见方法总结
    js Array数组对象常见方法总结
    纯js轮播图
  • 原文地址:https://www.cnblogs.com/zhjh256/p/16685139.html
Copyright © 2020-2023  润新知