在postgresql中,work_mem不足(包括物化、hashjoin、某些table函数如generate_series)时写临时文件的实现在src/backend/utils/sort/tuplestore.c(它本质上是 tuplesort.c (logtape.c协助其实现multiple merge支持)的简化版本,src/backend/utils/sort/sharedtuplestore.c其实parallel-aware版本)中。
创建临时文件并写入元组到临时文件的过程
static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple) { TSReadPointer *readptr; int i; ResourceOwner oldowner; state->tuples++; switch (state->status) { case TSS_INMEM: // 第一次写的状态,随后状态会改为TSS_WRITEFILE /* * Update read pointers as needed; see API spec above. */ readptr = state->readptrs; for (i = 0; i < state->readptrcount; readptr++, i++) { if (readptr->eof_reached && i != state->activeptr) { readptr->eof_reached = false; readptr->current = state->memtupcount; } } /* * Grow the array as needed. Note that we try to grow the array * when there is still one free slot remaining --- if we fail, * there'll still be room to store the incoming tuple, and then * we'll switch to tape-based operation. */ if (state->memtupcount >= state->memtupsize - 1) { (void) grow_memtuples(state); // 如果元祖数量超过了分配的元组可容纳大小,(默认是1024,在tuplesort_begin_common中初始化)则进行扩展,
// 但是大小不会超过work_mem/sizeof(MinTuple)的大小 Assert(state->memtupcount < state->memtupsize); } /* Stash the tuple in the in-memory array */ state->memtuples[state->memtupcount++] = tuple; /* * Done if we still fit in available memory and have array slots. */ if (state->memtupcount < state->memtupsize && !LACKMEM(state)) return; /* * Nope; time to switch to tape-based operation. Make sure that * the temp file(s) are created in suitable temp tablespaces. */ PrepareTempTablespaces(); /* associate the file with the store's resource owner */ oldowner = CurrentResourceOwner; CurrentResourceOwner = state->resowner; state->myfile = lt_BufFileCreateTempForWorkMemory(state->interXact); // 创建临时文件lightdb扩展支持更多的debug机制 CurrentResourceOwner = oldowner; /* * Freeze the decision about whether trailing length words will be * used. We can't change this choice once data is on tape, even * though callers might drop the requirement. */ state->backward = (state->eflags & EXEC_FLAG_BACKWARD) != 0; state->status = TSS_WRITEFILE; dumptuples(state); // 写元组到文件,在内部会将元组的指针偏移量转换为文件偏移量 break; case TSS_WRITEFILE: // 随后写元祖
写元组到临时文件
> writetup_heap C++ (gdb) dumptuples C++ (gdb) tuplestore_puttuple_common C++ (gdb) tuplestore_putvalues C++ (gdb) ExecMakeTableFunctionResult C++ (gdb) // 循环写 FunctionNext C++ (gdb) // 第一次访问表函数时会触发物化 ExecScanFetch C++ (gdb) ExecScan C++ (gdb) // 执行过滤、投影 ExecScanFetch C++ (gdb) ExecScan C++ (gdb) ExecProcNode C++ (gdb) ExecModifyTable C++ (gdb) ExecProcNode C++ (gdb) ExecutePlan C++ (gdb) standard_ExecutorRun C++ (gdb) pgss_ExecutorRun C++ (gdb) ltsa_ExecutorRun C++ (gdb) pgsp_ExecutorRun C++ (gdb) ProcessQuery C++ (gdb) PortalRunMulti C++ (gdb) PortalRun C++ (gdb) exec_simple_query C++ (gdb) PostgresMain C++ (gdb) BackendRun C++ (gdb) BackendStartup C++ (gdb) ServerLoop C++ (gdb) PostmasterMain C++ (gdb) main C++ (gdb)
但是实际的记录和临时文件中的存储并没有对上137609_ltsql_tmp137609.rar。如下:
zjh@postgres=# insert into t select id,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' from generate_series(1,200000) id;
INSERT 0 200000
其中id字段可以对上(4个字节),36个a没有对上,只用了10字节存储,看起来并非一一对应存储?因为内部还是和标准数据类型体系一样,有数据类型的概念,所以对于'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',采用的是varattrib_4b存储字符串,所以是压缩格式。
实际记录中,存储的是generate_series的输出也就是一个字段,而不是select,因此只存储了整型,除了4字节的数据本身外,还有10字节的额外负载(而MinimalTuple是16字节,(unsigned long)(&(((MinimalTuple)0)->t_bits))=15+1)。
在minimaltuple中,t_hoff这个字段值不正确。
uint8 t_hoff; /* sizeof header incl. bitmap, padding */
t_hoff | uint8 | 1 byte | offset to user data |
MINIMAL_TUPLE_DATA_OFFSET == (unsigned long)&((MinimalTuple) 0)->t_infomask2 == 10 char *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET; unsigned int tupbodylen = tuple->t_len - MINIMAL_TUPLE_DATA_OFFSET; == 10 unsigned int tuplen = tupbodylen + sizeof(int); == 14 BufFileWrite(state->myfile, (void *) &tuplen, sizeof(tuplen)); === 写入4字节总长度 值14 BufFileWrite(state->myfile, (void *) tupbody, tupbodylen); === 写入10字节:包含MinimalTuple后面部分6字节+值(本例中为整型)
所以加起来14刚好。
其次,运行期间临时文件的大小和运行完成后磁盘中显示的大小并不一致,如下:
[zjh@hs-10-20-30-193 keep]$ ll
total 2728
-rw------- 2 zjh zjh 2793472 Sep 11 15:51 137609_ltsql_tmp137609.10 --期间,该值为file.curPos的值,也就是实际大小。
[zjh@hs-10-20-30-193 keep]$ ll
total 2736
-rw------- 1 zjh zjh 2800000 Sep 11 15:34 137609_ltsql_tmp137609.6 --完成后
即使ExecMakeTableFunctionResult已经执行完成,到了ExecScanFetch fetch函数时,大小仍然是2793472,此时物化已经完成。因为在pg实现中,flush操作是以BLOCKSZ为单位的,达到BLOCKSZ才会做bufdump。所以最后一部分记录因为没有达到BLOCKSZ(2800000 - 2793472 = 6528),所以中间过程中并不是精确的2800000。
==== 按照2800000字节来算,每行14字节(也就是物理存储上并不包含MinimalTuple的header部分),大约200000行。而从文件内容来看,20万行确实需要2800000字节。
work_mem=4MB,应该内存中足以放下,为什么需要磁盘文件呢?因为在内存中,是包含MinimalTuple header部分的,而且并不是压缩格式,大约每行需要56个字节(其中数据本身41字节=4+1(short var header)+36,pack形式存储,如下:),也就是只能大约存储74897行,所以需要临时文件。
else if (VARLENA_ATT_IS_PACKABLE(att) && VARATT_CAN_MAKE_SHORT(val)) { /* convert to short varlena -- no alignment */ data_length = VARATT_CONVERTED_SHORT_SIZE(val); -- VARATT_CONVERTED_SHORT_SIZE == ((((varattrib_4b *) (val))->va_4byte.va_header >> 2) & 0x3FFFFFFF) SET_VARSIZE_SHORT(data, data_length); memcpy(data + 1, VARDATA(val), data_length - 1); -- VARDATA == ((((varattrib_4b *) (val))))->va_4byte }
所以,我比较好奇的是,压缩在哪里体现?在tuplestore_gettupleslot从临时文件读回元组的时候大小也不是41。
https://15721.courses.cs.cmu.edu/spring2020/slides/18-sortmergejoins.pdf
https://www.cs.cornell.edu/courses/cs432/2003fa/slides/ExternalSorting.pdf