• postgresql 发生 oom 的分析之一


    os:centos 7.4
    postgresql:10.4

    第一个窗口
    创建表,向表中插入大量数据

    $ psql
    psql (10.4)
    Type "help" for help.
    
    postgres=# create table test01(id integer, val char(1000)); 
    CREATE TABLE
    postgres=# insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));
    
    

    第二个窗口,观察os层面postgres用户的进程内存占用

    # watch "ps -aux |head -1;ps -aux|grep -i postgres;free -m"
    
    Every 2.0s: ps -aux |grep -i postgres                                                                                        Mon Aug 20 23:20:43 2018
    
    USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root      1570  0.0  0.0 220660    96 pts/0    S    22:53   0:00 su - postgres
    postgres  1571  0.0  0.0 116720    68 pts/0    S    22:53   0:00 -bash
    postgres  1629  0.0  1.3 729832 27040 pts/0    S    22:53   0:00 /usr/pgsql-10/bin/postgres -D /var/lib/pgsql/10/data
    postgres  1630  0.0  0.0 170624   212 ?        Ss   22:53   0:00 postgres: peiyb: logger process
    postgres  1632  0.0 18.2 730124 372980 ?       Ss   22:53   0:00 postgres: peiyb: checkpointer process
    postgres  1633  0.0  0.1 729968  2604 ?        Ss   22:53   0:00 postgres: peiyb: writer process
    postgres  1634  0.0  0.8 729832 16592 ?        Ds   22:53   0:00 postgres: peiyb: wal writer process
    postgres  1635  0.0  0.0 730396   540 ?        Ss   22:53   0:00 postgres: peiyb: autovacuum launcher process
    postgres  1636  0.0  0.0 172884   272 ?        Ss   22:53   0:00 postgres: peiyb: stats collector process
    postgres  1637  0.0  0.0 730252   240 ?        Ss   22:53   0:00 postgres: peiyb: bgworker: logical replication launcher
    postgres  1649  0.0  0.0 156728   136 pts/0    S+   22:54   0:00 ./bin/psql
    postgres  1650  0.4 90.5 2051040 1854772 ?     Ss   22:54   0:07 postgres: peiyb: postgres postgres [local] INSERT
    root      2178  0.0  0.0 126428   480 pts/1    S+   23:04   0:00 vi postgresql.conf
    root      3169  0.0  0.0 153272  1276 pts/2    S+   23:17   0:00 watch ps -aux |grep -i postgres
    root      3718  0.0  0.0 153268   752 pts/2    S+   23:20   0:00 watch ps -aux |grep -i postgres
    root      3720  0.0  0.0 113176  1212 pts/2    S+   23:20   0:00 sh -c ps -aux |grep -i postgres
    root      3722  0.0  0.0 112708   952 pts/2    R+   23:20   0:00 grep -i postgres
                  total        used        free  shared  buff/cache   available
    Mem:           2000        1401          15         506         583          42
    Swap:          2047        2047           0

    注意 1650 进程。

    当swap用尽时,第一个窗口就出现错误

    server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
    The connection to the server was lost. Attempting reset: Failed.
    !> 

    查看postgresql 日志

    2018-08-20 23:19:47.362 CST,"postgres","postgres",1650,"[local]",5b7ad60a.672,28,"idle",2018-08-20 22:54:02 CST,3/15,0,LOG,00000,"statement: create table test01(id integer, val char(1000));",,,,,,,,,"psql"
    2018-08-20 23:19:49.074 CST,"postgres","postgres",1650,"[local]",5b7ad60a.672,29,"CREATE TABLE",2018-08-20 22:54:02 CST,3/0,0,LOG,00000,"duration: 1711.734 ms",,,,,,,,,"psql"
    2018-08-20 23:20:00.404 CST,"postgres","postgres",1650,"[local]",5b7ad60a.672,30,"idle",2018-08-20 22:54:02 CST,3/16,0,LOG,00000,"statement: insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));",,,,,,,,,"psql"
    2018-08-20 23:20:18.603 CST,,,1632,,5b7ad5fd.660,5,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
    2018-08-20 23:20:34.211 CST,,,1632,,5b7ad5fd.660,6,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint complete: wrote 23014 buffers (70.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=13.311 s, sync=1.077 s, total=17.344 s; sync files=22, longest=0.250 s, average=0.029 s; distance=366925 kB, estimate=366925 kB",,,,,,,,,""
    2018-08-20 23:20:50.308 CST,,,1632,,5b7ad5fd.660,7,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
    2018-08-20 23:21:31.226 CST,,,1632,,5b7ad5fd.660,8,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint complete: wrote 24394 buffers (74.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=38.683 s, sync=0.932 s, total=40.819 s; sync files=2, longest=0.894 s, average=0.466 s; distance=389308 kB, estimate=389308 kB",,,,,,,,,""
    2018-08-20 23:21:57.251 CST,,,1632,,5b7ad5fd.660,9,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
    2018-08-20 23:22:57.569 CST,,,1632,,5b7ad5fd.660,10,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint complete: wrote 23210 buffers (70.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=57.202 s, sync=0.265 s, total=62.142 s; sync files=2, longest=0.208 s, average=0.132 s; distance=396971 kB, estimate=396971 kB",,,,,,,,,""
    2018-08-20 23:23:30.360 CST,,,1632,,5b7ad5fd.660,11,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
    2018-08-20 23:23:55.686 CST,,,1629,,5b7ad5fd.65d,3,,2018-08-20 22:53:49 CST,,0,LOG,00000,"server process (PID 1650) was terminated by signal 9: Killed","Failed process was running: insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));",,,,,,,,""
    2018-08-20 23:23:55.752 CST,,,1629,,5b7ad5fd.65d,4,,2018-08-20 22:53:49 CST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
    2018-08-20 23:23:56.236 CST,,,1635,,5b7ad5fd.663,1,,2018-08-20 22:53:49 CST,2/0,0,WARNING,57P02,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
    2018-08-20 23:23:56.969 CST,,,4102,,5b7add0b.1006,1,,2018-08-20 23:23:55 CST,,0,WARNING,57P02,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
    2018-08-20 23:23:56.970 CST,,,4105,"[local]",5b7add0c.1009,1,"",2018-08-20 23:23:56 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
    2018-08-20 23:23:56.978 CST,"postgres","postgres",4105,"[local]",5b7add0c.1009,2,"",2018-08-20 23:23:56 CST,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,""
    2018-08-20 23:23:56.978 CST,,,1629,,5b7ad5fd.65d,5,,2018-08-20 22:53:49 CST,,0,LOG,00000,"all server processes terminated; reinitializing",,,,,,,,,""
    2018-08-20 23:23:57.120 CST,,,4106,,5b7add0d.100a,1,,2018-08-20 23:23:57 CST,,0,LOG,00000,"database system was interrupted; last known up at 2018-08-20 23:22:57 CST",,,,,,,,,""
    2018-08-20 23:23:58.183 CST,,,4106,,5b7add0d.100a,2,,2018-08-20 23:23:57 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
    2018-08-20 23:23:58.202 CST,,,4106,,5b7add0d.100a,3,,2018-08-20 23:23:57 CST,,0,LOG,00000,"redo starts at 0/A51158C0",,,,,,,,,""
    2018-08-20 23:24:08.049 CST,,,4106,,5b7add0d.100a,4,,2018-08-20 23:23:57 CST,,0,LOG,00000,"redo done at 0/C67FB9F8",,,,,,,,,""
    2018-08-20 23:24:08.062 CST,,,4106,,5b7add0d.100a,5,,2018-08-20 23:23:57 CST,,0,LOG,00000,"checkpoint starting: end-of-recovery immediate",,,,,,,,,""
    2018-08-20 23:24:17.215 CST,,,4106,,5b7add0d.100a,6,,2018-08-20 23:23:57 CST,,0,LOG,00000,"checkpoint complete: wrote 32768 buffers (100.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=6.935 s, sync=2.161 s, total=9.154 s; sync files=2, longest=2.160 s, average=1.080 s; distance=547737 kB, estimate=547737 kB",,,,,,,,,""
    2018-08-20 23:24:17.246 CST,,,1629,,5b7ad5fd.65d,6,,2018-08-20 22:53:49 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""

    注意这条日志
    2018-08-20 23:23:55.686 CST,,,1629,,5b7ad5fd.65d,3,,2018-08-20 22:53:49 CST,,0,LOG,00000,”server process (PID 1650) was terminated by signal 9: Killed“,”Failed process was running: insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));”,,,,,,,,”“

    查看os日志

    # dmesg
    
    [ 1930.514760] Out of memory: Kill process 1650 (postgres) score 920 or sacrifice child
    [ 1930.514762] Killed process 1650 (postgres) total-vm:3993492kB, anon-rss:1323268kB, file-rss:588kB, shmem-rss:545252kB

    1650 进程确实是被os 的OOM killer干掉了。
    查看这几个参数

    vm.overcommit_memory = 0
    vm.overcommit_ratio = 50
    vm.swappiness = 30

    试着调整为

    vm.overcommit_memory = 2
    vm.overcommit_ratio = 80
    vm.swappiness = 1

    插入数据时就提示了错误,没有被os oom killer:

    postgres=# insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));
    ERROR:  out of memory
    DETAIL:  Failed on request of size 1028.

    对应的log

    2018-08-21 22:52:41.093 CST,"postgres","postgres",2256,"[local]",5b7c26ed.8d0,8,"INSERT",2018-08-21 22:51:25 CST,3/4,647,ERROR,53200,"out of memory","Failed on request of size 1028.",,,,,"insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));",,,"psql"

    申请的报错的 1028 受这两个参数的影响:

    # cat /proc/meminfo |grep -i commit
    
    CommitLimit:     5334540 kB
    Committed_AS:    3366504 kB

    CommitLimit:最大能分配的内存(个人理解仅仅在vm.overcommit_memory=2时候生效),具体的值是
    SWAP内存大小 + 物理内存 * overcommit_ratio / 100

    Committed_AS:当前已经分配的内存大小

    调整 vm.overcommit_ratio = 90

    # cat /proc/meminfo |grep -i commit
    CommitLimit:     5739216 kB
    Committed_AS:    3397904 kB

    从这个上面来看,貌似 postgresql 如果碰到大事务的数据库,session 级别发生 oom 是必然。

    于是上网搜了搜,貌似可以使用 cgroups来控制内存使用。下一篇blog试下。

  • 相关阅读:
    Ui设计哪里有好的素材
    android重写view和viewgroup的区别
    笔记本电脑连接wifi有时候会自动断网提示有限的访问权限解决办法
    ADT 怎么删除logcat过滤规则
    Android开发在使用第三方推送的时候出现INSTALL_FAILED_VERSION_DOWNGRADE
    评论的延迟加载(转载)
    让Entity Framework支持MySql数据库(转载)
    Net4.0---AspNet中URL重写的改进(转载)
    Entity Framework 使用
    深入浅出JSONP--解决ajax跨域问题
  • 原文地址:https://www.cnblogs.com/ctypyb2002/p/9792898.html
Copyright © 2020-2023  润新知