• DBA不可不知的操作系统内核参数


    数据库关心的OS内核参数

    512GB 内存为例

    1.

    参数

    fs.aio-max-nr  
    

    支持系统

    CentOS 6, 7       
    

    参数解释

    aio-nr & aio-max-nr:    
    .  
    aio-nr is the running total of the number of events specified on the    
    io_setup system call for all currently active aio contexts.    
    .  
    If aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN.    
    .  
    Note that raising aio-max-nr does not result in the pre-allocation or re-sizing    
    of any kernel data structures.    
    .  
    aio-nr & aio-max-nr:    
    .  
    aio-nr shows the current system-wide number of asynchronous io requests.    
    .  
    aio-max-nr allows you to change the maximum value aio-nr can grow to.    
    

    推荐设置

    fs.aio-max-nr = 1xxxxxx  
    .  
    PostgreSQL, Greenplum 均未使用io_setup创建aio contexts. 无需设置。    
    如果Oracle数据库,要使用aio的话,需要设置它。    
    设置它也没什么坏处,如果将来需要适应异步IO,可以不需要重新修改这个设置。   
    

    2.

    参数

    fs.file-max  
    

    支持系统

    CentOS 6, 7       
    

    参数解释

    file-max & file-nr:    
    .  
    The value in file-max denotes the maximum number of file handles that the Linux kernel will allocate.   
    .  
    When you get lots of error messages about running out of file handles,   
    you might want to increase this limit.    
    .  
    Historically, the kernel was able to allocate file handles dynamically,   
    but not to free them again.     
    .  
    The three values in file-nr denote :      
    the number of allocated file handles ,     
    the number of allocated but unused file handles ,     
    the maximum number of file handles.     
    .  
    Linux 2.6 always reports 0 as the number of free    
    file handles -- this is not an error, it just means that the    
    number of allocated file handles exactly matches the number of    
    used file handles.    
    .  
    Attempts to allocate more file descriptors than file-max are reported with printk,   
    look for "VFS: file-max limit <number> reached".    
    

    推荐设置

    fs.file-max = 7xxxxxxx  
    .  
    PostgreSQL 有一套自己管理的VFS,真正打开的FD与内核管理的文件打开关闭有一套映射的机制,所以真实情况不需要使用那么多的file handlers。     
    max_files_per_process 参数。     
    假设1GB内存支撑100个连接,每个连接打开1000个文件,那么一个PG实例需要打开10万个文件,一台机器按512G内存来算可以跑500个PG实例,则需要5000万个file handler。     
    以上设置绰绰有余。     
    

    3.

    参数

    kernel.core_pattern  
    

    支持系统

    CentOS 6, 7       
    

    参数解释

    core_pattern:    
    .  
    core_pattern is used to specify a core dumpfile pattern name.    
    . max length 128 characters; default value is "core"    
    . core_pattern is used as a pattern template for the output filename;    
      certain string patterns (beginning with '%') are substituted with    
      their actual values.    
    . backward compatibility with core_uses_pid:    
            If core_pattern does not include "%p" (default does not)    
            and core_uses_pid is set, then .PID will be appended to    
            the filename.    
    . corename format specifiers:    
            %<NUL>  '%' is dropped    
            %%      output one '%'    
            %p      pid    
            %P      global pid (init PID namespace)    
            %i      tid    
            %I      global tid (init PID namespace)    
            %u      uid    
            %g      gid    
            %d      dump mode, matches PR_SET_DUMPABLE and    
                    /proc/sys/fs/suid_dumpable    
            %s      signal number    
            %t      UNIX time of dump    
            %h      hostname    
            %e      executable filename (may be shortened)    
            %E      executable path    
            %<OTHER> both are dropped    
    . If the first character of the pattern is a '|', the kernel will treat    
      the rest of the pattern as a command to run.  The core dump will be    
      written to the standard input of that program instead of to a file.    
    

    推荐设置

    kernel.core_pattern = /xxx/core_%e_%u_%t_%s.%p    
    .  
    这个目录要777的权限,如果它是个软链,则真实目录需要777的权限  
    mkdir /xxx  
    chmod 777 /xxx  
    留足够的空间  
    

    4.

    参数

    kernel.sem   
    

    支持系统

    CentOS 6, 7       
    

    参数解释

    kernel.sem = 4096 2147483647 2147483646 512000    
    .  
    4096 每组多少信号量 (>=17, PostgreSQL 每16个进程一组, 每组需要17个信号量) ,     
    2147483647 总共多少信号量 (2^31-1 , 且大于4096*512000 ) ,     
    2147483646 每个semop()调用支持多少操作 (2^31-1),     
    512000 多少组信号量 (假设每GB支持100个连接, 512GB支持51200个连接, 加上其他进程, > 51200*2/16 绰绰有余)     
    .  
    # sysctl -w kernel.sem="4096 2147483647 2147483646 512000"    
    .  
    # ipcs -s -l    
      ------ Semaphore Limits --------    
    max number of arrays = 512000    
    max semaphores per array = 4096    
    max semaphores system wide = 2147483647    
    max ops per semop call = 2147483646    
    semaphore max value = 32767    
    

    推荐设置

    kernel.sem = 4096 2147483647 2147483646 512000    
    .  
    4096可能能够适合更多的场景, 所以大点无妨,关键是512000 arrays也够了。    
    

    5.

    参数

    kernel.shmall = 107374182    
    kernel.shmmax = 274877906944    
    kernel.shmmni = 819200    
    

    支持系统

    CentOS 6, 7        
    

    参数解释

    假设主机内存 512GB    
    .  
    shmmax 单个共享内存段最大 256GB (主机内存的一半,单位字节)      
    shmall 所有共享内存段加起来最大 (主机内存的80%,单位PAGE)      
    shmmni 一共允许创建819200个共享内存段 (每个数据库启动需要2个共享内存段。  将来允许动态创建共享内存段,可能需求量更大)     
    .  
    # getconf PAGE_SIZE    
    4096    
    

    推荐设置

    kernel.shmall = 107374182    
    kernel.shmmax = 274877906944    
    kernel.shmmni = 819200    
    .  
    9.2以及以前的版本,数据库启动时,对共享内存段的内存需求非常大,需要考虑以下几点  
    Connections:	(1800 + 270 * max_locks_per_transaction) * max_connections  
    Autovacuum workers:	(1800 + 270 * max_locks_per_transaction) * autovacuum_max_workers  
    Prepared transactions:	(770 + 270 * max_locks_per_transaction) * max_prepared_transactions  
    Shared disk buffers:	(block_size + 208) * shared_buffers  
    WAL buffers:	(wal_block_size + 8) * wal_buffers  
    Fixed space requirements:	770 kB  
    .  
    以上建议参数根据9.2以前的版本设置,后期的版本同样适用。  
    

    6.

    参数

    net.core.netdev_max_backlog  
    

    支持系统

    CentOS 6, 7     
    

    参数解释

    netdev_max_backlog    
      ------------------    
    Maximum number  of  packets,  queued  on  the  INPUT  side,    
    when the interface receives packets faster than kernel can process them.    
    

    推荐设置

    net.core.netdev_max_backlog=1xxxx    
    .  
    INPUT链表越长,处理耗费越大,如果用了iptables管理的话,需要加大这个值。    
    

    7.

    参数

    net.core.rmem_default  
    net.core.rmem_max  
    net.core.wmem_default  
    net.core.wmem_max  
    

    支持系统

    CentOS 6, 7     
    

    参数解释

    rmem_default    
      ------------    
    The default setting of the socket receive buffer in bytes.    
    .  
    rmem_max    
      --------    
    The maximum receive socket buffer size in bytes.    
    .  
    wmem_default    
      ------------    
    The default setting (in bytes) of the socket send buffer.    
    .  
    wmem_max    
      --------    
    The maximum send socket buffer size in bytes.    
    

    推荐设置

    net.core.rmem_default = 262144    
    net.core.rmem_max = 4194304    
    net.core.wmem_default = 262144    
    net.core.wmem_max = 4194304    
    

    8.

    参数

    net.core.somaxconn   
    

    支持系统

    CentOS 6, 7        
    

    参数解释

    somaxconn - INTEGER    
            Limit of socket listen() backlog, known in userspace as SOMAXCONN.    
            Defaults to 128.    
    	See also tcp_max_syn_backlog for additional tuning for TCP sockets.    
    

    推荐设置

    net.core.somaxconn=4xxx    
    

    9.

    参数

    net.ipv4.tcp_max_syn_backlog  
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    tcp_max_syn_backlog - INTEGER    
            Maximal number of remembered connection requests, which have not    
            received an acknowledgment from connecting client.    
            The minimal value is 128 for low memory machines, and it will    
            increase in proportion to the memory of machine.    
            If server suffers from overload, try increasing this number.    
    

    推荐设置

    net.ipv4.tcp_max_syn_backlog=4xxx    
    pgpool-II 使用了这个值,用于将超过num_init_child以外的连接queue。     
    所以这个值决定了有多少连接可以在队列里面等待。    
    

    10.

    参数

    net.ipv4.tcp_keepalive_intvl=20    
    net.ipv4.tcp_keepalive_probes=3    
    net.ipv4.tcp_keepalive_time=60     
    

    支持系统

    CentOS 6, 7        
    

    参数解释

    tcp_keepalive_time - INTEGER    
            How often TCP sends out keepalive messages when keepalive is enabled.    
            Default: 2hours.    
    .  
    tcp_keepalive_probes - INTEGER    
            How many keepalive probes TCP sends out, until it decides that the    
            connection is broken. Default value: 9.    
    .  
    tcp_keepalive_intvl - INTEGER    
            How frequently the probes are send out. Multiplied by    
            tcp_keepalive_probes it is time to kill not responding connection,    
            after probes started. Default value: 75sec i.e. connection    
            will be aborted after ~11 minutes of retries.    
    

    推荐设置

    net.ipv4.tcp_keepalive_intvl=20    
    net.ipv4.tcp_keepalive_probes=3    
    net.ipv4.tcp_keepalive_time=60    
    .  
    连接空闲60秒后, 每隔20秒发心跳包, 尝试3次心跳包没有响应,关闭连接。 从开始空闲,到关闭连接总共历时120秒。    
    

    11.

    参数

    net.ipv4.tcp_mem=8388608 12582912 16777216    
    

    支持系统

    CentOS 6, 7    
    

    参数解释

    tcp_mem - vector of 3 INTEGERs: min, pressure, max    
    单位 page    
            min: below this number of pages TCP is not bothered about its    
            memory appetite.    
    .  
            pressure: when amount of memory allocated by TCP exceeds this number    
            of pages, TCP moderates its memory consumption and enters memory    
            pressure mode, which is exited when memory consumption falls    
            under "min".    
    .  
            max: number of pages allowed for queueing by all TCP sockets.    
    .  
            Defaults are calculated at boot time from amount of available    
            memory.    
    64GB 内存,自动计算的值是这样的    
    net.ipv4.tcp_mem = 1539615      2052821 3079230    
    .  
    512GB 内存,自动计算得到的值是这样的    
    net.ipv4.tcp_mem = 49621632     66162176        99243264    
    .  
    这个参数让操作系统启动时自动计算,问题也不大  
    

    推荐设置

    net.ipv4.tcp_mem=8388608 12582912 16777216    
    .  
    这个参数让操作系统启动时自动计算,问题也不大  
    

    12.

    参数

    net.ipv4.tcp_fin_timeout  
    

    支持系统

    CentOS 6, 7        
    

    参数解释

    tcp_fin_timeout - INTEGER    
            The length of time an orphaned (no longer referenced by any    
            application) connection will remain in the FIN_WAIT_2 state    
            before it is aborted at the local end.  While a perfectly    
            valid "receive only" state for an un-orphaned connection, an    
            orphaned connection in FIN_WAIT_2 state could otherwise wait    
            forever for the remote to close its end of the connection.    
            Cf. tcp_max_orphans    
            Default: 60 seconds    
    

    推荐设置

    net.ipv4.tcp_fin_timeout=5    
    .  
    加快僵尸连接回收速度   
    

    13.

    参数

    net.ipv4.tcp_synack_retries  
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    tcp_synack_retries - INTEGER    
            Number of times SYNACKs for a passive TCP connection attempt will    
            be retransmitted. Should not be higher than 255. Default value    
            is 5, which corresponds to 31seconds till the last retransmission    
            with the current initial RTO of 1second. With this the final timeout    
            for a passive TCP connection will happen after 63seconds.    
    

    推荐设置

    net.ipv4.tcp_synack_retries=2    
    .  
    缩短tcp syncack超时时间  
    

    14.

    参数

    net.ipv4.tcp_syncookies  
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    tcp_syncookies - BOOLEAN    
            Only valid when the kernel was compiled with CONFIG_SYN_COOKIES    
            Send out syncookies when the syn backlog queue of a socket    
            overflows. This is to prevent against the common 'SYN flood attack'    
            Default: 1    
    .  
            Note, that syncookies is fallback facility.    
            It MUST NOT be used to help highly loaded servers to stand    
            against legal connection rate. If you see SYN flood warnings    
            in your logs, but investigation shows that they occur    
            because of overload with legal connections, you should tune    
            another parameters until this warning disappear.    
            See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.    
    .  
            syncookies seriously violate TCP protocol, do not allow    
            to use TCP extensions, can result in serious degradation    
            of some services (f.e. SMTP relaying), visible not by you,    
            but your clients and relays, contacting you. While you see    
            SYN flood warnings in logs not being really flooded, your server    
            is seriously misconfigured.    
    .  
            If you want to test which effects syncookies have to your    
            network connections you can set this knob to 2 to enable    
            unconditionally generation of syncookies.    
    

    推荐设置

    net.ipv4.tcp_syncookies=1    
    .  
    防止syn flood攻击   
    

    15.

    参数

    net.ipv4.tcp_timestamps  
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    tcp_timestamps - BOOLEAN    
            Enable timestamps as defined in RFC1323.    
    

    推荐设置

    net.ipv4.tcp_timestamps=1    
    .  
    tcp_timestamps 是 tcp 协议中的一个扩展项,通过时间戳的方式来检测过来的包以防止 PAWS(Protect Against Wrapped  Sequence numbers),可以提高 tcp 的性能。  
    

    16.

    参数

    net.ipv4.tcp_tw_recycle  
    net.ipv4.tcp_tw_reuse  
    net.ipv4.tcp_max_tw_buckets  
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    tcp_tw_recycle - BOOLEAN    
            Enable fast recycling TIME-WAIT sockets. Default value is 0.    
            It should not be changed without advice/request of technical    
            experts.    
    .  
    tcp_tw_reuse - BOOLEAN    
            Allow to reuse TIME-WAIT sockets for new connections when it is    
            safe from protocol viewpoint. Default value is 0.    
            It should not be changed without advice/request of technical    
            experts.    
    .  
    tcp_max_tw_buckets - INTEGER  
            Maximal number of timewait sockets held by system simultaneously.  
            If this number is exceeded time-wait socket is immediately destroyed  
            and warning is printed.   
    	This limit exists only to prevent simple DoS attacks,   
    	you _must_ not lower the limit artificially,   
            but rather increase it (probably, after increasing installed memory),    
            if network conditions require more than default value.   
    

    推荐设置

    net.ipv4.tcp_tw_recycle=0    
    net.ipv4.tcp_tw_reuse=1    
    net.ipv4.tcp_max_tw_buckets = 2xxxxx    
    .  
    net.ipv4.tcp_tw_recycle和net.ipv4.tcp_timestamps不建议同时开启    
    

    17.

    参数

    net.ipv4.tcp_rmem  
    net.ipv4.tcp_wmem  
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    tcp_wmem - vector of 3 INTEGERs: min, default, max    
            min: Amount of memory reserved for send buffers for TCP sockets.    
            Each TCP socket has rights to use it due to fact of its birth.    
            Default: 1 page    
    .  
            default: initial size of send buffer used by TCP sockets.  This    
            value overrides net.core.wmem_default used by other protocols.    
            It is usually lower than net.core.wmem_default.    
            Default: 16K    
    .  
            max: Maximal amount of memory allowed for automatically tuned    
            send buffers for TCP sockets. This value does not override    
            net.core.wmem_max.  Calling setsockopt() with SO_SNDBUF disables    
            automatic tuning of that socket's send buffer size, in which case    
            this value is ignored.    
            Default: between 64K and 4MB, depending on RAM size.    
    .  
    tcp_rmem - vector of 3 INTEGERs: min, default, max    
            min: Minimal size of receive buffer used by TCP sockets.    
            It is guaranteed to each TCP socket, even under moderate memory    
            pressure.    
            Default: 1 page    
    .  
            default: initial size of receive buffer used by TCP sockets.    
            This value overrides net.core.rmem_default used by other protocols.    
            Default: 87380 bytes. This value results in window of 65535 with    
            default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit    
            less for default tcp_app_win. See below about these variables.    
    .  
            max: maximal size of receive buffer allowed for automatically    
            selected receiver buffers for TCP socket. This value does not override    
            net.core.rmem_max.  Calling setsockopt() with SO_RCVBUF disables    
            automatic tuning of that socket's receive buffer size, in which    
            case this value is ignored.    
            Default: between 87380B and 6MB, depending on RAM size.    
    

    推荐设置

    net.ipv4.tcp_rmem=8192 87380 16777216    
    net.ipv4.tcp_wmem=8192 65536 16777216    
    .  
    许多数据库的推荐设置,提高网络性能  
    

    18.

    参数

    net.nf_conntrack_max  
    net.netfilter.nf_conntrack_max  
    

    支持系统

    CentOS 6    
    

    参数解释

    nf_conntrack_max - INTEGER    
            Size of connection tracking table.    
    	Default value is nf_conntrack_buckets value * 4.    
    

    推荐设置

    net.nf_conntrack_max=1xxxxxx    
    net.netfilter.nf_conntrack_max=1xxxxxx    
    

    19.

    参数

    vm.dirty_background_bytes   
    vm.dirty_expire_centisecs   
    vm.dirty_ratio   
    vm.dirty_writeback_centisecs   
    

    支持系统

    CentOS 6, 7        
    

    参数解释

    ==============================================================    
    .  
    dirty_background_bytes    
    .  
    Contains the amount of dirty memory at which the background kernel    
    flusher threads will start writeback.    
    .  
    Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only    
    one of them may be specified at a time. When one sysctl is written it is    
    immediately taken into account to evaluate the dirty memory limits and the    
    other appears as 0 when read.    
    .  
    ==============================================================    
    .  
    dirty_background_ratio    
    .  
    Contains, as a percentage of total system memory, the number of pages at which    
    the background kernel flusher threads will start writing out dirty data.    
    .  
    ==============================================================    
    .  
    dirty_bytes    
    .  
    Contains the amount of dirty memory at which a process generating disk writes    
    will itself start writeback.    
    .  
    Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be    
    specified at a time. When one sysctl is written it is immediately taken into    
    account to evaluate the dirty memory limits and the other appears as 0 when    
    read.    
    .  
    Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any    
    value lower than this limit will be ignored and the old configuration will be    
    retained.    
    .  
    ==============================================================    
    .  
    dirty_expire_centisecs    
    .  
    This tunable is used to define when dirty data is old enough to be eligible    
    for writeout by the kernel flusher threads.  It is expressed in 100'ths    
    of a second.  Data which has been dirty in-memory for longer than this    
    interval will be written out next time a flusher thread wakes up.    
    .  
    ==============================================================    
    .  
    dirty_ratio    
    .  
    Contains, as a percentage of total system memory, the number of pages at which    
    a process which is generating disk writes will itself start writing out dirty    
    data.    
    .  
    ==============================================================    
    .  
    dirty_writeback_centisecs    
    .  
    The kernel flusher threads will periodically wake up and write `old' data    
    out to disk.  This tunable expresses the interval between those wakeups, in    
    100'ths of a second.    
    .  
    Setting this to zero disables periodic writeback altogether.    
    .  
    ==============================================================    
    

    推荐设置

    vm.dirty_background_bytes = 4096000000    
    vm.dirty_expire_centisecs = 6000    
    vm.dirty_ratio = 80    
    vm.dirty_writeback_centisecs = 50    
    .  
    减少数据库进程刷脏页的频率,dirty_background_bytes根据实际IOPS能力以及内存大小设置    
    

    20.

    参数

    vm.extra_free_kbytes  
    

    支持系统

    CentOS 6    
    

    参数解释

    extra_free_kbytes    
    .  
    This parameter tells the VM to keep extra free memory   
    between the threshold where background reclaim (kswapd) kicks in,   
    and the threshold where direct reclaim (by allocating processes) kicks in.    
    .  
    This is useful for workloads that require low latency memory allocations    
    and have a bounded burstiness in memory allocations,   
    for example a realtime application that receives and transmits network traffic    
    (causing in-kernel memory allocations) with a maximum total message burst    
    size of 200MB may need 200MB of extra free memory to avoid direct reclaim    
    related latencies.    
    .  
    目标是尽量让后台进程回收内存,比用户进程提早多少kbytes回收,因此用户进程可以快速分配内存。    
    

    推荐设置

    vm.extra_free_kbytes=4xxxxxx    
    

    21.

    参数

    vm.min_free_kbytes  
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    min_free_kbytes:    
    .  
    This is used to force the Linux VM to keep a minimum number    
    of kilobytes free.  The VM uses this number to compute a    
    watermark[WMARK_MIN] value for each lowmem zone in the system.    
    Each lowmem zone gets a number of reserved free pages based    
    proportionally on its size.    
    .  
    Some minimal amount of memory is needed to satisfy PF_MEMALLOC    
    allocations; if you set this to lower than 1024KB, your system will    
    become subtly broken, and prone to deadlock under high loads.    
    .  
    Setting this too high will OOM your machine instantly.    
    

    推荐设置

    vm.min_free_kbytes = 2xxxxxx    
    .  
    防止在高负载时系统无响应,减少内存分配死锁概率。    
    

    22.

    参数

    vm.mmap_min_addr  
    

    支持系统

    CentOS 6, 7       
    

    参数解释

    mmap_min_addr    
    .  
    This file indicates the amount of address space  which a user process will    
    be restricted from mmapping.  Since kernel null dereference bugs could    
    accidentally operate based on the information in the first couple of pages    
    of memory userspace processes should not be allowed to write to them.  By    
    default this value is set to 0 and no protections will be enforced by the    
    security module.  Setting this value to something like 64k will allow the    
    vast majority of applications to work correctly and provide defense in depth    
    against future potential kernel bugs.    
    

    推荐设置

    vm.mmap_min_addr=6xxxx    
    .  
    防止内核隐藏的BUG导致的问题  
    

    23.

    参数

    vm.overcommit_memory   
    vm.overcommit_ratio   
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    ==============================================================    
    .  
    overcommit_kbytes:    
    .  
    When overcommit_memory is set to 2, the committed address space is not    
    permitted to exceed swap plus this amount of physical RAM. See below.    
    .  
    Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one    
    of them may be specified at a time. Setting one disables the other (which    
    then appears as 0 when read).    
    .  
    ==============================================================    
    .  
    overcommit_memory:    
    .  
    This value contains a flag that enables memory overcommitment.    
    .  
    When this flag is 0,   
    the kernel attempts to estimate the amount    
    of free memory left when userspace requests more memory.    
    .  
    When this flag is 1,   
    the kernel pretends there is always enough memory until it actually runs out.    
    .  
    When this flag is 2,   
    the kernel uses a "never overcommit"    
    policy that attempts to prevent any overcommit of memory.    
    Note that user_reserve_kbytes affects this policy.    
    .  
    This feature can be very useful because there are a lot of    
    programs that malloc() huge amounts of memory "just-in-case"    
    and don't use much of it.    
    .  
    The default value is 0.    
    .  
    See Documentation/vm/overcommit-accounting and    
    security/commoncap.c::cap_vm_enough_memory() for more information.    
    .  
    ==============================================================    
    .  
    overcommit_ratio:    
    .  
    When overcommit_memory is set to 2,   
    the committed address space is not permitted to exceed   
          swap + this percentage of physical RAM.    
    See above.    
    .  
    ==============================================================    
    

    推荐设置

    vm.overcommit_memory = 0    
    vm.overcommit_ratio = 90    
    .  
    vm.overcommit_memory = 0 时 vm.overcommit_ratio可以不设置   
    

    24.

    参数

    vm.swappiness   
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    swappiness    
    .  
    This control is used to define how aggressive the kernel will swap    
    memory pages.    
    Higher values will increase agressiveness, lower values    
    decrease the amount of swap.    
    .  
    The default value is 60.    
    

    推荐设置

    vm.swappiness = 0    
    

    25.

    参数

    vm.zone_reclaim_mode   
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    zone_reclaim_mode:    
    .  
    Zone_reclaim_mode allows someone to set more or less aggressive approaches to    
    reclaim memory when a zone runs out of memory. If it is set to zero then no    
    zone reclaim occurs. Allocations will be satisfied from other zones / nodes    
    in the system.    
    .  
    This is value ORed together of    
    .  
    1       = Zone reclaim on    
    2       = Zone reclaim writes dirty pages out    
    4       = Zone reclaim swaps pages    
    .  
    zone_reclaim_mode is disabled by default.  For file servers or workloads    
    that benefit from having their data cached, zone_reclaim_mode should be    
    left disabled as the caching effect is likely to be more important than    
    data locality.    
    .  
    zone_reclaim may be enabled if it's known that the workload is partitioned    
    such that each partition fits within a NUMA node and that accessing remote    
    memory would cause a measurable performance reduction.  The page allocator    
    will then reclaim easily reusable pages (those page cache pages that are    
    currently not used) before allocating off node pages.    
    .  
    Allowing zone reclaim to write out pages stops processes that are    
    writing large amounts of data from dirtying pages on other nodes. Zone    
    reclaim will write out dirty pages if a zone fills up and so effectively    
    throttle the process. This may decrease the performance of a single process    
    since it cannot use all of system memory to buffer the outgoing writes    
    anymore but it preserve the memory on other nodes so that the performance    
    of other processes running on other nodes will not be affected.    
    .  
    Allowing regular swap effectively restricts allocations to the local    
    node unless explicitly overridden by memory policies or cpuset    
    configurations.    
    

    推荐设置

    vm.zone_reclaim_mode=0    
    .  
    不使用NUMA  
    

    26.

    参数

    net.ipv4.ip_local_port_range  
    

    支持系统

    CentOS 6, 7         
    

    参数解释

    ip_local_port_range - 2 INTEGERS  
            Defines the local port range that is used by TCP and UDP to  
            choose the local port. The first number is the first, the  
            second the last local port number. The default values are  
            32768 and 61000 respectively.  
    .  
    ip_local_reserved_ports - list of comma separated ranges  
            Specify the ports which are reserved for known third-party  
            applications. These ports will not be used by automatic port  
            assignments (e.g. when calling connect() or bind() with port  
            number 0). Explicit port allocation behavior is unchanged.  
    .  
            The format used for both input and output is a comma separated  
            list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and  
            10). Writing to the file will clear all previously reserved  
            ports and update the current list with the one given in the  
            input.  
    .  
            Note that ip_local_port_range and ip_local_reserved_ports  
            settings are independent and both are considered by the kernel  
            when determining which ports are available for automatic port  
            assignments.  
    .  
            You can reserve ports which are not in the current  
            ip_local_port_range, e.g.:  
    .  
            $ cat /proc/sys/net/ipv4/ip_local_port_range  
            32000   61000  
            $ cat /proc/sys/net/ipv4/ip_local_reserved_ports  
            8080,9148  
    .  
            although this is redundant. However such a setting is useful  
            if later the port range is changed to a value that will  
            include the reserved ports.  
    .  
            Default: Empty  
    

    推荐设置

    net.ipv4.ip_local_port_range=40000 65535    
    .  
    限制本地动态端口分配范围,防止占用监听端口。  
    

    27.

    参数

      vm.nr_hugepages  
    

    支持系统

    CentOS 6, 7  
    

    参数解释

    ==============================================================  
    nr_hugepages  
    Change the minimum size of the hugepage pool.  
    See Documentation/vm/hugetlbpage.txt  
    ==============================================================  
    nr_overcommit_hugepages  
    Change the maximum size of the hugepage pool. The maximum is  
    nr_hugepages + nr_overcommit_hugepages.  
    See Documentation/vm/hugetlbpage.txt  
    .  
    The output of "cat /proc/meminfo" will include lines like:  
    ......  
    HugePages_Total: vvv  
    HugePages_Free:  www  
    HugePages_Rsvd:  xxx  
    HugePages_Surp:  yyy  
    Hugepagesize:    zzz kB  
    .  
    where:  
    HugePages_Total is the size of the pool of huge pages.  
    HugePages_Free  is the number of huge pages in the pool that are not yet  
                    allocated.  
    HugePages_Rsvd  is short for "reserved," and is the number of huge pages for  
                    which a commitment to allocate from the pool has been made,  
                    but no allocation has yet been made.  Reserved huge pages  
                    guarantee that an application will be able to allocate a  
                    huge page from the pool of huge pages at fault time.  
    HugePages_Surp  is short for "surplus," and is the number of huge pages in  
                    the pool above the value in /proc/sys/vm/nr_hugepages. The  
                    maximum number of surplus huge pages is controlled by  
                    /proc/sys/vm/nr_overcommit_hugepages.  
    .  
    /proc/filesystems should also show a filesystem of type "hugetlbfs" configured  
    in the kernel.  
    .  
    /proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge  
    pages in the kernel's huge page pool.  "Persistent" huge pages will be  
    returned to the huge page pool when freed by a task.  A user with root  
    privileges can dynamically allocate more or free some persistent huge pages  
    by increasing or decreasing the value of 'nr_hugepages'.  
    

    推荐设置

    如果要使用PostgreSQL的huge page,建议设置它。    
    大于数据库需要的共享内存即可。    
    

    28.

    参数

      fs.nr_open
    

    支持系统

    CentOS 6, 7
    

    参数解释

    nr_open:
    
    This denotes the maximum number of file-handles a process can
    allocate. Default value is 1024*1024 (1048576) which should be
    enough for most machines. Actual limit depends on RLIMIT_NOFILE
    resource limit.
    
    它还影响security/limits.conf 的文件句柄限制,单个进程的打开句柄不能大于fs.nr_open,所以要加大文件句柄限制,首先要加大nr_open
    

    推荐设置

    对于有很多对象(表、视图、索引、序列、物化视图等)的PostgreSQL数据库,建议设置为2000万,
    例如fs.nr_open=20480000
    

    数据库关心的资源限制

    1. 通过/etc/security/limits.conf设置,或者ulimit设置

    2. 通过/proc/$pid/limits查看当前进程的设置

    #        - core - limits the core file size (KB)  
    #        - memlock - max locked-in-memory address space (KB)  
    #        - nofile - max number of open files  建议设置为1000万 , 但是必须设置sysctl, fs.nr_open大于它,否则会导致系统无法登陆。
    #        - nproc - max number of processes  
    以上四个是非常关心的配置  
    ....  
    #        - data - max data size (KB)  
    #        - fsize - maximum filesize (KB)  
    #        - rss - max resident set size (KB)  
    #        - stack - max stack size (KB)  
    #        - cpu - max CPU time (MIN)  
    #        - as - address space limit (KB)  
    #        - maxlogins - max number of logins for this user  
    #        - maxsyslogins - max number of logins on the system  
    #        - priority - the priority to run user process with  
    #        - locks - max number of file locks the user can hold  
    #        - sigpending - max number of pending signals  
    #        - msgqueue - max memory used by POSIX message queues (bytes)  
    #        - nice - max nice priority allowed to raise to values: [-20, 19]  
    #        - rtprio - max realtime priority  
    

    数据库关心的IO调度规则

    1. 目前操作系统支持的IO调度策略包括cfq, deadline, noop 等。

    /kernel-doc-xxx/Documentation/block  
    -r--r--r-- 1 root root   674 Apr  8 16:33 00-INDEX  
    -r--r--r-- 1 root root 55006 Apr  8 16:33 biodoc.txt  
    -r--r--r-- 1 root root   618 Apr  8 16:33 capability.txt  
    -r--r--r-- 1 root root 12791 Apr  8 16:33 cfq-iosched.txt  
    -r--r--r-- 1 root root 13815 Apr  8 16:33 data-integrity.txt  
    -r--r--r-- 1 root root  2841 Apr  8 16:33 deadline-iosched.txt  
    -r--r--r-- 1 root root  4713 Apr  8 16:33 ioprio.txt  
    -r--r--r-- 1 root root  2535 Apr  8 16:33 null_blk.txt  
    -r--r--r-- 1 root root  4896 Apr  8 16:33 queue-sysfs.txt  
    -r--r--r-- 1 root root  2075 Apr  8 16:33 request.txt  
    -r--r--r-- 1 root root  3272 Apr  8 16:33 stat.txt  
    -r--r--r-- 1 root root  1414 Apr  8 16:33 switching-sched.txt  
    -r--r--r-- 1 root root  3916 Apr  8 16:33 writeback_cache_control.txt  
    

    如果你要详细了解这些调度策略的规则,可以查看WIKI或者看内核文档。

    从这里可以看到它的调度策略

    cat /sys/block/vdb/queue/scheduler   
    noop [deadline] cfq   
    

    修改

    echo deadline > /sys/block/hda/queue/scheduler  
    

    或者修改启动参数

    grub.conf  
    elevator=deadline  
    

    从很多测试结果来看,数据库使用deadline调度,性能会更稳定一些。

    其他

    1. 关闭透明大页

    2. 禁用NUMA

    3. SSD的对齐

  • 相关阅读:
    基于云的平台利用新技术来改变商店式购物营销
    在云上战斗:游戏设计师推出 Windows Azure 上的全球在线游戏
    use Visual studio2012 development kernel to hidden process on Windows8
    Mobile Services更新:增加了新的 HTML5/JS SDK 并对 Windows Phone 7.5 进行支持
    [转载]30个Oracle语句优化法例详解(3)
    [转载]Informix4gl FORM:直立控制录入两遍一概信息的设置
    [转载]Oracle数据库异构数据结合详解(1)
    [转载]informix onbar规复饬令用法
    [转载]30个Oracle语句优化划定端正详解(4)
    [转载]同一台效力器上搭建HDR实例
  • 原文地址:https://www.cnblogs.com/lykops/p/8263106.html
Copyright © 2020-2023  润新知