Redis监控工具,命令和调优

1.图形化监控

由于要对Redis做性能測试，发现了GitHub上有个python写的RedisLive监控工具评价不错。结果鼓捣了半天，最后发现其主页中引用了Google的jsapi脚本，必须在线连接谷歌的服务。Stackoverflow上说把js脚本下载到本地也没法解决这个问题，坑爹！

正要放弃时发现了一个从RedisLive fork出去的项目redis-monitor，应该是国人改的吧，去掉了对谷歌jsapi的依赖，并完好了多Redis实例的管理，最终最终看到了久违的曲线图。

首先要保证安装了python。

之后下载下列python包安装。能够手动下载tar.gz解压后运行python setup.py install逐一安装，或直接用pip下载：

tornado：一个python的web框架
redis.py：python的redis客户端
python-dateutil
backports.ssl_match_hostname
argparse
setuptools
six

之后从GitHub上下载解压redis-monitor-master，改动src/redis_live.conf。

必须配置一个单独的Redis实例存储监控数据，同一时候能够配置多个要监控的Redis实例。

之后启动redis-monitor有些麻烦，须要启动两个前台进程和两个后台进程：

#in src/script/redis-monitor.sh add redis-monitor as a startup service

#start web with port 8888
$ python redis_live.py

# start info collector
$ python redis_monitor.py

#start daemon
$ python redis_live_daemon.py
$ python redis_monitor_daemon.py

2.命令行监控

前面能够看到，虽然图形化监控Redis比較美观、直接。可是安装起来比較麻烦。

假设仅仅是想简单看一下Redis的负载情况的话，全然能够用它提供的一些命令来完毕。

2.1 吞吐量

Redis提供的INFO命令不仅能够查看实时的吞吐量(ops/sec)，还能看到一些实用的运行时信息。以下用grep过滤出一些比較重要的实时信息，比方已连接的和在堵塞的客户端、已用内存、拒绝连接、实时的tps和数据流量等：

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1 info | grep -e "connected_clients" -e "blocked_clients" -e "used_memory_human" -e "used_memory_peak_human" -e "rejected_connections" -e "evicted_keys" -e "instantaneous"

connected_clients:1
blocked_clients:0
used_memory_human:799.66K
used_memory_peak_human:852.35K
instantaneous_ops_per_sec:0
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
evicted_keys:0

2.2 延迟

2.2.1 客户端PING

从客户端能够监控Redis的延迟，利用Redis提供的PING命令，不断PING服务端，记录服务端响应PONG的时间。

以下开两个终端，一个监控延迟。一个监视服务端收到的命令：

[root@vm redis-3.0.3]# src/redis-cli --latency -h 127.0.0.1
min: 0, max: 1, avg: 0.08

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1
127.0.0.1:6379> monitor
OK
1439361594.867170 [0 127.0.0.1:59737] "PING"
1439361594.877413 [0 127.0.0.1:59737] "PING"
1439361594.887643 [0 127.0.0.1:59737] "PING"
1439361594.897858 [0 127.0.0.1:59737] "PING"
1439361594.908063 [0 127.0.0.1:59737] "PING"
1439361594.918277 [0 127.0.0.1:59737] "PING"
1439361594.928469 [0 127.0.0.1:59737] "PING"
1439361594.938693 [0 127.0.0.1:59737] "PING"
1439361594.948899 [0 127.0.0.1:59737] "PING"
1439361594.959110 [0 127.0.0.1:59737] "PING"

假设我们有益用DEBUG命令制造延迟，就能看到一些输出上的变化：

[root@vm redis-3.0.3]# src/redis-cli --latency -h 127.0.0.1
min: 0, max: 1995, avg: 1.60 (2361 samples)

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1
127.0.0.1:6379> debug sleep 1
OK
(1.00s)
127.0.0.1:6379> debug sleep .15
OK
127.0.0.1:6379> debug sleep .5
OK
(0.50s)
127.0.0.1:6379> debug sleep 2
OK
(2.00s)

2.2.2 服务端内部机制

服务端内部的延迟监控略微麻烦一些。由于延迟记录的默认阈值是0。虽然空间和时间耗费很小。Redis为了高性能还是默认关闭了它。

所以首先我们要开启它，设置一个合理的阈值。比如以下命令中设置的100ms：

127.0.0.1:6379> CONFIG SET latency-monitor-threshold 100
OK

由于Redis运行命令很快，所以我们用DEBUG命令人为制造一些慢运行命令：

127.0.0.1:6379> debug sleep 2
OK
(2.00s)
127.0.0.1:6379> debug sleep .15
OK
127.0.0.1:6379> debug sleep .5
OK

以下就用LATENCY的各种子命令来查看延迟记录：

LATEST：四列分别表示事件名、近期延迟的Unix时间戳、近期的延迟、最大延迟。
HISTORY：延迟的时间序列。
可用来产生图形化显示或报表。
GRAPH：以图形化的方式显示。最以下以竖行显示的是指延迟在多久曾经发生。
RESET：清除延迟记录。

127.0.0.1:6379> latency latest
1) 1) "command"
   2) (integer) 1439358778
   3) (integer) 500
   4) (integer) 2000

127.0.0.1:6379> latency history command
1) 1) (integer) 1439358773
   2) (integer) 2000
2) 1) (integer) 1439358776
   2) (integer) 150
3) 1) (integer) 1439358778
   2) (integer) 500

127.0.0.1:6379> latency graph command
command - high 2000 ms, low 150 ms (all time high 2000 ms)
--------------------------------------------------------------------------------
#  
|  
|  
|_#

666
mmm

在运行一条DEBUG命令会发现GRAPH图的变化，多出一条新的柱状线，以下的时间2s就是指延迟刚发生两秒钟：

127.0.0.1:6379> debug sleep 1.5
OK
(1.50s)
127.0.0.1:6379> latency graph command
command - high 2000 ms, low 150 ms (all time high 2000 ms)
--------------------------------------------------------------------------------
#   
|  #
|  |
|_#|

2222
333s
mmm

另一个有趣的子命令DOCTOR，它能列出一些指导建议。比如开启慢日志进一步追查问题原因，查看是否有大对象被踢出或过期。以及操作系统的配置建议等。

127.0.0.1:6379> latency doctor
Dave, I have observed latency spikes in this Redis instance. You don't mind talking about it, do you Dave?

1. command: 3 latency spikes (average 883ms, mean deviation 744ms, period 210.00 sec). Worst all time event 2000ms.

I have a few advices for you:

- Check your Slow Log to understand what are the commands you are running which are too slow to execute. Please check http://redis.io/commands/slowlog for more information.
- Deleting, expiring or evicting (because of maxmemory policy) large objects is a blocking operation. If you have very large objects that are often deleted, expired, or evicted, try to fragment those objects into multiple smaller objects.
- I detected a non zero amount of anonymous huge pages used by your process. This creates very serious latency events in different conditions, especially when Redis is persisting on disk. To disable THP support use the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled', make sure to also add it into /etc/rc.local so that the command will be executed again after a reboot. Note that even if you have already disabled THP, you still need to restart the Redis process to get rid of the huge pages already created.

2.2.3 度量延迟Baseline

延迟中的一部分是来自环境的，比方操作系统内核、虚拟化环境等等。Redis提供了让我们度量这一部分延迟基线(Baseline)的方法：

[root@vm redis-3.0.3]# src/redis-cli --intrinsic-latency 100 -h 127.0.0.1
Max latency so far: 2 microseconds.
Max latency so far: 3 microseconds.
Max latency so far: 26 microseconds.
Max latency so far: 37 microseconds.
Max latency so far: 1179 microseconds.
Max latency so far: 1623 microseconds.
Max latency so far: 1795 microseconds.
Max latency so far: 2142 microseconds.

35818026 total runs (avg latency: 2.7919 microseconds / 27918.90 nanoseconds per run).
Worst run took 767x longer than the average latency.

–intrinsic-latency后面是測试的时长(秒)，一般100秒足够了。

2.3 持续实时监控

Unix的WATCH命令是一个很实用的工具，它能够实时监视随意命令的输出结果。

比方上面我们提到的命令，稍加改造就能变成持续地实时监控工具：

[root@vm redis-3.0.3]# watch -n 1 -d "src/redis-cli -h 127.0.0.1 info | grep -e "connected_clients" -e "blocked_clients" -e "used_memory_human" -e "used_memory_peak_human" -e "rejected_connections" -e "evicted_keys" -e "instantaneous""

Every 1.0s: src/redis-cli -h 127.0.0.1 info | grep -e...  Wed Aug 12 14:30:40 2015

connected_clients:1
blocked_clients:0
used_memory_human:799.66K
used_memory_peak_human:852.35K
instantaneous_ops_per_sec:0
instantaneous_input_kbps:0.01
instantaneous_output_kbps:1.23
rejected_connections:0
evicted_keys:0

[root@vm redis-3.0.3]# watch -n 1 -d "src/redis-cli -h 127.0.0.1 latency graph command"

Every 1.0s: src/redis-cli -h 127.0.0.1 latency graph command                                                                                                               Wed Aug 12 14:33:25 2015

command - high 2000 ms, low 150 ms (all time high 2000 ms)
--------------------------------------------------------------------------------
#
|  #
|  |
|_#|

4441
0006
mmmm

2.4 慢操作日志

像SORT、LREM、SUNION等操作在大对象上会很耗时。使用时要注意參照官方API上每一个命令的算法复杂度。用前面介绍过的慢操作日志监控操作的运行时间。就像主流数据库提供的慢SQL日志一样，Redis也提供了记录慢操作的日志。

注意这部分日志仅仅会计算纯粹的操作耗时。

slowlog-log-slower-than设置慢操作的阈值，slowlog-max-len设置保存个数，由于慢操作日志与延迟记录一样，都是保存在内存中的：

127.0.0.1:6379> config set slowlog-log-slower-than 500
OK 
127.0.0.1:6379> debug sleep 1
OK
(0.50s)
127.0.0.1:6379> debug sleep .6
OK
127.0.0.1:6379> slowlog get 10
1) 1) (integer) 2
   2) (integer) 1439369937
   3) (integer) 473178
   4) 1) "debug"
      2) "sleep"
      3) ".6"
2) 1) (integer) 1
   2) (integer) 1439369821
   3) (integer) 499357
   4) 1) "debug"
      2) "sleep"
      3) "1"
3) 1) (integer) 0
   2) (integer) 1439365058
   3) (integer) 417846
   4) 1) "debug"
      2) "sleep"
      3) "1"

输出的四列的含义各自是：记录的自增ID、命令运行时的时间戳、命令的运行耗时(ms)、命令的内容。

注意上面的DEBUG命令并没有包括休眠时间。而仅仅是命令的处理时间。

3.官方优化建议

3.1 网络延迟

客户端能够通过TCP/IP或Unix域Socket连接到Redis。

通常在千兆网络环境中。TCP/IP网络延迟是200us(微秒)，Unix域Socket能够低到30us。

关于Unix域Socket(Unix Domain Socket)还是比較经常使用的技术。详细请參考Nginx+PHP-FPM的域Socket配置方法。

什么是域Socket？
维基百科：“Unix domain socket 或者 IPCsocket 是一种终端，能够使同一台操作系统上的两个或多个进程进行数据通信。与管道相比。Unix domain sockets 既能够使用字节流数和数据队列，而管道通信则仅仅能通过字节流。U**nix domain sockets的接口和Internet socket很像，但它不使用网络底层协议来通信。Unix domain socket的功能是POSIX操作系统里的一种组件。Unix domain sockets使用系统文件的地址来作为自己的身份。它能够被系统进程引用。
所以两个进程能够同一时候打开一个Unix domain sockets来进行通信。
只是这样的通信方式是发生在系统内核里而不会在网络里传播**。
”

网络方面我们能做的就是降低在网络往返时间RTT(Round-Trip Time)。官方提供了以下一些建议：

长连接：不要频繁连接/断开到服务器的连接，尽可能保持长连接(Jedis如今就是这样做的)。
域Socket：假设客户端与Redis服务端在同一台机器上的话。使用Unix域Socket。
多參数命令：相比管道，优先使用多參数命令，如mset/mget/hmset/hmget等。
管道化：其次使用管道降低RTT。
LUA脚本：对于有数据依赖而无法使用管道的命令，能够考虑在Redis服务端运行LUA脚本。

3.2 磁盘I/O

3.2.1 写磁盘

虽然Redis也是基于多路I/O复用的单线程机制。可是却没有像Nginx一样提供CPU Affinity的设置，避免fork出的子进程也跑在Redis主进程依附的CPU内核上。导致后台进程影响主进程。所以还是让操作系统自己去调度Redis主进程和后台进程吧。但反过来，假设不开启持久化机制的话，为Redis设置亲和性能否进一步提升性能呢？

3.2.2 操作系统Swap

假设系统内存不足，可能会将Redis相应的某些页从内存swap到磁盘文件上。能够通过/proc目录中的smaps文件查看是否有数据页被swap。

假设发现大量页被swap。则能够用vmstat和iostat进一步追查原因：

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1 info | grep process_id
process_id:24191

[root@vm redis-3.0.3]# cat /proc/24191/smaps | grep "Swap"
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
            ...
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB

3.3 其它因素

3.3.1 Fork子进程

写RDB文件和rewrite AOF文件都须要fork出一个后台进程，fork操作的主要消耗在于页表的拷贝，不同系统的耗时会有些差异。当中，Xen问题比較严重。

3.3.2 Transparent Huge Page

此外。假设Linux开启了THP(Transparent Huge Page)功能的话，会极大地影响延迟。

3.3.3 Key过期

Redis同一时候使用主动和被动两种方式剔除已经过期的Key：

被动：当客户端訪问到Key时，发现已经过期。则剔除
主动：每100ms剔除一批Key。假如过期Key超过25%则重复运行

所以，要避免同一时间超过25%的Key过期导致的Redis堵塞。设置过期时间时能够略微随机化一些。

4.最后一招：WatchDog

官方说法提供的最后一招(last resort)就是WatchDog。它能够将慢操作的整个函数运行栈打印到Redis日志中。由于它与前面介绍过的将记录保存在内存中的延迟和满操作记录不同。所以记得使用前要在redis.conf中配置logfile日志路径：

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1
127.0.0.1:6379> CONFIG SET watchdog-period 500
OK
127.0.0.1:6379> debug sleep 1
OK

[root@vm redis-3.0.3]# tailf redis.log 
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

51091:M 12 Aug 15:36:53.337 # Server started, Redis version 3.0.3
51091:M 12 Aug 15:36:53.338 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
51091:M 12 Aug 15:36:53.338 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
51091:M 12 Aug 15:36:53.343 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
51091:M 12 Aug 15:36:53.343 * DB loaded from disk: 0.000 seconds
51091:M 12 Aug 15:36:53.343 * The server is now ready to accept connections on port 6379

51091:signal-handler (1439365058) 
--- WATCHDOG TIMER EXPIRED ---
src/redis-server 127.0.0.1:6379(logStackTrace+0x43)[0x450363]
/lib64/libpthread.so.0(__nanosleep+0x2d)[0x3c0740ef3d]
/lib64/libpthread.so.0[0x3c0740f710]
/lib64/libpthread.so.0[0x3c0740f710]
/lib64/libpthread.so.0(__nanosleep+0x2d)[0x3c0740ef3d]
src/redis-server 127.0.0.1:6379(debugCommand+0x58d)[0x45180d]
src/redis-server 127.0.0.1:6379(call+0x72)[0x4201b2]
src/redis-server 127.0.0.1:6379(processCommand+0x3e5)[0x4207d5]
src/redis-server 127.0.0.1:6379(processInputBuffer+0x4f)[0x42c66f]
src/redis-server 127.0.0.1:6379(readQueryFromClient+0xc2)[0x42c7b2]
src/redis-server 127.0.0.1:6379(aeProcessEvents+0x13c)[0x41a52c]
src/redis-server 127.0.0.1:6379(aeMain+0x2b)[0x41a7eb]
src/redis-server 127.0.0.1:6379(main+0x2cd)[0x423c8d]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3c0701ed5d]
src/redis-server 127.0.0.1:6379[0x419b49]
51091:signal-handler (51091:signal-handler (1439365058) ) --------

附：參考资料

不得不说，Redis的官方文档写得很不错！

从中能学到许多不仅仅是Redis。还有系统方面的知识。前面推荐大家细致阅读官方站点上的每一个主题。

原文地址：https://www.cnblogs.com/blfbuaa/p/7091215.html

相关阅读:
从头学pytorch(二十一):全连接网络dense net
Linux环境实现python远程可视编程
 centos7安装Anaconda3
sql语句中包含引号处理方法
 syslog 日志
 python 判断是否为中文
 numpy简介
 django之模板显示静态文件
 Linux(Ubuntu)安装libpcap
Bug预防体系
原文地址：https://www.cnblogs.com/jpfss/p/11015014.html

热门文章
7.聚宽数据
 tflite量化
 tensorflow杂记
 docker学习笔记
 点云3d检测模型pointpillar
车道线检测LaneNet
autoware杂记
 数据集
 Deeplab
全卷积网络FCN