这个问题我们三年前就遇到过,限于当时的资源,没能分析出具体的原因。欠下的债肯定是要还的,最近又遇到了,不过解决起来倒是很快。
看 messages,发现有不少类似的 log:
Mar 6 15:00:11 lvs-1 Keepalived_healthcheckers[107592]: TCP connection to [192.168.90.5]:80 failed !!!
Mar 6 15:00:11 lvs-1 Keepalived_healthcheckers[107592]: Removing service [192.168.90.5]:80 from VS [111.111.111.4]:80
Mar 6 15:00:17 lvs-1 Keepalived_healthcheckers[107592]: TCP connection to [192.168.90.5]:80 success.
Mar 6 15:00:17 lvs-1 Keepalived_healthcheckers[107592]: Adding service [192.168.90.5]:80 to VS [111.111.111.4]:80
Mar 6 20:00:16 lvs-1 Keepalived_healthcheckers[107592]: TCP connection to [192.168.90.5]:80 failed !!!
Mar 6 20:00:16 lvs-1 Keepalived_healthcheckers[107592]: Removing service [192.168.90.5]:80 from VS [111.111.111.4]:80
Mar 6 20:00:22 lvs-1 Keepalived_healthcheckers[107592]: TCP connection to [192.168.90.5]:80 success.
Mar 6 20:00:22 lvs-1 Keepalived_healthcheckers[107592]: Adding service [192.168.90.5]:80 to VS [111.111.111.4]:80
结合监控图,发现出现问题的时候都是在整点,并且正点的时候系统的 load 确实比较大,把历史的 sar 数据调出来发现正点的 IO 利用率都是在 100%,这一联想就只有 logrotate 在正点做 rotate 然后 zip 压缩了。模拟了一下发现可以复现该问题,剩下的就是解决问题了。
除了上面这个,我们还发现在非整点也会出现类似的 log
Mar 10 17:10:23 lvs-2 sshd[54526]: Accepted publickey for jaseywang from 192.168.10.254 port 56532 ssh2
Mar 10 17:15:19 lvs-2 Keepalived_healthcheckers[34943]: MD5 digest error to [192.168.90.5]:80 url[/], MD5SUM [f9943d27e3420479271984e01dfe85dd].
Mar 10 17:15:19 lvs-2 Keepalived_healthcheckers[34943]: Removing service [192.168.90.5]:80 from VS [111.111.111.4]:80
Mar 10 17:15:22 lvs-2 Keepalived_healthcheckers[34943]: MD5 digest success to [192.168.90.5]:80 url(1).
Mar 10 17:15:28 lvs-2 Keepalived_healthcheckers[34943]: Remote Web server [192.168.90.5]:80 succeed on service.
抓包发现 LVS 向 RS 发送 GET / HTTP/1.0 的请求, RS 却返回了 HTTP/1.1 500 Internal Server Error 的非 200 错误码,所以 LVS remove 掉这些 RS 确实是正常的逻辑:
同样从监控中看到,出问题的时间段由于频繁的切换导致了前端几乎不可用,在 5 点多这一时段 error log 的比平常大了 100 多倍,达到了 500M:
从上面的信息,全量分析了出问题阶段所有 err log 的类型,发现了惊人的答案,99% 以上的问题都源于 “too many open file” 的错误,/proc 一看,发现 Nginx 的确实变成了 1024:
# cat /proc/3148/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 514947 514947 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 514947 514947 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
root 用户的没有问题:
# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 514947
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 512000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 512000
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
不过为什么突然由原来的系统级别的数字变成了 1024,实在值得奇怪,该问题很早之前在我们一台 Nginx 的机器上同样遇到,记得当时 restart 了 Nginx 依然没有效果,不得已只能 reboot 机器恢复。Nginx limit 变成 1024 的问题,其他用户也遇到类似的问题:
* http://serverfault.com/questions/516802/too-many-open-files-with-nginx-cant-seem-to-raise-limit
* http://anothersysadmin.wordpress.com/2013/08/09/nginx-and-the-too-many-open-files-limit/
目前还无法得知为什么,但是找到了部分原因,要避免此类问题倒是很简单,强行指定 worker_rlimit_nofile 就好了,谨慎起见,把 /proc/xxx/ 下面重要的文件都给监控上。