在我记忆中,CentOS 从很早开始,就没有和各种的问题挂钩过。得益于它保守的升级策略,它自己几乎从来不是问题专业户,我对它的印象更多是因为太过保守导致的库升级不及时问题。

可是最近,CentOS7 某个版本发行之后,偶尔的,会出现 supervisor 拉 redis 失败的问题。从日志的内容来看,全是 Permission denied 相关的问题,bind socket 的时候权限不够,write log 的时候权限不够,dump snapshot 的时候权限不够,只要跟写磁盘相关的,都有这个问题。

单独执行 /usr/bin/redis-server /etc/redis.conf,又一点儿问题没有,命令本身没有问题,一切正常。

关键,supervisor 是以 root 去拉 redis 的啊!表示很无辜 (o´・ェ・`o)

supervisor 的配置如下:

[program:redis]
command = /usr/bin/redis-server /etc/redis.conf
;process_name=%(program_name)s ; process_name expr (default %(program_name)s)
numprocs=1                    ; number of processes copies to start (def 1)
directory=/data                ; directory to cwd to before exec (def no cwd)
;umask=022                     ; umask for process (default None)
;priority=999                  ; the relative start priority (default 999)
autostart=true                ; start at supervisord start (default: true)
autorestart=true        ; whether/when to restart (default: unexpected)
startsecs=1                   ; number of secs prog must stay running (def. 1)
startretries=30                ; max # of serial start failures (default 3)
exitcodes=0,2                 ; 'expected' exit codes for process (default 0,2)
stopsignal=QUIT               ; signal used to kill process (default TERM)
;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
;stopasgroup=false             ; send stop signal to the UNIX process group (default false)
;killasgroup=false             ; SIGKILL the UNIX process group (def false)
user=root                   ; setuid to this UNIX account to run the program
;redirect_stderr=true          ; redirect proc stderr to stdout (default false)
stdout_logfile=/data/log/supervisor/redis.stdout.log        ; stdout log path, NONE for none; default AUTO
stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)
stdout_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
stdout_events_enabled=false   ; emit events on stdout writes (default false)
stderr_logfile=/data/log/supervisor/redis.stderr.log        ; stderr log path, NONE for none; default AUTO
stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)
stderr_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
stderr_events_enabled=false   ; emit events on stderr writes (default false)
;environment=A=1,B=2           ; process environment additions (def no adds)
;serverurl=AUTO                ; override serverurl computation (childutils)

redis 的配置中,相关的配置选项如下:

daemonize no
# 实际上,在出现问题的服务器上,pid 貌似一直没有落地成功,但所幸 redis 没有报错
pidfile /data/log/redis/redis.pid
port 6379
tcp-backlog 128
bind 127.0.0.1
# 只要开启 socket 监听就一定报错。临时解决办法:不使用 socket 方式监听
unixsocket /dev/shm/redis.sock
unixsocketperm 777
timeout 0
tcp-keepalive 0
loglevel notice
# 只要开启 log 就一定报错。临时解决办法:日志路径改成黑洞 /dev/null
logfile /data/log/redis/redis.log
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
# 出现问题时,dump 文件也没有落地成功,不过不影响 redis 继续执行
dbfilename dump.rdb
dir /data/log/redis/

因为 redis 的报错是通过 redis.stdout.log 来暴露,我们只用关注这个日志文件就可以了。下面我截几段。

10640:M 01 Mar 18:04:10.932 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
10640:M 01 Mar 18:04:10.932 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
10640:M 01 Mar 18:04:10.932 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
10640:M 01 Mar 18:04:10.932 # Opening Unix socket: bind: Permission denied
10641:M 01 Mar 18:04:41.022 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
10641:M 01 Mar 18:04:41.023 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
10641:M 01 Mar 18:04:41.023 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
10641:M 01 Mar 18:04:41.023 # Opening Unix socket: bind: Permission denied
                _._
           _.-``__ ''-._
      _.-``    `.  `_.  ''-._           Redis 3.0.7 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 12967
  `-._    `-._  `-./  _.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |           http://redis.io
  `-._    `-._`-.__.-'_.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |
  `-._    `-._`-.__.-'_.-'    _.-'
      `-._    `-.__.-'    _.-'
          `-._        _.-'
              `-.__.-'

11560:M 01 Mar 23:55:35.650 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
11560:M 01 Mar 23:55:35.650 # Server started, Redis version 3.0.7
11560:M 01 Mar 23:55:35.650 # Fatal error loading the DB: Permission denied. Exiting.
11580:M 01 Mar 23:55:44.691 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
11580:M 01 Mar 23:55:44.691 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
11580:M 01 Mar 23:55:44.692 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
11580:M 01 Mar 23:55:44.692 # Opening Unix socket (/dev/shm/redis.sock) and uid/gid (0/0): bind: Permission denied
3321:M 04 Mar 10:49:40.192 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
3321:M 04 Mar 10:49:40.192 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
3321:M 04 Mar 10:49:40.192 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
3321:M 04 Mar 10:49:40.193 # Opening Unix socket (/dev/shm/redis.sock) and uid/gid (0/0): bind: Permission denied. | uid: 0 | gid: 0 | sock family: 1 | sock path: /dev/shm/redis.sock | eno: 101
3322:M 04 Mar 10:50:10.264 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
3322:M 04 Mar 10:50:10.264 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
3322:M 04 Mar 10:50:10.264 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
3322:M 04 Mar 10:50:10.264 # Opening Unix socket (/dev/shm/redis.sock) and uid/gid (0/0): bind: Permission denied. | uid: 0 | gid: 0 | sock family: 1 | sock path: /dev/shm/redis.sock | eno: 101

第二段日志中,socket 具体的路径和 uid/gid 是我修改源码之后打印出来的,用来证实 supervisor 确实是在指定的位置拉起 redis socket,并且确实使用的是指定的用户(root)身份。

实际上,无论是用 getuid 还是 geteuid 都是一样的结果,确定 redis 的用户 ID 是 root(uid == 0)。

因为不是每台电脑都能重现这个问题,而且有时候重做系统,两台服务器同时重做,一台 OK ,另外一台能重现此问题。只要重现一次,就能反复重现,100% 概率。

此外还有一点需要注意的是,第二段和第三段都提示说需要用 ulimit -n 重新设置 file descriptors limit,可实际情况是,/etc/profile 里面已经设置了该值到 102400,而且肯定是已经生效的。

ulimit -n  # 1024000
  • supervisord 是 root 身份启动的,而且的确是以 root 身份拉 redis;
  • redis 本身没有问题,shell 里面启动是 OK 的;
  • supervisord + redis 就会在磁盘落地时报权限问题;
  • redis 报错时,似乎还取到不合实际情况的系统变量;
  • 并不是每个 CentOS7 环境都报错,但报错的都是 CentOS7。服务器的差异点还没找到,有时候两台同时重做的服务器,一台不行一台行;

直到现在还没找到问题根源以及解决办法。

大写的囧……

(UPDATED) 2017-08-17: 还没找到原因。不过相信大概率和 supervisor 的配置有关。 supervisor 拉起来的进程,系统环境不再是 euid 对应的环境,这点已经确认。 至于这样的变化是怎样导致 redis 的 socket 无法读写的,确实就不知道了。