Strace is a tool that should be in a toolbox of every system administrator. Not only that it can help in troubleshooting simple problems (ie. missing libraries in newly created chroot, which ldd mysteriously misses to report) but it also helps in debugging very complex system problems and performance issues.
Recently I experienced a very strange problem with one of the RHEL 3 servers we’ve got. Problem manifested in a very strange way, SSH and su logins hanged, other daemons were also hanging during the startup, only way to reboot or shutdown the server was to physically press the restart/power off button, etc. All this could have been caused by problems on both software and hardware level. First suspicious was bad RAID controller, but after tests this proved to be a mislead. After more tests and brainstorms hardware problems were definitely excluded, so problem has to be on the software side. But what could be the problem?
After few more misleading steps I tried to trace system calls created by su command and found very interesting results.
$ strace -f -s 1024 -o /tmp/su.strace.out su -
[-- cut --]
3138 open(“/dev/audit”, O_RDWR) = 3
3138 fcntl64(3, F_GETFD) = 0
3138 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
3138 ioctl(3, 0x801c406f
And this is where the strace output ends and su command hangs. Audit device file is opened (file descriptor 3) and as soon as the first request is dispatched to this device (ioctl system call to file descriptor 3) command freezes. According to this I should just disable audit on the server and the problem will be gone.
As a test, audit daemon was temporarily stopped and I tried to switch to another user and the problem was indeed gone.
After searching for similar problems with audit daemon I found an article in Red Hat knowledge base regarding the exactly same
From the article:
When the free space in the filesystem holding the audit logs is less than 20%, the above notify command will error out and auditd will enter suspend mode. This causes all system calls to block.
So this behavior is not a bug but actual feature of the software. :o) From security point of view this is expected behaviour – attacker could fill up filesystem where audit logs are stored before the attack and audit will be disabled, meaning no logs of his activity, so better not to allow ANY activity on the system if audit is not able to write to its logs. But still, this kind of behaviour renders the system completely useless to legitimate users.
The topic of this post is not audit, so I will stop here. Important thing is that strace led us directly to the main source of the problem. Resolution of issues like this would be much more complex and time consuming without this great little tool. :)