The Out of Disk Space Problem

Here's a very interesting situation I've ran into several times when troubleshooting a full disk issue on Linux server that in theory shouldn't really run out of the disk space.

What makes it even more puzzling initially is that disk space tools like df do not show any out of space issues. This can quickly throw a user off the troubleshooting course and into a weird decision making mode, especially under time pressure.

Below is a real life scenario of a rather average web server hitting a wall due to a disk space issue.

A server in this state normally exhibits symptoms of being out of disk space when on the surface all looks just fine. The most common symptoms are: logging stops, new ssh sessions may show i/o errors, and SNMP reporting stops.

A quick test to confirm a full disk scenario is to try and touch a new file as non-root user.

Partition Size

First things first, let's check the total size of configured partitions with fdisk.

In this case the / root partition is a LVM type and 16.1 GB in size. Now I know the size baseline.

fdisk:

Disk /dev/mapper/vg_bigoldserver-root: 16.1 GB, 16106127360 bytes
255 heads, 63 sectors/track, 1958 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk Space Usage Report

Checking the disk space usage with df shows 90% of disk space used on the / root partition. 10% free so technically doesn't look like a disk space problem. This is what may throw troubleshooting efforts off.

df -h:

[09/10_10:42 root@bigoldserver /]$ df -h
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/vg_bigoldserver-root  15G   13G  1.5G  90% /
tmpfs                             7.8G     0  7.8G   0% /dev/shm
/dev/sdb1                         477M   96M  356M  22% /boot

Find Large Files

In addition to checking the disk space with df let's search for the lergest directories and files. This may give a bit more clues as to what's the source of the disk space issue.

One great tool I like to use is ncdu but any other methods will work too as long as they generate a list of the largest files and directories.

Right away looking at the output something doesn't add up. The /var directory is only 1.5 GB in size. Most likely log files in /var/log directory are taking up all that space so it's up to the user to dig deeper to verify. Still 1.5 GB isn't enough to take up the whole disk space especially since the rest of the files don't seem to add up to a lot either.

ncdu -x /

--- / ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    1.5 GiB [##########] /var
    1.2 GiB [#######   ] /usr
  519.9 MiB [###       ] /root
  497.5 MiB [###       ] /lib
  275.2 MiB [#         ] /opt
  123.0 MiB [          ] /home
   34.5 MiB [          ] /etc
   23.3 MiB [          ] /lib64
   14.9 MiB [          ] /sbin
    6.8 MiB [          ] /bin
  100.0 KiB [          ] /tmp
   40.0 KiB [          ]  err_pid.log
e  16.0 KiB [          ] /lost+found
e   4.0 KiB [          ] /srv
e   4.0 KiB [          ] /mnt
e   4.0 KiB [          ] /media
e   4.0 KiB [          ] /cgroup
>   0.0   B [          ] /sys
>   0.0   B [          ] /selinux
>   0.0   B [          ] /proc
>   0.0   B [          ] /dev
>   0.0   B [          ] /boot
    0.0   B [          ]  .autofsck

Check Running Services

This is where the troubleshooting branches of into the specific issue of finding what may be writing to disk that isn't easy to identify.

Let's check what services are currently enabled and started on the server. This is an example of CentOS 6, so I'm using chkconfig tool.

The httpd services catches my attention as it's got a tendency to write a lot of logs.

chkconfig --list | grep ":on"

[09/10_10:44 root@bigoldserver /]$ chkconfig --list | grep ":on"
acpid           0:off   1:off   2:on    3:on    4:on    5:on    6:off
atd             0:off   1:off   2:off   3:on    4:on    5:on    6:off
auditd          0:off   1:off   2:on    3:on    4:on    5:on    6:off
blk-availability    0:off   1:on    2:on    3:on    4:on    5:on    6:off
crond           0:off   1:off   2:on    3:on    4:on    5:on    6:off
httpd           0:off   1:off   2:on    3:on    4:on    5:on    6:off
iptables        0:off   1:off   2:on    3:on    4:on    5:on    6:off
irqbalance      0:off   1:off   2:off   3:on    4:on    5:on    6:off
lvm2-monitor    0:off   1:on    2:on    3:on    4:on    5:on    6:off
mdmonitor       0:off   1:off   2:on    3:on    4:on    5:on    6:off
netfs           0:off   1:off   2:off   3:on    4:on    5:on    6:off
network         0:off   1:off   2:on    3:on    4:on    5:on    6:off
nfslock         0:off   1:off   2:off   3:on    4:on    5:on    6:off
postfix         0:off   1:off   2:on    3:on    4:on    5:on    6:off
rpcbind         0:off   1:off   2:on    3:on    4:on    5:on    6:off
rpcgssd         0:off   1:off   2:off   3:on    4:on    5:on    6:off
rpcidmapd       0:off   1:off   2:off   3:on    4:on    5:on    6:off
rsyslog         0:off   1:off   2:on    3:on    4:on    5:on    6:off
snmpd           0:off   1:off   2:on    3:on    4:on    5:on    6:off
sshd            0:off   1:off   2:on    3:on    4:on    5:on    6:off
sysstat         0:off   1:on    2:on    3:on    4:on    5:on    6:off
udev-post       0:off   1:on    2:on    3:on    4:on    5:on    6:off
xinetd          0:off   1:off   2:off   3:on    4:on    5:on    6:off

Find Apache Process ID

To dig deeper let's find out Apache's process ID so we can later use it to find out more about what it's doing on the disk.

Here I find out with the ps tool that Apache's PID is 2197.

ps -ewf | grep httpd:

[09/10_10:45 root@bigoldserver /]$ ps -ewf | grep httpd
root      2197     1  0  2017 ?        00:08:37 /usr/sbin/httpd
apache    8902  2197  0 Aug20 ?        00:02:33 /usr/sbin/httpd
apache    9991  2197  0 00:03 ?        00:04:06 /usr/sbin/httpd
apache   11063  2197  1 08:56 ?        00:01:19 /usr/sbin/httpd
apache   11493  2197  1 09:03 ?        00:01:25 /usr/sbin/httpd
apache   11494  2197  1 09:03 ?        00:01:20 /usr/sbin/httpd
apache   11495  2197  1 09:03 ?        00:01:23 /usr/sbin/httpd
apache   11506  2197  1 09:03 ?        00:01:23 /usr/sbin/httpd
apache   15472  2197  1 10:07 ?        00:00:37 /usr/sbin/httpd
apache   15762  2197  1 10:12 ?        00:00:28 /usr/sbin/httpd
apache   17532  2197  0 02:03 ?        00:03:23 /usr/sbin/httpd
root     18226 17114  0 10:46 pts/0    00:00:00 grep httpd

List of Files

Using the lsof tool and Apache's PID I can list all currently opened files in use by that particular process. Going through the list, which may be and most likely will be quite long, I find several lines of interest all the way to the end.

Correlating that list with the actual content of the /var/log/httpd directory as reported by the ls tool answers the question of what's taking up the disk space but it's not reported by the operating system.

Someone must have manually removed the following log files at one point without letting them rotate properly or restarting the service afterward. This caused the file inode (file pointer) to be removed from the file system while the service kept writing more data the disk due to open file descriptors still being open.

lsof -p 2197:

httpd   2197 root    7w   REG  253,0      67182    133950 /var/log/httpd/httpd-ssl-error_log-20180614 (deleted)
httpd   2197 root    8w   REG  253,0    3661727    133947 /var/log/httpd/httpd-http-error_log-20180614 (deleted)
httpd   2197 root    9w   REG  253,0          0    169240 /var/log/httpd/access_log (deleted)
httpd   2197 root   10w   REG  253,0   18710132    134560 /var/log/httpd/httpd-ssl-access_log-20180614 (deleted)
httpd   2197 root   11w   REG  253,0   82175672    134561 /var/log/httpd/httpd-ssl-error.log-20180614 (deleted)
httpd   2197 root   12w   REG  253,0 1657015856    134558 /var/log/httpd/httpd-http-access_log-20180614 (deleted)
httpd   2197 root   13w   REG  253,0 7181035399    134559 /var/log/httpd/httpd-http-error.log-20180614 (deleted)

Check for Other Deleted Files

Find out any more of those zombie files just to make sure nothing else has been impacted.

lsof | grep "deleted"

Solution

Restart httpd to properly remove those files, which will close the open file descriptors, stop any further writes, and free up the disk space.


Reading Time

~5 min read

Published

Category

Linux

Tags

Contact