A few Bytes too many
I have one volume on XFS on a VDO (Legacy) partition, this volume hosts all of my docker containers and resides on a physical RAID device. The docker containers themselves are quite happy with this arrangement, until…
One docker container in particular doesn’t trim or compress it’s logs, and it is the application output log, so it’s fairly useless in the long run, but great for debugging the application in general.
This log, can under a variety of circumstances, grow to be VERY VERY large. It will reach Terabytes in size if it’s not trimmed / deleted or compressed. Something that the application itself doesn’t do, and in this case, it caused havoc on the system, and the filesystem wouldn’t remount, so the filesystem had to be recovered. Except xfs_repair wouldn’t work on it.
XFS is particularly bad about repairing filesystems if there is something amiss with the underlying storage. Notwithstanding bad blocks, the filesystem is really bad if the underlying storage fills up, and the volume happens to be thin-provisioned.
The layout
Under normal circumstances, the filesystem is mounted at /mnt/local
Here’s how it’s stacked:
XFS Filesystem |
Virtual Data Optimization (VDO in Partition Presented as 16TB with Deduplication and Compression Enabled) |
6 1TB SAS on Perc H330 Raid Device |
Under normal circumstances, the Compression and DeDuplication should cover 99% of use cases here, and have a solid structure unless something goes wrong.
The first sign of trouble
A Few days ago, all of my containers went offline, and I got an alert saying that my proxy wasn’t responding. – This was on a Friday night, so I didn’t look at it until early Sunday.
To my surprise all of my containers were indeed offline and docker ps –all showed them all. My initial thought was that there was just a general crash of the service, so I tried to restart Portainer and would bring up the rest of the containers from there. Wrong
The container would come up but have no config, and as weird as that is, I thought that the container had just failed, (Portainer is really easy to rebuild). Also Wrong.
a simple ls /mnt/local revealed that the docker directory was empty save for the default files placed there by portainer. That was incredibly odd, and now I’ve got a mystery.
Diagnosis
Running journalctl -xe revealed that the filesystem was unmounted due to a write error.
What do you do except run xfs_repair? Except that didn’t yield anything since it was complaining about a dirty log.
Then, I followed the instructions and ran xfs_repair -L, ok, so don’t do that… Hindsight is 20/20.
That operation caused what I can only at this point assume is my current situation.
The first step after this was to run vdostats –human-readable, this yielded the answer to my question: “What is going on here” The underlying disk was full.
Recovery Attempts
First and Foremost: DO NOT WRITE ANY DATA TO A VOLUME YOU ARE TRYING TO RECOVER
Second: photorec and testdisk DO NOT SUPPORT XFS, They will however happily recover files if their headers are known and are contiguous.
First things first:
Now that we know what the situation is (Full storage) lets copy the volume off to another device.
I won’t entertain you with the two failed attempts here, but in short I also tried to:
- Use dd to copy the VDO Volume to a file on my file server, but since the filesystem on the target was EXT4, the file write failed at 16TB, which is the maximum size on XFS
- And this got me into all sorts of problems (More for another blog post) Tried to expand the volume using the Raid card by adding another Disk. (See rule #1 above)
So what did I do?
I did a dd from the underlying disk (After booting into Centos 9) to a usb drive: dd if=/dev/sda of=/dev/sdb
So that worked.
Since the system I was working on now has Centos 9 on it, (Again, another blog post) I couldn’t use the legacy vdo tools, (another blog post).
So I did what any rational human would do, and installed VMWare on my desktop. (Why is it so slow on Windows 11)
First off, I needed to import the volume:
vdo import /dev/sda1 -n vdo
then I realized part one of the issue, VDO is residing in the partition which cannot grow (it’s not LVM)
So, we have to first grow the partition
parted /dev/sda
(parted) resizepart 1
<< 10000GB
the partition expanded
now you can run
vdo growPhysical -n vdo
Now, you have a bunch of free space underlying the VDO volume and that should allow the repair to complete, since the issue of free space has been solved. Right? Wrong
xfs_repair still outputs this:
Now I still have the same issue during the repair.
Like and Follow for Part 2. 🙂
Discover more from Christine Alifrangis
Subscribe to get the latest posts sent to your email.