Programster's Blog

Tutorials focusing on Linux, programming, and open source

Recover BTRFS Array After Device Failure

Today, one of the drives in my BTRFS RAID 10 array failed and I am posting how to handle the situation for others, and in case it ever comes up again.

Symptoms

I run non-critical Virtualbox instances over an NFS to my BTRFS RAID array, and I've been doing this for over a year now without any issues. However, I noticed that the instances became unresponsive and incredibly slow. It was so slow, there is no way you would not notice it. At this point I checked on the array with sudo btrfs fi show and everything came back as fine. I tried running a defragmentation operation to see if that would speed things up and it didn't. It wasn't until I performed a reboot and the server failed to come back online that it became clear what had happened. In future, I recommend unmounting and re-mounting the filesystem if you see it is going slow to check for issues.

Recovery Steps

Firstly, you need to mount the array in "degraded" mode. This basically tells the system to mount the data even though it realizes it is not fully redundant now. By not having mounted normally, this has forced the user (you) to acknowledge that there is a situation and you are now aware of it.

Pick one of the drives in your RAID array and mount it like so:

sudo mount -o degraded /dev/sd[x] /path/to/mount/point

I've only tested this by specifying one of my remaining working drives, and have not tested with specifying the failed device.

If you are unsure what drive letter to use, you can look up the devices with sudo blkid which will list the drive letters with TYPE="btrfs" like in the example below:

/dev/sda: UUID="7c95ba2a-deb8-4163-aa3c-299667bfcb43" UUID_SUB="8d0f5b96-2f93-4afe-b602-c3a8e0497111" TYPE="btrfs"
/dev/sdb1: UUID="D1AC-53DB" TYPE="vfat"
/dev/sdb2: UUID="69cdbe20-9773-4036-9e84-d6a48faf4c4b" TYPE="swap"
/dev/sdb3: UUID="3e81667f-9ed3-417f-816d-d64dd11f2a69" TYPE="ext4"
/dev/sdc1: UUID="7c95ba2a-deb8-4163-aa3c-299667bfcb43" UUID_SUB="b9371f1c-33ea-4ceb-9fc8-fe374cf9fc8f" TYPE="btrfs"

Do not list all the devices in the array like so. This will not work.

sudo mount -o degraded -t btrfs \
/dev/sda \
/dev/sdb \
/dev/sdc1 \
/dev/sdd \
/dev/sde \
/dev/sdf \
/raid10

Now we can automatically remove the missing device from the array. Execute the command below in a session management tool such as screen, tmux, or byobu because it may not return for many hours.

sudo btrfs device delete missing /path/to/btrfs/mount

You cannot go below the minimum number of the device required for your RAID level (1 for RAID 1, and 4 for RAID 10 etc). Also, if you still meet the minimum number of drives you need enough space on the remaining drives to now hold all the data at the same redundancy level. If you fail one of those criteria, before you can remove a device you will need to add a new one. This means you need a spare drive lying around at all times, or you need to ensure that you keep plenty of spare disk space in your array for one failed drive. Adding a drive is as easy as sudo btrfs device add /dev/sd[x] /path/to/mount. If you don't have any spare slots, I hope you labelled your drives so that you can remove the failed one and put the new one in its place.

Make sure to use missing as the parameter and let BTRFS takes care of the rest, preventing you from accidentally specifying the wrong drive.

Unfortunately, this will result in a complete rebalance of blocks across all of the remaining drives. This means that you are screwed if any of the remaining drives fails during this time in which they all have to work pretty hard for many hours, making it similar to RAID5.

My CPU was hard at work as shown in the picture below.

Grey lines represent time the CPU was waiting on the drives, whilst green represents calculation time. One thread would always be running at max capacity, whilst two threads were eaten by disk wait time. However, I think there was possibly another issue going on as the output of top showed he Xorg process was at 100% utilization.

To Be Finished

I am guessing that after the rebalance is finished, I should be able to unmount and remount my array without the degraded flag. I've been waiting over 10 hours for the balance to finish so I though I'd release the tutorial and update this bit later.

Update

Eventually, I got an IO error from the device delete command. Running a sudo btrfs fi show showed that I had managed to remove all the data from device 2, but it was refusing to be "removed" from the array.

Label: none  uuid: 58bd01a7-f160-4fea-aed3-c378c2332699
        Total devices 6 FS bytes used 6.76TiB
        devid    1 size 2.73TiB used 2.50TiB path /dev/sda
        devid    2 size 0.00 used 2.50TiB path 
        devid    3 size 3.64TiB used 2.50TiB path /dev/sdf
        devid    4 size 2.73TiB used 2.50TiB path /dev/sdb
        devid    5 size 2.73TiB used 2.50TiB path /dev/sdc1
        devid    6 size 3.64TiB used 2.50TiB path /dev/sde

I tried unmounting and remounting the array with no success, even in degraded mode. At this point my heart sank and I tried one last ditch attempt of unplugging each drive one by one, and re-plugging it back in, before booting the server up again. This is where it gets wierd. The server booted without complaint and the array was mounted without me having to manually mount it in degraded mode. At this point I decided to test it by unmounting it and mounting it. This failed and I had to mount in degraded mode. I then tried deleting /dev/sdd again which failed. I then tried unmounting the array and running btrfsck on the array which spat out a load of warnings/errors (see appendix). I then mounted the array which only worked in degraded mode. I am now running a btrfs scrub which is finding/fixing lots of issues.

scrub status for 58bd01a7-f160-4fea-aed3-c378c2332699
        scrub started at Sun Feb  7 21:13:52 2016 and was aborted after 4135 seconds
        total bytes scrubbed: 1.57TiB with 238991 errors
        error details: verify=2272 csum=236719
        corrected errors: 238988, uncorrectable errors: 0, unverified errors: 0

Unfortunately this is where I am now, mid-way through a scrub. I cannot tell whether running this scrub will "fix" the array or allow me to remove /dev/sdd. Running sudo btrfs scrub status -d /raid10 outputs the following

scrub device /dev/sda (id 1) status
        scrub started at Sun Feb  7 21:13:52 2016, running for 4711 seconds
        total bytes scrubbed: 457.60GiB with 0 errors
scrub device /dev/sde (id 2) status
        scrub started at Sun Feb  7 21:13:52 2016, running for 4711 seconds
        total bytes scrubbed: 396.75GiB with 286399 errors
        error details: verify=2272 csum=284127
        corrected errors: 286399, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdg (id 3) status
        scrub started at Sun Feb  7 21:13:52 2016, running for 4711 seconds
        total bytes scrubbed: 224.18GiB with 0 errors
scrub device /dev/sdb (id 4) status
        scrub started at Sun Feb  7 21:13:52 2016, running for 4711 seconds
        total bytes scrubbed: 289.20GiB with 0 errors
scrub device /dev/sdd1 (id 5) history
        scrub started at Sun Feb  7 21:13:52 2016 and was aborted after 0 seconds
        total bytes scrubbed: 0.00 with 0 errors
scrub device /dev/sdf (id 6) status
        scrub started at Sun Feb  7 21:13:52 2016, running for 4711 seconds
        total bytes scrubbed: 475.63GiB with 0 errors

This shows that there might have also been an issue with /dev/sde. I'm hoping that /dev/sdd is showing no scrubbing because we successfully moved the data, and not because its just flat failed. I'm not sure if /dev/sdd has actually failed now as after one of the reboots it appears to be operating. I've run SMART scans on all drives and they all appear healthy, except /dev/sdd gets stuck with 10% remaining. At this point the server became unresponsive and I had to perform a hard reboot. When I mounted the array in degraded mode and tried to resume the scrub, the time was not increasing. Cancelling a scrub would fail, saying that one wasn't being run, whilst starting it would say that one was already running. I used the line below from Marc's blog to force a cancellation.

perl -pi -e 's/finished:0/finished:1/' /var/lib/btrfs/*

After which, I was able to start the scrub again. So far it hasn't found any issues.

Debugging

If you do not perform the step to get the array online in degraded mode before trying to remove the failed device you will get an error like

ERROR: error removing the device '/dev/sdb' - Inappropriate ioctl for device

When your in a panic, seeing this message can cause you to need your brown pants. I'm hoping this tutorial will come up for others Googling this error message.

References

Appendix

Output of btrfsck

sudo btrfsck /dev/sdd1
No valid Btrfs found on /dev/sdd1

sudo btrfsck /dev/sde                                                                                      
warning, device 5 is missing
warning devid 5 not found already
Checking filesystem on /dev/sde
UUID: 58bd01a7-f160-4fea-aed3-c378c2332699
checking extents
checking free space cache
Error reading 36824793235456, -1
failed to load free space cache for block group 25965986381824free space inode generation (0) did not match free space cache generation (263494)
free space inode generation (0) did not match free space cache generation (263494)
free space inode generation (0) did not match free space cache generation (263494)
free space inode generation (0) did not match free space cache generation (263494)
free space inode generation (0) did not match free space cache generation (263494)
free space inode generation (0) did not match free space cache generation (263494)
free space inode generation (0) did not match free space cache generation (263494)
free space inode generation (0) did not match free space cache generation (268613)
free space inode generation (0) did not match free space cache generation (268612)
free space inode generation (0) did not match free space cache generation (315822)
free space inode generation (0) did not match free space cache generation (315826)
free space inode generation (0) did not match free space cache generation (271362)
free space inode generation (0) did not match free space cache generation (278248)
free space inode generation (0) did not match free space cache generation (285754)
free space inode generation (0) did not match free space cache generation (271506)
free space inode generation (0) did not match free space cache generation (271362)
free space inode generation (0) did not match free space cache generation (279813)
free space inode generation (0) did not match free space cache generation (285766)
free space inode generation (0) did not match free space cache generation (285697)
free space inode generation (0) did not match free space cache generation (285758)
free space inode generation (0) did not match free space cache generation (285754)
free space inode generation (0) did not match free space cache generation (285697)
free space inode generation (0) did not match free space cache generation (271481)
free space inode generation (0) did not match free space cache generation (279813)
free space inode generation (0) did not match free space cache generation (279813)
free space inode generation (0) did not match free space cache generation (285754)
free space inode generation (0) did not match free space cache generation (285697)
free space inode generation (0) did not match free space cache generation (279813)
free space inode generation (0) did not match free space cache generation (279813)
free space inode generation (0) did not match free space cache generation (279813)
free space inode generation (0) did not match free space cache generation (271475)
free space inode generation (0) did not match free space cache generation (279813)
free space inode generation (0) did not match free space cache generation (284204)
free space inode generation (0) did not match free space cache generation (279814)
free space inode generation (0) did not match free space cache generation (279814)
free space inode generation (0) did not match free space cache generation (286850)
free space inode generation (0) did not match free space cache generation (271475)
free space inode generation (0) did not match free space cache generation (271744)
free space inode generation (0) did not match free space cache generation (315822)
free space inode generation (0) did not match free space cache generation (279558)
free space inode generation (0) did not match free space cache generation (270848)
free space inode generation (0) did not match free space cache generation (315876)
checking fs roots
checking csums
checking root refs
found 5419802658561 bytes used err is 0
total csum bytes: 7247450076
total tree bytes: 8555118592
total fs tree bytes: 241958912
total extent tree bytes: 291864576
btree space waste bytes: 702146162
file data blocks allocated: 19521148178432
 referenced 7291670536192
Btrfs v3.12