Forcing the failover to a hot spare in a degraded ZFS pool

Question

I have a simple 5x1TB RAIDz1 configuration (tank? pool? vdev?), with a global spare assigned to it. One of the 5 drives in the array is listed in a FAULTED state (corrupted data), and the spare is listed as AVAIL. The array lists as DEGRADED. Clearly there is no mechanism for the array to gracefully failover to the spare, so how do I force the failover?

I have read many forum posts from many locations, discussing detaching the drive, replaceing the drive with the spare, physically removing the drive, moving the spare to the same slot etc.

The replace command tells me it cannot replace the drive because the spare is in a spare or replacing config and to try detach.

The detach command tells me it is only compatible with mirrors and vdev replacement.

There is no indication that the spare is being used to rebuild the array.

I am not keen to start physically moving drives around, neither the current array member nor the functioning hot spare - I'd prefer not to interrupt anything.

I'd also prefer to not bring the array down, restart he server etc. The system is designed to transparently recover without this, I want to learn how. The data is backed up so I have free reign.

Linux Kernel: 3.10.0-1160

ZFS Version: 5

Update:

Output from replace function:

[root@localhost ~]# zpool replace <name> 4896358983234274072 ata-WDC_WD10EFRX-68PJCN0_WD-<serial>
cannot replace 4896358983234274072 with ata-WDC_WD10EFRX-68PJCN0_WD-<serial>: already in replacing/spare config; wait for completion or use 'zpool detach'

Output from detach function:

[root@localhost ~]# zpool detach <name> 4896358983234274072
cannot detach 4896358983234274072: only applicable to mirror and replacing vdevs

ZFS version:

[root@localhost ~]# zfs upgrade
This system is currently running ZFS filesystem version 5.
All filesystems are formatted with the current version.
[root@localhost ~]# modinfo zfs | grep version
version:        0.8.2-1
rhelversion:    7.9
srcversion:     29C160FF878154256C93164
vermagic:       3.10.0-1160.49.1.el7.x86_64 SMP mod_unload modversions

zpool status:

[root@localhost ~]# zpool status <name>
  pool: <name>
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h18m with 0 errors on Mon Apr  4 13:29:39 2022
config:
    NAME                                               STATE     READ WRITE CKSUM
    &lt;name&gt;                                                 DEGRADED     0     0     0
      raidz1-0                                         DEGRADED     0     0     0
        pci-0000:01:00.0-sas-0x443322110c000000-lun-0  ONLINE       0     0     0
        ata-WDC_WD10EFRX-68FYTN0_WD-&lt;serial&gt;       ONLINE       0     0     0
        pci-0000:01:00.0-sas-0x4433221109000000-lun-0  ONLINE       0     0     0
        4896358983234274072                            FAULTED      0     0     0  corrupted data
        pci-0000:01:00.0-sas-0x443322110b000000-lun-0  ONLINE       0     0     0
    spares
      ata-WDC_WD10EFRX-68PJCN0_WD-&lt;serial&gt;         AVAIL

Update 2:

Restarting the server allowed the replace operation to be carried out without interference or issue. I am now looking into updating ZFS and potentially the kernel, and want to make sure that is a safe operation to be doing with an existing array built within the older system.

If you have a hot spare in a pool, all you need to do is detach the failed drive. Can you add the output of zpool status to your question? And the output (including any error messages) of the commands you ran. Also, where did you get ZFS version 5 from? The latest version of ZFS on Linux is 2.1.4 — cas, Apr 06 '22 at 01:13
That's odd. the detach should have worked. Anyway, maybe try detaching the spare and then replacing the failed drive with it. Re: version - the on-disk format is version 5, while the version of the zfs drive is 0.8.2. zfs 0.8.2 is very old now, but not all that surprising since you're running an ancient kernel. There have been a lot of improvements and bug-fixes since then....if you weren't running RHEL, I'd recommend upgrading both the kernel and the zfs driver. — cas, Apr 06 '22 at 08:54
Why would RHEL prevent upgrading kernel or zfs version? I have a plan to set up a new OS for this server (no clear upgrade path, so need to do it by lobotomy). I am looking at CentOS 9/Stream, though others are suggesting Debian or FreeBSD as better alternatives. Is there a good reason to move away from RHEL/EPEL/CentOS? If I upgrage zfs in the process (I will), will there be a safe upgrade path for all the filesystems I run? — J Collins, Apr 06 '22 at 09:00
RHEL doesn't prevent upgrading the kernel. But upgrading the kernel from 3.10.0 to, say 5.16.x, may require a cascade of other upgrades, starting with libc. Eventually, you'd end up with something that wasn't even remotely similar to RHEL (or, at least, require upgrading to RHEL 8.5...which still only has official packages for kernel 4.18. The pre-release RHEL 9 beta has kernel 5.14), which kind of defeats the purpose of using it instead of something like Debian or Ubuntu. — cas, Apr 06 '22 at 10:28
OTOH, the release notes for ZFS 2.14 say it's compatible with kernels from 3.10 to 5.17, so you should be able to just upgrade ZFS without upgrading the kernel. — cas, Apr 06 '22 at 10:29
@cas with the repos I'm subscribed to, I'm not sure there is an update stream to that version. Could you point to some guidance on an upgrade and any foreseen issues? I think you're my new best friend when it comes to ZFS.. — J Collins, Apr 06 '22 at 14:46
I don't really know how to do that on RHEL, I don't use it - I don't need or want commercial support so it has no appeal to me. Looking at https://openzfs.github.io/openzfs-docs/Getting%20Started/RHEL-based%20distro/index.html, it seems that ZoL only provide zfs 0.8.x binaries for RHEL 7. On Debian, I just install or upgrade the zfs-dkms package (and the kernel-header package for my current kernel) and DKMS compiles the zfs driver for that kernel. The URL above also mentions DKMS packages for RHEL but I don't know what versions they provide for each RHEL release. — cas, Apr 06 '22 at 22:46
Given that the ZoL site mentions that 2.14 is compatible with kernel 3.10, I'm sure it must be possible - but you may have to compile and install it yourself, or hunt for a third-party repo that provides either binary or dkms packages. IMO, if you can find zfs dkms packages for your RHEL release, that would be best. It still ends up compiling the module (rather than just installing a pre-compiled binary) but dkms automates the entire process. 10+ years ago, when I first started using ZFS, I used to compile it myself. When dkms packages became available for Debian, it because much easier. — cas, Apr 06 '22 at 22:51
did detaching the spare and replacing the failed drive work? — cas, Apr 08 '22 at 00:45
@cas I went though a period where dkms would recompile ZFS each time and completely hose it,requiring a half day of fault finding and fixing. I might have PTSD from that process..! Nonethelss I believe Iam using the RHEL (EPEL for CentOS) repo, and that seems to be satisfied with the 0.8 release. And ultimately I've had success rebuilding after a full server restart, so hat might start pointing at the issue as well. — J Collins, Apr 08 '22 at 09:03

Forcing the failover to a hot spare in a degraded ZFS pool

0 Answers0