Replacing a Failing NVMe Drive in a Hetzner Dedicated Server
One of Hetzner dedicated servers has two Samsung 1 TB NVMe drives in a software RAID-1 array. This is a walkthrough of how I replaced a failing drive without data loss.
1. The alert
It started with an email from smartd running on the server itself:
Subject: SMART error (Health) detected on host
Device: /dev/nvme1, Critical Warning (0x04): Reliability
Device info:
SAMSUNG MZVL21T0HCLR-00B00, S/N:S676NF0R518195, FW:GXA7601Q, 1.02 TB
Critical Warning 0x04 means the NVMe subsystem’s reliability has been degraded - the drive is reporting it can no longer guarantee data integrity. Time to act.
2. Confirming which drive was failing
I ran smartctl against both drives to see their health status.
/dev/nvme0n1 was healthy:
SMART overall-health self-assessment test result: PASSED
Percentage Used: 81%
Available Spare: 100%
Critical Warning: 0x00
/dev/nvme1n1 was not:
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
Percentage Used: 100%
Available Spare: 100%
Critical Warning: 0x04
Percentage Used: 100% confirms the drive has exhausted its rated write endurance. The serial number (S676NF0R518195) matched the one in the alert email - /dev/nvme1n1 was the bad drive.
3. Checking the RAID array
# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
33520640 blocks super 1.2 [2/2] [UU]
md2 : active raid1 nvme1n1p3[1] nvme0n1p3[0]
965467456 blocks super 1.2 [2/2] [UU]
bitmap: 7/8 pages [28KB], 65536KB chunk
md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
1046528 blocks super 1.2 [2/2] [UU]
unused devices: <none>
[2/2] [UU] means all arrays are fully intact - both drives active, no degradation at the RAID level. The SMART warning was about the drive’s internal health, not a failure that had caused it to drop out of the array yet. This is actually the best case: the drive was flagging trouble early enough to replace it before RAID noticed anything wrong.
4. Checking the partition table type
Before doing anything, I confirmed whether the disks used MBR or GPT:
# parted -l
Model: SAMSUNG MZVL21T0HCLR-00B00 (nvme)
Disk /dev/nvme0n1: 1024GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number Start End Size Type File system Flags
1 1049kB 34.4GB 34.4GB primary raid
2 34.4GB 35.4GB 1074MB primary raid
3 35.4GB 1024GB 989GB primary raid
Model: Linux Software RAID Array (md)
Disk /dev/md2: 989GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:
Number Start End Size File system Flags
1 0.00B 989GB 989GB ext4
Model: Linux Software RAID Array (md)
Disk /dev/md0: 34.3GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:
Number Start End Size File system Flags
1 0.00B 34.3GB 34.3GB linux-swap(v1)
Model: SAMSUNG MZVL21T0HCLR-00B00 (nvme)
Disk /dev/nvme1n1: 1024GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number Start End Size Type File system Flags
1 1049kB 34.4GB 34.4GB primary raid
2 34.4GB 35.4GB 1074MB primary raid
3 35.4GB 1024GB 989GB primary raid
Model: Linux Software RAID Array (md)
Disk /dev/md1: 1072MB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:
Number Start End Size File system Flags
1 0.00B 1072MB 1072MB ext3
Partition Table: msdos on both NVMe drives confirms MBR. This matters because the method to back up and restore the partition layout differs between MBR and GPT. The output also gives a useful overview of the full setup: three partitions per drive (swap, boot, root), all marked as raid and active in their respective md arrays.
5. Backing up the partition table
With the partition layout confirmed as MBR, I dumped it from the healthy drive as a precaution:
# sfdisk --dump /dev/nvme0n1 > nvme0n1_parttable_mbr.bak
This saves the exact partition boundaries. If anything goes wrong during restoration on the new drive, this file is the reference.
6. Removing the failing drive from the RAID
Since the RAID arrays were still fully healthy ([UU]), mdadm won’t let you remove an active member directly - you have to explicitly mark each partition as failed first, then remove it:
# mdadm /dev/md0 --fail /dev/nvme1n1p1 && mdadm /dev/md0 -r /dev/nvme1n1p1
mdadm: set /dev/nvme1n1p1 faulty in /dev/md0
mdadm: hot removed /dev/nvme1n1p1 from /dev/md0
# mdadm /dev/md1 --fail /dev/nvme1n1p2 && mdadm /dev/md1 -r /dev/nvme1n1p2
mdadm: set /dev/nvme1n1p2 faulty in /dev/md1
mdadm: hot removed /dev/nvme1n1p2 from /dev/md1
# mdadm /dev/md2 --fail /dev/nvme1n1p3 && mdadm /dev/md2 -r /dev/nvme1n1p3
mdadm: set /dev/nvme1n1p3 faulty in /dev/md2
mdadm: hot removed /dev/nvme1n1p3 from /dev/md2
--fail marks the device as faulty in the array, which allows -r to remove it. Skipping --fail on a healthy array would return an error. Afterwards /proc/mdstat shows each array degraded to a single member, ready for the drive to be physically swapped:
# cat /proc/mdstat
md0 : active raid1 nvme0n1p1[0]
33520640 blocks super 1.2 [2/1] [U_]
md2 : active raid1 nvme0n1p3[0]
965467456 blocks super 1.2 [2/1] [U_]
bitmap: 8/8 pages [32KB], 65536KB chunk
md1 : active raid1 nvme0n1p2[0]
1046528 blocks super 1.2 [2/1] [U_]
unused devices: <none>
7. Requesting the replacement
I opened a support ticket in Hetzner Robot requesting a drive swap, including the serial number of the failing drive (S676NF0R518195). Hetzner handles the physical replacement in their datacenter.
8. Hetzner’s response
A few hours later I received confirmation that the drive had been replaced and the server had been booted into Hetzner’s rescue system with a temporary root password. The rescue system is a minimal Linux environment that runs from RAM - useful for operations like this where you need to work on the disks without the OS running on them.
9. Verifying the new drive
After logging into the rescue system, I ran smartctl on the new drive to confirm it was healthy:
# smartctl -a /dev/nvme1n1
Serial Number: S676NF0R554066
Firmware Version: GXA7801Q
SMART overall-health self-assessment test result: PASSED
Percentage Used: 2%
Available Spare: 100%
Critical Warning: 0x00
New serial number, firmware updated to GXA7801Q, Percentage Used: 2%, and a clean bill of health. Good to proceed.
10. Restoring the partition table
I copied the partition layout from the healthy drive directly onto the new one:
# sfdisk -d /dev/nvme0n1 | sfdisk /dev/nvme1n1
Checking that no-one is using this disk right now ... OK
>>> Created a new DOS (MBR) disklabel with disk identifier 0x4cdbd4cd.
/dev/nvme1n1p1: Created a new partition 1 of type 'Linux raid autodetect' and of size 32 GiB.
/dev/nvme1n1p2: Created a new partition 2 of type 'Linux raid autodetect' and of size 1 GiB.
/dev/nvme1n1p3: Created a new partition 3 of type 'Linux raid autodetect' and of size 920.9 GiB.
/dev/nvme1n1p4: Done.
New situation:
Device Boot Start End Sectors Size Id Type
/dev/nvme1n1p1 2048 67110911 67108864 32G fd Linux raid autodetect
/dev/nvme1n1p2 67110912 69208063 2097152 1G fd Linux raid autodetect
/dev/nvme1n1p3 69208064 2000407215 1931199152 920.9G fd Linux raid autodetect
The partition table has been altered.
Syncing disks.
sfdisk -d dumps the partition table in a format that sfdisk can read back. Piping it directly avoids any manual transcription errors. The output confirms three partitions with matching sizes, types (Linux raid autodetect), and the same disk identifier as the original.
11. Adding the new drive to the RAID arrays
First confirmed the arrays were still degraded from the rescue system’s perspective:
# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 nvme0n1p3[0]
965467456 blocks super 1.2 [2/1] [U_]
bitmap: 8/8 pages [32KB], 65536KB chunk
md1 : active raid1 nvme0n1p2[0]
1046528 blocks super 1.2 [2/1] [U_]
md0 : active raid1 nvme0n1p1[0]
33520640 blocks super 1.2 [2/1] [U_]
unused devices: <none>
Then added the new drive’s partitions:
# mdadm /dev/md0 -a /dev/nvme1n1p1
mdadm: added /dev/nvme1n1p1
# mdadm /dev/md1 -a /dev/nvme1n1p2
mdadm: added /dev/nvme1n1p2
# mdadm /dev/md2 -a /dev/nvme1n1p3
mdadm: added /dev/nvme1n1p3
mdadm -a triggers the rebuild immediately. Checking /proc/mdstat right after shows md0 already syncing while md1 and md2 are queued (resync=DELAYED):
# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 nvme1n1p3[2] nvme0n1p3[0]
965467456 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 8/8 pages [32KB], 65536KB chunk
md1 : active raid1 nvme1n1p2[2] nvme0n1p2[0]
1046528 blocks super 1.2 [2/1] [U_]
resync=DELAYED
md0 : active raid1 nvme1n1p1[2] nvme0n1p1[0]
33520640 blocks super 1.2 [2/1] [U_]
[=>...................] recovery = 8.3% (2801920/33520640) finish=2.3min speed=215532K/sec
unused devices: <none>
The 32 GB swap partition finished in a couple of minutes. The 921 GB data partition took longer. Once done, all arrays showed [UU] - both members healthy:
md0 : active raid1 nvme1n1p1[2] nvme0n1p1[0] [2/2] [UU]
md1 : active raid1 nvme1n1p2[2] nvme0n1p2[0] [2/2] [UU]
md2 : active raid1 nvme1n1p3[2] nvme0n1p3[0] [2/2] [UU]
12. Installing the bootloader on the new drive
RAID sync only copies data - it doesn’t install a bootloader. If the original drive fails before the bootloader is updated, the server won’t boot. So I chrooted into the system and ran grub-install on the new drive.
Watch out when following Hetzner’s official guide.
The guide lists
grub-mkdevicemapandgrub-installcommands near the top of the section, but only mentions at the very bottom that these must be run inside a chroot of your installed OS - not directly in the rescue system. Since Hetzner always boots the server into rescue mode after a drive swap, this applies every time.It’s easy to run
grub-install /dev/nvme1n1in the rescue environment before noticing that footnote, which installs GRUB in the wrong place and leaves the server unbootable. Always do the mounts andchrootfirst, then run any GRUB commands.
A quick blkid shows what each array actually contains:
# blkid | grep md
/dev/md2: UUID="02219d86-c796-4a92-b20e-273c35928815" BLOCK_SIZE="4096" TYPE="ext4"
/dev/md0: UUID="e1fd1a91-eab0-4ff3-aa66-aa2ccb48fc86" TYPE="swap"
/dev/md1: UUID="8d8068ab-8ff5-4c70-8b06-14c862b31c26" SEC_TYPE="ext2" BLOCK_SIZE="4096" TYPE="ext3"
md2- root filesystem (ext4), mounted at/mntmd1- boot partition (ext3), mounted at/mnt/bootmd0- swap, not mounted - swap isn’t needed for a chroot
# mount /dev/md2 /mnt
# mount /dev/md1 /mnt/boot
# mount --bind /dev /mnt/dev
# mount --bind /proc /mnt/proc
# mount --bind /sys /mnt/sys
# chroot /mnt
# grub-mkdevicemap -n
# grub-install /dev/nvme1n1
Installing for i386-pc platform.
Installation finished. No error reported.
grub-mkdevicemap -n regenerates GRUB’s device map - a file that maps GRUB’s device names (like (hd0)) to the actual block devices. The -n flag runs it without prompting. Running it before grub-install ensures GRUB has an accurate view of the current disk layout inside the chroot.
The --bind mounts give the chroot access to the running kernel’s device tree, proc filesystem, and sysfs - necessary for grub-install to work correctly.
After exiting the chroot, I rebooted the server out of rescue mode. It came back up cleanly on the restored RAID-1 array.
Was the replacement actually necessary?
After opening the ticket, Hetzner support replied before doing anything, explaining the situation:
The message ‘Critical Warning: 0x04’ is caused by “Percentage Used” being above 100%. This only means that the drive’s warranty from the manufacturer is over. But as long as ‘Available Spare’ is greater than ‘Available Spare Threshold’, you can safely ignore this message.
Unfortunately, tools like smartctl will report the disk as failed, so you might need some custom filters for your monitoring tool.
We have investigated and analyzed this topic with the manufacturers for a very long time. Unfortunately, it is not possible to disable this warning for our use case.
So technically, the drive was not failing - it had just exceeded its rated write endurance and the manufacturer warranty was up. Available Spare was still at 100%, well above the 10% threshold, meaning the drive had no bad blocks and was still operating normally.
I decided to proceed anyway. The warning isn’t going away and monitoring will keep alerting on it. The second drive was already at 81% used, and the serial numbers - S676NF0R518195 and S676NF0R518196, one digit apart - strongly suggest both drives came from the same manufacturing batch. Drives from the same batch tend to fail around the same time. With a RAID-1, the whole point is to survive a single drive failure - but if both drives degrade together, that safety net disappears. Getting ahead of it now made more sense than waiting.