Replacing an NVMe Drive in a Hetzner Dedicated Server

One of Hetzner dedicated servers has two Samsung 1 TB NVMe drives in a software RAID-1 array. This is a walkthrough of how I replaced a drive that had exceeded its rated write endurance, without data loss.

1. The alert

It started with an email from smartd running on the server itself:

Subject: SMART error (Health) detected on host

Device: /dev/nvme1, Critical Warning (0x04): Reliability

Device info:
SAMSUNG MZVL21T0HCLR-00B00, S/N:S676NF0R518195, FW:GXA7601Q, 1.02 TB

Critical Warning 0x04 means the NVMe subsystem’s reliability has been degraded - the drive is reporting it can no longer guarantee data integrity. Time to act.

2. Confirming which drive had the warning

I ran smartctl against both drives to see their health status.

/dev/nvme0n1 was healthy:

SMART overall-health self-assessment test result: PASSED

Percentage Used:  81%
Available Spare:  100%
Critical Warning: 0x00

/dev/nvme1n1 was not:

SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

Percentage Used:  100%
Available Spare:  100%
Critical Warning: 0x04

Percentage Used: 100% confirms the drive has exhausted its rated write endurance. The serial number (S676NF0R518195) matched the one in the alert email - /dev/nvme1n1 was the bad drive.

3. Checking the RAID array

# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
      33520640 blocks super 1.2 [2/2] [UU]

md2 : active raid1 nvme1n1p3[1] nvme0n1p3[0]
      965467456 blocks super 1.2 [2/2] [UU]
      bitmap: 7/8 pages [28KB], 65536KB chunk

md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
      1046528 blocks super 1.2 [2/2] [UU]

unused devices: <none>

[2/2] [UU] means all arrays are fully intact - both drives active, no degradation at the RAID level. The SMART warning was about the drive’s internal health, not a failure that had caused it to drop out of the array yet. This is actually the best case: the drive was flagging trouble early enough to replace it before RAID noticed anything wrong.

4. Checking the partition table type

Before doing anything, I confirmed whether the disks used MBR or GPT:

# parted -l
Model: SAMSUNG MZVL21T0HCLR-00B00 (nvme)
Disk /dev/nvme0n1: 1024GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number  Start   End     Size    Type     File system  Flags
 1      1049kB  34.4GB  34.4GB  primary               raid
 2      34.4GB  35.4GB  1074MB  primary               raid
 3      35.4GB  1024GB  989GB   primary               raid

Model: Linux Software RAID Array (md)
Disk /dev/md2: 989GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:
Number  Start  End    Size   File system  Flags
 1      0.00B  989GB  989GB  ext4

Model: Linux Software RAID Array (md)
Disk /dev/md0: 34.3GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:
Number  Start  End     Size    File system     Flags
 1      0.00B  34.3GB  34.3GB  linux-swap(v1)

Model: SAMSUNG MZVL21T0HCLR-00B00 (nvme)
Disk /dev/nvme1n1: 1024GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number  Start   End     Size    Type     File system  Flags
 1      1049kB  34.4GB  34.4GB  primary               raid
 2      34.4GB  35.4GB  1074MB  primary               raid
 3      35.4GB  1024GB  989GB   primary               raid

Model: Linux Software RAID Array (md)
Disk /dev/md1: 1072MB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:
Number  Start  End     Size    File system  Flags
 1      0.00B  1072MB  1072MB  ext3

Partition Table: msdos on both NVMe drives confirms MBR. This matters because the method to back up and restore the partition layout differs between MBR and GPT. The output also gives a useful overview of the full setup: three partitions per drive (swap, boot, root), all marked as raid and active in their respective md arrays.

5. Backing up the partition table

With the partition layout confirmed as MBR, I dumped it from the healthy drive as a precaution:

# sfdisk --dump /dev/nvme0n1 > nvme0n1_parttable_mbr.bak

This saves the exact partition boundaries. If anything goes wrong during restoration on the new drive, this file is the reference.

6. Removing the failing drive from the RAID

Since the RAID arrays were still fully healthy ([UU]), mdadm won’t let you remove an active member directly - you have to explicitly mark each partition as failed first, then remove it:

# mdadm /dev/md0 --fail /dev/nvme1n1p1 && mdadm /dev/md0 -r /dev/nvme1n1p1
mdadm: set /dev/nvme1n1p1 faulty in /dev/md0
mdadm: hot removed /dev/nvme1n1p1 from /dev/md0
# mdadm /dev/md1 --fail /dev/nvme1n1p2 && mdadm /dev/md1 -r /dev/nvme1n1p2
mdadm: set /dev/nvme1n1p2 faulty in /dev/md1
mdadm: hot removed /dev/nvme1n1p2 from /dev/md1
# mdadm /dev/md2 --fail /dev/nvme1n1p3 && mdadm /dev/md2 -r /dev/nvme1n1p3
mdadm: set /dev/nvme1n1p3 faulty in /dev/md2
mdadm: hot removed /dev/nvme1n1p3 from /dev/md2

--fail marks the device as faulty in the array, which allows -r to remove it. Skipping --fail on a healthy array would return an error. Afterwards /proc/mdstat shows each array degraded to a single member, ready for the drive to be physically swapped:

# cat /proc/mdstat
md0 : active raid1 nvme0n1p1[0]
      33520640 blocks super 1.2 [2/1] [U_]
md2 : active raid1 nvme0n1p3[0]
      965467456 blocks super 1.2 [2/1] [U_]
      bitmap: 8/8 pages [32KB], 65536KB chunk
md1 : active raid1 nvme0n1p2[0]
      1046528 blocks super 1.2 [2/1] [U_]
unused devices: <none>

7. Requesting the replacement

I opened a support ticket in Hetzner Robot requesting a drive swap, including the serial number of the failing drive (S676NF0R518195). Hetzner handles the physical replacement in their datacenter.

8. Hetzner’s response

A few hours later I received confirmation that the drive had been replaced and the server had been booted into Hetzner’s rescue system with a temporary root password. The rescue system is a minimal Linux environment that runs from RAM - useful for operations like this where you need to work on the disks without the OS running on them.

9. Verifying the new drive

After logging into the rescue system, I ran smartctl on the new drive to confirm it was healthy:

# smartctl -a /dev/nvme1n1

Serial Number: S676NF0R554066
Firmware Version: GXA7801Q

SMART overall-health self-assessment test result: PASSED

Percentage Used:   2%
Available Spare:   100%
Critical Warning:  0x00

New serial number, firmware updated to GXA7801Q, Percentage Used: 2%, and a clean bill of health. Good to proceed.

10. Restoring the partition table

I copied the partition layout from the healthy drive directly onto the new one:

# sfdisk -d /dev/nvme0n1 | sfdisk /dev/nvme1n1
Checking that no-one is using this disk right now ... OK

>>> Created a new DOS (MBR) disklabel with disk identifier 0x4cdbd4cd.
/dev/nvme1n1p1: Created a new partition 1 of type 'Linux raid autodetect' and of size 32 GiB.
/dev/nvme1n1p2: Created a new partition 2 of type 'Linux raid autodetect' and of size 1 GiB.
/dev/nvme1n1p3: Created a new partition 3 of type 'Linux raid autodetect' and of size 920.9 GiB.
/dev/nvme1n1p4: Done.

New situation:
Device         Boot    Start        End    Sectors   Size Id Type
/dev/nvme1n1p1          2048   67110911   67108864    32G fd Linux raid autodetect
/dev/nvme1n1p2      67110912   69208063    2097152     1G fd Linux raid autodetect
/dev/nvme1n1p3      69208064 2000407215 1931199152 920.9G fd Linux raid autodetect

The partition table has been altered.
Syncing disks.

sfdisk -d dumps the partition table in a format that sfdisk can read back. Piping it directly avoids any manual transcription errors. The output confirms three partitions with matching sizes, types (Linux raid autodetect), and the same disk identifier as the original.

11. Adding the new drive to the RAID arrays

First confirmed the arrays were still degraded from the rescue system’s perspective:

# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 nvme0n1p3[0]
      965467456 blocks super 1.2 [2/1] [U_]
      bitmap: 8/8 pages [32KB], 65536KB chunk
md1 : active raid1 nvme0n1p2[0]
      1046528 blocks super 1.2 [2/1] [U_]
md0 : active raid1 nvme0n1p1[0]
      33520640 blocks super 1.2 [2/1] [U_]
unused devices: <none>

Then added the new drive’s partitions:

# mdadm /dev/md0 -a /dev/nvme1n1p1
mdadm: added /dev/nvme1n1p1
# mdadm /dev/md1 -a /dev/nvme1n1p2
mdadm: added /dev/nvme1n1p2
# mdadm /dev/md2 -a /dev/nvme1n1p3
mdadm: added /dev/nvme1n1p3

mdadm -a triggers the rebuild immediately. Checking /proc/mdstat right after shows md0 already syncing while md1 and md2 are queued (resync=DELAYED):

# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 nvme1n1p3[2] nvme0n1p3[0]
      965467456 blocks super 1.2 [2/1] [U_]
        resync=DELAYED
      bitmap: 8/8 pages [32KB], 65536KB chunk
md1 : active raid1 nvme1n1p2[2] nvme0n1p2[0]
      1046528 blocks super 1.2 [2/1] [U_]
        resync=DELAYED
md0 : active raid1 nvme1n1p1[2] nvme0n1p1[0]
      33520640 blocks super 1.2 [2/1] [U_]
      [=>...................]  recovery =  8.3% (2801920/33520640) finish=2.3min speed=215532K/sec
unused devices: <none>

The 32 GB swap partition finished in a couple of minutes. The 921 GB data partition took longer. Once done, all arrays showed [UU] - both members healthy:

md0 : active raid1 nvme1n1p1[2] nvme0n1p1[0]  [2/2] [UU]
md1 : active raid1 nvme1n1p2[2] nvme0n1p2[0]  [2/2] [UU]
md2 : active raid1 nvme1n1p3[2] nvme0n1p3[0]  [2/2] [UU]

12. Installing the bootloader on the new drive

RAID sync only copies data - it doesn’t install a bootloader. If the original drive fails before the bootloader is updated, the server won’t boot. So I chrooted into the system and ran grub-install on the new drive.

Watch out when following Hetzner’s official guide.

The guide lists grub-mkdevicemap and grub-install commands near the top of the section, but only mentions at the very bottom that these must be run inside a chroot of your installed OS - not directly in the rescue system. Since Hetzner always boots the server into rescue mode after a drive swap, this applies every time.

It’s easy to run grub-install /dev/nvme1n1 in the rescue environment before noticing that footnote, which installs GRUB in the wrong place and leaves the server unbootable. Always do the mounts and chroot first, then run any GRUB commands.

A quick blkid shows what each array actually contains:

# blkid | grep md
/dev/md2: UUID="02219d86-c796-4a92-b20e-273c35928815" BLOCK_SIZE="4096" TYPE="ext4"
/dev/md0: UUID="e1fd1a91-eab0-4ff3-aa66-aa2ccb48fc86" TYPE="swap"
/dev/md1: UUID="8d8068ab-8ff5-4c70-8b06-14c862b31c26" SEC_TYPE="ext2" BLOCK_SIZE="4096" TYPE="ext3"

md2 - root filesystem (ext4), mounted at /mnt
md1 - boot partition (ext3), mounted at /mnt/boot
md0 - swap, not mounted - swap isn’t needed for a chroot

# mount /dev/md2 /mnt
# mount /dev/md1 /mnt/boot
# mount --bind /dev /mnt/dev
# mount --bind /proc /mnt/proc
# mount --bind /sys /mnt/sys
# chroot /mnt
# grub-mkdevicemap -n
# grub-install /dev/nvme1n1
Installing for i386-pc platform.
Installation finished. No error reported.

grub-mkdevicemap -n regenerates GRUB’s device map - a file that maps GRUB’s device names (like (hd0)) to the actual block devices. The -n flag runs it without prompting. Running it before grub-install ensures GRUB has an accurate view of the current disk layout inside the chroot.

The --bind mounts give the chroot access to the running kernel’s device tree, proc filesystem, and sysfs - necessary for grub-install to work correctly.

After exiting the chroot, I rebooted the server out of rescue mode. It came back up cleanly on the restored RAID-1 array.

Was the replacement actually necessary?

After opening the ticket, Hetzner support replied before doing anything, explaining the situation:

The message ‘Critical Warning: 0x04’ is caused by “Percentage Used” being above 100%. This only means that the drive’s warranty from the manufacturer is over. But as long as ‘Available Spare’ is greater than ‘Available Spare Threshold’, you can safely ignore this message.

Unfortunately, tools like smartctl will report the disk as failed, so you might need some custom filters for your monitoring tool.

We have investigated and analyzed this topic with the manufacturers for a very long time. Unfortunately, it is not possible to disable this warning for our use case.

So technically, the drive was not failing - it had just exceeded its rated write endurance and the manufacturer warranty was up. Available Spare was still at 100%, well above the 10% threshold, meaning the drive had no bad blocks and was still operating normally.

I decided to proceed anyway. The warning isn’t going away and monitoring will keep alerting on it. The second drive was already at 81% used, and the serial numbers - S676NF0R518195 and S676NF0R518196, one digit apart - strongly suggest both drives came from the same manufacturing batch. Drives from the same batch tend to fail around the same time. With a RAID-1, the whole point is to survive a single drive failure - but if both drives degrade together, that safety net disappears. Getting ahead of it now made more sense than waiting.