Corruption in raw send bug finally closed!
This morning, Brian Behlendorf closed the long standing bug reporting occasional corruption when replicating encrypted datasets using raw send.
This bug’s final dissection and fix were the result of a coordinated community effort, and I’m proud of our own community’s part in that.
Cheers everybody!
github.com/openzfs/zfs
ZFS corruption related to snapshots post-2.0.x upgrade
opened 01:30PM - 08 May 21 UTC
closed 04:55PM - 19 May 25 UTC
jgoerzen
Type: Defect
Component: Encryption
Status: Triage Needed
### System information
Type | Version/Name
--- | ---
Distribution Name | D…ebian
Distribution Version | Buster
Linux Kernel | 5.10.0-0.bpo.5-amd64
Architecture | amd64
ZFS Version | 2.0.3-1~bpo10+1
SPL Version | 2.0.3-1~bpo10+1
### Describe the problem you're observing
Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:
```
zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May 3 16:58:33 2021
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p7 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0xeb51>:<0x0>
```
Of note, the `<0xeb51>` is sometimes a snapshot name; if I `zfs destroy` the snapshot, it is replaced by this tag.
Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub **without rebooting** after seeing this kind of `zpool status` output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:
```
[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P U OE 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140] dump_stack+0x6d/0x88
[393801.328149] spl_panic+0xd3/0xfb [spl]
[393801.328153] ? __wake_up_common_lock+0x87/0xc0
[393801.328221] ? zei_add_range+0x130/0x130 [zfs]
[393801.328225] ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275] ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302] arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331] arc_read_done+0x24d/0x490 [zfs]
[393801.328388] zio_done+0x43d/0x1020 [zfs]
[393801.328445] ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502] zio_execute+0x90/0xf0 [zfs]
[393801.328508] taskq_thread+0x2e7/0x530 [spl]
[393801.328512] ? wake_up_q+0xa0/0xa0
[393801.328569] ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574] ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576] kthread+0x116/0x130
[393801.328578] ? kthread_park+0x80/0x80
[393801.328581] ret_from_fork+0x22/0x30
```
However I want to stress that this backtrace is not the original **cause** of the problem, and it only appears if I do a scrub without first rebooting.
After that panic, the scrub stalled -- and a second error appeared:
```
zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Sat May 8 08:11:07 2021
152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
0B repaired, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p7 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0xeb51>:<0x0>
rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>
```
I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.
I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?
- It is a laptop
- It uses ZFS crypto (the others use LUKS)
I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.
### Describe how to reproduce the problem
I can't at will. I have to wait for a spell.
### Include any warning/errors/backtraces from the system logs
See above
### Potentially related bugs
- I already mentioned #11688 which seems similar, but a scrub doesn't immediately resolve the issue here
- A quite similar backtrace also involving `arc_buf_destroy` is in #11443. The behavior described there has some parallels to what I observe. I am uncertain from the discussion what that means for this.
- In #10697 there are some similar symptoms, but it looks like a different issue to me
2 posts - 2 participants
Read full topic
#zfs