Thursday, April 27, 2017
Recorder updates
Did a few more commits to the flight recorder to repair what had been broken by yesteday's update.
For the past two days, my blog index was showing a very empty page. I was annoyed by that. This was a bug in blogmax.el which I spent a little time to fix. It took more time than I wished, because the Emacs Lisp debugger is quite primitive . This was caused by me having a macro-like thing as a macro-argument, so I protected this scenario better.
BTRFS debugging
No response to my e-mail on the btrfs mailing list.
Updated Red Hat Bugzilla 1435567">btrfs-progs with the adjusted recorder. Also adjusted the instrumentation and ran the tool again.
This updated instrumentation shows that we do insert a max_size of 0 in add_extent_rec_nolookup:
78657 [1.837787:0x44f023] MAX_SIZE_ZERO: Set max size 0 for rec 0x138a030 from tmpl 0x7fffffffcf70
This is indeed the failing rec in check_extent_refs:
#6 0x000000000045c529 in check_extent_refs (extent_cache=0x7fffffffdb30, root=0x777570) at cmds-check.c:9322 warning: Source file is more recent than executable. 9322 RECORD(EXTENTS, "Container extent record %p for cache %p start %lu max_size %lu", rec, cache, rec->start, rec->max_size); (gdb) p rec $1 = (struct extent_record *) 0x138a030
I dig a bit deeper. We set rec->max_size = 0 in add_extent_rec_nolookup called from add_tree_backref, where we cleared the extent_record tmpl with a memset, so indeed, max_size is 0. However, we immediately after that do a lookup_cache_extent with a size of 1. So I wonder if at that stage, we should not set max_size to 1 for the newly created extent record.
That goes slightly farther, but then I get
*** Error in `btrfs check': double free or corruption (!prev): 0x0000000007d9c430 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x7925b)[0x7ffff6feb25b] /lib64/libc.so.6(+0x828ea)[0x7ffff6ff48ea] /lib64/libc.so.6(cfree+0x4c)[0x7ffff6ff831c] btrfs check[0x44d784] btrfs check[0x4531ac] btrfs check[0x45b24b] btrfs check[0x45b743] btrfs check[0x45c2b1] btrfs check(cmd_check+0xcad)[0x4602d4] btrfs check(main+0x8b)[0x40b7fb] /lib64/libc.so.6(__libc_start_main+0xf1)[0x7ffff6f92401] btrfs check(_start+0x2a)[0x40b4ba] ======= Memory map: ======== 00400000-004a4000 r-xp 00000000 08:35 25167142 /home/ddd/Work/btrfs-progs/btrfs 006a4000-006a8000 r--p 000a4000 08:35 25167142 /home/ddd/Work/btrfs-progs/btrfs 006a8000-006fb000 rw-p 000a8000 08:35 25167142 /home/ddd/Work/btrfs-progs/btrfs 006fb000-316a6000 rw-p 00000000 00:00 0 [heap] 7ffff0000000-7ffff0021000 rw-p 00000000 00:00 0 7ffff0021000-7ffff4000000 ---p 00000000 00:00 0 7ffff6d5b000-7ffff6d71000 r-xp 00000000 08:33 3156890 /usr/lib64/libgcc_s-6.3.1-20161221.so.1 7ffff6d71000-7ffff6f70000 ---p 00016000 08:33 3156890 /usr/lib64/libgcc_s-6.3.1-20161221.so.1 7ffff6f70000-7ffff6f71000 r--p 00015000 08:33 3156890 /usr/lib64/libgcc_s-6.3.1-20161221.so.1 7ffff6f71000-7ffff6f72000 rw-p 00016000 08:33 3156890 /usr/lib64/libgcc_s-6.3.1-20161221.so.1 7ffff6f72000-7ffff712f000 r-xp 00000000 08:33 3154711 /usr/lib64/libc-2.24.so 7ffff712f000-7ffff732e000 ---p 001bd000 08:33 3154711 /usr/lib64/libc-2.24.so 7ffff732e000-7ffff7332000 r--p 001bc000 08:33 3154711 /usr/lib64/libc-2.24.so 7ffff7332000-7ffff7334000 rw-p 001c0000 08:33 3154711 /usr/lib64/libc-2.24.so 7ffff7334000-7ffff7338000 rw-p 00000000 00:00 0 7ffff7338000-7ffff7350000 r-xp 00000000 08:33 3155302 /usr/lib64/libpthread-2.24.so 7ffff7350000-7ffff7550000 ---p 00018000 08:33 3155302 /usr/lib64/libpthread-2.24.so 7ffff7550000-7ffff7551000 r--p 00018000 08:33 3155302 /usr/lib64/libpthread-2.24.so 7ffff7551000-7ffff7552000 rw-p 00019000 08:33 3155302 /usr/lib64/libpthread-2.24.so 7ffff7552000-7ffff7556000 rw-p 00000000 00:00 0 7ffff7556000-7ffff7578000 r-xp 00000000 08:33 3155132 /usr/lib64/liblzo2.so.2.0.0 7ffff7578000-7ffff7777000 ---p 00022000 08:33 3155132 /usr/lib64/liblzo2.so.2.0.0 7ffff7777000-7ffff7778000 r--p 00021000 08:33 3155132 /usr/lib64/liblzo2.so.2.0.0 7ffff7778000-7ffff7779000 rw-p 00000000 00:00 0 7ffff7779000-7ffff778e000 r-xp 00000000 08:33 3155608 /usr/lib64/libz.so.1.2.8 7ffff778e000-7ffff798d000 ---p 00015000 08:33 3155608 /usr/lib64/libz.so.1.2.8 7ffff798d000-7ffff798e000 r--p 00014000 08:33 3155608 /usr/lib64/libz.so.1.2.8 7ffff798e000-7ffff798f000 rw-p 00015000 08:33 3155608 /usr/lib64/libz.so.1.2.8 7ffff798f000-7ffff79cc000 r-xp 00000000 08:33 3153511 /usr/lib64/libblkid.so.1.1.0 7ffff79cc000-7ffff7bcc000 ---p 0003d000 08:33 3153511 /usr/lib64/libblkid.so.1.1.0 7ffff7bcc000-7ffff7bd0000 r--p 0003d000 08:33 3153511 /usr/lib64/libblkid.so.1.1.0 7ffff7bd0000-7ffff7bd1000 rw-p 00041000 08:33 3153511 /usr/lib64/libblkid.so.1.1.0 7ffff7bd1000-7ffff7bd2000 rw-p 00000000 00:00 0 7ffff7bd2000-7ffff7bd6000 r-xp 00000000 08:33 3154270 /usr/lib64/libuuid.so.1.3.0 7ffff7bd6000-7ffff7dd5000 ---p 00004000 08:33 3154270 /usr/lib64/libuuid.so.1.3.0 7ffff7dd5000-7ffff7dd6000 r--p 00003000 08:33 3154270 /usr/lib64/libuuid.so.1.3.0 7ffff7dd6000-7ffff7dd7000 rw-p 00000000 00:00 0 7ffff7dd7000-7ffff7dfc000 r-xp 00000000 08:33 3154536 /usr/lib64/ld-2.24.so 7ffff7fdb000-7ffff7fe0000 rw-p 00000000 00:00 0 7ffff7ff5000-7ffff7ff8000 rw-p 00000000 00:00 0 7ffff7ff8000-7ffff7ffa000 r--p 00000000 00:00 0 [vvar] 7ffff7ffa000-7ffff7ffc000 r-xp 00000000 00:00 0 [vdso] 7ffff7ffc000-7ffff7ffd000 r--p 00025000 08:33 3154536 /usr/lib64/ld-2.24.so 7ffff7ffd000-7ffff7ffe000 rw-p 00026000 08:33 3154536 /usr/lib64/ld-2.24.so 7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0 7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0 [stack] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Another pass, and I get:
Ok we have overlapping extents that aren't completely covered by each other, this is going to require more careful thought. The extents are [52428800-16384] and [52432896-16384]
This is yet another abort. But the first number seems to match the range we always had trouble with (from two days ago):
#5 0x0000555555584cb6 in set_extent_bits (tree=0x7fffffffdba0, start=52428800, end=52428799, bits=bits@entry=1) at extent_io.c:310 310 err = insert_state(tree, prealloc, start, end, bits);
Trying to figure out why I always have the message:
ERROR: add_tree_backref failed: File exists
Added disambiguation to the message, which exists in several places, now I get:
ERROR: add_tree_backref failed (tree block): File exists
Next run is the list deletion crash:
Program received signal SIGSEGV, Segmentation fault. list_del (entry=0xe445420) at /home/ddd/Work/btrfs-progs/kernel-lib/list.h:125 125 __list_del(entry->prev, entry->next); (gdb) up #1 free_all_extent_backrefs (rec=rec@entry=0xe445350) at cmds-check.c:5388 5388 list_del(cur); (gdb) p cur $1 = (struct list_head *) 0xe445420 (gdb) p *cur $2 = {next = 0x0, prev = 0xe645350}
The list is screwed up:
p rec->backrefs $5 = { next = 0xe445420, prev = 0xe645350 } (gdb) p *rec->backrefs->next $6 = (struct list_head *) 0xe445420 (gdb) p *rec->backrefs->next $7 = { next = 0x0, prev = 0xe645350 } (gdb) p *rec->backrefs->prev $8 = { next = 0xe445350, prev = 0x60 }
We find our friend 52432896 again as the root cause for the message "file exists":
Breakpoint 1, add_tree_backref (extent_cache=extent_cache@entry=0x7fffffffdb30, bytenr=bytenr@entry=52445184, parent=parent@entry=0, root=2, found_ref=found_ref@entry=0) at cmds-check.c:6240 6240 return -EEXIST; (gdb) p rec $7 = (struct extent_record *) 0x889cd0 (gdb) p rec->start $8 = 52432896 (gdb) p rec->bytenr There is no member named bytenr. (gdb) p bytenr $9 = 52445184
Not too surprisingly, it's also in the original dmesg log for the disk crash:
[525761.195108] btree_readpage_end_io_hook: 2 callbacks suppressed [525761.195114] BTRFS error (device sda4): bad tree block start 11022913622198796681 52432896 [525761.203698] BTRFS error (device sda4): bad tree block start 11022913622198796681 52432896 (some stuff removed here) [525970.013825] BTRFS error (device sda4): unable to find ref byte nr 52432896 parent 0 root 2 owner 0 offset 0