Toolset improvements

Thursday, April 27, 2017

Recorder updates

Did a few more commits to the flight recorder to repair what had been broken by yesteday's update.

For the past two days, my blog index was showing a very empty page. I was annoyed by that. This was a bug in blogmax.el which I spent a little time to fix. It took more time than I wished, because the Emacs Lisp debugger is quite primitive . This was caused by me having a macro-like thing as a macro-argument, so I protected this scenario better.

BTRFS debugging

No response to my e-mail on the btrfs mailing list.

Updated Red Hat Bugzilla 1435567">btrfs-progs with the adjusted recorder. Also adjusted the instrumentation and ran the tool again.

This updated instrumentation shows that we do insert a max_size of 0 in add_extent_rec_nolookup:

78657 [1.837787:0x44f023] MAX_SIZE_ZERO: Set max size 0 for rec 0x138a030 from tmpl 0x7fffffffcf70

This is indeed the failing rec in check_extent_refs:

#6  0x000000000045c529 in check_extent_refs (extent_cache=0x7fffffffdb30, root=0x777570) at cmds-check.c:9322
warning: Source file is more recent than executable.
9322	                        RECORD(EXTENTS, "Container extent record %p for cache %p start %lu max_size %lu", rec, cache, rec->start, rec->max_size);
(gdb) p rec
$1 = (struct extent_record *) 0x138a030

I dig a bit deeper. We set rec->max_size = 0 in add_extent_rec_nolookup called from add_tree_backref, where we cleared the extent_record tmpl with a memset, so indeed, max_size is 0. However, we immediately after that do a lookup_cache_extent with a size of 1. So I wonder if at that stage, we should not set max_size to 1 for the newly created extent record.

That goes slightly farther, but then I get

*** Error in `btrfs check': double free or corruption (!prev): 0x0000000007d9c430 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7925b)[0x7ffff6feb25b]
/lib64/libc.so.6(+0x828ea)[0x7ffff6ff48ea]
/lib64/libc.so.6(cfree+0x4c)[0x7ffff6ff831c]
btrfs check[0x44d784]
btrfs check[0x4531ac]
btrfs check[0x45b24b]
btrfs check[0x45b743]
btrfs check[0x45c2b1]
btrfs check(cmd_check+0xcad)[0x4602d4]
btrfs check(main+0x8b)[0x40b7fb]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7ffff6f92401]
btrfs check(_start+0x2a)[0x40b4ba]
======= Memory map: ========
00400000-004a4000 r-xp 00000000 08:35 25167142                           /home/ddd/Work/btrfs-progs/btrfs
006a4000-006a8000 r--p 000a4000 08:35 25167142                           /home/ddd/Work/btrfs-progs/btrfs
006a8000-006fb000 rw-p 000a8000 08:35 25167142                           /home/ddd/Work/btrfs-progs/btrfs
006fb000-316a6000 rw-p 00000000 00:00 0                                  [heap]
7ffff0000000-7ffff0021000 rw-p 00000000 00:00 0 
7ffff0021000-7ffff4000000 ---p 00000000 00:00 0 
7ffff6d5b000-7ffff6d71000 r-xp 00000000 08:33 3156890                    /usr/lib64/libgcc_s-6.3.1-20161221.so.1
7ffff6d71000-7ffff6f70000 ---p 00016000 08:33 3156890                    /usr/lib64/libgcc_s-6.3.1-20161221.so.1
7ffff6f70000-7ffff6f71000 r--p 00015000 08:33 3156890                    /usr/lib64/libgcc_s-6.3.1-20161221.so.1
7ffff6f71000-7ffff6f72000 rw-p 00016000 08:33 3156890                    /usr/lib64/libgcc_s-6.3.1-20161221.so.1
7ffff6f72000-7ffff712f000 r-xp 00000000 08:33 3154711                    /usr/lib64/libc-2.24.so
7ffff712f000-7ffff732e000 ---p 001bd000 08:33 3154711                    /usr/lib64/libc-2.24.so
7ffff732e000-7ffff7332000 r--p 001bc000 08:33 3154711                    /usr/lib64/libc-2.24.so
7ffff7332000-7ffff7334000 rw-p 001c0000 08:33 3154711                    /usr/lib64/libc-2.24.so
7ffff7334000-7ffff7338000 rw-p 00000000 00:00 0 
7ffff7338000-7ffff7350000 r-xp 00000000 08:33 3155302                    /usr/lib64/libpthread-2.24.so
7ffff7350000-7ffff7550000 ---p 00018000 08:33 3155302                    /usr/lib64/libpthread-2.24.so
7ffff7550000-7ffff7551000 r--p 00018000 08:33 3155302                    /usr/lib64/libpthread-2.24.so
7ffff7551000-7ffff7552000 rw-p 00019000 08:33 3155302                    /usr/lib64/libpthread-2.24.so
7ffff7552000-7ffff7556000 rw-p 00000000 00:00 0 
7ffff7556000-7ffff7578000 r-xp 00000000 08:33 3155132                    /usr/lib64/liblzo2.so.2.0.0
7ffff7578000-7ffff7777000 ---p 00022000 08:33 3155132                    /usr/lib64/liblzo2.so.2.0.0
7ffff7777000-7ffff7778000 r--p 00021000 08:33 3155132                    /usr/lib64/liblzo2.so.2.0.0
7ffff7778000-7ffff7779000 rw-p 00000000 00:00 0 
7ffff7779000-7ffff778e000 r-xp 00000000 08:33 3155608                    /usr/lib64/libz.so.1.2.8
7ffff778e000-7ffff798d000 ---p 00015000 08:33 3155608                    /usr/lib64/libz.so.1.2.8
7ffff798d000-7ffff798e000 r--p 00014000 08:33 3155608                    /usr/lib64/libz.so.1.2.8
7ffff798e000-7ffff798f000 rw-p 00015000 08:33 3155608                    /usr/lib64/libz.so.1.2.8
7ffff798f000-7ffff79cc000 r-xp 00000000 08:33 3153511                    /usr/lib64/libblkid.so.1.1.0
7ffff79cc000-7ffff7bcc000 ---p 0003d000 08:33 3153511                    /usr/lib64/libblkid.so.1.1.0
7ffff7bcc000-7ffff7bd0000 r--p 0003d000 08:33 3153511                    /usr/lib64/libblkid.so.1.1.0
7ffff7bd0000-7ffff7bd1000 rw-p 00041000 08:33 3153511                    /usr/lib64/libblkid.so.1.1.0
7ffff7bd1000-7ffff7bd2000 rw-p 00000000 00:00 0 
7ffff7bd2000-7ffff7bd6000 r-xp 00000000 08:33 3154270                    /usr/lib64/libuuid.so.1.3.0
7ffff7bd6000-7ffff7dd5000 ---p 00004000 08:33 3154270                    /usr/lib64/libuuid.so.1.3.0
7ffff7dd5000-7ffff7dd6000 r--p 00003000 08:33 3154270                    /usr/lib64/libuuid.so.1.3.0
7ffff7dd6000-7ffff7dd7000 rw-p 00000000 00:00 0 
7ffff7dd7000-7ffff7dfc000 r-xp 00000000 08:33 3154536                    /usr/lib64/ld-2.24.so
7ffff7fdb000-7ffff7fe0000 rw-p 00000000 00:00 0 
7ffff7ff5000-7ffff7ff8000 rw-p 00000000 00:00 0 
7ffff7ff8000-7ffff7ffa000 r--p 00000000 00:00 0                          [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 00:00 0                          [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00025000 08:33 3154536                    /usr/lib64/ld-2.24.so
7ffff7ffd000-7ffff7ffe000 rw-p 00026000 08:33 3154536                    /usr/lib64/ld-2.24.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0 
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Another pass, and I get:

Ok we have overlapping extents that aren't completely covered by each
other, this is going to require more careful thought.  The extents are
[52428800-16384] and [52432896-16384]

This is yet another abort. But the first number seems to match the range we always had trouble with (from two days ago):

#5  0x0000555555584cb6 in set_extent_bits (tree=0x7fffffffdba0, 
    start=52428800, end=52428799, bits=bits@entry=1) at extent_io.c:310
310			err = insert_state(tree, prealloc, start, end, bits);

Trying to figure out why I always have the message:

ERROR: add_tree_backref failed: File exists

Added disambiguation to the message, which exists in several places, now I get:

ERROR: add_tree_backref failed (tree block): File exists

Next run is the list deletion crash:

Program received signal SIGSEGV, Segmentation fault.
list_del (entry=0xe445420) at /home/ddd/Work/btrfs-progs/kernel-lib/list.h:125
125		__list_del(entry->prev, entry->next);
(gdb) up
#1  free_all_extent_backrefs (rec=rec@entry=0xe445350) at cmds-check.c:5388
5388			list_del(cur);
(gdb) p cur
$1 = (struct list_head *) 0xe445420
(gdb) p *cur
$2 = {next = 0x0, prev = 0xe645350}

The list is screwed up:

p rec->backrefs
$5 = {
  next = 0xe445420, 
  prev = 0xe645350
}
(gdb) p *rec->backrefs->next
$6 = (struct list_head *) 0xe445420
(gdb) p *rec->backrefs->next
$7 = {
  next = 0x0, 
  prev = 0xe645350
}
(gdb) p *rec->backrefs->prev
$8 = {
  next = 0xe445350, 
  prev = 0x60
}

We find our friend 52432896 again as the root cause for the message "file exists":

Breakpoint 1, add_tree_backref (extent_cache=extent_cache@entry=0x7fffffffdb30, bytenr=bytenr@entry=52445184, parent=parent@entry=0, root=2, found_ref=found_ref@entry=0) at cmds-check.c:6240
6240			return -EEXIST;
(gdb) p rec
$7 = (struct extent_record *) 0x889cd0
(gdb) p rec->start
$8 = 52432896
(gdb) p rec->bytenr
There is no member named bytenr.
(gdb) p bytenr
$9 = 52445184

Not too surprisingly, it's also in the original dmesg log for the disk crash:

[525761.195108] btree_readpage_end_io_hook: 2 callbacks suppressed
[525761.195114] BTRFS error (device sda4): bad tree block start 11022913622198796681 52432896
[525761.203698] BTRFS error (device sda4): bad tree block start 11022913622198796681 52432896
(some stuff removed here)
[525970.013825] BTRFS error (device sda4): unable to find ref byte nr 52432896 parent 0 root 2  owner 0 offset 0