Tuesday, April 25, 2017
Having tried all the official BTRFS tools and gone as far as I could with a boot from a Live Fedora25 USB stick, I went ahead and reinstalled Fedora 25 on the old Shuttle boot disk. Why I had to reinstall, I don't know, but apparently the bootlader that was installed on that disk is not "universal", and Muse won't boot from it.
After reinstallation, rebuilt btrfs-progs-4.10.1-1.fc25.x86_64.rpm from RPM sources. The first run gets me:
enabling repair mode Checking filesystem on /dev/sda4 UUID: 26a0c84c-d2ac-4da8-b880-684f2ea48a22 checking extents ERROR: add_tree_backref failed: File exists extent_io.c:143: insert_state: BUG_ON `end < start` triggered, value 1 btrfs check(+0x305a6)[0x556c478505a6] btrfs check(+0x307fd)[0x556c478507fd] btrfs check(set_extent_bits+0x256)[0x556c47850cc6] btrfs check(+0x6329b)[0x556c4788329b] btrfs check(cmd_check+0xb43)[0x556c47885143] btrfs check(main+0x80)[0x556c47837c80] /lib64/libc.so.6(__libc_start_main+0xf1)[0x7f4ed220a401] btrfs check(_start+0x2a)[0x556c47837d8a] Aborted (core dumped)
Same as before. Running under gdb without the --repair option gives:
root 258 inode 285735 errors 2001, no inode item, link count wrong unresolved ref dir 362 index 44 namelen 11 name fedora.json filetype 1 errors 4, no inode ref root 258 inode 285736 errors 2001, no inode item, link count wrong unresolved ref dir 472 index 42 namelen 10 name gnome.json filetype 1 errors 4, no inode ref root 258 inode 285737 errors 2001, no inode item, link count wrong unresolved ref dir 481 index 41 namelen 18 name extensions-web.xml filetype 1 errors 4, no inode ref root 258 inode 285738 errors 2001, no inode item, link count wrong unresolved ref dir 389 index 44 namelen 9 name odrs.json filetype 1 errors 4, no inode ref root 258 inode 285759 errors 2001, no inode item, link count wrong unresolved ref dir 281 index 648 namelen 13 name notifications filetype 1 errors 4, no inode ref root 258 inode 285768 errors 2001, no inode item, link count wrong unresolved ref dir 257 index 366 namelen 11 name .Xauthority filetype 1 errors 4, no inode ref root 258 inode 287091 errors 2001, no inode item, link count wrong unresolved ref dir 521 index 38 namelen 30 name .saves-7375-muse.dinechin.org~ filetype 1 errors 4, no inode ref root 258 inode 287119 errors 2001, no inode item, link count wrong unresolved ref dir 257 index 368 namelen 8 name .lesshst filetype 1 errors 4, no inode ref root 258 inode 289021 errors 2001, no inode item, link count wrong unresolved ref dir 266 index 76 namelen 13 name gstreamer-1.0 filetype 2 errors 4, no inode ref root 258 inode 289997 errors 2001, no inode item, link count wrong unresolved ref dir 3058 index 16 namelen 10 name cscope.out filetype 1 errors 4, no inode ref ERROR: errors found in fs roots found 1124075982849 bytes used, error(s) found total csum bytes: 1094304808 total tree bytes: 2285043712 total fs tree bytes: 797573120 total extent tree bytes: 258555904 btree space waste bytes: 294399420 file data blocks allocated: 72350215241728 referenced 1037659140096
With the --repair option, I get the following stack trace information:
gdb) r --repair /dev/sda4 Starting program: /usr/sbin/btrfsck --repair /dev/sda4 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". enabling repair mode Checking filesystem on /dev/sda4 UUID: 26a0c84c-d2ac-4da8-b880-684f2ea48a22 checking extents ERROR: add_tree_backref failed: File exists extent_io.c:143: insert_state: BUG_ON `end < start` triggered, value 1 btrfs check(+0x30596)[0x555555584596] btrfs check(+0x307ed)[0x5555555847ed] btrfs check(set_extent_bits+0x256)[0x555555584cb6] btrfs check(+0x6328b)[0x5555555b728b] btrfs check(cmd_check+0xb43)[0x5555555b9133] btrfs check(main+0x80)[0x55555556bc70] /lib64/libc.so.6(__libc_start_main+0xf1)[0x7ffff6f92401] btrfs check(_start+0x2a)[0x55555556bd7a]Program received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58 58 } Missing separate debuginfos, use: dnf debuginfo-install libgcc-6.3.1-1.fc25.x86_64 (gdb) up #1 0x00007ffff6fa952a in __GI_abort () at abort.c:89 89 raise (SIGABRT); (gdb) #2 0x00005555555845aa in bugon_trace ( assertion=assertion@entry=0x5555555d5593 "end < start", func=func@entry=0x5555555d56f8 <__func__.9791> "insert_state", line=line@entry=143, val=val@entry=1, filename=0x5555555d5577 "extent_io.c") at kerncompat.h:115 115 abort(); (gdb) #3 0x00005555555847ed in bugon_trace (val=1, line=143, func=
, filename=0x5555555d5577 "extent_io.c", assertion=0x5555555d5593 "end < start") at extent_io.c:152 152 } (gdb) #4 insert_state (tree=tree@entry=0x7fffffffdba0, state=state@entry=0x55557eb12010, start=start@entry=52428800, end=end@entry=52428799, bits=bits@entry=1) at extent_io.c:143 143 BUG_ON(end < start); (gdb) #5 0x0000555555584cb6 in set_extent_bits (tree=0x7fffffffdba0, start=52428800, end=52428799, bits=bits@entry=1) at extent_io.c:310 310 err = insert_state(tree, prealloc, start, end, bits); (gdb) #6 0x0000555555584d9a in set_extent_dirty (tree= , start= , end= ) at extent_io.c:414 414 return set_extent_bits(tree, start, end, EXTENT_DIRTY); (gdb) #7 0x00005555555b728b in check_extent_refs (extent_cache=0x7fffffffda60, root=0x55555587d570) at cmds-check.c:9255 9255 set_extent_dirty(root->fs_info->excluded_extents, (gdb) #8 check_chunks_and_extents (root=root@entry=0x55555587d570) at cmds-check.c:9945 9945 ret = check_extent_refs(root, &extent_cache); (gdb) #9 0x00005555555b9133 in cmd_check (argc= , argv= ) at cmds-check.c:12961 12961 ret = check_chunks_and_extents(root); (gdb) #10 0x000055555556bc70 in main (argc=3, argv=0x7fffffffe328) at btrfs.c:246 246 ret = cmd->fn(argc, argv); (gdb)
Second run under debugger gives me a slightly different result:
Program received signal SIGSEGV, Segmentation fault. list_del (entry=0x555555db0420) at /usr/src/debug/btrfs-progs-v4.10.1/kernel-lib/list.h:125 125 __list_del(entry->prev, entry->next); (gdb) bt #0 list_del (entry=0x555555db0420) at /usr/src/debug/btrfs-progs-v4.10.1/kernel-lib/list.h:125 #1 free_all_extent_backrefs (rec=0x555555db0350) at cmds-check.c:5386 #2 maybe_free_extent_rec (extent_cache=0x7fffffffd990, rec=0x555555db0350) at cmds-check.c:5417 #3 0x00005555555b308f in check_block (flags=, buf=0x55557b87cdf0, extent_cache=0x7fffffffd990, root=0x55555587d570) at cmds-check.c:5851 #4 run_next_block (root=root@entry=0x55555587d570, bits=bits@entry=0x555555884170, last=last@entry=0x7fffffffd738, pending=pending@entry=0x7fffffffd9b0, seen=seen@entry=0x7fffffffd9a0, reada=reada@entry=0x7fffffffd9c0, nodes=0x7fffffffd9d0, extent_cache=0x7fffffffd990, chunk_cache=0x7fffffffd980, dev_cache=0x7fffffffd970, block_group_cache=0x7fffffffda80, dev_extent_cache=0x7fffffffdaa0, ri=0x55555587d820, bits_nr=1024) at cmds-check.c:7690 #5 0x00005555555b4b6d in deal_root_from_list (list=list@entry=0x7fffffffda00, root=root@entry=0x55555587d570, bits=bits@entry=0x555555884170, pending=pending@entry=0x7fffffffd9b0, seen=seen@entry=0x7fffffffd9a0, reada=reada@entry=0x7fffffffd9c0, nodes=0x7fffffffd9d0, extent_cache=0x7fffffffd990, chunk_cache=0x7fffffffd980, dev_cache=0x7fffffffd970, block_group_cache=0x7fffffffda80, dev_extent_cache=0x7fffffffdaa0, bits_nr=1024) at cmds-check.c:9751 #6 0x00005555555b5805 in check_chunks_and_extents (root=root@entry=0x55555587d570) at cmds-check.c:9918 #7 0x00005555555b9133 in cmd_check (argc= , argv= ) at cmds-check.c:12961 #8 0x000055555556bc70 in main (argc=3, argv=0x7fffffffe258) at btrfs.c:246
Not sure why this run is different. The etnry look readable:
$2 = {next = 0x0, prev = 0x555555fb0350}
I don't think it would fail just because next is null. Well, actually, according to the source code it does:
static inline void __list_del(struct list_head * prev, struct list_head * next) { next->prev = prev; prev->next = next; }
So we should never have null as next. Maybe time to build with CONFIG_DEBUG_LIST???
The problem comes from this code in cmds-check.c in function check_extent_refs around L9260:
while(cache) { rec = container_of(cache, struct extent_record, cache); set_extent_dirty(root->fs_info->excluded_extents, rec->start, rec->start + rec->max_size - 1); cache = next_cache_extent(cache); }
In my case, rec->max_size is 0. The content of the corresponding record shows an inconsistency between size and max_size:
(gdb) p *rec $10 = { backrefs = { next = 0x555556490100, prev = 0x555556490100 }, dups = { next = 0x555556490040, prev = 0x555556490040 }, list = { next = 0x555556490050, prev = 0x555556490050 }, cache = { rb_node = { __rb_parent_color = 93824996670721, rb_right = 0x0, rb_left = 0x0 }, objectid = 0, start = 52428800, size = 1 }, parent_key = { objectid = 0, type = 0 '000', offset = 0 }, start = 52428800, max_size = 0, nr = 1, refs = 0, extent_item_refs = 0, generation = 0, parent_generation = 0, info_objectid = 0, num_duplicates = 0, info_level = 0 '000', flag_block_full_backref = 2, found_rec = 0, content_checked = 0, owner_ref_checked = 0, is_root = 0, metadata = 1, bad_full_backref = 0, crossing_stripes = 0, wrong_chunk_type = 0 }
The problem also occurs on current master branch.
Added the Red Hat Bugzilla 1435567"> the recorder to btrfs tools.
Recorder entries before the crash look like this:
5542557 [73.057880:0x45db8b] MAX_SIZE: Add 229/231 max_size 16384 for extent cache 0x7fffffffdb30 5542558 [73.057882:0x45db8b] MAX_SIZE: Add 230/231 max_size 16384 for extent cache 0x7fffffffdb30 5542559 [82.656875:0x45de02] MAX_SIZE: Add root to pending max_size 16384 for extent cache 0x7fffffffdb30 5542560 [82.660978:0x45de02] MAX_SIZE: Add root to pending max_size 16384 for extent cache 0x7fffffffdb30 5542561 [82.661416:0x45ead4] CHECK: Check extent root 0x77c180 cache 0x7fffffffdb30 5542562 [82.661417:0x45ec30] REPAIR: Search cache 0x138ec80 start 52428800 size 1 5542563 [82.661417:0x45ec58] EXTENTS: Container extent record 0x138ec50 for cache 0x138ec80 start 52428800 max_size 0
Added instrumentation to specifically record if max_size is ever set to 0. It trips, but this time I have the list SIGSEGV again.
78655 [1.792241:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139eb80 max_size 16384 tmpl 0x7fffffffd12078657 [1.792242:0x451cb8] MAX_SIZE_ZERO: Set max size 0 for rec 0x139ec50 from tmpl 0x7fffffffcf80 78660 [1.792244:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139ed50 max_size 16384 tmpl 0x7fffffffd120