Friday the 13th

Friday, April 13, 2018

Recorder integration in SPICE

Ran into three snags yesterday.

In order to support the recorder on older variants of MinGW, I added an autoconfiguration check for sigaction. I assumed that a similar check existed for autoconf, but it does not. The closest I found is information about restartable system calls. I will need to change that check in the recorder for a check for the sigaction function, which is easier to write with autoconf.
Another problem is that linking recorder exported symbols from Spice-GTK fails on Linux. I thought that I had resolved that, and the problem does not exist at the top of the branch, but apparently there is a necessary commit in the branch that I did not pull back in the integration branch.
I noticed that the spice-common self-test for aborting on warnings is now failing. This is apparently due to changes I made in the way the recorder builds the linked list of recorders during initialization, which breaks the call to recorder_set_traces done in a constructor in log.c. There is an order dependency between constructors there, which is not good.

The first topic made me realize that for my Meson/Ninja comparison with make-it-quick, I had largely underestimated the cost of autoconf maintenance, by ignoring all the custom m4 macros that SPICE requires.

Regarding the order of initialization, tracing shows:

spice_log_init called
Recorder for spice_warning is 0
Recorder spice_info activated
Recorder spice_debug activated
Recorder spice_warning activated
Recorder spice_error activated
Recorder spice_critical activated
Recorder recorders activated
Recorder recorders_warning activated
Recorder recorders_error activated
Recorder signals activated
Recorder recorder_traces activated

As a result, the call to recorder_traces_set from spice_log_init has no effect. Hmmm.

This shows there is something bogus in the way constructor are managed. In theory, if library a calls libary b, then the constructors for b should run before the constructors for a, not the other way round (that is pretty much guaranteed in languages such as Ada). Not sure if that was really designed, or if it's just some "worse is better" implementation that just stayed that way...

The reason the top-of-branch works is a reordering of the recorder declarations. Integrated that. After that, the tests are less broken, but tests that depend on the function name being present fail because the recorder no longer records it. Repairing that without adding a new field to recorder entries took a little "dirty trick", which is to append the __FILE__ and __LINE__ locations to the message, and reserving the where field for the function alone. The formatting function can then skip the location information if the recorder_location tweak is not set. It's a bit strange do do extra work in that case, but it's on a slow path (rendering) anyway.

With that, all the spice-common tests pass again, with a minor change in the test suite. The issue is that the test suite uses g_setenv to set the environment, but some of the initialization testing is done in constructors. If the g_setenv is placed before forking the child, it works. But if the g_setenv is inside the child, then it fails, since the value of the environment variable in the constructor was incorrect.

Issue with the Terra Master disk array

My Terra Master disk bay regularly disconnects on a semi-regular basis. I have messages like:

[21444.774238] fuse init (API version 7.26)
[45769.678835] xhci_hcd 0000:00:14.0: Cannot set link state.
[45769.678920] usb usb4-port6: cannot disable (err = -32)
[45769.678997] usb 4-6: USB disconnect, device number 2
[45769.690860] sd 10:0:0:0: [sdb] Synchronizing SCSI cache
[45769.810762] sd 10:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[45769.890472] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[45769.890999] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[45769.891411] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
[45769.891808] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[45769.892194] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[45769.892555] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
[45769.892845] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0
[45769.893128] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 13, rd 0, flush 0, corrupt 0, gen 0
[45769.893405] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 14, rd 0, flush 0, corrupt 0, gen 0
[45769.893679] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 15, rd 0, flush 0, corrupt 0, gen 0
[45769.894006] BTRFS: error (device sdb4) in btrfs_commit_transaction:2263: errno=-5 IO failure (Error while writing out transaction)
[45769.894285] BTRFS info (device sdb4): forced readonly
[45769.894445] BTRFS warning (device sdb4): Skipping commit of aborted transaction.
[45769.894715] BTRFS: error (device sdb4) in cleanup_transaction:1876: errno=-5 IO failure
[45769.894991] BTRFS info (device sdb4): delayed_refs has NO entry
[45769.895158] BTRFS error (device sdb4): commit super ret -5
[45769.895392] BTRFS error (device sdb4): cleaner transaction attach returned -30
[45773.174877] usb 4-6: new SuperSpeed USB device number 3 using xhci_hcd
[45773.187823] usb 4-6: New USB device found, idVendor=152d, idProduct=0567
[45773.188038] usb 4-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[45773.188220] usb 4-6: Product: TerraMaster_DAS
[45773.188389] usb 4-6: Manufacturer: JMicron
[45773.188560] usb 4-6: SerialNumber: 201609040218
[45773.190965] scsi host11: uas
[45773.191811] scsi 11:0:0:0: Direct-Access     JMicron  H/W RAID5        0103 PQ: 0 ANSI: 6
[45773.223151] sd 11:0:0:0: Attached scsi generic sg2 type 0
[45773.223465] sd 11:0:0:0: [sdc] Spinning up disk...
[45774.270717] .
[45774.270809] ready
[45774.271778] sd 11:0:0:0: [sdc] 23441702912 512-byte logical blocks: (12.0 TB/10.9 TiB)
[45774.272074] sd 11:0:0:0: [sdc] 4096-byte physical blocks
[45774.272436] sd 11:0:0:0: [sdc] Write Protect is off
[45774.272618] sd 11:0:0:0: [sdc] Mode Sense: 53 00 00 08
[45774.272959] sd 11:0:0:0: [sdc] Disabling FUA
[45774.273139] sd 11:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[45774.338116]  sdc: sdc1 sdc2 sdc3 sdc4
[45774.339997] sd 11:0:0:0: [sdc] Attached SCSI disk

So apparently, while nothing happens (the above was at night, I was not active), I get this USB disconnect:

[45769.678835] xhci_hcd 0000:00:14.0: Cannot set link state.
[45769.678920] usb usb4-port6: cannot disable (err = -32)
[45769.678997] usb 4-6: USB disconnect, device number 2

immediately followed by BTRFS complaining its volume is gone:

[45769.690860] sd 10:0:0:0: [sdb] Synchronizing SCSI cache
[45769.810762] sd 10:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[45769.890472] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[45769.890999] BTRFS error (device sdb4): bdev /dev/sdb4 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0

Then about 4 seconds later, the device "reappears":

45773.174877] usb 4-6: new SuperSpeed USB device number 3 using xhci_hcd
[45773.187823] usb 4-6: New USB device found, idVendor=152d, idProduct=0567
[45773.188038] usb 4-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[45773.188220] usb 4-6: Product: TerraMaster_DAS
[45773.188389] usb 4-6: Manufacturer: JMicron
[45773.188560] usb 4-6: SerialNumber: 201609040218
[45773.190965] scsi host11: uas

but unfortunately, it does not reconnect properly, it moves to /dev/sdc instead of /dev/sdb and does not remount.

This is annoying, since this is where I hold all my "pet" VMs. I can somewhat mitigate the problem by using an UUID in /etc/fstab, but I'm clearly at risk of some data corruption.

I have three hypotheses to explain that behavior:

The Terra Master is slowly failing, and this is the first sign
# Since it's raining a lot, my power is not as stable as it should, and the Terra Master is more sensitive to power fluctuations than the PC. # The latest Fedora update I did caused it. I don't think that's the case because if I remember correclty, I saw that behavior before (I attributed it to the cat liking to jump on the warm PC. But I addressed the cat issue by making it impractical for her to get there.)

Regression in the Gitlab CI for the recorder

The Gitlab CI fails for the recorder on the struct sigaction configuration check that I had added for MinGW support.

[BEGIN]              opt linux in [top]
make[1]: *** No rule to make target '/builds/c3d/recorder/.objects/linux/opt/CFG_HAVE_struct_sigaction.h', needed by 'config.h'.  Stop.
make[1]: Leaving directory '/builds/c3d/recorder'

Works on my Linux machines. Weird.

Ah, apparently the installed include files were selected over the local ones. Need to check that behavior in the long run.

Switching recorder to HAVE_SIGACTION

Switching from HAVE_STRUCT_SIGACTION to HAVE_SIGACTION is easy in the recorder, but then I have the same build bug as in the GitLab CI. So there is a real issue I need to understand there.

A first problem is that I tried to simplify a rule that was looking at two different locations for the check code. That is likely to be the problem. Ooops, typo in where the configuration files are looked up by default if make-it-quick has not been installed. Easy fix.

Linker issue on Linux

Another issue is that if the installation directory does not exist, using the -rpath option for the linker fails. This is annoying. After checking with a "regular" build, the same option is sent to the linker but does not fault in that case:

gcc -shared  -fPIC -DPIC  .libs/spice-util.o .libs/spice-gtk-session.o .libs/spice-widget.o .libs/vncdisplaykeymap.o .libs/spice-grabsequence.o .libs/desktop-integration.o .libs/usb-device-widget.o .libs/spice-widget-cairo.o .libs/spice-widget-egl.o .libs/spice-widget-enums.o .libs/spice-marshal.o   -Wl,-rpath -Wl,/home/ddd/Work/spice.old/spice-gtk/src/.libs -Wl,-rpath -Wl,/usr/local/lib ./.libs/libspice-client-glib-2.0.so -lcelt051 -lgthread-2.0 -lopus -ljpeg -lz -llz4 -lpixman-1 -lssl -lcrypto -lpulse-mainloop-glib -lpulse -lgstaudio-1.0 -lgstapp-1.0 -lgstvideo-1.0 -lgstbase-1.0 -lgstreamer-1.0 -lsasl2 -lcacard -lusb-1.0 -lusbredirhost -lusbredirparser -lphodav-2.0 -lsoup-2.4 -lgtk-3 -lgdk-3 -lpangocairo-1.0 -lpango-1.0 -latk-1.0 -lcairo-gobject -lgdk_pixbuf-2.0 -lgio-2.0 -lgobject-2.0 -lglib-2.0 -lepoxy -lcairo -lX11 -lm  -g -O2 -Wl,--version-script=./map-file   -pthread -Wl,-soname -Wl,libspice-client-gtk-3.0.so.5 -o .libs/libspice-client-gtk-3.0.so.5.0.0

The problem is a missing -Wl, for the -rpath argument. This repairs the Linux build and the Gitlab CI.

Warnings for type-punning

Something I had observed for a long time: GCC (but not clang, curiously) complains about type punning:

recorder.c:2009:13: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]

The reason is that the current model does not declare the full type for a recorder in RECORDER_DECLARE. You only give the name, and you have an incomplete type. The RECORDER_INFO has to make a cast to recorder_info, which is the first field in all the recorder structs. This is a poor-man's inheritance, and would be safe in C++ (casting to base class is OK). But I'm not sure how to write that in C.

Going through an intermediate pointer when accessing the info solves the problem without breaking type safety nor signficantly changing the performance:

Without indirection (variations of ~10ns on each depending on machine's temperature):

Testing recorder version 1.02
Launching 1 normal recorder thread
Normal recorder testing in progress, please wait about 25s
Normal recorder testing completed, stopping threads
Normal test: all threads have stopped, 400314451 iterations
Recorder test analysis (Normal version):
  Iterations            = 400314451
  Iterations / ms       =    16012
  Duration per record   =       62ns
  Number of threads     =        1
Recorder test complete (Normal version), 1 threads.
  Iterations      =  400314451
  Iterations / ms =      16012
  Record cost     =         62ns
Launching 1 fast recorder thread
Fast recorder testing in progress, please wait about 25s
Fast recorder testing completed, stopping threads
Fast test: all threads have stopped, 609663579 iterations
Recorder test analysis (Fast version):
  Iterations            = 609663579
  Iterations / ms       =    24386
  Duration per record   =       41ns
  Number of threads     =        1
Recorder test complete (Fast version), 1 threads.
  Iterations      =  609663579
  Iterations / ms =      24386
  Record cost     =         41ns

With indirection:

Recorder test analysis (Normal version):
  Iterations            = 367791695
  Iterations / ms       =    14711
  Duration per record   =       67ns
  Number of threads     =        1
Recorder test complete (Normal version), 1 threads.
  Iterations      =  367791695
  Iterations / ms =      14711
  Record cost     =         67ns
Launching 1 fast recorder thread
Fast recorder testing in progress, please wait about 25s
Fast recorder testing completed, stopping threads
Fast test: all threads have stopped, 561034403 iterations
Recorder test analysis (Fast version):
  Iterations            = 561034403
  Iterations / ms       =    22441
  Duration per record   =       44ns
  Number of threads     =        1
Recorder test complete (Fast version), 1 threads.
  Iterations      =  561034403
  Iterations / ms =      22441
  Record cost     =         44ns

So really in the same ballpark.

Restart Jenkins

Restarted Jenkins, which went down after the power outage last week. Took a look, a lot of things turned red. There will be a bit of work to cleanup that mess.

Reading and learning

Interesting documentation about Fuchsia, the new OS from Google.