JMicron (Terramaster) bug with recent kernels?

Saturday, May 12, 2018

Had another instance of I/O error on the Terramaster. This looks a bit like this one: bz1315013

Just before it went bezerk, I had run a zpool statusand gotten:

  pool: data
 state: ONLINE

status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested config:

NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 2 sdd ONLINE 1 0 1 sde ONLINE 0 0 1 sdf ONLINE 0 0 1

errors: No known data errors

When things went south, instead, I got:

  pool: data
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: none requested
config:

NAME STATE READ WRITE CKSUM data ONLINE 12 32 0 raidz1-0 ONLINE 32 65 0 sdb ONLINE 10 49 0 sdc ONLINE 39 45 2 sdd ONLINE 13 31 1 sde ONLINE 36 15 1 sdf ONLINE 32 35 1

errors: List of errors unavailable: pool I/O is currently suspended

errors: 30 data errors, use '-v' for a list

Problem is that since after shutting down the device and restarting it, the serial devices change, it's a bit hard to reconnect the zpool.

Replaced the names of the disks using trick found here:

zpool export data
zpool import -d /dev/disk/by-id -aN

Now zpool status shows:

  pool: data
 state: ONLINE
  scan: resilvered 92K in 0h0m with 0 errors on Sat May 12 16:21:21 2018
config:

NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 usb-ST3000DM_001-1CH166_201609040218-0:0 ONLINE 0 0 0 usb-WDC_WD30_EFRX-68EUZN0_201609040218-0:1 ONLINE 0 0 0 usb-ST3000DM_001-1CH166_201609040218-0:2 ONLINE 0 0 0 usb-WDC_WD30_EFRX-68EUZN0_201609040218-0:3 ONLINE 0 0 0 usb-WDC_WD30_EFRX-68EUZN0_201609040218-0:4 ONLINE 0 0 0

errors: No known data errors

Hopefully, this means my system will now survive when switching the array off/on because of the USB reset...

Turns out that /dev/sdd is not the third disk in the array (I expected that disks were numbered in array disk order). It seems to be the disk #4 that was blinking red on regular occasions. It's one of the two usb-ST3000DM_001-1CH166_201609040218-0 in the system.