#1327 Issue closed: Stalled recovery due to mkfs interactive prompt

Labels: enhancement, bug, cleanup, fixed / solved / done

deacomm opened issue at 2017-04-24 12:51:

During a restore, depending on what condition the target disk is in, wipefs will sometimes not wipe existing filesystem from the block device, and subsequent mkfs will stall the whole process, by interactively asking for confirmation:

2017-04-20 14:36:14 Creating filesystem of type ext3 with mount point /boot on /dev/sda1.
+++ Print 'Creating filesystem of type ext3 with mount point /boot on /dev/sda1.'
+++ test 1
+++ echo -e 'Creating filesystem of type ext3 with mount point /boot on /dev/sda1.'
+++ wipefs -a /dev/sda1
wipefs: /dev/sda1: ignoring nested "dos" partition table on non-whole disk device
wipefs: Use the --force option to force erase.
+++ mkfs -t ext3 -b 4096 -i 16384 -U 5dc25119-fc7c-4d93-93fb-2b26a6916036 /dev/sda1
mke2fs 1.42.11 (09-Jul-2014)
Found a dos partition table in /dev/sda1
Proceed anyway? (y,n)

This prompt is visible in the log file, but not on system's console, there is just an empty line with a blinking cursor. After I hit 'y' on console, restore process continues.

This was on SLES12SP1, with rear 2.00 package from OBS:

Name : rear Relocations: (not relocatable)
Version : 2.00 Vendor: obs://build.opensuse.org/Archiving
Release : 1 Build Date: Fri Jan 6 12:24:20 2017
Install Date: (not installed) Build Host: lamb10
Group : Applications/File Source RPM: rear-2.00-1.src.rpm
Size : 1184748 License: GPLv3
Signature : DSA/SHA1, Fri Jan 6 12:24:31 2017, Key ID 6b7485db725a0c43

schlomo commented at 2017-04-24 13:10:

We could add --force to wipefs. Just need to find out if the older distros support that parameter.

We already use --force with many other tools so that this would be good thing for wipefs too.

gozora commented at 2017-04-24 13:15:

I'm afraid that will not be universal enough :-(
on my Ubunty Trusty:

# wipefs --force
wipefs: unrecognized option '--force'

Maybe --force is available only on newer versions ?

V.

schlomo commented at 2017-04-24 13:16:

Or the traditional wipefs .... <<<y :-)

gozora commented at 2017-04-24 13:18:

You just don't like pipes, do you? :-)

jsmeix commented at 2017-04-24 13:21:

In general FYI:
The whole current 'wipefs' implementation is likely insufficient,
see https://github.com/rear/rear/issues/799

schlomo commented at 2017-04-24 13:21:

I find y | wipefs ... indeed less readable. And it spawns another shell for literally nothing. I prefer to use Bash with all what it can do for us.

jsmeix commented at 2017-04-24 13:24:

@schlomo @gozora
I think the issue is not about that wipefs needs user input,
the issue is that mkfs stops and needs user input
because wipefs did not fully clean up the disk before.

jsmeix commented at 2017-04-24 13:29:

@deacomm FYI:
wipefs does not wipe whole existing filesystems.
All what wipefs does is to wipe some well known areas
of partitions (like /dev/sda2) or disks (like /dev/sda).
Depending on what condition the target disk is in
wipefs may fail to wipe this or that particular remainders.

jsmeix commented at 2017-04-24 13:31:

Regarding
"prompt is visible in the log file, but not on system's console"
see https://github.com/rear/rear/issues/885
which is about the other way round.

jsmeix commented at 2017-04-24 13:42:

I know this mke2fs behaviour

Found a dos partition table in /dev/sda1
Proceed anyway? (y,n)

very well from my experiments with
"Generic system installation with the plain SUSE installation system"
cf. https://en.opensuse.org/SDB:Disaster_Recovery

I do not remember well but I think something like

wipefs -a -f /dev/sda1
wipefs -a -f /dev/sda

helped.

@gozora
regarding whether or not wipefs supports '-f'
remember how we implemented things like 'mkfs -U' in
layout/prepare/GNU/Linux/130_include_filesystem_code.sh
just try the preferred modern way and if that fails fall back
to the traditional way:

wipefs -a -f /dev/sda1 || wipefs -a /dev/sda1
wipefs -a -f /dev/sda || wipefs -a /dev/sda

On my SLES12 system:

# man wipefs
...
  -f, --force
    Force erasure, even if the filesystem is mounted.
    This is required in order to erase a partition-table
    signature on a block device.

versus on my SLES11 system where "man wipefs"
does not tell about '-f, --force'.

deacomm commented at 2017-04-24 13:43:

@jsmeix Either wipefs has to reliably prepare ground for subsequent mkfs, or the script calling mkfs has to account for the possibility of wipefs not doing its thing, and act accordingly. As it is now, between wipefs and mkfs calls the system is in indeterminate state.

One partial workaround would be to print mkfs output to console, so that user at least knows why the recovery is stalled, and knows that they can just press 'y' to continue.

jsmeix commented at 2017-04-24 13:49:

Only FYI
here some excerpts from my newest generic installation script
(my quick and ditry hack to install a system with LVM)
where I use wipefs as planned
in https://github.com/rear/rear/issues/799
plus proper waiting for block device nodes according
to https://github.com/rear/rear/issues/791

# Partitioning:
harddisk_disklabel="msdos"
harddisk_devices="/dev/sda /dev/sdb /dev/sdc"
partition1_begin_percentage="0"
partition1_end_percentage="40"
partition2_begin_percentage="$partition1_end_percentage"
partition2_end_percentage="100"
# LVM:
lvm_PVs="/dev/sda1 /dev/sda2 /dev/sdb1 /dev/sdb2 /dev/sdc1 /dev/sdc2"
lvm_VG="system"
lvm_LV_swap_size="2g"
lvm_LV_root_size="4g"
lvm_LV_home_size="3g"
# Filesystems:
LV_swap_prepare_command="mkswap -f"
LV_filesystem="ext4"
LV_make_filesystem_command="mkfs.$LV_filesystem -F"
...
# First of all clean up possibly already existing partitions:
for partition in $lvm_PVs
do test -b $partition && wipefs -a -f $partition || true
done
for harddisk_device in $harddisk_devices
do test -b $harddisk_device && wipefs -a -f $harddisk_device || true
done
# Make partitions on the harddisk devices:
for harddisk_device in $harddisk_devices
do # Wait until the harddisk device node exists:
   until test -b $harddisk_device
   do echo "Waiting until $harddisk_device exists (retrying in 1 second)"
      sleep 1
   done
   # Erase filesystem, raid or partition-table signatures (magic strings) to clean up a used disk before making filesystems:
   wipefs -a -f $harddisk_device
   # Create new disk label. The new disk label will have no partitions:
   parted -s $harddisk_device mklabel $harddisk_disklabel
   # Make partitions on a harddisk device:
   # Use hardcoded parted fs-type "ext2" as dummy for now regardless what filesystem will be actually created there later:
   parted -s --align=optimal $harddisk_device unit % mkpart primary ext2 $partition1_begin_percentage $partition1_end_percentage
   parted -s $harddisk_device set 1 lvm on type 0x8e
   parted -s --align=optimal $harddisk_device unit % mkpart primary ext2 $partition2_begin_percentage $partition2_end_percentage
   parted -s $harddisk_device set 2 lvm on type 0x8e
done
# Enable boot flag on /dev/sda1:
parted -s /dev/sda set 1 boot on
# Report what is actually set up by parted:
for harddisk_device in $harddisk_devices
do parted -s $harddisk_device unit GiB print
done
# Wait until all partition device nodes exist:
for partition_device in $lvm_PVs
do until test -b $partition_device
   do echo "Waiting until $partition_device exists (retrying in 1 second)"
      sleep 1
   done
done
# LVM setup:
# LVM metadata cannot be backed up (to /etc/lvm) because /etc/lvm is read-only in the SUSE installation system
# and any backups that are stored inside the SUSE installation system are useless
# so that automated metadata backup is disabled with '--autobackup n'.
# Do not use lvmetad in the SUSE installation system to avoid many confusing warning messages from LVM tools.
# Because /etc/lvm is read-only it is moved away and a writable copy is created:
mv /etc/lvm /etc/lvm.orig
mkdir /etc/lvm
set +f
cp -a /etc/lvm.orig/* /etc/lvm
set -f
grep 'use_lvmetad = 1' /etc/lvm/lvm.conf && sed -i -e 's/use_lvmetad = 1/use_lvmetad = 0/' /etc/lvm/lvm.conf
# Initialize all partitions for use by LVM as PVs:
pvcreate --verbose --yes -ff $lvm_PVs
# Create a single volume group of all PVs (i.e. of all partitions):
vgcreate --verbose --yes --autobackup n $lvm_VG $lvm_PVs
# Create logical volumes in the existing volume group:
lvcreate --verbose --yes --autobackup n --size $lvm_LV_swap_size --name swap $lvm_VG
swap_LV_device="/dev/$lvm_VG/swap"
lvcreate --verbose --yes --autobackup n --size $lvm_LV_root_size --name root $lvm_VG
root_LV_device="/dev/$lvm_VG/root"
lvcreate --verbose --yes --autobackup n --size $lvm_LV_home_size --name home $lvm_VG
home_LV_device="/dev/$lvm_VG/home"
# Set up swap:
# Wait until the swap LV device node exists:
until test -b $swap_LV_device
do echo "Waiting until $swap_LV_device exists (retrying in 1 second)"
   sleep 1
done
# Erase filesystem, raid or partition-table signatures (magic strings) to clean up a used disk before making filesystems:
wipefs -a -f $swap_LV_device
# Prepare swap LV:
$LV_swap_prepare_command $swap_LV_device
# Use the swap LV:
swapon --fixpgsz $swap_LV_device
# Set up root filesystem:
# Wait until the root LV device node exists:
until test -b $root_LV_device
do echo "Waiting until $root_LV_device exists (retrying in 1 second)"
   sleep 1
done
# Erase filesystem, raid or partition-table signatures (magic strings) to clean up a used disk before making filesystems:
wipefs -a -f $root_LV_device
# Make root LV filesystem:
$LV_make_filesystem_command $root_LV_device
# Set up home filesystem:
# Wait until the home LV device node exists:
until test -b $home_LV_device
do echo "Waiting until $home_LV_device exists (retrying in 1 second)"
   sleep 1
done
# Erase filesystem, raid or partition-table signatures (magic strings) to clean up a used disk before making filesystems:
wipefs -a -f $home_LV_device
# Make home LV filesystem:
$LV_make_filesystem_command $home_LV_device
# Probably useless but to be on the safe side wait until udev has finished:
udevadm settle --timeout=20

jsmeix commented at 2017-04-24 13:56:

I fixed my
https://github.com/rear/rear/issues/1327#issuecomment-296672094

On an already used disk with partitions on it
one must first 'wipefs' each partition block device
and finally one can 'wipefs' the whole disk, cf.
https://github.com/rear/rear/issues/799#issue-141001306

The other way round after 'wipefs' the whole disk
there are probably no longer any partition block devices
so that one cannot 'wipefs' them which may still leave
this or that unwanted remainders on the disk that may
re-appear after same partitions had been recreated.

jsmeix commented at 2017-04-24 14:03:

@deacomm
I think on SLE12 the easiest way is
when you just adapt wipefs_command
or perhaps even simply run
yes | rear ...

gdha commented at 2017-04-24 14:26:

@jsmeix is it ok we assign this to you as SME?

jsmeix commented at 2017-04-24 14:43:

One cannot run

yes | rear recover

because the recovery system does not have 'yes'
but one can run

echo -e 'y\ny\ny\ny\ny\ny\n' | rear recover

;-)

jsmeix commented at 2017-04-24 14:50:

@deacomm
when you replace in
usr/share/rear/layout/prepare/GNU/Linux/130_include_filesystem_code.sh

  wipefs_command="wipefs -a $device"

with

  wipefs_command="wipefs -a -f $device || wipefs -a $device"

does it then also work for you?
(For me it works on SLE12.)

schlomo commented at 2017-04-24 18:06:

Careful. Trying one and then the other might also fail for other reasons. Then better auto-detect, e.g. like this:

wipefs_command="wipefs --all $(wipefs --help 2>&1 | grep --quiet force && echo --force) $device"

(I prefer to spell out arguments for better readability)

jsmeix commented at 2017-04-25 09:51:

@schlomo
I do not understand what could go wrong with things like

try_something || try_something_else || do_fallback

Could you explain and provide an example
why this kind of code is bad?

Of course in

try_something || try_something_else || do_fallback

each one could fail for its own specific and different reasons
but compared to what we do currently which is only plain

try_something

without any failure or error handling

try_something || try_something_else || do_fallback

should be an improvement.

deacomm commented at 2017-04-25 11:38:

Schlomo's wipefs_command works, wipefs is always called with --all --force, and the recovery process finishes.

Thinking about it, I prefer the simplicity of try_something || try_something else, especially since you can't tell what wipefs --help text will say in the future. Naively checking for presence of string "force" might not be the safest, although it is very improbable to cause issues in this particular case.

It's a matter of taste, really.

schlomo commented at 2017-04-25 12:28:

@jsmeix:

  1. command1 || command1 || failed will always fall through to command2 if anything goes wrong with command1. You won't know why it failed and the log file will be spammed with error messages that we consciously ignore. We cannot hide the STDERR of the first wipe because we might need its errors in case it fails for another reason (not the missing support for force).
  2. How long takes command1? If the first variant takes a long time to fail then the user has to wait a long time for nothing. Of course in this case of wipefs this is less likely.
  3. Being more explicit is always better for people who read the code later. Failing the wipefs command because of missing parameter support and then trying without the problematic parameter means that we don't explicitly test for the --force support but implicitly fail if it is not there (or anything else goes wrong).

@deacomm IMHO checking the help output if --force is mentioned is a very safe interface to decide that we can use the --force option. The only case where this would fail if the developers for wipefs remove that option and leave a text like "--force is no longer supported". Also that I find extremely unlikely for wipefs

BTW, several years ago we had a similar discussion in relation to the different syslinux versions found on different distros and there in the end @dagwieers introduced the concept of feature flags (see set_syslinux_features()). The rationale there was also to favor explicit checks over implicit failing.

jsmeix commented at 2017-04-25 13:01:

Anyone who likes may do an appropriate GitHub pull request.

jsmeix commented at 2017-04-26 08:10:

I hit the wrong button. This issue should stay open
until an appropriate GitHub pull request was merged.

jsmeix commented at 2017-04-28 10:26:

@schlomo because of your
https://github.com/rear/rear/issues/700#issuecomment-297960017

I like to mention that

wipefs_command="wipefs --all $(wipefs --help 2>&1 | grep --quiet force && echo --force) $device"

would let ReaR fail on SLE11 if 'set -eu' was set in
layout/prepare/GNU/Linux/130_include_filesystem_code.sh
What happens on my SLE11 system
(where wipefs does not support '--force')

# ( set -e ; device=/dev/sdXn ; wipefs_command="wipefs --all $(wipefs --help 2>&1 | grep --quiet force && echo --force) $device" ; echo $wipefs_command )
[no output]

This is because here 'grep' results non-zero exit code.

An additional side-note

wipefs_command="wipefs --all $(wipefs --help 2>&1 | grep --quiet force && echo --force) $device"

would call 'wipefs --all --force' when any kind of 'force' (sub)string
appears in 'wipefs --help' (not necessarily the word '--force').

jsmeix commented at 2017-04-28 14:10:

In https://github.com/rear/rear/pull/1336
I implemented basically my above proposal in
https://github.com/rear/rear/issues/1327#issuecomment-296672094
and additionally now dd is used as generic fallback in any case,
see my comments in the new code.

It seems to work o.k. for me on SLE12 with and without wipefs.
Further enhancements welcome from whoever likes
to maintain his enhancements also in the future ;-)

jsmeix commented at 2017-04-28 14:45:

With https://github.com/rear/rear/pull/1336 merged
this issue should no longer happen in practice
regardless that https://github.com/rear/rear/pull/1336
is not yet the ultimate solution, cf.
https://github.com/rear/rear/pull/1336#issuecomment-298011438

jsmeix commented at 2017-05-04 08:15:

With https://github.com/rear/rear/pull/1341 merged
the wipefs code is now cleaned up as described in
https://github.com/rear/rear/pull/1336#issuecomment-298016104

It is still not yet the ultimate solution, i.e.
https://github.com/rear/rear/pull/1336#issuecomment-298011438
still applies.

schlomo commented at 2017-06-19 09:05:

I just realized that the title was somewhat misleading. We started from mkfs needing a confirmation and ended up with a better implementation of wipefs which removes most mkfs confirmation requests.

However, mkfs will still ask for confirmation on whole disks without partition tables and this fix does not solve that use case.

Do you see any problem with adding the appropriate force option to the various mkfs calls?

jsmeix commented at 2017-06-19 11:28:

In general I think that "rear recover" should enforce
to recreate filesystems because when the admin
runs "rear recover" he knows and wants to get
all existing stuff on that machine replaced.

FYI some details:

Currently we use '-f' in case of xfs and btrfs cf.
layout/prepare/GNU/Linux/130_include_filesystem_code.sh

But as far as I see we do not use '-f' for reiserfs.

For ext2/ext3/ext4 we do not use '-F' as far as I see.

For vfat it seems there is no force option
according to "man mkfs.vfat" (except '-I' which also
seems to be for whole disks without partition table).

For the fallback case we cannot use a force option
because various mkfs.filesystem commands use
different option characters (like '-f' versus 'F')
or do not provide a force option.

deacomm commented at 2017-06-26 10:09:

FWIW, the title was correct. The interactive prompt came from mkfs. Wipefs was involved because it failed to wipe the previous filesystem signature before the mkfs call. I'm not sure whether wipefs even has an interactive prompt of any kind.

This is just splitting hair, of course. :)

jsmeix commented at 2017-06-26 11:29:

@deacomm
thanks for mentioning it.
I think it is not just splitting hair to have a correct title
because otherwise things are misleading and might
get falsely fixed.
Actually this issue is two issues:
The one that I noticed matches your
https://github.com/rear/rear/issues/1327#issuecomment-311018185
but that alone is not sufficient as @schlomo found out in his
https://github.com/rear/rear/issues/1327#issuecomment-309381866
so that the full fix for issues like that is:
Better clean up disks to avoid most mkfs confirmation requests
(that was what I did)
plus
call mkfs with a 'force' option as far as possible for each
particular filesystem (what @schlomo does).


[Export of Github issue for rear/rear.]