#2059 Issue `closed`: BUG: The disk layout recreation script failed¶

Labels: bug, support / question, no-issue-activity

dcz01 opened issue at 2019-02-28 15:59:¶

Relax-and-Recover (ReaR) Issue Template¶

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

ReaR version ("/usr/sbin/rear -V"):

Relax-and-Recover 2.4 / 2018-06-21

OS version ("cat /etc/rear/os.conf" or "lsb_release -a" or "cat /etc/os-release"):

OS_VENDOR=RedHatEnterpriseServer
OS_VERSION=6.10
# The following information was added automatically by the mkrescue workflow:
ARCH='Linux-i386'
OS='GNU/Linux'
OS_VERSION='6.10'
OS_VENDOR='RedHatEnterpriseServer'
OS_VENDOR_VERSION='RedHatEnterpriseServer/6.10'
OS_VENDOR_ARCH='RedHatEnterpriseServer/i386'
# End of what was added automatically by the mkrescue workflow.

ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):

OUTPUT=ISO
OUTPUT_URL=file:///root/rear
BACKUP=NETFS
#BACKUP=TSM
BACKUP_PROG=tar
BACKUP_PROG_CRYPT_ENABLED=1
BACKUP_PROG_CRYPT_KEY=""
BACKUP_PROG_CRYPT_OPTIONS="/usr/bin/openssl aes256 -salt -k"
BACKUP_PROG_DECRYPT_OPTIONS="/usr/bin/openssl aes256 -d -k"
#BACKUP_URL=nfs://<IP-Adresse oder DNS-Name>/<Freigabepfad>
BACKUP_URL=cifs://NotesRechte/BRP-Backup
BACKUP_OPTIONS="cred=/etc/rear/cifs,sec=ntlmsspi"
BACKUP_TYPE=incremental
FULLBACKUPDAY="Sat"
BACKUP_PROG_EXCLUDE=( '/tmp/*' '/dev/shm/*' $VAR_DIR/output/\* '/opt/tivoli/tsm/rear/*' '/mnt/*' '/media/*' '/var/lib/pgsql/*/data/base/*' '/var/lib/pgsql/*/data/global/*' '/var/lib/pgsql/*/data/pg*/*' )
SSH_ROOT_PASSWORD='$1$HGjk3XUV$lid3Nd3k01Kht1mpMscLw1'
NON_FATAL_BINARIES_WITH_MISSING_LIBRARY='/opt/tivoli/tsm/client/ba/bin/libvixMntapi.so.1.1.0'

Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR):
Test virtual machine on VMware
System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):
x86_64
Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):
BIOS and GRUB
Storage (lokal disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):
Lokal Disk (one disk)
Description of the issue (ideally so that others can reproduce it):
Problem in the recovery environment with recreating the disk layout. I found and invalid parameter (--force) in the log file.
Workaround, if any:
none
Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):
rear-FBD01PSS.log

dcz01 commented at 2019-02-28 16:03:¶

I checked the same process with the BACKUP=TSM and there was no problem.
I've tested the case that an client of us could use an newer hardware with an bigger hard drive or RAID system.

jsmeix commented at 2019-02-28 18:27:¶

As far as I see on first glance in your
https://github.com/rear/rear/files/2915792/rear-FBD01PSS.log
it is neither the wipefs --force nor any other command in

cleanup_command='wipefs --all --force /dev/sda3 || wipefs --all /dev/sda3 || dd if=/dev/zero of=/dev/sda3 bs=512 count=1 || true'

that makes it fail because of the || true at the end.

Where diskrestore.sh actually fails is at

+++ mkfs -t ext4 -b 4096 -i 16346 -F /dev/sda3
mke2fs 1.41.12 (17-May-2010)
mkfs.ext4: No such device or address while trying to determine filesystem size
++ ((  1 == 0  ))
++ true
+++ UserInput -I LAYOUT_CODE_RUN -p 'The disk layout recreation script failed' -D 'Rerun disk recreation script (/var/lib/rear/layout/diskrestore.sh)' ...

That needs to be investigated why that fails.
It seems something with your /dev/sda3 is not right...

dcz01 commented at 2019-03-01 08:05:¶

@jsmeix Yes, there is something with the /dev/sda3.
The recreation script must expand the volume/partition a little bit, because the new disk is bigger than before like i mentioned above.

jsmeix commented at 2019-03-01 09:14:¶

It does increase the last partition /dev/sda3
as it should according to what there is in your
https://github.com/rear/rear/files/2915792/rear-FBD01PSS.log

Including layout/prepare/default/420_autoresize_last_partitions.sh
Examining /dev/sda to automatically resize its last active partition
Checking /dev/sda1 if it is the last partition on /dev/sda
Checking /dev/sda2 if it is the last partition on /dev/sda
Checking /dev/sda3 if it is the last partition on /dev/sda
Found 'primary' partition /dev/sda3 as last partition on /dev/sda
Determining if last partition /dev/sda3 is resizeable
Determining new size for last partition /dev/sda3
Determining if last partition /dev/sda3 actually needs to be increased or shrinked
New /dev/sda is 53687091200 bigger than old disk
Increasing last partition /dev/sda3 up to end of disk (new disk at least 10% bigger)
Changed last partition /dev/sda3 size from 156308078592 to 209995169792 bytes

which results in diskrestore.sh this parted call

+++ parted -s /dev/sda mkpart ''\''primary'\''' 4753195008B 214748364799B

where 214748364799 - ( 4753195008 - 1 ) = 209995169792 is as it should
(4753195008 is the first byte of that partition and 214748364799 is the last one
so both belong "inside" the partition).

This are all parted calls while diskrestore.sh is run

+++ parted -s /dev/sda mklabel msdos
...
+++ parted -s /dev/sda mkpart ''\''primary'\''' 1048576B 525336575B
...
+++ parted -s /dev/sda set 1 boot on
...
+++ parted -s /dev/sda mkpart ''\''primary'\''' 525336576B 4753195007B
...
+++ parted -s /dev/sda mkpart ''\''primary'\''' 4753195008B 214748364799B
...
+++ partprobe -s /dev/sda
/dev/sda: msdos partitions 1 2 3

and all run without an error so that afterwards there should be /dev/sda
with the partition device nodes /dev/sda1 /dev/sda2 /dev/sda3
but somehow something is wrong with /dev/sda3 because then

+++ Log 'Creating filesystem of type ext4 with mount point / on /dev/sda3.'
...
+++ wipefs --all /dev/sda3
wipefs: error: /dev/sda3: probing initialization failed
+++ dd if=/dev/zero of=/dev/sda3 bs=512 count=1
dd: opening `/dev/sda3': No such device or address
...
+++ mkfs -t ext4 -b 4096 -i 16346 -U fd2aaed0-d3c1-4bf4-99cb-e12f3da6a5c4 -F /dev/sda3
mke2fs 1.41.12 (17-May-2010)
mkfs.ext4: No such device or address while trying to determine filesystem size
+++ mkfs -t ext4 -b 4096 -i 16346 -F /dev/sda3
mke2fs 1.41.12 (17-May-2010)
mkfs.ext4: No such device or address while trying to determine filesystem size

so it seems in contrast to what partprobe had reported before

+++ partprobe -s /dev/sda
/dev/sda: msdos partitions 1 2 3

there is no partition device node /dev/sda3 on your particular system.

From my current point of view this issue looks like just one more instance
of the various annoyances with asynchroneously created device nodes
where one cannot know when a device node will actually appear
and - even worse - when an existing device node stays there stable.

See
https://github.com/rear/rear/issues/791
and follow the links therein.

Some blind shot in the dark what might help:

In general when device nodes that are expected to appear are late
it helps to sit and wait until the system (hardware kernel udev whatever...)
has settled its various activity so that at the

Confirm or edit the disk recreation script
1) Confirm disk recreation script and continue 'rear recover'
2) Edit disk recreation script (/var/lib/rear/layout/diskrestore.sh)
3) View disk recreation script (/var/lib/rear/layout/diskrestore.sh)
4) View original disk space usage (/var/lib/rear/layout/config/df.txt)
5) Use Relax-and-Recover shell and return back to here
6) Abort 'rear recover'

user dialog you should chose

2) Edit disk recreation script (/var/lib/rear/layout/diskrestore.sh)

and insert a simple sleep 10 command before the commands
that create the filesystems or even better add some
"wait for the actually needed thing" code like

for countdown in $( seq 60 -1 0 ) ; do
    test -b /dev/sda3 && break
    LogPrint "waiting for /dev/sda3 ($countdown)"
    sleep 1
done

cf.
https://github.com/rear/rear/issues/791#issue-138931029

dcz01 commented at 2019-03-01 09:38:¶

@jsmeix Ok, the same error occours with TSM...
Here is the same error in log:
rear-FBD01PSS.log
But with the rear -D recover executed a second time it worked...
rear-FBD01PSS.log

jsmeix commented at 2019-03-01 14:10:¶

With probability one when

with the 'rear -D recover' executed a second time it worked

the root cause are those kind of timing issues that I described in
https://github.com/rear/rear/issues/2059#issuecomment-468597405

I.e. try to make the diskrestore.sh script wait a bit before it creates
the filesystems so that the needed device nodes will be there.

dcz01 commented at 2019-03-01 15:25:¶

@jsmeix That would be great if it possible.
Thanks very much :-)
Have a nice weekend (A schens Wochenend wünsch i da).

jsmeix commented at 2019-03-04 11:15:¶

@dcz01
I think you misunderstood me - I meant that you change things in your ReaR
as you need it for your use case - at least as a workaround - because
currently I don't know how to correctly wait for the real thing, cf.
https://github.com/rear/rear/issues/791

I think in your particular case where wipefs --force does not work
you may change in your
usr/share/rear/layout/prepare/GNU/Linux/130_include_filesystem_code.sh
the value of the cleanup_command to something like

cleanup_command="sleep 5 && wipefs --all $device || dd if=/dev/zero of=$device bs=512 count=1 || true"

so that it does a dumb wait of 5 seconds to let (hopefully) /dev/sda3 appear
before it tries to clean it up and afterwards makes a filesystem on it.

dcz01 commented at 2019-03-04 11:36:¶

@jsmeix Ah well ok, now i understand you. But for an future release it would be great to insert this change for all ReaR users by default or?

jsmeix commented at 2019-03-04 11:58:¶

@dcz01
this totally depends on how enthusiastic your feedback is
if it really works for you this way... ;-)

If it works with sleep 5 please test if it also reliably works with sleep 1
because I would much prefer to not sleep too much because
on systems with many filesystems it would sleep hardcoded
that amount of time for each one of the filesystems, cf.
https://github.com/rear/rear/issues/791#issuecomment-224231100

dcz01 commented at 2019-03-04 14:50:¶

@jsmeix When i edit your solution in that file /usr/share/rear/layout/prepare/GNU/Linux/130_include_filesystem_code.sh in the rescue system, i also get the same error with sleep 5 or sleep 1.
Or am i editing the wrong file in rescue mode?

jsmeix commented at 2019-03-05 09:48:¶

@dcz01
provided it really sleeps it is disappointing that this does not help.
I was sure it would help to wait and then /dev/sda3 would finally appear.

Did you try to verify that sleeping won't help at all in your case
by let it sleep very long, e.g. sleep 60 or even sleep 300?

If waiting would not help at all here, this case could become
just another frustrating example how unpredictable and hopeless
things have become since device nodes appear and disappear
asynchronously and dynamically so that in practice it behaves randomly.

Check your ReaR log file in debug mode (i.e. rear -D recover)
to verify that it actually sleeps.

In the running recovery system use find / | grep include_filesystem_code
to find where that script actually is in your particular recovery system.

Normally it is in /usr/share/rear/... but when you run e.g. our current ReaR
GitHub master code from within a separated directory as described at
https://github.com/rear/rear/issues/2019#issuecomment-469219216
then the scripts are in the recovery system at a different place,
e.g. on my test systems where I run the ReaR GitHub master code
this way (and I always use KEEP_BUILD_DIR="yes" cf. default.conf)
I have the scripts (on the original system where I run rear -D mkrescue):

# find /tmp/rear.UK5Jx27aSLvwZ62/rootfs/ | grep include_filesystem_code
/tmp/rear.UK5Jx27aSLvwZ62/rootfs/root/rear.github.master/usr/share/rear/layout/prepare/GNU/Linux/130_include_filesystem_code.sh

i.e. I would get the scripts in the running recovery system under

/root/rear.github.master/usr/share/rear/...

and in this case /usr/share/rear would be a symlink
that points to /root/rear.github.master/usr/share/rear
so that in the running recovery system the ReaR scripts
can be always found under /usr/share/rear/...

I verified on my test system that in
.../layout/prepare/GNU/Linux/130_include_filesystem_code.sh

cleanup_command="sleep 15 && wipefs ...

actually lets it sleep 15 seconds.

dcz01 commented at 2019-03-05 11:34:¶

@jsmeix Acutally at my system this doesn't work... i don't know why.
Here is the actual log with the tested 300 seconds (also tested with 5, 30 and 60) and a screenshot.
rear-FBD01PSS.log

jsmeix commented at 2019-03-06 16:04:¶

@dcz01
had no time today so only a short notice how you may proceed
(which means how you could do some debugging legwork):

After you logged in into the recovery system
do manually what the diskrestore.sh script would do
to see how things behave in your case step by step.

Use a recent "rear -D recover" log file to see what
the diskrestore.sh script would do or even better
copy the diskrestore.sh script out of the recovery system,
perferably also attach it here as reference, so that you can
inspect directly what the diskrestore.sh script would do.

Before the first 'parted' command and after each
subsequent command that you do run commands like

# parted -s /dev/sda unit MiB print

# ls -l /dev/sda*

# find /dev/disk/by-* -ls | grep '/sda'

# lsblk -i -p -o NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,SIZE,MOUNTPOINT

to see at each step what there actually happens regarding your sda
and how that changes for each parted, wipefs, and mkfs call
that you do on your particular system.

dcz01 commented at 2019-03-07 11:31:¶

@jsmeix
Here i got the diskrestore.sh from the rescue system for you.
diskrestore.zip

jsmeix commented at 2019-03-07 14:08:¶

@dcz01
according to your https://github.com/rear/rear/files/2941002/diskrestore.zip
what I mean that you should run manually in a new booted recovery system
(without having done any "rear recover" attempt before)
is something like

# parted -s /dev/sda unit MiB print
# ls -l /dev/sda*

# parted -s /dev/sda mklabel msdos
# parted -s /dev/sda unit MiB print
# ls -l /dev/sda*

# parted -s /dev/sda mkpart primary 1048576B 525336575B
# parted -s /dev/sda unit MiB print
# ls -l /dev/sda*

# parted -s /dev/sda set 1 boot on
# parted -s /dev/sda unit MiB print

# parted -s /dev/sda mkpart primary 525336576B 4753195007B
# parted -s /dev/sda unit MiB print
# ls -l /dev/sda*

# parted -s /dev/sda mkpart primary 4753195008B 214748364799B
# parted -s /dev/sda unit MiB print
# ls -l /dev/sda*

# mkfs -t ext4 -b 4096 -i 16346 -U fd2aaed0-d3c1-4bf4-99cb-e12f3da6a5c4 -F /dev/sda3
# tune2fs  -m 3 -c -1 -i 0d -o user_xattr,acl /dev/sda3
# lsblk -i -p -o NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,SIZE,MOUNTPOINT

# mkdir -p /mnt/local/
# mount -o rw /dev/sda3 /mnt/local/
# lsblk -i -p -o NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,SIZE,MOUNTPOINT

and see how far you can get with that...

If the disk is not new, i.e. if there is already some old stuff on it, see
https://github.com/rear/rear/issues/799
how to use wipefs in the recovery system directly after login as root
to first and foremost clean up the used dik before anything else is done
with that disk (i.e. before one runs parted and things like that).

dcz01 commented at 2019-03-08 10:13:¶

@jsmeix
I also got some information that a program wasn't found...

jsmeix commented at 2019-03-11 14:42:¶

@dcz01
so we have a "heisenbug" here,
cf. https://en.wikipedia.org/wiki/Heisenbug
because it seems all "just work" (there is not any error message)
when you do the actual setup commands manually.

Currently I have no idea how to find out why things fail when
those setup commands are called by the diskrestore.sh script.

Don't worry that you don't have lsblk in your recovery system.
It is not needed to set up something.
It would only show the results in a nice way.
You can add it to your recovery system via

PROGS=( "${PROGS[@]}" lsblk )

Since
https://github.com/rear/rear/commit/b4fd97fd47d8c15ca1383dd911ec018780dce77b
(which had happened after the ReaR 2.4 release in June 2018))
lsblk is included in the recovery system by default.

jsmeix commented at 2019-03-11 14:48:¶

@rmetrich
could you have a look here - perhaps you have an idea what goes wrong here.

It is RHEL 6.10 where basically things fail when the partition and
filesystem setup commands are called by the diskrestore.sh script
but things work when the same commands are called manually
in the recovery system.

Perhaps - only a blind guess - the additional my_udevtrigger and
my_udevsettle calls (cf. usr/share/rear/lib/linux-functions.sh)
in diskrestore.sh do more harm than good in this particular case?

github-actions commented at 2020-06-28 01:33:¶

Stale issue message

[Export of Github issue for rear/rear.]

#2059 Issue closed: BUG: The disk layout recreation script failed¶