#1724 Issue `closed`: On PPC64 manual IP config fails with "SIOCSIFFLAGS: Invalid argument"¶

Labels: support / question, fixed / solved / done, special hardware or VM

schubiduuu opened issue at 2018-02-06 09:04:¶

Relax-and-Recover (ReaR) Issue Template¶

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

rear version (/usr/sbin/rear -V): rear-2.3-49.git.0.4e40a7b.unknown.changed
OS version (cat /etc/rear/os.conf or lsb_release -a):

SB Version:    core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-ppc64:core-3.2-ppc64:core-4.0-ppc64:desktop-4.0-noarch:desktop-4.0-ppc32:desktop-4.0-ppc64:graphics-2.0-noarch:graphics-2.0-ppc32:graphics-2.0-ppc64:graphics-3.2-noarch:graphics-3.2-ppc32:graphics-3.2-ppc64:graphics-4.0-noarch:graphics-4.0-ppc32:graphics-4.0-ppc64
Distributor ID: SUSE LINUX
Description:    SUSE Linux Enterprise Server 11 (ppc64)
Release:        11
Codename:       n/a

rear configuration files (cat /etc/rear/site.conf or cat /etc/rear/local.conf):

AUTOEXCLUDE_MULTIPATH=n
BOOT_OVER_SAN=y

REAR_INITRD_COMPRESSION="lzma"
OUTPUT=ISO
BACKUP=TSM
OUTPUT_URL=file:///iso/

COPY_AS_IS=( "${COPY_AS_IS[@]}" /etc/sysconfig/network/config /etc/sysconfig/network/scripts/functions )
MODULES=( 'all_modules' )
FIRMWARE_FILES=( 'yes' )
MODULES_LOAD=( scsi_mod scsi_tgt scsi_transport_srp ibmvscsic dm_mod dm_snapshot scsi_dh scsi_dh_alua scsi_dh_emc scsi_dh_rdac scsi_dh_hp_sw dm_multipath dm_round_robin dm_queue_length dm_least_pending dm_service_time crc_t10dif sd_mod linear dm_log dm_region_hash dm_mirror mbcache jbd ext3 ibmveth nx_crypto sg ipv6_lib ipv6 loop xfs fuse sunrpc nfs_acl auth_rpcgss lockd fscache nfs joydev cdrom ide_cd_mod sr_mod st )

EXCLUDE_VG=( vgHANA-data-HC2 vgHANA-data-HC3 vgHANA-log-HC2 vgHANA-log-HC3 vgHANA-shared-HC2 vgHANA-hared-HC3 )
BACKUP_PROG_EXCLUDE=( "${BACKUP_PROG_EXCLUDE[@]}" '/hana/*' )
COPY_AS_IS_TSM=( /etc/adsm/TSM.PWD /opt/tivoli/tsm/client/ba/bin/dsmc /opt/tivoli/tsm/client/ba/bin/inclexcl /opt/tivoli/tsm/client/ba/bin/dsm.sys /opt/tivoli/tsm/client/ba/bin/dsm.opt /opt/tivoli/tsm/client/api/bin64/libgpfs.so /opt/tivoli/tsm/client/api/bin64/libdmapi.so /opt/tivoli/tsm/client/ba/bin/EN_US/dsmclientV3.cat /usr/local/ibm/gsk8* )

Are you using legacy BIOS or UEFI boot? BIOS
Brief description of the issue:

After boot into recovery environment I tried to change the IP address or change the operational state of eth0 it keeps telling me: SIOCSIFFLAGS: Invalid argument.

Commands that I used were ifconfig and ip.

Work-around, if any: no

jsmeix commented at 2018-02-06 12:57:¶

This is a follow-up of another issue starting at
https://github.com/rear/rear/issues/1701#issuecomment-361954934

I guess (but I am not at all a sufficient networking expert
to make an authoritative decision here) that this issue
depends on "special hardware" - probably not on the ppc64
architecture because @schabrolles tested ReaR a lot
on POWER architecture - but perhaps more likely on special
networking hardware (e.g. special network interface card).

@schabrolles
if time permits, could you have a look here?

jsmeix commented at 2018-02-06 13:06:¶

@schubiduuu
according to your
https://github.com/rear/rear/issues/1701#issuecomment-363384811
you already use in your etc/rear/local.conf

MODULES=( 'all_modules' )
FIRMWARE_FILES=( 'yes' )
REAR_INITRD_COMPRESSION="lzma"

but that alone would not make all those kernel modules loaded
in the recovery system that were loaded on your original system, cf.
https://github.com/rear/rear/issues/1701#issuecomment-362537902

If the issue depends on the kernel driver for your network interface card
and/or on other kernel modules related to networking, you may also try
if it helps to get all modules that were loaded on your original system
also loaded in the recovery system via

MODULES_LOAD=( first_module second_module ... )

where you could generate the list of needed modules in the right ordering
on the original system e.g. by a command like

cat /proc/modules | cut -d ' ' -f 1 | tac

under the assumption that /proc/modules lists last loaded modules topmost.

schubiduuu commented at 2018-02-06 13:48:¶

I included every module listed in the output of "cat /proc/modules | cut -d ' ' -f 1 | tac".

The issue still persists. The strange thing about this issue is that we were able to use REAR with SLES11 and the same HW in the past. We didn't change anything regarding devices, drivers, OS etc.

Local.conf of the initial post is updated to the current one.

jsmeix commented at 2018-02-06 14:53:¶

@schubiduuu
if you were able to use an older version of ReaR
with SLES11 and the same HW in the past
it would be of interest which older version had worked for you
(ideally up to which ReaR version it had worked for you).

The major rewite of 310_network_devices.sh
is only available in the current ReaR GitHub master code since
https://github.com/rear/rear/commit/02937854b6ea2060a978d69355581ba93adb9e7e

The last released ReaR version 2.3 still has the
"traditional" 310_network_devices.sh (plus separated 360_teaming.sh).

@schubiduuu
could you verify if it works again when you downgrade your
current ReaR GitHub master code to the released ReaR version 2.3.

To download RPM packages of ReaR version 2.3 see
http://relax-and-recover.org/download/

Hmmm...
by the way I noticed that for SLE11 it is not built for POWER architecture
only for openSUSE_Factory_PowerPC it gets built for ppc64 and ppc64le
so that RPM packages of ReaR version 2.3 for POWER architecture
are currently only available at
http://download.opensuse.org/repositories/Archiving:/Backup:/Rear/openSUSE_Factory_PowerPC/
but it should not matter for which SUSE product you use ReaR
because ReaR is always the same bash scripts.

To downgrade a "git clone" of a ReaR GitHub master code
to a particular git commit, for example to the git commit
that matches theReaR version 2.3 release you can do the following:

Basically "git clone" it into a separated directory and then
find and checkout the desired git commit and finally
configure and run ReaR from within that directory like:

# git clone https://github.com/rear/rear.git

# mv rear rear.github.master

# cd rear.github.master

# git log | grep -B5 'ReaR 2.3 release'

commit 37bfa9c8591a69c83ea48bae73b6b0286ac36e8c
Author: Gratien D'haese 
Date:   Wed Dec 20 16:22:57 2017 +0100

    ReaR 2.3 release

# git reset --hard 37bfa9c8591a69c83ea48bae73b6b0286ac36e8c
HEAD is now at 37bfa9c8 ReaR 2.3 release

# vi etc/rear/local.conf

# usr/sbin/rear -D mkbackup

Note the relative paths "etc/rear/" and "usr/sbin/".

To run git on SLE11 you need the RPM package "git-core" for ppc64 on SLE11.
Currently I don't know wherefrom you could download it if you do not yet have it.

schubiduuu commented at 2018-02-06 15:21:¶

The file "310_network_devices.sh" was modified to incoporate the latest changes #1701 .
The version which worked for a while is: rear-2.2-106.git.0.39e3103.unknown.changed.
We had to test newer REAR version due to some issues we encountered during some tests in the past.

I don't use the version for SLES11 on x86 but the openSUSE_Factory_PowerPC version.

jsmeix commented at 2018-02-07 13:09:¶

@schubiduuu
I have no good news because I cannot reproduce
any such issue on my SLES11 SP4 x86_64 system
with current ReaR GitHub master code.

On my SLES11 SP4 x86_64 system everything "just works" for me
for DHCP and
for static IP address setup (same as the one of the original system) and
for manual network interface setup via

NETWORKING_PREPARATION_COMMANDS=( 'ip link set dev eth0 up' 'ip addr add 10.160.4.255/16 dev eth0' 'return' )

and for manual network interface setup via recovery system kernel command line parameters

ip=10.160.4.255 nm=255.255.0.0 gw=10.160.255.245 netdev=eth0

I found some minor issues related to manual network interface setup
so that I did https://github.com/rear/rear/pull/1725
but nothing really serious - even without that changes I always got a
usable network interface eth0 so that both remote access via SSH
and "rear recover" worked well for me.

I use this local.conf:

OUTPUT=ISO
BACKUP=NETFS
BACKUP_OPTIONS="nfsvers=3,nolock"
BACKUP_URL=nfs://10.160.4.244/nfs
SSH_ROOT_PASSWORD="rear"
USE_STATIC_NETWORKING="yes"
NETWORKING_PREPARATION_COMMANDS=( 'ip link set dev eth0 up' 'ip addr add 10.160.4.255/16 dev eth0' 'return' )
FIRMWARE_FILES=( 'no' )
MODULES=( 'loaded_modules' )

jsmeix commented at 2018-02-07 13:14:¶

@schubiduuu
does this issue and https://github.com/rear/rear/issues/1701
appear only on one single system or do they appear on several systems?

In case of the latter do they appear only on same kind of hardware
(e.g. only with one particular kind of network interface card)
or do the issues appear on various different ppc64 systems?

schubiduuu commented at 2018-02-07 13:32:¶

This issue appears on every SLES11 system (ppc) we have.

jsmeix commented at 2018-02-07 13:41:¶

Is "every SLES11 system (ppc)" same kind of hardware
(in particular do all have same networking hardware)
or do you have various different ppc64 systems?

I assume it only happens for SLES11 ppc64 systems
but with SLES12 ppc64 systems it works
or do you not have SLES12 ppc64 systems?

schubiduuu commented at 2018-02-07 13:48:¶

Yes we are speaking about the same hardware as each system is virtual running on the same host.
We have issues for SLES11 ppc64 but not for SLES12.

jsmeix commented at 2018-02-07 13:56:¶

@schabrolles
I dare to assign this issue to you - even if you may not have time right now.

I will still have a look but as non-POWER user I can neither really help
with POWER architecture (PPC64) specific issues nor can I do
a meaningful analysis what the root cause could be.

schabrolles commented at 2018-02-08 08:26:¶

@schubiduuu I'm gonna try to replicate it on my power in my lab .... I did not remember having already face this issue.

schabrolles commented at 2018-02-08 08:37:¶

@schubiduuu, I got one idea,

I think your issue could be related to a TOO BIG initrd
This is mostly due to the size of TSM client included into the rescue image.

Try the following to reduce the amount of data to copy into the rescue image:

COPY_AS_IS_TSM=( /etc/adsm/TSM.PWD /opt/tivoli/tsm/client/ba/bin/dsmc /opt/tivoli/tsm/client/ba/bin/tsmbench_inclexcl /opt/tivoli/tsm/client/ba/bin/dsm.sys /opt/tivoli/tsm/client/ba/bin/dsm.opt /opt/tivoli/tsm/client/api/bin64/libgpfs.so /opt/tivoli/tsm/client/api/bin64/libdmapi.so /opt/tivoli/tsm/client/ba/bin/EN_US/dsmclientV3.cat /usr/local/ibm/gsk8* )

replace /opt/tivoli/tsm/client/ba/bin/tsmbench_inclexcl with your TSM include/exclude file if you have one.

You normally don't need to copy the firmware files or all modules (try by removing those lines in your local.conf)

jsmeix commented at 2018-02-08 09:55:¶

@schabrolles
many thanks for having a look!

FYI
why @schubiduuu has currently all kernel modules
and firmware in his recovery system, see
https://github.com/rear/rear/issues/1701
in particular starting at
https://github.com/rear/rear/issues/1701#issuecomment-361954934
and the subsequent comments, in particular
https://github.com/rear/rear/issues/1701#issuecomment-363364586

I did some googling for "SIOCSIFFLAGS: Invalid argument"
and got the impression that this error message is usually related
to low-level issues like unsupported networking hardware,
misbehaving networking hardware kernel driver,
no MAC address (or a MAC address 00:00:00:00:00:00),
and things like that.

From that I assumed that needed kernel modules or even firmware
might be missing in the recovery system.
Therefore I suggested to try things like MODULES=( 'all_modules' )
FIRMWARE_FILES=( 'yes' ) and MODULES_LOAD=( ...), cf.
https://github.com/rear/rear/issues/1724#issuecomment-363416571

schabrolles commented at 2018-02-08 10:33:¶

@jsmeix,

I remember I had an issue with SLES11 and too BIG initrd (with TSM).
The symptoms were: rescue image can boot (even with too BIG initrd) but I got network ERRORS and FiberChannel failures.

I got a Power LPAR with SLES11 and just tested it with current master branch. it always boot without having to specify :
MODULES=( 'all_modules' )
FIRMWARE_FILES=( 'yes' )
=> I recommend to not use them, because I want to have the smallest initrd as possible.

A good think would be to have the latest kernel updated for SLES11 SP4. (https://www.suse.com/support/update/announcement/2017/suse-su-20173265-1/)

 powerpc/prom: Increase minimum RMA size to 512MB (bsc#984530,
bsc#1052370).

jsmeix commented at 2018-02-08 10:57:¶

@schabrolles
in general it would be good when during "rear mkrescue" it could autodetect
as many issues as possible that would later lead to failures during "rear recover"
cf. https://github.com/rear/rear/issues/1718#issuecomment-361871332

Do you think it is possible to autodetect during "rear mkrescue"
that the ReaR initrd is too big and error out in this case?

schubiduuu commented at 2018-02-08 13:13:¶

@schabrolles We followed you advice to include less files to reduce the size of the recovery boot media and voilà we are able to use the network interfaces and we are currently running a restore. Thanks so far for this incredible helpful advice. 👍

jsmeix commented at 2018-02-08 13:28:¶

@schabrolles
many thanks for your help here!
Without you as POWER magician we (i.e. the other ReaR guys)
would probably never had been able to see the root cause.

jsmeix commented at 2018-02-08 13:35:¶

@schubiduuu
also many thanks to you because we (i.e. the ReaR developers)
need venturous users like you who try out our "latest greatest"
ReaR GitHub master code, report issues, and even stay patient
when things do no longer "just work".

jsmeix commented at 2018-02-08 13:38:¶

@schabrolles
I set this "support question" issue to "fixed/solved/done"
but I think it revealed a needed enhancement in ReaR
to avoid a too big initrd in certain POWER environments, cf.
https://github.com/rear/rear/issues/1724#issuecomment-364076807

jsmeix commented at 2018-07-12 14:26:¶

@schabrolles
regarding my above
https://github.com/rear/rear/issues/1724#issuecomment-364113716
see my https://github.com/rear/rear/issues/1859
which intends to make such issues more obvious to the user
provided corrupted or missing files in the recovery system
are the root cause of this issue here.

schabrolles commented at 2018-07-16 07:22:¶

@jsmeix, You're right, I think #1859 could really help here. If initrd cannot be uncompressed totally due to lack of space, we should detect it with md5sums (file missing or with different size).

jsmeix commented at 2018-07-16 09:29:¶

@schabrolles
could you explain the reason behind why in particular on POWER architecture
it might happen that the "initrd cannot be uncompressed totally due to lack of space"
because I would think in particular on POWER systems there is plenty of "space"
(I assume that "space" is main memory).

schabrolles commented at 2018-07-16 10:03:¶

@jsmeix, It is specific to the PowerVM (Firmware based virtualization).
During LPAR (aka Logical Partition which could be seen as a VM) boot, the Power Hardware hypervisor allocates private space for the bootstrap (256MB). I don't have all the detail (and knowledge) about this very specific phase (virtual firmware of the LPAR before LPAR kernel bootstrap), but this 256MB should store some Firmware information, kernel and initrd (compressed). It seems that the 256MB is hardcoded in Linux kernel for Power ( variable ppc64_rma_size ). Recent Redhat (since rhel7.4) and Suse (since sles12sp2) should have 512MB RMA size.

Finding a rule to know what is the maximum size allowed is a bit difficult as it depends on:

Power FW version
Kernel version (256MB or 512MB)
number of real and virtual adapters (will increase space needed by the FW part)

So, having something that tell us that something is wrong about initrd data in the recovery system is a good start even if it would be better to have it during backup or rear-initrd creation (rear mkrescue).

jsmeix commented at 2018-07-16 10:53:¶

@schabrolles
many thanks for your explanation.
Now I understand the reason behind.

jsmeix commented at 2018-07-17 11:57:¶

With https://github.com/rear/rear/pull/1864 merged
corrupt files in the recovery system should be detected
and reported in an obvious way.

jsmeix commented at 2018-07-18 10:11:¶

I assume this isssue is meanwhile sufficiently answered
and handled so that I can close it hereby.

[Export of Github issue for rear/rear.]

#1724 Issue closed: On PPC64 manual IP config fails with "SIOCSIFFLAGS: Invalid argument"¶