#1926 Issue closed: Restore fails when teaming runner is configured as LACP

Labels: enhancement, support / question, fixed / solved / done

MogiePete opened issue at 2018-10-10 22:16:

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • ReaR version: 2.4-1.el7

  • OS version: CentOS 7.5

  • ReaR configuration files:
    OUTPUT=ISO
    BACKUP=TSM

  • Hardware: Cisco C240 M4

  • System architecture: x86_64

  • Firmware: BIOS

  • Storage: Local Disk

  • Description of the issue (ideally so that others can reproduce it):

In issue #655 @rg-fine has an issue with rear working with teamed interfaces in RHEL and provided work around in PR #662.

The resolution was to simplify the function that all ip-addresses and routing-entries of a teaming-interface will be attached to its first available teaming-member according to teamdctl.

While this works well in a team configured for broadcast, roundrobin, and activebackup it does not work when configured for LACP. (I can't confirm for loadbalance as I haven't tested it).

As stated in #662 all ip-addresses and routing-entries are configured on one interface but LACP is not operational. This results in the host not being able to communicate with the TSM server and the backup failing.

jsmeix commented at 2018-10-11 07:07:

@MogiePete
in ReaR version 2.4

Network setup was completely reworked to support
bonding, bridges, vlans and teaming. There is a full
rewrite of the 310_network_devices.sh script generating
network interfaces for use during ReaR rescue/recovery
system networking setup via the 60-network-devices.sh
script. It also handles corner cases/odd setups that can
be found from time to time, typically when the administrator
uses bonding plus bridges plus vlans as well as teaming.

see
http://relax-and-recover.org/documentation/release-notes-2-4

@rmetrich
could you have a look here because it is about
teamed network interfaces which could be related to your code in
usr/share/rear/rescue/GNU/Linux/310_network_devices.sh

jsmeix commented at 2018-10-11 07:12:

@MogiePete
in general when ReaR's automated setup of the networking stuff
in the ReaR recovery system does not work as you need it
in your particular case, you can as a workaround manually
specify any networking setup comands as you need it via the
NETWORKING_PREPARATION_COMMANDS array, see
usr/share/rear/conf/default.conf

MogiePete commented at 2018-10-11 15:25:

@jsmeix

I looked at the NETWORKING_PREPARATION_COMMANDS but I don't think that will resolve the issue. Bonding is done with a kernel module while teaming is done in user space so I don't believe I can configure the interfaces and they would work unless I configured them as a bond which defeats the purpose of using teaming.

At this point, until a solution is found, I am having to revert all of my physical hosts running EL7 to bonding.

rmetrich commented at 2018-10-11 15:39:

Hello,

If I remember well, the teaming is always simplified.
Could you hence submit the generated file (/etc/scripts/system-setup.d/60-network-devices.sh)?

Also a debug log of the rear -dD mkrescue may help there.

rmetrich commented at 2018-10-11 15:46:

@MogiePete
Would it work with LACP if an interface is taken administratively down?
What happens on the OS if you set carrier to 0 on one of the physical interface?
The idea is to find a solution without integrating the whole teamdctl stack.

MogiePete commented at 2018-10-11 20:30:

@rmetrich

I have attached the 60-network-devices.sh renamed to .txt as requrested along with the debug log.

In regards to your question about setting the carrier to 0 I do not know how to do that with teaming and rear but it would be worth a try.

Also, we did work with our network team and tried with the second port set to admin down but they were not seeing any LACP traffic. Additionally, when they disabled the Port-Channel then traffic was flowing but this is not the ideal path to take as doing a restore when the OS is configured for bonding does not require any modification of the network configuration.

60-network-devices.sh.txt
rear-mariadb01.log

rmetrich commented at 2018-10-12 06:48:

Looking closer, I think we really need to simplify the Teaming configuration because including Teaming is very complicated: it requires d-bus and other stuff we probably don't want in the rescue.
Could your network team describe the switch configuration of the ports?
With same switch configuration, is it possible to configure the OS in bonding instead? If so, can you describe the required bonding configuration?
Asking because maybe we could just map teaming to bonding while in rescue.

MogiePete commented at 2018-10-12 14:24:

@rmetrich

Here is the configuration of the Port-Channel and one of the port configurations.

interface Port-channel276
switchport
switchport access vlan 417
switchport mode access
no ip address
!
interface GigabitEthernet3/14
switchport
switchport access vlan 417
switchport mode access
no ip address
channel-group 276 mode on
!

The system works with bonding and teaming. I actually reconfigured the network for bonding, created a new rescue disk, and did the restore without any issues.

This is the bonding configuration on the OS:

BONDING_OPTS="downdelay=0 miimon=1 mode=802.3ad updelay=0"

I think the idea of mapping teaming to bonding would work perfectly. Same network configuration will work with both.

rmetrich commented at 2018-10-15 09:12:

Hello @MogiePete ,
I've started implementing the teaming code inside recovery. I believe this is doable without including D-Bus.
Please allow me some time as this is not my primary responsibility at Red Hat.

My understanding is that with LACP, using network simplification will not work, even for Bonding. Hence, I would also like to harden the bond code so that in case of LACP, no network simplification happens and a message gets printed to the user.

Once code is implemented, please help us testing, because I don't have access to the required hardware.

jsmeix commented at 2018-10-15 10:21:

@rmetrich
as a networking noob I have a beginners question:

I had a quick look at
https://en.wikipedia.org/wiki/Link_aggregation
which also talks about redundancy in case one of the links fails.

If one of two links fails, there is only one left and then
things should still work with the one left network interface.

Now I wonder why it is not possible to set up only one single
network interface "as usual" and it still works to access
the server where the backup is stored?

I.e. why is the whole link aggregation setup required in the recovery system
to be able to access the backup server via a single network interface?

Or in other words:
Why is the whole link aggregation setup required in a network
where "teaming with LACP" is used when a particular host
has only one single NIC?

rmetrich commented at 2018-10-15 11:31:

@jsmeix LACP works by sending frames to negociate link state between switch and host. If we just plumb a single network interface, then no frame will be sent, preventing any network traffic to occur.

MogiePete commented at 2018-10-16 03:31:

@rmetrich

Thank you for working on this. I am sure many will benefit.

Once the code is committed I will test and provide feedback.

Thanks again.

rmetrich commented at 2018-10-16 12:25:

@MogiePete

PR is #1934, please give it a try on real hardware.
Please test the following scenarios with LACP enabled on the switch

  • Bonding
  • Bonding with SIMPLIFY_BONDING=y in ReaR configuration
  • Teaming
  • Teaming with SIMPLIFY_TEAMING=y in ReaR configuration

Due to LACP, SIMPLIFY_XXX shouldn't do anything: Bonding and Teaming shall persist in recovery.
The teamdctl binary is automatically added to the ISO image for convenience, when a Team exists and is not simplified.

MogiePete commented at 2018-10-16 21:11:

@rmetrich

I was able to test this using bonding and teaming after building the rpm from your pull request.

On one of the two nodes I had an issue with the system booting properly after the TSM restore. The UUID of /dev/sda1 which holds /boot was not updated in the fstab which resulted in the system failing to load the operating system properly. A quick edit after booting into recovery on the CentOS iso resolved the issue. The two systems were identical in terms of hardware and OS configuration so I am not sure what the issue was.

Other than that one issue I can report that the additions to the networking code are working properly.

Thank you so much for the quick turn around on this issue.

jsmeix commented at 2018-10-17 08:16:

@MogiePete
can you open a new separated issue for

The UUID of /dev/sda1 which holds /boot
was not updated in the fstab which resulted in the
system failing to load the operating system properly

We need separated GitHub issue reports for separated issues
because otherwise different things could mess up until havoc.

Normally UUIDs in the restored etc/fstab get automatically adjusted via
usr/share/rear/finalize/GNU/Linux/280_migrate_uuid_tags.sh
that reads a FS_UUID_MAP file var/lib/rear/layout/fs_uuid_mapping
which gets created during "rear recover" by the script
usr/share/rear/layout/prepare/GNU/Linux/130_include_filesystem_code.sh

You may also have a look at item (5)
in https://github.com/rear/rear/issues/1854#issue-338886055
and https://github.com/rear/rear/pull/1758
and https://github.com/rear/rear/issues/1857

To have a chance to understand what went wrong in your case
I need a debug log of "rear -D recover" (the recovery log gets
copied into the recreated system so that you should have it)
plus - if possible - your old etc/fstab from the backup and the
new adapted one that works.

jsmeix commented at 2018-10-18 11:43:

With https://github.com/rear/rear/pull/1934 merged
this issue should be fixed.

jsmeix commented at 2018-10-23 08:15:

@MogiePete
thank you for your follow up issue https://github.com/rear/rear/issues/1938
regarding https://github.com/rear/rear/issues/1926#issuecomment-430533773

pcahyna commented at 2019-03-04 10:23:

@MogiePete "I was able to test this using bonding and teaming after building the rpm from your pull request."
Did you test teaming with multiple interfaces in the team? For us this did not work because teaming seems to have a problem analogous to #1954, which was fixed for bonding, but not for teaming.

MogiePete commented at 2019-03-04 13:18:

@pcahyna

Yes this was tested with 2 interfaces configured for LACP.

pcahyna commented at 2019-03-04 13:43:

@MogiePete thanks for the confirmation. I don't see how this could have worked though. In our testing, teamd is invoked as

teamd -d -c '{"device": "int-team", "ports": {"em1": {"link_watch": {"name": "ethtool"}}, "em1": {"link_watch": {"name": "ethtool"}}}, "runner": {"name": "lacp", "tx_hash": ["eth", "ipv4", "ipv6"]}}'

which fails with this message:

Failed to parse config: duplicate object key near '"em1"' on line 1, column 82
Failed to load config.

@rmetrich any idea?

rmetrich commented at 2019-03-04 14:06:

@pcahyna You have "em1" twice, is that expected?

pcahyna commented at 2019-03-04 14:11:

@rmetrich, yes, that's the point. That's why it fails. Since the MAC address is duplicated, the sed command modified the script so that it contains a duplicate interface name.
My question was, how it is possible that it worked for @MogiePete because I don't see in which case this problem could have been avoided.

MogiePete commented at 2019-03-04 14:50:

@pcahyna @rmetrich

This is the version I had installed:

Relax-and-Recover 2.4-git.3115.507837a.unknown / 2018-10-16


[Export of Github issue for rear/rear.]