#564 Issue closed: Exit codes

Labels: enhancement, discuss / RFC

clausa opened issue at 2015-03-11 14:22:

We are running "/usr/sbin/rear checklayout || /usr/sbin/rear mkrescue" from cron.

If EXIT_CODE gets set to a non-zero value anywhere in the code, /usr/sbin/rear writes "WARNING rc=$EXIT_CODE" to syslog. That triggers a logcheck script to notify our monitoring system - so someone can look into this.

However, if there is a change in the layout, EXIT_CODE is set to 1, in order to trigger mkrescue to run.

Unfortunately that also result in a WARNING in syslog, which creates a false alarm.

Any suggestions, on how to "fix" this?

gdha commented at 2015-03-19 12:14:

EXIT_CODE=1 must be kept for the cron entry so that mkrescue is getting triggered.
We could have different exit codes. e.g. EXIT_CODE=2 for serious errors...
Proposal:

  • EXIT_CODE=0 : no issue found
  • EXIT_CODE=1 : layout changed (needed to trigger mkrescue). I would like to see this logged as well in syslog, but how to formulate this? NOTICE: rear layout changed - trigger mkrescue or something like that?
  • EXIT_CODE>1 : serious error that should get logged in syslog

jsmeix commented at 2015-03-19 12:46:

@gdha
why must it be "EXIT_CODE=1" for cron?

I think in general the exit code value '1' is used for "anything wrong"
and not to indicate a specific outcome.

In rear-1.17.0 in usr/sbin/rear I don't see a reason why the EXIT_CODE value for cron cannot be something else.

In particular in usr/sbin/rear it seems "EXIT_CODE=1" is already used for "ERROR: The specified command '$WORKFLOW' does not exist !"

If my understanding is right I would like to suggest something like:

EXIT_CODE=0 : no issue found

EXIT_CODE=1 : default value for "anything wrong"

EXIT_CODE=2 ... EXIT_CODE=9 : reserved for specific fatal errors.

EXIT_CODE>9 : reserved to indicate a specific outcome

in particular currently used:

EXIT_CODE=11 : layout changed

For background info what I have in mind see the exit codes of CUPS backends and see CUPS filters how messages are prefixed there, see
"Exit Codes" at
http://cups.org/documentation.php/doc-1.7/man-backend.html
and "Log Messages" at
http://cups.org/documentation.php/doc-1.7/man-filter.html

schlomo commented at 2015-03-19 13:04:

I am actually against using exit codes as an official interface. My favourite bad example are the rsync exit codes Where 24 means that all is fine but some source files disappeared between scanning and transferring. The result is that whereever I use rsync in a script I must check the exit code and add code to handle 24 like 0.

Therefore I suggest to use exit codes > 0 only as a signal to tell the outside world that we could not do our job as expected. More advanced stuff should be done via an explicit interface so that you can write something like this

res=$(rear something) || deal_with_error
if [[ "$res" == *need-new-rescue-image* ]] ; then
    do_something_else
fi

IMHO that is much more legible than for example

rear something
res=$?
if (( res == 11 )) ; then
    do_something
elif (( res > 0 )) ; then
    deal_with_error
fi

For the case of creating a cron job to run ReaR if the layout changed I would always opt to add stuff to rear to keep the conditionals out of the cron job. Within ReaR we have much more detail and context infos available to make decisions.

In this specific example I would rather extend ReaR to support something like

rear mkrescue --if-layout-changed

so that you can rely on ReaR to do the right thing. And if the exit code is >0 then you know for sure that there was a problem.

jsmeix commented at 2015-03-19 13:25:

@schlomo

when exit codes are used as an interface I agree that the ranges of error exit codes and non-error exit codes must be strictly separated.

If I understand you correctly you propose some special stdout (or perhaps also stderr) messages as an official interface.

Currently I don't have an opinion if this is good or bad.

But couldn't that conflict with other output when e.g. "rear -v" is used?
For example accidentally "rear -v" might output something like "No layout change -> skipping need-new-rescue-image". What I mean is: If some special messages are an official interface every contributor to rear must carefully avoid to output something that matches the special messages by accident.

jsmeix commented at 2015-03-19 13:42:

Only a quick idea:

What about using specific prefixed lines on stdout as official interface like:

$ rear --very-verbose something
Hello,
my name is rear!
How are you?
INFO: I do now 'something' for you:
DEBUG: Init 'something'
Hello, my name is something!
DEBUG: Cheking whether or not need-new-rescue-image
INFO: Found new device /dev/sdX mounted as /dev/sdX9 at /foo
disk layout has changed => need-new-rescue-image
INTERFACE: need-new-rescue-image
I am fine!
ERROR: failed to read /etc/fstab
Oops!
I am no longer fine!
I abort now... 
$ echo $?
1

schlomo commented at 2015-03-20 09:44:

I would rather think about defining call backs. That means that you have to specify a program that will be called if ReaR has to tell something. Could be something as simple es echo >special_file or something that talks to a central ReaR management service.

jsmeix commented at 2015-03-20 11:53:

@schlomo

Do I understand your "defining call backs" correctly that this is another different interface how rear could communicate with "the outer world"
because I think "defining call backs" does not match what you wrote in your example above "res=$(rear something)".

If I understand it correctly there are now three proposals for an interface how rear could communicate with "the outer world":

  • Using exit codes
  • Using stdout/stderr messages
  • Using call backs

I think this particular issue has revealed a general issue in rear that currently it is not yet well defined how rear would communicate with "the outer world".

Perhaps a new separated issue should be created for the general issue how rear would communicate with "the outer world" and this issue here is only about how rear could tell when the disk layout has changed (or when in general a new rescue image is needed, e.g. also when kernel drivers have changed).

schlomo commented at 2015-03-20 11:59:

You are right. This is a topic which should be discussed in greater detail. What I meant is that if to go beyond a simple single reply on stdout then I would rather opt for a call back system.

So far we use echo and read to communicate with the user. I think that this should be moved to a library that could then be customized to use a call back.

To go back to the original issue as opened by @clausa, I think that it would be enough to simply add a rear mkrescue --only-if-changed feature or something similar. @clausa, could you please comment what would be the most simple solution to your problem?

clausa commented at 2015-03-20 20:37:

@schlomo being able to just call something like rear mkrescue --only-if-changed would definitely solve our 'problem'. So +1 from here.

schlomo commented at 2015-03-21 19:52:

I just had a look. May way to implement this would be to take a checksum of the layout and gracefully exit if the checksum did not change after saving the layout.

jsmeix commented at 2015-03-25 16:32:

@schlomo
regarding a "rear mkrescue --only-if-changed" feature:

Perhaps I am overanxious and I do not have sufficient overview knowledge of the whole rear but I fear such a feature could become a nightmare to implement and to keep it working correctly.

I think "rear mkrescue --only-if-changed" promises the user that he will get a new rear recovery system image when anything has changed that requires a new recovery system.

This means when any such change is by accident not recognized and later recovery fails because the recovery system is outdated, then the user rightfully complains that rear is guilty.

Without such a feature it is left to the user to decide if he needs a new recovery system (i.e. in case of doubt he would just make it anew).

In contrast a specific feature like "rear mkrescue --if-disklayout-changed" is perfectly o.k. from my point of view because it can be documented that this means any change in the disklayout.conf file (nothing more, nothing less).

When several such specific "rear mkrescue --if-this-changed", "rear mkrescue --if-that-changed" features are implemented, a general "rear mkrescue --if-changed" feature could be just a simpler way to call "rear mkrescue --if-this-changed --if-that-changed" but this would not mean the user gets a new recovery system image when anything has changed that requires a new recovery system.

schlomo commented at 2015-03-25 17:02:

So your concern is mostly about the wording? No problem.

In any case, if I would implement this then I would not change any existing behaviour but extend mkrescue (and possibly mkbackup) to drop out quickly if nothing relevant changed.

In any case there would be some remaining risk involved. Imagine a server beeing set up and getting OS updated over 5 years. In many cases the disk layout would not change in those 5 years. But the ReaR version and the kernel would change.

To successfully recover this system on new hardware you would really need the recent ReaR version and OS kernel from all those updates. But since you never created a new Rescue Image you will try to recover your system with a Linux and ReaR that is 5 years old.

I actually believe that this example covers a majority of use cases in a data center.

Coming back to the original issue of @clausa I can think of the following:

  • Why don't you create a new rescue image every day, week or month? Why try to optimize this point? How much money or time does this optimization actually save you?
  • Provide a Pull request to optionally disable logging to syslog
  • Provide a Pull request that makes rear not log a warning if there was no Error (Maybe it is indeed a bug that we log a warning there)

Looking at 7820c992 where that behaviour was added I see a reference to OVO which was apparently used somewhere to collect infos about ReaR. @gdha, could you please try to remember what was the background there? Is it really necessary to log any exit code != 0 to syslog as a warning? Or maybe it would be enough to log the actual Errors?

Bottom line, after looking at everything I start to believe that it would be more "logical" that an exit code != 0 should be logged to syslog as info and not as warning. After all, it can be part of regular operations.

gdha commented at 2015-03-25 19:09:

@schlomo Indeed it was added on request of an important customer (who actually pay for writing code) for picking it up via OVO as there are thousands of Linux systems and nobody seemed to know how many failures there where on a weekly basis. So, what @clausa does not want to see, is not what my customer wanted... To be honest I believe it is good practice if you see a warning to always do a quick check of the log file. Better one check too many then a failed recovery afterwards...

schlomo commented at 2015-03-25 19:24:

@gdha, are you 100% sure that your OVO customer actually needs the WARNING line in syslog for all cases where the exit code was != 0? Or do they more care about catching the ERROR lines which are generated by our Error function?

Could you talk to the OVO customer about the cases where a non-0 exit code is actually the desired behaviour? Maybe now they have some filters in their monitoring to remove those "false positives" and maybe they could also benefit from improving this topic?

gdha commented at 2015-03-25 19:42:

@schlomo I can ask them next week when I'm there again on what they scan. I cannot tell for the moment - only that there problem has reduced significantly.

schlomo commented at 2015-03-25 19:44:

@gdha That sounds great! IMHO we should drop the WARNING line altogether. A WARNING is in my experience totally useless. Either I treat it as an error and live with a lot of false positives or I ignore it and only care about errors.

I would rather like to improve ReaR to throw an Error if we think that a human should check and not bother humans in all other cases.

jsmeix commented at 2015-03-26 09:22:

@schlomo
interesting information! On the one hand I fully agree with your reasoning why warning messages are useless but on the other hand I would like to be able to show warnings as a programmer when there are issues where I can continue the program in a reasonable way but with an uncertain feeling if this is what the user actually wanted.

For example:

# /bin/showsomething --output-mode=color
WARNING: output-mode=color unsupported on this terminal, using fallback output-mode=ASCII

in contrast to using the fallback silently or abort.

Perhaps for rear it could be a special case that there is nothing in between success and failure for a disaster recovery tool?
If yes, would this mean that rear must abort for any kind of issue?

schlomo commented at 2015-03-26 10:25:

In my experience the wish for "warnings" is actually an expression of not trusting the software. As a software program, emitting a "warning" means that you want to shed the responsibility to actually decide if this is an error or not.

But the only thing you achieve as a program is to make your human waste time on false positives.

That is why I am against warnings. A program should call my attention to something only if I must take action as a result.

Your example with the color mode is perfectly showing my point: There is actually no action the user can take right now, the user can only accept it as it is. The only real "action" the user could possibly take is to use the program in a different way (from a color terminal). But since the user happens to use the program from an ASCII terminal, he probably has a reason for doing so and the warning does not help at all.

Therefore in your example the warning is a false positive because there is nothing one can do about it. I would really prefer your showsomething program to either issue an info message or - even better - to just shut up about this. If I miss the color output I will start myself to look into it. For that purpose a verbose or debug message would be much better suited.

Or put different: Warnings are only useful for people who want to run their environment manually. For those who build automation all kind of warnings are a curse.

jsmeix commented at 2015-03-26 11:14:

Many thanks for your great explanation!

In particular your "emitting a 'warning' means that you want to shed the responsibility to actually decide if this is an error or not" describes exactly what at least my motivation as programmer is in such cases - as I wrote "uncertain feeling".

Your "issue an info message" is also an interesting point that makes things very clear.

The issue is not that a program must run either silently successfully or abort with an ERROR message,
the issue is only that WARNING messages are useless.

If I understand it correctly the following would be o.k.:

# /bin/showsomething --output-mode=color
INFO: output-mode=color unsupported on this terminal, using fallback output-mode=ASCII

# echo $?
0

# /bin/showsomething --output-mode=color 2>/dev/null

# echo $?
0

# /bin/showsomething --debug --output-mode=color
DEBUG: TERM=ibmmono
INFO: output-mode=color unsupported on this terminal, using fallback output-mode=ASCII

# echo $?
0

schlomo commented at 2015-03-26 11:20:

@jsmeix exactly that is what I mean. Only bother humans if they are really supposed to do something about that single run of the program.

gdha commented at 2015-05-31 15:42:

@rear/contributors Are we fine with this issue? Can we close it?

schlomo commented at 2015-05-31 15:45:

@gdha I am fine with this, thanks a lot!

clausa commented at 2015-05-31 19:40:

Thanks all :-)

jsmeix commented at 2016-12-16 10:15:

@clausa
for ReaR 2.0 there is again a change what rear
reports to syslog when it finishes, see
https://github.com/rear/rear/pull/1130 and
https://github.com/rear/rear/issues/1089

clausa commented at 2016-12-19 12:53:

@jsmeix

Thanks for the heads-up. I'm pretty sure this change will bring the issue raised in this ticket. (as "failed" in logfiles will also raise an alarm in our systems).

I might be wrong, but IIRC the issues is merely related to how non-zero exit codes are being used. A change in the disklayout is IMHO not a failure, but it will be logged as "rear checklayout failed with exit code 1" as RC has been set to 1 in order to notify/trigger "mkrescue"

jsmeix commented at 2016-12-19 13:36:

As far as I see the root cause that
usr/share/rear/layout/compare/default/500_compare_layout.sh
and
usr/share/rear/layout/compare/default/510_compare_files.sh
set EXIT_CODE=1
if disk layout or config files have changed
is still not solved.

Therefore one cannot distinguish between a real failure
where a non zero exit code is right and the misuse of
a non zero exit code to signal something.

With the new syslog messages you see at least the workflow
so that you can implement special "failure" handling when
the checklayout workflow ends with non zero exit code.

jsmeix commented at 2016-12-19 13:57:

I think I cannot solve it in a backward compatible way.
Therefore I think I will "just add another special case hack"
to usr/sbin/rear when the checklayout workflow ends.

jsmeix commented at 2016-12-20 10:34:

In
https://github.com/rear/rear/pull/1133
I implemented special case syslog message
when the checklayout workflow exits with exit code 1.
Now in /var/log/messages it look like

rear[13332]: 12759: rear checklayout finished with zero exit code

versus (when the checklayout workflow exits with exit code 1)

rear[11460]: 10888: rear checklayout finished with exit code 1 (layout or config changed)

versus (when checklayout exits with another non-zero exit code)

rear[12046]: 11473: rear checklayout failed with exit code 2

jsmeix commented at 2016-12-20 10:43:

With https://github.com/rear/rear/pull/1133 merged
I consider this issue to be again fixed.

The root cause "misuse of non-zero exit codes"
is not yet solved and will not be solved for ReaR 2.0

jsmeix commented at 2016-12-20 11:47:

To deal with the root cause I submited
https://github.com/rear/rear/issues/1134


[Export of Github issue for rear/rear.]