#1418 PR merged
: Make function get_disk_size () be more tolerant when querying /sys/...¶
Labels: enhancement
, fixed / solved / done
, minor bug
gozora opened issue at 2017-07-16 15:16:¶
Hello guys,
After thinking a bit about this issue a bit, I've decided to introduce new function retry_command () in layout-functions.sh, as it might be useful for other situations in the future as well.
As always any comments are mote than welcome.
V.
gozora commented at 2017-07-17 11:09:¶
Just a small check point on what we have so far (because I'm having trouble to keep attention :-) )
try_and_verify ()
{
command="$*";
local retry=0;
for pass in trial verification;
do
until command_stdout=$(eval $command); do
sleep $REAR_SLEEP_DELAY;
retry=$(( retry + 1 ));
if [[ $retry -ge $REAR_MAX_RETRIES ]]; then
Error "Could not successfully finish command: '$command'";
fi;
done;
done
echo $command_stdout
}
jsmeix commented at 2017-07-17 11:32:¶
Since your
https://github.com/rear/rear/pull/1418#commitcomment-23133281
I understand the reason behind and I like it!
Your try_and_verify in your
https://github.com/rear/rear/pull/1418#issuecomment-315726470
works well for me:
# REAR_SLEEP_DELAY=1 # REAR_MAX_RETRIES=3 # filename="/etc/issue" # out=$( try_and_verify cat qqq $filename ) cat: qqq: No such file or directory cat: qqq: No such file or directory cat: qqq: No such file or directory Could not successfully finish command: 'cat qqq /etc/issue' # echo "'$out'" '' # out=$( try_and_verify cat qqq $filename '||' echo Hello ) cat: qqq: No such file or directory cat: qqq: No such file or directory # echo "'$out'" 'Welcome to SUSE Linux Enterprise Desktop 11 SP3 (i586) - Kernel \r (\l). Hello'
Note the appended 'Hello' after the /etc/issue output.
jsmeix commented at 2017-07-17 11:35:¶
You must use
echo "$command_stdout"
to get spaces and newlines in the output:
# out=$( try_and_verify cat qqq $filename '||' echo Hello ) cat: qqq: No such file or directory cat: qqq: No such file or directory # echo "'$out'" ' Welcome to SUSE Linux Enterprise Desktop 11 SP3 (i586) - Kernel \r (\l). Hello'
schlomo commented at 2017-07-17 11:37:¶
Please also test for files with blanks like this:
out=$( function cat "/etc/file with blank" "||" wc -l )
It is very important that we make ReaR safe internally so that we won't fail when users provide such paths.
gozora commented at 2017-07-17 11:51:¶
@schlomo
Following syntax works fine for me:
nr_blocks=$(try_and_verify 'cat "/tmp/disk size"')
gozora commented at 2017-07-17 11:53:¶
@schlomo
Even this one:
nr_blocks=$(try_and_verify 'cat "/tmp/disk size" | wc -l')
schlomo commented at 2017-07-17 11:58:¶
@gozora I personally find such double quotings really ugly. Hence IMHO it is much better to use '$@' which avoids that.
gozora commented at 2017-07-17 12:04:¶
@schlomo
Do you mean replacing command="$*"
with command="$@"
?
schlomo commented at 2017-07-17 12:09:¶
Yes and no. Don't copy the command into your own variable but use "$@"
as is. Did you see my in-line remarks on your PR?
schlomo commented at 2017-07-17 12:10:¶
And yes, you can use eval "$@"
to support even more complex things.
gozora commented at 2017-07-17 12:36:¶
@schlomo
Did you see my in-line remarks on your PR?
I can see inline remarks from @gdha and @jsmeix but not a single one from you ...
This is a new version of function:
try_and_verify ()
{
local retry=0
for pass in trial verification; do
until command_stdout=$(eval "$@"); do
sleep $REAR_SLEEP_DELAY
retry=$(( retry + 1 ))
if [[ $retry -ge $REAR_MAX_RETRIES ]]; then
Error "Could not successfully finish command '$@'"
fi
done
done
echo "$command_stdout"
}
Is it OK for everybody ?
V.
schlomo commented at 2017-07-17 12:39:¶
I didn't realize I need to send my review remarks. Sorry.
schlomo commented at 2017-07-17 12:46:¶
To be honest, I don't really get the point of the trial and verification approach. If I read the code correctly it will actually do the command twice but not compare between the two outputs.
What is your use case where you need to run the command twice?
IMHO you should choose one of the following approaches:
- Run the command till two consecutive executions yield the same result.
- Run the command till it is successful. No need to repeat it.
- The combination: Run till two consecutive executions yield the same result and are successful.
About the numeric evaluation: I strongly suggest to be consistent there
and use either (( ))`` or
letand not
[[
]]. In our [coding style](https://github.com/rear/rear/wiki/Coding-Style#test---) we mention to not use
[[`
for numerical checks, maybe we should expand that also to be more clear
about assignments.
gozora commented at 2017-07-17 13:02:¶
@schlomo
What is your use case where you need to run the command twice?
Please check https://github.com/rear/rear/issues/1370#issuecomment-303046386
- Run the command till two consecutive executions yield the same result.
- The combination: Run till two consecutive executions yield the same result and are successful.
This smells by endless loop, so I don't think it is a good idea.
About the numeric evaluation: I strongly suggest to be consistent ...
Which part of code you have on your mind?
V.
schlomo commented at 2017-07-17 13:15:¶
@gozora Now I understand the background story. After reading it I have a strong opinion that the "try and validate" approach is overengineering that makes ReaR more complex. It tries to work around the fact that we use the wrong tools for a job (as it is the case of using partprobe to "force" a block device to appear).
I am all in favor of implementing a wait logic in ReaR but would like to
keep all of that really simple. That means that in my opinion your
function should just do that: Retry several times till something works.
So your other until ... do
example would fit the job much better.
And yes, we need to deal with udev properly and stop kicking it. But this is another problem.
About numeric evaluation: Please use ((
or let
when you deal with
variables that contain numbers.
gozora commented at 2017-07-17 13:30:¶
@schlomo
After reading it I have a strong opinion that the "try and validate" approach is overengineering that makes ReaR more complex.
I basically don't have opinion here. Maybe you guys who are with ReaR much longer, should run some vote or something ;-)
About numeric evaluation: Please use (( or let when you deal with variables that contain numbers.
I guess that is what I already did, or am I overlooking something?
retry=$(( retry + 1 ))
schlomo commented at 2017-07-17 13:44:¶
The numerical assignment should be
let retry++
(( retry++ ))
and the test should be also
if (( retry > REAR_MAX_RETRIES )) ; then
And the error messages in case of a failure should not only explain what went wrong (the actual command) but rather explain why this is an issue and what the user should do now.
gozora commented at 2017-07-17 13:55:¶
And the error messages in case of a failure should not only explain what went wrong (the actual command) but rather explain why this is an issue and what the user should do now.
Again a small example, to check If I get your idea right:
You would like to change:
# ls /tmp1
ls: cannot access /tmp1: No such file or directory
to
# ls /tmp1
ls: cannot access /tmp1: No such file or directory
ls: Because you've made a typo.
?
V.
schlomo commented at 2017-07-17 14:00:¶
Yes, although I would turn the order around. First print the ReaR error message and then the command and the output. That way the reading order (top-bottom) helps the user get the full context.
Example:
... other rear output ...
ERROR: Could not get block device for freshly created partitions
Timed out after 5 seconds waiting for command 'stat -t /dev/sda4' to be successful:
stat: cannot stat '/dev/sda4': No such file or directory
jsmeix commented at 2017-07-17 14:03:¶
I think on first glance the following could be assumed:
The try_and_verify function looks primarily meant for
things as in
https://github.com/rear/rear/issues/791
where the verify part is needed because of
https://github.com/rear/rear/issues/791#issuecomment-211318620
But for the issue in
https://github.com/rear/rear/issues/1370
it looks as if a plain retry_command function
(without additional verify) is the right thing.
But according to
https://github.com/rear/rear/issues/1370#issuecomment-303046386
from my point of view the issue in
https://github.com/rear/rear/issues/1370
is actually the same as
https://github.com/rear/rear/issues/791
Because of that a "try and validate" approach is needed
because according to what I put together in
https://github.com/rear/rear/issues/791
that is how things behave.
@schlomo
regarding using partprobe to "force" a block device to appear:
Yes, that is (as far as I understand it) the wrong tool.
But from that does not follow ReaR can avoid the
"try and validate" approach because "try and validate"
results from how udev and the kernel behave
at least sometimes.
Because udev and kernel behave sometimes unreliably
we need a "try and validate" approach that makes ReaR
more complex to compensate those bugs elsewhere,
cf. "Dirty hacks welcome" in
https://github.com/rear/rear/wiki/Coding-Style
In my personal opinion the real bug is in tools like parted
that return to the user 'successfully' without having their
job actually completely done.
E.g. after
parted /dev/sda unit MiB mkpart primary ext2 8 2048
returned successfully the new /dev/sda1 should be there
rock solid ready to use so that an immediate subsequent
mkfs.ext2 /dev/sda1
should always "just work". i.e.
parted /dev/sda unit MiB mkpart primary ext2 8 2048 && mkfs.ext2 /dev/sda1
should "just work".
But this is not true so that ReaR has to work around.
schlomo commented at 2017-07-17 14:11:¶
@jsmeix I understood you in the other issue so that we simply should
wait for the block device after using parted
instead of kicking udev.
Is that wrong? For that we just need a simple retry_command
function.
I am really against this generic retry function that does stuff twice
after it is successful. It requires the developer to be very much aware
of the fact. Imagine you use that on something that does more than just
read something? Like creating a filesystem? It will fail till the
block device appears, work exactly once, and then fail again (unless you
have --force
specified). In such a case the use of the try and
validate approach is wrong because the validation and the trying should
use different commands.
AFAIKT for the reason that @gozora wrote his function a plain retry is enough and that is why I would like to keep it that simple here and see where that helps us. If you need the try and validate approach then please call the retry function twice to make it very transparent that you need to have two successful calls of that command.
Again, keeping it simple reduces both code lines and complexity and also increases readability for those cases where you need this special behavior.
If we generalize each and every problem then ReaR will become too complex and we will have a very hard time to debug it when those things cause problems.
jsmeix commented at 2017-07-17 14:31:¶
@schlomo
it would not work reliably if ReaR waited for a block device
(e.g. before parted and before mkfs) with a simple
retry_command function because nowadays at least sometimes
block devices can disappear and reappear out of a sudden, see
https://github.com/rear/rear/issues/791#issuecomment-211318620
so that a verify part is needed to be fail-safe.
A simple retry_command function that is called two times
should be sufficient like in
# wait for /dev/sda retry_command test -b /dev/sda # verify /dev/sda is there stable retry_command test -b /dev/sda # make /dev/sda1 parted ... mkpart ... /dev/sda # wait for /dev/sda1 retry_command test -b /dev/sda1 # verify /dev/sda1 is there stable retry_command test -b /dev/sda1 # make /dev/sda1 ext2 mkfs.ext2 /dev/sda1
Looks ugly?
YES!
That is the ugliness of how udev/systemd/kernel/whatever behave :-(
Accordingly a try_and_verify function
that calls retry_command two times
hides some of the ugliness:
# wait for /dev/sda and verify it is there stable try_and_verify test -b /dev/sda # make /dev/sda1 parted ... mkpart ... /dev/sda # wait for /dev/sda1 and verify it is there stable try_and_verify test -b /dev/sda1 # make /dev/sda1 ext2 mkfs.ext2 /dev/sda1
For the fun of weirdness:
If device nodes could disappear and reappear several times
until they stay there stable, a try_and_verify function would
have to call retry_command as many times as disappear and
reappear could happen in the worst case to be really fail-safe ;-)
schlomo commented at 2017-07-17 14:36:¶
@jsmeix in that case I would prefer to have a retry_command
function
and another function retry_and_verify
that calls the retry_command
function twice. You get the same result and I get the keep it simple
approach. :-)
jsmeix commented at 2017-07-17 14:37:¶
@schlomo
I agree that the current try_and_verify function cannot be used
with arbitrary commands - it particular not with commands that
write something (unless that writing is idempotent), cf.
https://github.com/rear/rear/pull/1418#commitcomment-23133068
jsmeix commented at 2017-07-17 14:51:¶
I think the name 'retry_and_verify' still does not make it
absolutely clear that the command is really called twice
because 'verify' may look like a safe read-only test.
I think there should be a plain 'retry_command' function
plus a 'retry_command_twice' function that calls
retry_command two times.
Or any other name that makes it obvious that command
is called at least two times e.g. 'retry_and_retry_command'
(or whatever you like).
Now when users look at plain code like
retry_command_twice test -b /dev/sda parted ... mkpart ... /dev/sda retry_command_twice test -b /dev/sda1 mkfs.ext2 /dev/sda1
they may get somewhat alerted by the strange function name
because strange things do actually happen here.
schlomo commented at 2017-07-17 14:53:¶
@jsmeix fine with me, the more names explain what they do the better.
jsmeix commented at 2017-07-18 09:09:¶
After sleeping over it I think I know what went wrong:
I think the root cause is that the function errors out.
I think normally a function should not end the whole program
(except things like an explicit 'exit_program' function).
I think normally a function should only return a non-zero exit code
when it failed and leave it to its caller what to do in this case.
In particular assume the use case where a command should
be retried but it is not fatal when it fails like:
retry_command CMD1 || retry_command CMD2 || Error "..."
Currently this is not possible.
Accordingly I like to suggest to remove the Error and
replace it with a 'return 1' and a simple generic
Log or LogPrint message i.e. no longer a specific error
message as (somewhat overcomplicated additional)
function argument.
I assume this would - by the way - also solve the
obscure issue with the clean exit (which indicates
that things had become needlessly overcomplicated).
With such a simple and straightforward 'retry_command' function
that only does what its name tells and nothing more
code would become more clear because error messages
are then in the caller and not indirectly in the function.
Cf. RFC 1925 6a:
"It is always possible to add another level of indirection."
E.g. my above example would then look like
# wait for /dev/sda retry_command test -b /dev/sda || Error "/dev/sda did not appear" # verify that /dev/sda stays there stable retry_command test -b /dev/sda || Error "/dev/sda disappeared" # make /dev/sda1 parted ... mkpart ... /dev/sda || Error "making /dev/sda1 failed" # wait for /dev/sda1 retry_command test -b /dev/sda1 || Error "/dev/sda1 did not appear" # verify that /dev/sda1 stays there stable retry_command test -b /dev/sda1 || Error "/dev/sda1 disappeared" # make ext2 on /dev/sda1 mkfs.ext2 /dev/sda1 || Error "making ext2 on /dev/sda1 failed"
In contrast to what I wrote above in
https://github.com/rear/rear/pull/1418#issuecomment-315773035
I think meanwhile that the two subsequent explicit calls of
'retry_command test -b ...' plus the specific different error
messages make the code more clear and better understandable
compared to "obscure hidden magic" in a try_and_verify function.
schlomo commented at 2017-07-18 09:11:¶
@jsmeix very good thoughts. Indeed it is very bad programming practice to die from within function calls. In languages with exceptions (Java, Python ...) functions only report errors but don't decide what to do.
jsmeix commented at 2017-07-18 09:20:¶
@gozora
I think the command that actually should be retried
according to the original code is
test -r /sys/block/$disk_name/size
because the original code errors out after that comand.
Accordingly I think with a simple 'retry_command' function
the code should be something like this:
retry_command test -r /sys/block/$disk_name/size || Error "..."
gozora commented at 2017-07-18 13:07:¶
@schlomo @jsmeix any ideas for commit messages are more than welcome :-)
schlomo commented at 2017-07-18 13:24:¶
@gozora I mean this file (first green line in screenshot):
gozora commented at 2017-07-18 13:30:¶
@schlomo I know what you mean ...
But get_disk_size () is called by following code in
200_partition_layout.sh
devsize=$(get_disk_size ${disk#/sys/block/})
Afterwards (in get_disk_size) you call retry_command ().
If I would omit $(
and )
then devsize=123456
becomes
devsize='
123456'
Does it give a sense?
schlomo commented at 2017-07-18 13:45:¶
@gozora I checked out your branch to understand what is happening.
The get_disk_size
function is defined as this
get_disk_size() {
local disk_name=$1
local block_size=$(get_block_size ${disk_name%/*})
$(retry_command test -r /sys/block/$disk_name/size) || Error "Could not determine size of disk $disk_name"
local nr_blocks=$( < /sys/block/$disk_name/size)
local disk_size=$(( nr_blocks * block_size ))
### Make sure we always return a number
echo $(( disk_size ))
}
The $( ... ) || Error
line is IMHO problematic because it will try to
execute whatever is returned by the subshell. In your specific case in
most situations the subshell returns nothing, so it seems OK.
But, this whole story reveals a bug in your retry_function
: If the
command generates no output the retry_function
will nevertheless
output an empty line. To trace such things I use
sudo usr/sbin/rear shell
in your branch:
# set -x
set -x
# retry_command :
retry_command :
+ retry_command :
+ local retry=0
++ eval :
+++ :
+ command_stdout=
+ echo ''
Here we clearly see that the echo ''
is the culprit here. If you
replace it with echo -n
then everything will be fine and then you can
remove the $( ... )
from the get_disk_size
function.
gozora commented at 2017-07-18 13:58:¶
Hello @schlomo,
I need a bit cool down time from this issue ...
I start to feel a bit angry and I don't want to do programming like
this.
So if you want, feel free to complete this pull request, or postpone it for v2.3.
Sorry for that.
V.
schlomo commented at 2017-07-18 14:16:¶
@gozora sure thing and sorry for upsetting you with my nitpicking.
jsmeix commented at 2017-07-18 14:30:¶
@schlomo
I changed it hopefully right (not yet tested)
according to how I think what you mean.
Could you please have a look and verify if it works?
By the way regarding
https://github.com/rear/rear/pull/1418#issuecomment-316068241
To trace ... things ... use usr/sbin/rear shell
Wow!
Now I understand for what "rear shell" is meant to be used.
schlomo commented at 2017-07-18 14:36:¶
I think it looks very good now. Does anybody actually have a test case for this fix? If not then we'll have to merge it like this.
jsmeix commented at 2017-07-18 14:39:¶
I am brave and merge it right now as is and then I test it
and if issues appear I fix them...
schlomo commented at 2017-07-18 14:43:¶
Thanks a lot. This is real team work!
jsmeix commented at 2017-07-18 15:20:¶
I tested it by adding to init/default/030_update_recovery_system.sh
(I use an 'init' script because those are run for any workflow):
disk_name=sda disk_size=$( get_disk_size $disk_name) LogPrint "get_disk_size $disk_name results $disk_size" disk_name=qqq disk_size=$( get_disk_size $disk_name)
and it results
e205:~/rear.master # usr/sbin/rear -d -D mkrescue Relax-and-Recover 2.1 / Git Using log file: /root/rear.master/var/log/rear/rear-e205.log get_disk_size sda results 21474836480 ERROR: Could not determine size of disk qqq Aborting due to an error, check /root/rear.master/var/log/rear/rear-e205.log for details Terminated
with this in the log file (excerpts):
+ source /root/rear.master/usr/share/rear/init/default/030_update_recovery_system.sh ++ disk_name=sda +++ get_disk_size sda +++ local disk_name=sda ++++ get_block_size sda ++++ '[' -r /sys/block/sda/queue/logical_block_size ']' ++++ echo 512 +++ local block_size=512 +++ retry_command test -r /sys/block/sda/size +++ local retry=0 ++++ eval test -r /sys/block/sda/size +++++ test -r /sys/block/sda/size +++ command_stdout= +++ echo -n '' +++ local nr_blocks=41943040 +++ local disk_size=21474836480 +++ echo 21474836480 ++ disk_size=21474836480 ++ LogPrint 'get_disk_size sda results 21474836480' ++ Log 'get_disk_size sda results 21474836480' +++ date '+%Y-%m-%d %H:%M:%S.%N ' ++ local 'timestamp=2017-07-18 17:17:01.006874366 ' ++ test 1 -gt 0 ++ echo '2017-07-18 17:17:01.006874366 get_disk_size sda results 21474836480' 2017-07-18 17:17:01.006874366 get_disk_size sda results 21474836480 ++ Print 'get_disk_size sda results 21474836480' ++ test 1 ++ echo -e 'get_disk_size sda results 21474836480' ++ disk_name=qqq +++ get_disk_size qqq +++ local disk_name=qqq ++++ get_block_size qqq ++++ '[' -r /sys/block/qqq/queue/logical_block_size ']' ++++ echo 512 +++ local block_size=512 +++ retry_command test -r /sys/block/qqq/size +++ local retry=0 ++++ eval test -r /sys/block/qqq/size +++++ test -r /sys/block/qqq/size +++ command_stdout= +++ sleep 1 +++ let retry++ +++ (( retry >= REAR_MAX_RETRIES )) ++++ eval test -r /sys/block/qqq/size +++++ test -r /sys/block/qqq/size +++ command_stdout= +++ sleep 1 +++ let retry++ +++ (( retry >= REAR_MAX_RETRIES )) ++++ eval test -r /sys/block/qqq/size +++++ test -r /sys/block/qqq/size +++ command_stdout= +++ sleep 1 +++ let retry++ +++ (( retry >= REAR_MAX_RETRIES )) ++++ eval test -r /sys/block/qqq/size +++++ test -r /sys/block/qqq/size +++ command_stdout= +++ sleep 1 +++ let retry++ +++ (( retry >= REAR_MAX_RETRIES )) ++++ eval test -r /sys/block/qqq/size +++++ test -r /sys/block/qqq/size +++ command_stdout= +++ sleep 1 +++ let retry++ +++ (( retry >= REAR_MAX_RETRIES )) +++ Log 'retry_command '\''test -r /sys/block/qqq/size'\'' failed' ++++ date '+%Y-%m-%d %H:%M:%S.%N ' +++ local 'timestamp=2017-07-18 17:17:06.020432290 ' +++ test 1 -gt 0 +++ echo '2017-07-18 17:17:06.020432290 retry_command '\''test -r /sys/block/qqq/size'\'' failed' 2017-07-18 17:17:06.020432290 retry_command 'test -r /sys/block/qqq/size' failed +++ return 1 +++ Error 'Could not determine size of disk qqq' +++ LogPrintError 'ERROR: Could not determine size of disk qqq' +++ Log 'ERROR: Could not determine size of disk qqq'
[Export of Github issue for rear/rear.]