#1418 PR merged: Make function get_disk_size () be more tolerant when querying /sys/...

Labels: enhancement, fixed / solved / done, minor bug

gozora opened issue at 2017-07-16 15:16:

Hello guys,

After thinking a bit about this issue a bit, I've decided to introduce new function retry_command () in layout-functions.sh, as it might be useful for other situations in the future as well.

As always any comments are mote than welcome.

V.

gozora commented at 2017-07-17 11:09:

Just a small check point on what we have so far (because I'm having trouble to keep attention :-) )

try_and_verify ()
{
    command="$*";
    local retry=0;

    for pass in trial verification;
    do
        until command_stdout=$(eval $command); do
            sleep $REAR_SLEEP_DELAY;
            retry=$(( retry + 1 ));
            if [[ $retry -ge $REAR_MAX_RETRIES ]]; then
                Error "Could not successfully finish command: '$command'";
            fi;
        done;
    done

    echo $command_stdout
}

jsmeix commented at 2017-07-17 11:32:

Since your
https://github.com/rear/rear/pull/1418#commitcomment-23133281
I understand the reason behind and I like it!

Your try_and_verify in your
https://github.com/rear/rear/pull/1418#issuecomment-315726470
works well for me:

# REAR_SLEEP_DELAY=1
# REAR_MAX_RETRIES=3

# filename="/etc/issue"

# out=$( try_and_verify cat qqq $filename )
cat: qqq: No such file or directory
cat: qqq: No such file or directory
cat: qqq: No such file or directory
Could not successfully finish command: 'cat qqq /etc/issue'

# echo "'$out'"
''

# out=$( try_and_verify cat qqq $filename '||' echo Hello )
cat: qqq: No such file or directory
cat: qqq: No such file or directory

# echo "'$out'"
'Welcome to SUSE Linux Enterprise Desktop 11 SP3 (i586) - Kernel \r (\l). Hello'

Note the appended 'Hello' after the /etc/issue output.

jsmeix commented at 2017-07-17 11:35:

You must use

    echo "$command_stdout"

to get spaces and newlines in the output:

# out=$( try_and_verify cat qqq $filename '||' echo Hello )
cat: qqq: No such file or directory
cat: qqq: No such file or directory

# echo "'$out'"
'
Welcome to SUSE Linux Enterprise Desktop 11 SP3  (i586) - Kernel \r (\l).


Hello'

schlomo commented at 2017-07-17 11:37:

Please also test for files with blanks like this:

out=$( function cat "/etc/file with blank" "||" wc -l )

It is very important that we make ReaR safe internally so that we won't fail when users provide such paths.

gozora commented at 2017-07-17 11:51:

@schlomo
Following syntax works fine for me:

nr_blocks=$(try_and_verify 'cat "/tmp/disk size"')

gozora commented at 2017-07-17 11:53:

@schlomo
Even this one:

nr_blocks=$(try_and_verify 'cat "/tmp/disk size" | wc -l')

schlomo commented at 2017-07-17 11:58:

@gozora I personally find such double quotings really ugly. Hence IMHO it is much better to use '$@' which avoids that.

gozora commented at 2017-07-17 12:04:

@schlomo
Do you mean replacing command="$*" with command="$@"?

schlomo commented at 2017-07-17 12:09:

Yes and no. Don't copy the command into your own variable but use "$@" as is. Did you see my in-line remarks on your PR?

schlomo commented at 2017-07-17 12:10:

And yes, you can use eval "$@" to support even more complex things.

gozora commented at 2017-07-17 12:36:

@schlomo

Did you see my in-line remarks on your PR?

I can see inline remarks from @gdha and @jsmeix but not a single one from you ...

This is a new version of function:

try_and_verify ()
{
    local retry=0

    for pass in trial verification; do
        until command_stdout=$(eval "$@"); do
            sleep $REAR_SLEEP_DELAY
            retry=$(( retry + 1 ))
            if [[ $retry -ge $REAR_MAX_RETRIES ]]; then
                Error "Could not successfully finish command '$@'"
            fi
        done
    done

    echo "$command_stdout"
}

Is it OK for everybody ?

V.

schlomo commented at 2017-07-17 12:39:

I didn't realize I need to send my review remarks. Sorry.

schlomo commented at 2017-07-17 12:46:

To be honest, I don't really get the point of the trial and verification approach. If I read the code correctly it will actually do the command twice but not compare between the two outputs.

What is your use case where you need to run the command twice?

IMHO you should choose one of the following approaches:

  • Run the command till two consecutive executions yield the same result.
  • Run the command till it is successful. No need to repeat it.
  • The combination: Run till two consecutive executions yield the same result and are successful.

About the numeric evaluation: I strongly suggest to be consistent there and use either (( ))`` orletand not[[ ]]. In our [coding style](https://github.com/rear/rear/wiki/Coding-Style#test---) we mention to not use[[` for numerical checks, maybe we should expand that also to be more clear about assignments.

gozora commented at 2017-07-17 13:02:

@schlomo

What is your use case where you need to run the command twice?

Please check https://github.com/rear/rear/issues/1370#issuecomment-303046386

  • Run the command till two consecutive executions yield the same result.
  • The combination: Run till two consecutive executions yield the same result and are successful.

This smells by endless loop, so I don't think it is a good idea.

About the numeric evaluation: I strongly suggest to be consistent ...

Which part of code you have on your mind?

V.

schlomo commented at 2017-07-17 13:15:

@gozora Now I understand the background story. After reading it I have a strong opinion that the "try and validate" approach is overengineering that makes ReaR more complex. It tries to work around the fact that we use the wrong tools for a job (as it is the case of using partprobe to "force" a block device to appear).

I am all in favor of implementing a wait logic in ReaR but would like to keep all of that really simple. That means that in my opinion your function should just do that: Retry several times till something works. So your other until ... do example would fit the job much better.

And yes, we need to deal with udev properly and stop kicking it. But this is another problem.

About numeric evaluation: Please use (( or let when you deal with variables that contain numbers.

gozora commented at 2017-07-17 13:30:

@schlomo

After reading it I have a strong opinion that the "try and validate" approach is overengineering that makes ReaR more complex.

I basically don't have opinion here. Maybe you guys who are with ReaR much longer, should run some vote or something ;-)

About numeric evaluation: Please use (( or let when you deal with variables that contain numbers.

I guess that is what I already did, or am I overlooking something?

retry=$(( retry + 1 ))

schlomo commented at 2017-07-17 13:44:

The numerical assignment should be

let retry++
(( retry++ ))

and the test should be also

if (( retry > REAR_MAX_RETRIES )) ; then

And the error messages in case of a failure should not only explain what went wrong (the actual command) but rather explain why this is an issue and what the user should do now.

gozora commented at 2017-07-17 13:55:

And the error messages in case of a failure should not only explain what went wrong (the actual command) but rather explain why this is an issue and what the user should do now.

Again a small example, to check If I get your idea right:

You would like to change:

# ls /tmp1
ls: cannot access /tmp1: No such file or directory

to

# ls /tmp1
ls: cannot access /tmp1: No such file or directory
ls: Because you've made a typo.

?

V.

schlomo commented at 2017-07-17 14:00:

Yes, although I would turn the order around. First print the ReaR error message and then the command and the output. That way the reading order (top-bottom) helps the user get the full context.

Example:

... other rear output ...
ERROR: Could not get block device for freshly created partitions
Timed out after 5 seconds waiting for command 'stat -t /dev/sda4' to be successful:
stat: cannot stat '/dev/sda4': No such file or directory

jsmeix commented at 2017-07-17 14:03:

I think on first glance the following could be assumed:

The try_and_verify function looks primarily meant for
things as in https://github.com/rear/rear/issues/791
where the verify part is needed because of
https://github.com/rear/rear/issues/791#issuecomment-211318620

But for the issue in
https://github.com/rear/rear/issues/1370
it looks as if a plain retry_command function
(without additional verify) is the right thing.

But according to
https://github.com/rear/rear/issues/1370#issuecomment-303046386
from my point of view the issue in
https://github.com/rear/rear/issues/1370
is actually the same as
https://github.com/rear/rear/issues/791

Because of that a "try and validate" approach is needed
because according to what I put together in
https://github.com/rear/rear/issues/791
that is how things behave.

@schlomo
regarding using partprobe to "force" a block device to appear:
Yes, that is (as far as I understand it) the wrong tool.
But from that does not follow ReaR can avoid the
"try and validate" approach because "try and validate"
results from how udev and the kernel behave
at least sometimes.

Because udev and kernel behave sometimes unreliably
we need a "try and validate" approach that makes ReaR
more complex to compensate those bugs elsewhere,
cf. "Dirty hacks welcome" in
https://github.com/rear/rear/wiki/Coding-Style

In my personal opinion the real bug is in tools like parted
that return to the user 'successfully' without having their
job actually completely done.

E.g. after

parted /dev/sda unit MiB mkpart primary ext2 8 2048

returned successfully the new /dev/sda1 should be there
rock solid ready to use so that an immediate subsequent

mkfs.ext2 /dev/sda1

should always "just work". i.e.

parted /dev/sda unit MiB mkpart primary ext2 8 2048 && mkfs.ext2 /dev/sda1

should "just work".
But this is not true so that ReaR has to work around.

schlomo commented at 2017-07-17 14:11:

@jsmeix I understood you in the other issue so that we simply should wait for the block device after using parted instead of kicking udev. Is that wrong? For that we just need a simple retry_command function.

I am really against this generic retry function that does stuff twice after it is successful. It requires the developer to be very much aware of the fact. Imagine you use that on something that does more than just read something? Like creating a filesystem? It will fail till the block device appears, work exactly once, and then fail again (unless you have --force specified). In such a case the use of the try and validate approach is wrong because the validation and the trying should use different commands.

AFAIKT for the reason that @gozora wrote his function a plain retry is enough and that is why I would like to keep it that simple here and see where that helps us. If you need the try and validate approach then please call the retry function twice to make it very transparent that you need to have two successful calls of that command.

Again, keeping it simple reduces both code lines and complexity and also increases readability for those cases where you need this special behavior.

If we generalize each and every problem then ReaR will become too complex and we will have a very hard time to debug it when those things cause problems.

jsmeix commented at 2017-07-17 14:31:

@schlomo
it would not work reliably if ReaR waited for a block device
(e.g. before parted and before mkfs) with a simple
retry_command function because nowadays at least sometimes
block devices can disappear and reappear out of a sudden, see
https://github.com/rear/rear/issues/791#issuecomment-211318620
so that a verify part is needed to be fail-safe.

A simple retry_command function that is called two times
should be sufficient like in

# wait for /dev/sda
retry_command test -b /dev/sda
# verify /dev/sda is there stable
retry_command test -b /dev/sda
# make /dev/sda1
parted ... mkpart ... /dev/sda
# wait for /dev/sda1
retry_command test -b /dev/sda1
# verify /dev/sda1 is there stable
retry_command test -b /dev/sda1
# make /dev/sda1 ext2
mkfs.ext2 /dev/sda1

Looks ugly?
YES!
That is the ugliness of how udev/systemd/kernel/whatever behave :-(

Accordingly a try_and_verify function
that calls retry_command two times
hides some of the ugliness:

# wait for /dev/sda and verify it is there stable
try_and_verify test -b /dev/sda
# make /dev/sda1
parted ... mkpart ... /dev/sda
# wait for /dev/sda1 and verify it is there stable
try_and_verify test -b /dev/sda1
# make /dev/sda1 ext2
mkfs.ext2 /dev/sda1

For the fun of weirdness:
If device nodes could disappear and reappear several times
until they stay there stable, a try_and_verify function would
have to call retry_command as many times as disappear and
reappear could happen in the worst case to be really fail-safe ;-)

schlomo commented at 2017-07-17 14:36:

@jsmeix in that case I would prefer to have a retry_command function and another function retry_and_verify that calls the retry_command function twice. You get the same result and I get the keep it simple approach. :-)

jsmeix commented at 2017-07-17 14:37:

@schlomo
I agree that the current try_and_verify function cannot be used
with arbitrary commands - it particular not with commands that
write something (unless that writing is idempotent), cf.
https://github.com/rear/rear/pull/1418#commitcomment-23133068

jsmeix commented at 2017-07-17 14:51:

I think the name 'retry_and_verify' still does not make it
absolutely clear that the command is really called twice
because 'verify' may look like a safe read-only test.

I think there should be a plain 'retry_command' function
plus a 'retry_command_twice' function that calls
retry_command two times.
Or any other name that makes it obvious that command
is called at least two times e.g. 'retry_and_retry_command'
(or whatever you like).

Now when users look at plain code like

retry_command_twice test -b /dev/sda
parted ... mkpart ... /dev/sda
retry_command_twice test -b /dev/sda1
mkfs.ext2 /dev/sda1

they may get somewhat alerted by the strange function name
because strange things do actually happen here.

schlomo commented at 2017-07-17 14:53:

@jsmeix fine with me, the more names explain what they do the better.

jsmeix commented at 2017-07-18 09:09:

After sleeping over it I think I know what went wrong:

I think the root cause is that the function errors out.
I think normally a function should not end the whole program
(except things like an explicit 'exit_program' function).
I think normally a function should only return a non-zero exit code
when it failed and leave it to its caller what to do in this case.

In particular assume the use case where a command should
be retried but it is not fatal when it fails like:

retry_command CMD1 || retry_command CMD2 || Error "..."

Currently this is not possible.

Accordingly I like to suggest to remove the Error and
replace it with a 'return 1' and a simple generic
Log or LogPrint message i.e. no longer a specific error
message as (somewhat overcomplicated additional)
function argument.

I assume this would - by the way - also solve the
obscure issue with the clean exit (which indicates
that things had become needlessly overcomplicated).

With such a simple and straightforward 'retry_command' function
that only does what its name tells and nothing more
code would become more clear because error messages
are then in the caller and not indirectly in the function.
Cf. RFC 1925 6a:
"It is always possible to add another level of indirection."

E.g. my above example would then look like

# wait for /dev/sda
retry_command test -b /dev/sda || Error "/dev/sda did not appear"
# verify that /dev/sda stays there stable
retry_command test -b /dev/sda || Error "/dev/sda disappeared"
# make /dev/sda1
parted ... mkpart ... /dev/sda || Error "making /dev/sda1 failed"
# wait for /dev/sda1
retry_command test -b /dev/sda1 || Error "/dev/sda1 did not appear"
# verify that /dev/sda1 stays there stable
retry_command test -b /dev/sda1 || Error "/dev/sda1 disappeared"
# make ext2 on /dev/sda1
mkfs.ext2 /dev/sda1 || Error "making ext2 on /dev/sda1 failed"

In contrast to what I wrote above in
https://github.com/rear/rear/pull/1418#issuecomment-315773035
I think meanwhile that the two subsequent explicit calls of
'retry_command test -b ...' plus the specific different error
messages make the code more clear and better understandable
compared to "obscure hidden magic" in a try_and_verify function.

schlomo commented at 2017-07-18 09:11:

@jsmeix very good thoughts. Indeed it is very bad programming practice to die from within function calls. In languages with exceptions (Java, Python ...) functions only report errors but don't decide what to do.

jsmeix commented at 2017-07-18 09:20:

@gozora

I think the command that actually should be retried
according to the original code is

test -r /sys/block/$disk_name/size

because the original code errors out after that comand.

Accordingly I think with a simple 'retry_command' function
the code should be something like this:

retry_command test -r /sys/block/$disk_name/size || Error "..."

gozora commented at 2017-07-18 13:07:

@schlomo @jsmeix any ideas for commit messages are more than welcome :-)

schlomo commented at 2017-07-18 13:24:

@gozora I mean this file (first green line in screenshot):
image

gozora commented at 2017-07-18 13:30:

@schlomo I know what you mean ...
But get_disk_size () is called by following code in 200_partition_layout.sh

devsize=$(get_disk_size ${disk#/sys/block/})

Afterwards (in get_disk_size) you call retry_command ().
If I would omit $( and ) then devsize=123456 becomes

devsize='
123456'

Does it give a sense?

schlomo commented at 2017-07-18 13:45:

@gozora I checked out your branch to understand what is happening.

The get_disk_size function is defined as this

get_disk_size() {
    local disk_name=$1

    local block_size=$(get_block_size ${disk_name%/*})

    $(retry_command test -r /sys/block/$disk_name/size) || Error "Could not determine size of disk $disk_name"

    local nr_blocks=$( < /sys/block/$disk_name/size)
    local disk_size=$(( nr_blocks * block_size ))

    ### Make sure we always return a number
    echo $(( disk_size ))
}

The $( ... ) || Error line is IMHO problematic because it will try to execute whatever is returned by the subshell. In your specific case in most situations the subshell returns nothing, so it seems OK.

But, this whole story reveals a bug in your retry_function: If the command generates no output the retry_function will nevertheless output an empty line. To trace such things I use sudo usr/sbin/rear shell in your branch:

# set -x 
set -x 
# retry_command :
retry_command :
+ retry_command :
+ local retry=0
++ eval :
+++ :
+ command_stdout=
+ echo ''

Here we clearly see that the echo '' is the culprit here. If you replace it with echo -n then everything will be fine and then you can remove the $( ... ) from the get_disk_size function.

gozora commented at 2017-07-18 13:58:

Hello @schlomo,

I need a bit cool down time from this issue ...
I start to feel a bit angry and I don't want to do programming like this.

So if you want, feel free to complete this pull request, or postpone it for v2.3.

Sorry for that.

V.

schlomo commented at 2017-07-18 14:16:

@gozora sure thing and sorry for upsetting you with my nitpicking.

jsmeix commented at 2017-07-18 14:30:

@schlomo
I changed it hopefully right (not yet tested)
according to how I think what you mean.
Could you please have a look and verify if it works?

By the way regarding
https://github.com/rear/rear/pull/1418#issuecomment-316068241

To trace ... things ... use
  usr/sbin/rear shell

Wow!
Now I understand for what "rear shell" is meant to be used.

schlomo commented at 2017-07-18 14:36:

I think it looks very good now. Does anybody actually have a test case for this fix? If not then we'll have to merge it like this.

jsmeix commented at 2017-07-18 14:39:

I am brave and merge it right now as is and then I test it
and if issues appear I fix them...

schlomo commented at 2017-07-18 14:43:

Thanks a lot. This is real team work!

jsmeix commented at 2017-07-18 15:20:

I tested it by adding to init/default/030_update_recovery_system.sh
(I use an 'init' script because those are run for any workflow):

disk_name=sda
disk_size=$( get_disk_size $disk_name)
LogPrint "get_disk_size $disk_name results $disk_size"
disk_name=qqq
disk_size=$( get_disk_size $disk_name)

and it results

e205:~/rear.master # usr/sbin/rear -d -D mkrescue
Relax-and-Recover 2.1 / Git
Using log file: /root/rear.master/var/log/rear/rear-e205.log
get_disk_size sda results 21474836480
ERROR: Could not determine size of disk qqq
Aborting due to an error, check /root/rear.master/var/log/rear/rear-e205.log for details
Terminated

with this in the log file (excerpts):

+ source /root/rear.master/usr/share/rear/init/default/030_update_recovery_system.sh
++ disk_name=sda
+++ get_disk_size sda
+++ local disk_name=sda
++++ get_block_size sda
++++ '[' -r /sys/block/sda/queue/logical_block_size ']'
++++ echo 512
+++ local block_size=512
+++ retry_command test -r /sys/block/sda/size
+++ local retry=0
++++ eval test -r /sys/block/sda/size
+++++ test -r /sys/block/sda/size
+++ command_stdout=
+++ echo -n ''
+++ local nr_blocks=41943040
+++ local disk_size=21474836480
+++ echo 21474836480
++ disk_size=21474836480
++ LogPrint 'get_disk_size sda results 21474836480'
++ Log 'get_disk_size sda results 21474836480'
+++ date '+%Y-%m-%d %H:%M:%S.%N '
++ local 'timestamp=2017-07-18 17:17:01.006874366 '
++ test 1 -gt 0
++ echo '2017-07-18 17:17:01.006874366 get_disk_size sda results 21474836480'
2017-07-18 17:17:01.006874366 get_disk_size sda results 21474836480
++ Print 'get_disk_size sda results 21474836480'
++ test 1
++ echo -e 'get_disk_size sda results 21474836480'
++ disk_name=qqq
+++ get_disk_size qqq
+++ local disk_name=qqq
++++ get_block_size qqq
++++ '[' -r /sys/block/qqq/queue/logical_block_size ']'
++++ echo 512
+++ local block_size=512
+++ retry_command test -r /sys/block/qqq/size
+++ local retry=0
++++ eval test -r /sys/block/qqq/size
+++++ test -r /sys/block/qqq/size
+++ command_stdout=
+++ sleep 1
+++ let retry++
+++ ((  retry >= REAR_MAX_RETRIES  ))
++++ eval test -r /sys/block/qqq/size
+++++ test -r /sys/block/qqq/size
+++ command_stdout=
+++ sleep 1
+++ let retry++
+++ ((  retry >= REAR_MAX_RETRIES  ))
++++ eval test -r /sys/block/qqq/size
+++++ test -r /sys/block/qqq/size
+++ command_stdout=
+++ sleep 1
+++ let retry++
+++ ((  retry >= REAR_MAX_RETRIES  ))
++++ eval test -r /sys/block/qqq/size
+++++ test -r /sys/block/qqq/size
+++ command_stdout=
+++ sleep 1
+++ let retry++
+++ ((  retry >= REAR_MAX_RETRIES  ))
++++ eval test -r /sys/block/qqq/size
+++++ test -r /sys/block/qqq/size
+++ command_stdout=
+++ sleep 1
+++ let retry++
+++ ((  retry >= REAR_MAX_RETRIES  ))
+++ Log 'retry_command '\''test -r /sys/block/qqq/size'\'' failed'
++++ date '+%Y-%m-%d %H:%M:%S.%N '
+++ local 'timestamp=2017-07-18 17:17:06.020432290 '
+++ test 1 -gt 0
+++ echo '2017-07-18 17:17:06.020432290 retry_command '\''test -r /sys/block/qqq/size'\'' failed'
2017-07-18 17:17:06.020432290 retry_command 'test -r /sys/block/qqq/size' failed
+++ return 1
+++ Error 'Could not determine size of disk qqq'
+++ LogPrintError 'ERROR: Could not determine size of disk qqq'
+++ Log 'ERROR: Could not determine size of disk qqq'

[Export of Github issue for rear/rear.]