#1370 Issue closed: ERROR: BUG BUG BUG! Could not determine size of disk sdd/sdd1

Labels: discuss / RFC, fixed / solved / done, minor bug

mtquattro opened issue at 2017-05-20 02:28:

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • rear version (/usr/sbin/rear -V): Relax-and-Recover 1.17.2 / Git

  • OS version (cat /etc/rear/os.conf or lsb_release -a): CentOS Linux release 7.3.1611 (Core)

  • rear configuration files (cat /etc/rear/site.conf or cat /etc/rear/local.conf):
    OUTPUT=USB
    BACKUP=NETFS
    BACKUP_URL=usb:///dev/disk/by-label/REAR-000
    USB_DEVICE=/dev/disk/by-label/REAR-000

  • Are you using legacy BIOS or UEFI boot? NO

  • Brief description of the issue:
    Problem occurs while running rear -v rescue. Reprodusable.

+++ StopIfError 'Partition sdd1 is numbered '''1'''. More than 128 partitions is not supported.'
+++ (( 0 != 0 ))
+++ echo 1
++ partition_nr=1
++ partition_prefix=sdd
+++ get_disk_size sdd/sdd1
+++ local disk_name=sdd/sdd1
++++ get_block_size sdd
++++ '[' -r /sys/block/sdd/queue/logical_block_size ']'
++++ echo 512
+++ local block_size=512
+++ '[' -r /sys/block/sdd/sdd1/size ']'
+++ BugIfError 'Could not determine size of disk sdd/sdd1, please file a bug.'
+++ (( 1 != 0 ))
+++ BugError 'Could not determine size of disk sdd/sdd1, please file a bug.'
+++ '[' Could not determine size of disk sdd/sdd1, please file a bug. -eq Could not determine size of disk sdd/sdd1, please file a bug. ']'
+++ EXIT_CODE=1
+++ Error 'BUG BUG BUG! ' 'Could not determine size of disk sdd/sdd1, please file a bug.' '
=== Issue report ===
Please report this unexpected issue at: https://github.com/rear/rear/issues
Also include the relevant bits from /var/log/rear/rear-dnxlab-300-2.log

  • Work-around, if any:

gozora commented at 2017-05-20 09:32:

Hi @mtquattro,

I'm observing this behavior from time to time when I test ReaR backup on USB. I did not examine exact reason (maybe udev doing something with disk links), but what helps me to avoid such situation is following:

locate get_disk_size () in /usr/share/rear/lib/layout-functions.sh and add sleep 5 before local block_size=$(get_block_size ${disk_name%/*}) so the final result looks something like this:

get_disk_size() {
    local disk_name=$1

    sleep 5
    local block_size=$(get_block_size ${disk_name%/*})

    [ -r /sys/block/$disk_name/size ]
    BugIfError "Could not determine size of disk $disk_name, please file a bug."

    local nr_blocks=$( < /sys/block/$disk_name/size)
    local disk_size=$(( nr_blocks * block_size ))

    ### Make sure we always return a number
    echo $(( disk_size ))
}

You can download/check my copy of ReaR in https://github.com/gozora/rear for more details.

I'll not create pull request as this workaround is very naive and can largely increase backup time of "enterprise" systems with lots of disks.

I would appreciate if you could test if it resolves your problem and let us know. Then we can maybe find some smarter way how to do the implementation. (Maybe a loop with counter, or apply sleep time only for OUTPUT=USB?)

V.

mtquattro commented at 2017-05-21 21:30:

That worked.Thanks.

-------- Original Message --------
Subject: Re: [rear/rear] ERROR: BUG BUG BUG! Could not determine size of
disk sdd/sdd1 (#1370)
From: Vladimir Gozora notifications@github.com
Date: Sat, May 20, 2017 2:32 am
To: rear/rear rear@noreply.github.com
Cc: mtquattro misha@tomushev.com, Mention
mention@noreply.github.com

Hi @mtquattro, I'm observing this behavior from time to time when I test ReaR backup on USB. I did not examine exact reason (maybe udev doing something with disk links), but what helps me to avoid such situation is following: locate get_disk_size () in /usr/share/rear/lib/layout-functions.sh and add sleep 5 before local block_size=$(get_block_size ${disk_name%/*}) so the final result looks something like this: get_disk_size() {
local disk_name=$1

sleep 5
local block_size=$(get_block_size ${disk_name%/*})

[ -r /sys/block/$disk_name/size ]
BugIfError "Could not determine size of disk $disk_name, please file a bug."

local nr_blocks=$( < /sys/block/$disk_name/size)
local disk_size=$(( nr_blocks * block_size ))

### Make sure we always return a number
echo $(( disk_size ))

}
You can download/check my copy of ReaR in https://github.com/gozora/rear for more details. I'll not create pull request as this workaround is very naive and can largely increase backup time of "enterprise" systems with lots of disks. I would appreciate if you could test if it resolves your problem and let us know. Then we can maybe find some smarter way how to do the implementation. (Maybe a loop with counter, or apply sleep time only for BACKUP=USB?) V. —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

gozora commented at 2017-05-22 06:21:

@gdha, @jsmeix

What would you say if we update code as follows:

- local block_size=$(get_block_size ${disk_name%/*})
+ local block_size=$(get_block_size ${disk_name%/*} || sleep 5; get_block_size ${disk_name%/*})

?

V.

schlomo commented at 2017-05-22 06:37:

IIRC || binds closer than ; so that in this form you will call get_block_size always a second time. IMHO this is a good place to write a proper if clause instead.

gozora commented at 2017-05-22 06:41:

Thanks @schlomo!

gozora commented at 2017-05-22 08:40:

Another way that could be even more efficient could be:

+ my_udevsettle
local block_size=$(get_block_size ${disk_name%/*})
...

But I need to test if this works sufficiently well ...

V.

jsmeix commented at 2017-05-22 09:20:

I didn't check the details but in general
when a 'sleep' before a device node access avoids errors
it indicates that some udev related weirdness happens, see
https://github.com/rear/rear/issues/791
and see my currently used code in
https://github.com/rear/rear/issues/1327#issuecomment-296674131

Accordingly I think "the ultimate really right fix" ;-)
could be something like

# Wait until the block device node exists:
for pass in trial verification ; do
    until test -b $block_device ; do
        LogPrint "Waiting until $block_device exists (retrying in 1 second)"
        sleep 1
    done
done

Why two passes "trial" plus "verification" ?
See "disappear and reappear" in
https://github.com/rear/rear/issues/791#issuecomment-211318620

Alternatively the easiest and perhaps in the end the best what
ReaR could do is to play dumb and retry the actual command
until it succeeds:

until COMMAND $block_device ; do
    LogPrint "Retrying 'COMMAND $block_device' in 1 second"
    sleep 1
done

In particular when all what ReaR could do is to error out
when 'COMMAND $block_device' fails which would not help
the user in any useful way in particular during "rear recover"
I think it is better when ReaR retries endlessly and verbosely.
The advantage is that while ReaR is endlessly retrying
the user could run any commands that may solve it.
E.g. try yourself:

# until cat /etc/issue /tmp/qqq ; do sleep 1 ; done

which retries endlessly until you create /tmp/qqq.

FYI:
I asked a colleague what the best way is how do deal with
udev's magically appearing (and disappearing) device nodes.
He replied:
All what one can do is to wait for the device node.
For background information and reasoning read
https://github.com/rear/rear/issues/791
I am not at all a udev expert I only collected what I noticed.

gozora commented at 2017-05-22 10:17:

Thanks @jsmeix,

I'm a bit concerned about possible endless loop though ...
What about breaking the loop after e.g. 10 retries ?

V.

gozora commented at 2017-05-22 10:20:

@gdha regarding the milestone I've set, I think that this is not a big deal bug so it can wait until 2.2 release.

jsmeix commented at 2017-05-22 12:28:

@gozora
what would you suggest that ReaR should do
after e.g. 10 (or any finite number of) retries?

gozora commented at 2017-05-22 12:46:

@jsmeix

I'd let it continue in execution of code (hence let it crash):

 [ -r /sys/block/$disk_name/size ]
    BugIfError "Could not determine size of disk $disk_name, please file a bug."

but change BugIfError to StopIfError ...

V.

jsmeix commented at 2017-05-22 13:03:

@gozora
what is the advantage in practice for the user of "let it crash"
compared to endless and verbose retries?

I think the user can at any time press [Ctrl]+[C] to abort ReaR
(this is of course a required precondition for endless retries).

I am not against error abort but I wonder why it is better
compared to endless retries.

Both are not the intended behaviour but perhaps endless retries
could behave better (i.e. "less fatal") in practice?

Or in other words:
With endless retries the decision to abort is at the user
and as long as it retries the user could do something
that makes it continue.

Of course endless retries only make sense when there is
a chance that it could later continue (which is the case here).

gozora commented at 2017-05-22 13:46:

@jsmeix
First and foremost normally you don't start rear mkbackup from shell but using Cron.
Now imagine you have rear scheduled in Cron on hundreds of server and you magically hit some some bug and end in endless loop ;-).
Letting it crash with reasonable message can be detected by monitoring SW and appropriate action can be taken.
If you let rear running endlessly, you even don't have to realize that something is wrong ...

V.

jsmeix commented at 2017-05-23 09:22:

@gozora
thanks for your info - now I understand!

I had overlooked that get_disk_size() is called both in
scripts in layout/prepare (i.e. during "rear recover") and
also in scripts in layout/save (i.e. during "rear mkrescue/mkbackup").

I had the wrong assumption that udev related delay issues
could happen only during "rear recover" while block device
nodes are created.

Now I learned that udev related delay issues could also
happen during "rear mkrescue/mkbackup".

Now I understand and agree that there must be an upper limit
how often and/or how long ReaR retires
until it gives up and errors out.

This leads to the problem what the right value for such
an upper limit should be.

Assume 10 retries is hardcoded then there is the question
what is different after 10 retries to error out instead of
doing an 11th retry (because users on Fubar-Linux
need at least 11 retries ;-)

Therefore there must be config variables
(with reasonable default settings)
so that the user can - if needed - specify
the upper limits as he needs - something like
REAR_TIMEOUT=60
and
REAR_MAX_RETRIES=10
probably also a
REAR_SLEEP_DELAY=1
to specify how fast ReaR retries.

gozora commented at 2017-05-23 10:30:

Hello @jsmeix,

Yes I was thinking about similar variable introduction as well.
I would however recommend to use only REAR_MAX_RETRIES and REAR_SLEEP_DELAY. REAR_TIMEOUT looks a bit useless as ti can be calculated as REAR_MAX_RETRIES * REAR_SLEEP_DELAY = REAR_TIMEOUT.

What do you think?

V.

jsmeix commented at 2017-05-23 11:10:

I mentioned REAR_TIMEOUT only for completeness.
For this particular issue it is not needed.

In general REAR_TIMEOUT could be used for
any command that may not return after a reasonable time
i.e. for any command that may actually hang up
or process longer than one can wait,
cf. things like https://github.com/rear/rear/issues/1283

FYI
in general regarding how one could timeout any command
you may have a look at my test_remote_* scripts in
https://github.com/yast/yast-printer/tree/master/tools
that implement it via different kind of "time bombs".

For example time bombs with fixed waiting time in any case
as in
https://github.com/yast/yast-printer/blob/master/tools/test_remote_socket

if ( ( echo -n '' >/dev/tcp/$HOST/$PORT ) & ECHO_PID=$! ; sleep 2s ; kill $ECHO_PID &>/dev/null ; wait $ECHO_PID )
then # Test whether port on host accepts data:
     if ( ( echo -en '\r' >/dev/tcp/$HOST/$PORT ) & ECHO_PID=$! ; sleep 2s ; kill $ECHO_PID &>/dev/null ; wait $ECHO_PID )
     then echo -en "\nPort '$PORT' on host '$HOST' accepts data\n"
          exit 0
     fi
     echo -en "\nConnection possible to port '$PORT' on host '$HOST' but does not accept data\n"
     exit 4
fi

Or a time bomb that only waits when really needed as in
https://github.com/yast/yast-printer/blob/master/tools/test_remote_lpd

$NETCAT -w $TIMEOUT -p $PORT $HOST 515 <$NETCAT_IN >$NETCAT_OUT 2>/dev/null &
NETCAT_PID=$!
{ sleep ${TIMEOUT}s ; kill $NETCAT_PID &>/dev/null ; } &

The former example uses hardcoded 'sleep 2s'
where ReaR should use 'sleep $REAR_SLEEP_DELAY'.

The latter example uses TIMEOUT in a way
how REAR_TIMEOUT could be used in general.

jsmeix commented at 2017-05-23 11:22:

Hmmm...

With the generic REAR_TIMEOUT idea in mind
there could be perhaps even a global timeout like
REAR_TOTAL_TIMEOUT=3600
that lets a whole "rear mkbackup/mkrescue/recover" time out
by a global timebomb that kills the main script.

The use case and reasoning behind would be automated
usage like running "rear mkbackup/mkrescue" by cron
or an automated "rear recover" to get a failure/crash result
after a maximum overall time instead of a possibly endlessly
running 'rear ...' call,
cf. https://github.com/rear/rear/issues/1370#issuecomment-303104688

jsmeix commented at 2017-05-24 07:44:

@gozora
FYI in general regarding "timeout"
see also https://github.com/rear/rear/issues/1366

For ReaR 2.2 we should provide some generically working
basic support for various kind of "timeout" stuff.

jsmeix commented at 2017-07-14 10:48:

@gozora
could you have a look if this issue could be solved for ReaR v 2.2
or if it should be postponed to ReaR v 2.3 cf.
https://github.com/rear/rear/issues/1398#issuecomment-315318878

gozora commented at 2017-07-14 12:14:

Hello @jsmeix,

I guess I can make fix for this issue in v2.2.

Just for summary, from what I've read in communication of this issue, preferred solution is following:

# Wait until the block device node exists:
for pass in trial verification ; do
    until test -b $block_device ; do
        LogPrint "Waiting until $block_device exists (retrying in 1 second)"
        sleep 1
    done
done

Combined with loop break if timeout (REAR_MAX_RETRIES * REAR_SLEEP_DELAY) is reached, is that correct?

V.

jsmeix commented at 2017-07-18 14:48:

With https://github.com/rear/rear/pull/1418 merged
this issue should be fixed.

jsmeix commented at 2017-07-18 15:13:

Only FYI:
Perhaps this issue is not yet properly fixed because
the current get_disk_size function calls get_block_size
before it tests /sys/block/$disk_name/size
but get_block_size also aceesses /sys/block/$disk_name
namely /sys/block/$disk_name/queue/logical_block_size
and I assume when udev is not yet ready the whole
/sys/block/$disk_name stuff is not yet there but
fortunately get_block_size has a fallback when
/sys/block/$disk_name/queue/logical_block_size
is not accessible so that in most cases things
should work by better luck now...


[Export of Github issue for rear/rear.]