#3177 PR merged: New unique_unsorted() function

Labels: enhancement, fixed / solved / done

jsmeix opened issue at 2024-03-08 15:11:

  • Type: Enhancement

  • Impact: High
    High impact where needed.
    No impact esewhere.

  • Reference to related issue (URL):

Triggred by my experiments in
https://github.com/rear/rear/pull/3175
how 'tar' behaves when exact same files or directories
are provided as 'tar' arguments to be archived.

  • How was this pull request tested?

See
https://github.com/rear/rear/pull/3175#issuecomment-1985907542
But I did not yet test "rear recover".

  • Description of the changes in this pull request:

In lib/global-functions.sh added a
new unique_unsorted() function
that outputs lines in STDIN or in a file
without subsequent duplicate lines
which keeps the ordering of the lines.

pcahyna commented at 2024-03-08 15:18:

@jsmeix there is sort -u added in #3175 , isn't that enough? Should be, unless we require preserving the order - do we? To us (me and @lzaoral ) it seemed that the order is quite unpredictable / random anyway.

jsmeix commented at 2024-03-08 15:46:

I want to preserve the ordering to be on the safe side.

I have no real-world use-case why the ordering is important.

It is just my gut feeling that in some special cases
the ordering could be crucial because for 'tar'
the ordering of the arguments does make a difference
(what is stored where in the archive) and in some special
cases that could be crucial.

Only what comes to my mind offhandedly right now:

# tar -czf fullbackup.tgz / /database

might behave rather different during restore compared to

# tar -czf fullbackup.tgz /database /

(under the assumption that /database is on its own filesystem)
i.e.
first the basic system gets restored (some Gigabytes),
then the database gets restored (some Terabytes)
versus
first the the database gets restored (some Terabytes),
then the basic system gets restored (some Gigabytes)

As an exercise for the reader:
For both cases estimate the probability
that restore of the basic system fails
;-)

jsmeix commented at 2024-03-08 15:52:

Sigh!
I guess it is ReaR's careless code why
the order is quite unpredictable / random anyway
when one leaves such things to ReaR's automatism
and that is one reason behind why I implemented
BACKUP_ONLY_INCLUDE and BACKUP_ONLY_EXCLUDE
to provide "final power to the user".

So

BACKUP_ONLY_INCLUDE="yes"
BACKUP_PROG_INCLUDE=( / /database )

versus

BACKUP_ONLY_INCLUDE="yes"
BACKUP_PROG_INCLUDE=( /database / )

(under the assumption that /database is on its own filesystem)
should behave rather different during restore
but I did not test it - something to do for next week.

jsmeix commented at 2024-03-08 16:11:

I avoid false expectations about this pull request:
This one does not help when when a filesystem/btrfs subvolume
is mounted more than once at different mountpoint directories
because this one only skips exactly same duplicated arguments
what 'tar' and 'rsync' should archive so this pull request
is meant only to help when by accident exactly same arguments
are provided more than once to 'tar' and 'rsync'.

I think it can never be needed that same arguments
are provided more than once to 'tar' and 'rsync'
so I think it is (hopefully) safe to always ignore
exactly same duplicated arguments.

pcahyna commented at 2024-03-08 16:15:

@jsmeix I would characterize it as a way to avoid sort -u in https://github.com/rear/rear/pull/3175 and thus preserve the order of tar arguments in cases where the order matters ( e.g. explicit BACKUP_PROG_INCLUDE setting ). Correct me if I am wrong.

pcahyna commented at 2024-03-08 16:32:

Great @schlomo I had independently exactly the same idea for the function's name!

jsmeix commented at 2024-03-11 06:57:

Thank you for the better function name!
I will use uniq_unsorted

I was thinking about a shorter function name
that still clearly tells what that function does
but had no better idea than the long verbose name.

jsmeix commented at 2024-04-09 12:42:

I am wondering if an even better function name
could be unique_unsorted with spelled out 'unique'
to make it more clear that this function
is something "on its own" which is in particular
not a wrapper or some enhanced reimplementation
of the uniq program (with all its options)?

And in general I don't like old *nix style
crppld wrds lk 'uniq_usrtd' ;-)

jsmeix commented at 2024-04-11 09:23:

For now I kept the if check for a filename argument
because in the STDIN case I like to timeout after 5 seconds
when there is nothing at STDIN or when STDIN does not get closed
to avoid that 'awk' waits endlessly for (further) input
which would make ReaR behave as if it hung up, cf.
https://github.com/rear/rear/wiki/Coding-Style#beware-of-the-emptiness

@schlomo
I assume all this can be implemented in a single case
but I would need time to find out how to do that properly.
If a proper implementation in a single case
is obvious to you, please show it to me here.

jsmeix commented at 2024-04-12 12:35:

@rear/contributors
if there are no objections I will merge it today
a bit after 15:00 CEST (13:00 UTC).

schlomo commented at 2024-04-12 12:46:

Looks good to me.

Maybe silly question:

awk "..." "$@" doesn't work? It would save you the if clause

jsmeix commented at 2024-04-12 13:31:

@schlomo
I merged it "as is" because I had tested that at least a bit
and - at least currently - I don't find time to test a rather
different implementation.

Furthermore I would have to think about
if a check if stdin is a tty makes sense, cf.
https://github.com/rear/rear/pull/3177#discussion_r1562435947
and if yes if then it helps to have the stdin case
well separated from the "input from file" case,
cf. RFC 1925 item (5)

It is always possible to aglutenate [sic] multiple separate problems
into a single complex interdependent solution. In most cases
this is a bad idea.

[Export of Github issue for rear/rear.]