#2377 Issue closed: Remove duplicates in COPY_AS_IS PROGS REQUIRED_PROGS LIBS before processing?

Labels: enhancement, cleanup, fixed / solved / done

jsmeix opened issue at 2020-04-22 06:56:

Currently the elements in the arrays
COPY_AS_IS PROGS REQUIRED_PROGS LIBS
are processed one by one "as is" by scripts like
build/GNU/Linux/100_copy_as_is.sh
build/GNU/Linux/390_copy_binaries_libraries.sh
so that duplicate array elements are processed as often as they appear.

This way nothing goes wrong, it only takes more time if there are duplicates.
In particular the ordering of the elements in the arrays is kept.
Perhaps the ordering in COPY_AS_IS might be important.

I wonder if it is worth the effort to remove duplicates
in particular to filter out duplicates that keeps the ordering of the elements.

I will experiment a bit with that...

jsmeix commented at 2020-04-22 09:27:

At least for my test case the COPY_AS_IS array
results only a few duplicates:

# sort /tmp/rear.RX0ACBwlXC2bRiS/tmp/copy-as-is-filelist | uniq -cd
      2 /etc/localtime
      2 /etc/modules-load.d/
      2 /usr/share/kbd/keymaps/legacy/i386/qwerty/defkeymap.map.gz

jsmeix commented at 2020-04-22 12:50:

For PROGS and REQUIRED_PROGS duplicates are already removed
in build/GNU/Linux/390_copy_binaries_libraries.sh
via the sort -u in this code

local all_binaries=( $( for bin in "${PROGS[@]}" "${REQUIRED_PROGS[@]}" ; do
                            bin_path="$( get_path "$bin" )"
                            if test -x "$bin_path" ; then
                                echo $bin_path
                                Log "Found binary $bin_path"
                            fi
                        done 2>>/dev/$DISPENSABLE_OUTPUT_DEV | sort -u ) )

jsmeix commented at 2020-04-22 12:57:

For LIBS duplicates are already removed in the following way:

In build/GNU/Linux/390_copy_binaries_libraries.sh there is

local all_libs=( "${LIBS[@]}" $( RequiredSharedObjects "${all_binaries[@]}" "${LIBS[@]}" ) )

and the RequiredSharedObjects function in lib/linux-functions.sh

function RequiredSharedObjects () {
    ...
    for file_for_ldd in $@ ; do
        ...
        ldd $file_for_ldd
        ...
    done 2>>/dev/$DISPENSABLE_OUTPUT_DEV | awk ' /^\t.+ => not found/ { print "Shared object " $1 " not found" > "/dev/stderr" }
                                                 /^\t.+ => \// { print $3 }
                                                 /^\t\// && !/ => / { print $1 } ' | sort -u
}

has a sort -u

jsmeix commented at 2020-04-23 09:48:

Filtering out duplicates and keeping the ordering via

printf '%s\n' "${ARRAY[@]}" | awk '!seen[$0]++'

seems to run rather fast even for huge arrays:

# unset arr

# for i in $( seq 1000 9999 ) ; do arr+=( "element $i with \ * ' ? special chars" ) ; done

# for i in $( seq 9999 -1 1000 ) ; do arr+=( "element $i with \ * ' ? special chars" ) ; done

# printf '%s\n' "${arr[@]}" | wc -l
18000

# printf '%s\n' "${arr[@]}" | head -n3
element 1000 with \ * ' ? special chars
element 1001 with \ * ' ? special chars
element 1002 with \ * ' ? special chars

# printf '%s\n' "${arr[@]}" | tail -n3
element 1002 with \ * ' ? special chars
element 1001 with \ * ' ? special chars
element 1000 with \ * ' ? special chars

# time printf '%s\n' "${arr[@]}" | awk '!seen[$0]++' | wc -l
9000

real    0m0.160s
user    0m0.167s
sys     0m0.028s

# printf '%s\n' "${arr[@]}" | awk '!seen[$0]++' | head -n3
element 1000 with \ * ' ? special chars
element 1001 with \ * ' ? special chars
element 1002 with \ * ' ? special chars

# printf '%s\n' "${arr[@]}" | awk '!seen[$0]++' | tail -n3
element 9997 with \ * ' ? special chars
element 9998 with \ * ' ? special chars
element 9999 with \ * ' ? special chars

For me COPS_AS_IS has about 130 elements
so a more realistic test with an array that has 200 duplicate elements:

# unset arr

# for i in $( seq 100 299 ) ; do arr+=( "element $i with \ * ' ? special chars" ) ; done
# for i in $( seq 299 -1 100 ) ; do arr+=( "element $i with \ * ' ? special chars" ) ; done

# printf '%s\n' "${arr[@]}" | wc -l
400

# time printf '%s\n' "${arr[@]}" | awk '!seen[$0]++' | wc -l
200

real    0m0.012s
user    0m0.011s
sys     0m0.007s

I used a not so fast Intel Core i3-4000M CPU @ 2.40GHz for that tests
cf. https://github.com/rear/rear/issues/2364#issuecomment-616567193

So filtering out duplicates and keeping the ordering in COPY_AS_IS
that needs to run only once in build/GNU/Linux/100_copy_as_is.sh
could be a reasonable thing to have cleaner working code there.
From my personal point of view our code looks sloppy and careless
when we let tar needlessly copy duplicated things several times.

jsmeix commented at 2020-04-23 15:48:

Via https://github.com/rear/rear/pull/2378
I implemented filtering out duplicates in COPY_AS_IS
and also removing duplicates in the copy_as_is_filelist_file

jsmeix commented at 2020-04-27 13:29:

With https://github.com/rear/rear/pull/2378 merged
this issue is done.


[Export of Github issue for rear/rear.]