#2377 Issue closed
: Remove duplicates in COPY_AS_IS PROGS REQUIRED_PROGS LIBS before processing?¶
Labels: enhancement
, cleanup
, fixed / solved / done
jsmeix opened issue at 2020-04-22 06:56:¶
Currently the elements in the arrays
COPY_AS_IS PROGS REQUIRED_PROGS LIBS
are processed one by one "as is" by scripts like
build/GNU/Linux/100_copy_as_is.sh
build/GNU/Linux/390_copy_binaries_libraries.sh
so that duplicate array elements are processed as often as they appear.
This way nothing goes wrong, it only takes more time if there are
duplicates.
In particular the ordering of the elements in the arrays is kept.
Perhaps the ordering in COPY_AS_IS might be important.
I wonder if it is worth the effort to remove duplicates
in particular to filter out duplicates that keeps the ordering of the
elements.
I will experiment a bit with that...
jsmeix commented at 2020-04-22 09:27:¶
At least for my test case the COPY_AS_IS array
results only a few duplicates:
# sort /tmp/rear.RX0ACBwlXC2bRiS/tmp/copy-as-is-filelist | uniq -cd
2 /etc/localtime
2 /etc/modules-load.d/
2 /usr/share/kbd/keymaps/legacy/i386/qwerty/defkeymap.map.gz
jsmeix commented at 2020-04-22 12:50:¶
For PROGS and REQUIRED_PROGS duplicates are already removed
in build/GNU/Linux/390_copy_binaries_libraries.sh
via the sort -u
in this code
local all_binaries=( $( for bin in "${PROGS[@]}" "${REQUIRED_PROGS[@]}" ; do
bin_path="$( get_path "$bin" )"
if test -x "$bin_path" ; then
echo $bin_path
Log "Found binary $bin_path"
fi
done 2>>/dev/$DISPENSABLE_OUTPUT_DEV | sort -u ) )
jsmeix commented at 2020-04-22 12:57:¶
For LIBS duplicates are already removed in the following way:
In build/GNU/Linux/390_copy_binaries_libraries.sh there is
local all_libs=( "${LIBS[@]}" $( RequiredSharedObjects "${all_binaries[@]}" "${LIBS[@]}" ) )
and the RequiredSharedObjects
function in lib/linux-functions.sh
function RequiredSharedObjects () {
...
for file_for_ldd in $@ ; do
...
ldd $file_for_ldd
...
done 2>>/dev/$DISPENSABLE_OUTPUT_DEV | awk ' /^\t.+ => not found/ { print "Shared object " $1 " not found" > "/dev/stderr" }
/^\t.+ => \// { print $3 }
/^\t\// && !/ => / { print $1 } ' | sort -u
}
has a sort -u
jsmeix commented at 2020-04-23 09:48:¶
Filtering out duplicates and keeping the ordering via
printf '%s\n' "${ARRAY[@]}" | awk '!seen[$0]++'
seems to run rather fast even for huge arrays:
# unset arr
# for i in $( seq 1000 9999 ) ; do arr+=( "element $i with \ * ' ? special chars" ) ; done
# for i in $( seq 9999 -1 1000 ) ; do arr+=( "element $i with \ * ' ? special chars" ) ; done
# printf '%s\n' "${arr[@]}" | wc -l
18000
# printf '%s\n' "${arr[@]}" | head -n3
element 1000 with \ * ' ? special chars
element 1001 with \ * ' ? special chars
element 1002 with \ * ' ? special chars
# printf '%s\n' "${arr[@]}" | tail -n3
element 1002 with \ * ' ? special chars
element 1001 with \ * ' ? special chars
element 1000 with \ * ' ? special chars
# time printf '%s\n' "${arr[@]}" | awk '!seen[$0]++' | wc -l
9000
real 0m0.160s
user 0m0.167s
sys 0m0.028s
# printf '%s\n' "${arr[@]}" | awk '!seen[$0]++' | head -n3
element 1000 with \ * ' ? special chars
element 1001 with \ * ' ? special chars
element 1002 with \ * ' ? special chars
# printf '%s\n' "${arr[@]}" | awk '!seen[$0]++' | tail -n3
element 9997 with \ * ' ? special chars
element 9998 with \ * ' ? special chars
element 9999 with \ * ' ? special chars
For me COPS_AS_IS has about 130 elements
so a more realistic test with an array that has 200 duplicate elements:
# unset arr
# for i in $( seq 100 299 ) ; do arr+=( "element $i with \ * ' ? special chars" ) ; done
# for i in $( seq 299 -1 100 ) ; do arr+=( "element $i with \ * ' ? special chars" ) ; done
# printf '%s\n' "${arr[@]}" | wc -l
400
# time printf '%s\n' "${arr[@]}" | awk '!seen[$0]++' | wc -l
200
real 0m0.012s
user 0m0.011s
sys 0m0.007s
I used a not so fast Intel Core i3-4000M CPU @ 2.40GHz for that tests
cf.
https://github.com/rear/rear/issues/2364#issuecomment-616567193
So filtering out duplicates and keeping the ordering in COPY_AS_IS
that needs to run only once in build/GNU/Linux/100_copy_as_is.sh
could be a reasonable thing to have cleaner working code there.
From my personal point of view our code looks sloppy and careless
when we let tar
needlessly copy duplicated things several times.
jsmeix commented at 2020-04-23 15:48:¶
Via
https://github.com/rear/rear/pull/2378
I implemented filtering out duplicates in COPY_AS_IS
and also removing duplicates in the copy_as_is_filelist_file
jsmeix commented at 2020-04-27 13:29:¶
With
https://github.com/rear/rear/pull/2378
merged
this issue is done.
[Export of Github issue for rear/rear.]