- All Implemented Interfaces:
StatsCollector
public class DistinctBinarySize
extends Object
implements StatsCollector
Collects the number and size of distinct binaries.
We keep references to large binaries in a fixed-size set, so that for large
binaries, we have accurate data. For smaller binaries, we have approximate
data only, managed in a Bloom filter (also with a fixed size). If the set of
large binaries grows too large, the entries are removed and instead added to
a Bloom filter. The size threshold (which binaries go to the set and which
binaries go to the Bloom filter) is changed dynamically. The Bloom filter can
collect about 1 million entries per MB with a false-positive rate of 1%.
That means that for large binaries, we have the exact de-duplicated count and
size. For smaller binaries, we only have an approximation, due to the nature
of the Bloom filter. To compensate for false positives, the total size of the
presumed duplicates (where the size was not accumulated when adding to the
Bloom filter) is summed up, and at the end of the collection process,
multiplied with the false-positive rate of the Bloom filter.
Experiments show that for 4 million references, total size 17 TB, the
approximation error is less than nearly zero when allocating 16 MB for the
large set, and 16 MB for the Bloom filter. When not allocating any memory for
the large binaries, and only 1 MB for the Bloom filter, the estimation error
is 5%. In general it seems that giving more memory to the Bloom filter is
more efficient than giving the memory to have accurate data for very large
binaries, unless if the size distribution of binaries is very skewed (and so,
having an error for a very large binary would have a big effect).