Class DistinctBinarySize

  • All Implemented Interfaces:
    StatsCollector

    public class DistinctBinarySize
    extends Object
    implements StatsCollector
    Collects the number and size of distinct binaries. We keep references to large binaries in a fixed-size set, so that for large binaries, we have accurate data. For smaller binaries, we have approximate data only, managed in a Bloom filter (also with a fixed size). If the set of large binaries grows too large, the entries are removed and instead added to a Bloom filter. The size threshold (which binaries go to the set and which binaries go to the Bloom filter) is changed dynamically. The Bloom filter can collect about 1 million entries per MB with a false-positive rate of 1%. That means that for large binaries, we have the exact de-duplicated count and size. For smaller binaries, we only have an approximation, due to the nature of the Bloom filter. To compensate for false positives, the total size of the presumed duplicates (where the size was not accumulated when adding to the Bloom filter) is summed up, and at the end of the collection process, multiplied with the false-positive rate of the Bloom filter. Experiments show that for 4 million references, total size 17 TB, the approximation error is less than nearly zero when allocating 16 MB for the large set, and 16 MB for the Bloom filter. When not allocating any memory for the large binaries, and only 1 MB for the Bloom filter, the estimation error is 5%. In general it seems that giving more memory to the Bloom filter is more efficient than giving the memory to have accurate data for very large binaries, unless if the size distribution of binaries is very skewed (and so, having an error for a very large binary would have a big effect).
    • Constructor Detail

      • DistinctBinarySize

        public DistinctBinarySize​(long largeBinariesMB,
                                  long bloomFilterMB)