Class DistinctBinarySize

  • All Implemented Interfaces:
    StatsCollector

    public class DistinctBinarySize
    extends java.lang.Object
    implements StatsCollector
    Collects the number and size of distinct binaries. We keep references to large binaries in a fixed-size set, so that for large binaries, we have accurate data. For smaller binaries, we have approximate data only, managed in a Bloom filter (also with a fixed size). If the set of large binaries grows too large, the entries are removed and instead added to a Bloom filter. The size threshold (which binaries go to the set and which binaries go to the Bloom filter) is changed dynamically. The Bloom filter can collect about 1 million entries per MB with a false-positive rate of 1%. That means that for large binaries, we have the exact de-duplicated count and size. For smaller binaries, we only have an approximation, due to the nature of the Bloom filter. To compensate for false positives, the total size of the presumed duplicates (where the size was not accumulated when adding to the Bloom filter) is summed up, and at the end of the collection process, multiplied with the false-positive rate of the Bloom filter. Experiments show that for 4 million references, total size 17 TB, the approximation error is less than nearly zero when allocating 16 MB for the large set, and 16 MB for the Bloom filter. When not allocating any memory for the large binaries, and only 1 MB for the Bloom filter, the estimation error is 5%. In general it seems that giving more memory to the Bloom filter is more efficient than giving the memory to have accurate data for very large binaries, unless if the size distribution of binaries is very skewed (and so, having an error for a very large binary would have a big effect).
    • Constructor Summary

      Constructors 
      Constructor Description
      DistinctBinarySize​(long largeBinariesMB, long bloomFilterMB)  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void add​(NodeData node)
      Collect data for this node.
      void end()
      End collection.
      java.util.List<java.lang.String> getRecords()
      Get the statistics in the form of a list of records.
      java.lang.String toString()  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Constructor Detail

      • DistinctBinarySize

        public DistinctBinarySize​(long largeBinariesMB,
                                  long bloomFilterMB)
    • Method Detail

      • add

        public void add​(NodeData node)
        Description copied from interface: StatsCollector
        Collect data for this node.
        Specified by:
        add in interface StatsCollector
        Parameters:
        node - the node
      • getRecords

        public java.util.List<java.lang.String> getRecords()
        Description copied from interface: StatsCollector
        Get the statistics in the form of a list of records.
        Specified by:
        getRecords in interface StatsCollector
        Returns:
        the results
      • toString

        public java.lang.String toString()
        Overrides:
        toString in class java.lang.Object