java.lang.Object
- org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.modules.DistinctBinarySize

All Implemented Interfaces:

StatsCollector
```
public class DistinctBinarySize
extends Object
implements StatsCollector
```
Collects the number and size of distinct binaries. We keep references to large binaries in a fixed-size set, so that for large binaries, we have accurate data. For smaller binaries, we have approximate data only, managed in a Bloom filter (also with a fixed size). If the set of large binaries grows too large, the entries are removed and instead added to a Bloom filter. The size threshold (which binaries go to the set and which binaries go to the Bloom filter) is changed dynamically. The Bloom filter can collect about 1 million entries per MB with a false-positive rate of 1%. That means that for large binaries, we have the exact de-duplicated count and size. For smaller binaries, we only have an approximation, due to the nature of the Bloom filter. To compensate for false positives, the total size of the presumed duplicates (where the size was not accumulated when adding to the Bloom filter) is summed up, and at the end of the collection process, multiplied with the false-positive rate of the Bloom filter. Experiments show that for 4 million references, total size 17 TB, the approximation error is less than nearly zero when allocating 16 MB for the large set, and 16 MB for the Bloom filter. When not allocating any memory for the large binaries, and only 1 MB for the Bloom filter, the estimation error is 5%. In general it seems that giving more memory to the Bloom filter is more efficient than giving the memory to have accurate data for very large binaries, unless if the size distribution of binaries is very skewed (and so, having an error for a very large binary would have a big effect).

Constructor Summary

Constructors
Constructor Description

DistinctBinarySize(long largeBinariesMB, long bloomFilterMB)

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`void`	`add(NodeData node)`	Collect data for this node.
`void`	`end()`	End collection.
`List<String>`	`getRecords()`	Get the statistics in the form of a list of records.
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - DistinctBinarySize
```
public DistinctBinarySize(long largeBinariesMB,
                          long bloomFilterMB)
```
- Method Detail
  - add
```
public void add(NodeData node)
```
    Description copied from interface: StatsCollector
    
    Collect data for this node.
    
    Specified by:
    
    add in interface StatsCollector
    
    Parameters:
    
    node - the node
  - end
```
public void end()
```
    Description copied from interface: StatsCollector
    
    End collection.
    
    Specified by:
    
    end in interface StatsCollector
  - getRecords
```
public List<String> getRecords()
```
    Description copied from interface: StatsCollector
    
    Get the statistics in the form of a list of records.
    
    Specified by:
    
    getRecords in interface StatsCollector
    
    Returns:
    
    the results
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object

Class DistinctBinarySize

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

DistinctBinarySize

Method Detail

add

end

getRecords

toString