Class BytesRefHash

java.lang.Object
org.apache.lucene.util.BytesRefHash

public final class BytesRefHash extends Object
BytesRefHash is a special purpose hash-map like data-structure optimized for BytesRef instances. BytesRefHash maintains mappings of byte arrays to ids (Map<BytesRef,int>) storing the hashed bytes efficiently in continuous storage. The mapping to the id is encapsulated inside BytesRefHash and is guaranteed to be increased for each added BytesRef.

Note: The maximum capacity BytesRef instance passed to add(BytesRef) must not be longer than ByteBlockPool.BYTE_BLOCK_SIZE-2. The internal storage is limited to 2GB total byte storage.

  • Field Details

  • Constructor Details

  • Method Details

    • size

      public int size()
      Returns the number of BytesRef values in this BytesRefHash.
      Returns:
      the number of BytesRef values in this BytesRefHash.
    • get

      public BytesRef get(int bytesID, BytesRef ref)
      Populates and returns a BytesRef with the bytes for the given bytesID.

      Note: the given bytesID must be a positive integer less than the current size (size())

      Parameters:
      bytesID - the id
      ref - the BytesRef to populate
      Returns:
      the given BytesRef instance populated with the bytes for the given bytesID
    • sort

      public int[] sort(Comparator<BytesRef> comp)
      Returns the values array sorted by the referenced byte values.

      Note: This is a destructive operation. clear() must be called in order to reuse this BytesRefHash instance.

      Parameters:
      comp - the Comparator used for sorting
    • clear

      public void clear(boolean resetPool)
      Clears the BytesRef which maps to the given BytesRef
    • clear

      public void clear()
    • close

      public void close()
      Closes the BytesRefHash and releases all internally used memory
    • add

      public int add(BytesRef bytes)
      Adds a new BytesRef
      Parameters:
      bytes - the bytes to hash
      Returns:
      the id the given bytes are hashed if there was no mapping for the given bytes, otherwise (-(id)-1). This guarantees that the return value will always be >= 0 if the given bytes haven't been hashed before.
      Throws:
      BytesRefHash.MaxBytesLengthExceededException - if the given bytes are > 2 + ByteBlockPool.BYTE_BLOCK_SIZE
    • add

      public int add(BytesRef bytes, int code)
      Adds a new BytesRef with a pre-calculated hash code.
      Parameters:
      bytes - the bytes to hash
      code - the bytes hash code

      Hashcode is defined as:

       int hash = 0;
       for (int i = offset; i < offset + length; i++) {
         hash = 31 * hash + bytes[i];
       }
       
      Returns:
      the id the given bytes are hashed if there was no mapping for the given bytes, otherwise (-(id)-1). This guarantees that the return value will always be >= 0 if the given bytes haven't been hashed before.
      Throws:
      BytesRefHash.MaxBytesLengthExceededException - if the given bytes are > ByteBlockPool.BYTE_BLOCK_SIZE - 2
    • find

      public int find(BytesRef bytes)
      Returns the id of the given BytesRef.
      See Also:
    • find

      public int find(BytesRef bytes, int code)
      Returns the id of the given BytesRef with a pre-calculated hash code.
      Parameters:
      bytes - the bytes to look for
      code - the bytes hash code
      Returns:
      the id of the given bytes, or -1 if there is no mapping for the given bytes.
    • addByPoolOffset

      public int addByPoolOffset(int offset)
      Adds a "arbitrary" int offset instead of a BytesRef term. This is used in the indexer to hold the hash for term vectors, because they do not redundantly store the byte[] term directly and instead reference the byte[] term already stored by the postings BytesRefHash. See add(int textStart) in TermsHashPerField.
    • reinit

      public void reinit()
      reinitializes the BytesRefHash after a previous clear() call. If clear() has not been called previously this method has no effect.
    • byteStart

      public int byteStart(int bytesID)
      Returns the bytesStart offset into the internally used ByteBlockPool for the given bytesID
      Parameters:
      bytesID - the id to look up
      Returns:
      the bytesStart offset into the internally used ByteBlockPool for the given id