Class DocTermOrds

java.lang.Object
org.apache.lucene.index.DocTermOrds

public class DocTermOrds extends Object
This class enables fast access to multiple term ords for a specified field across all docIDs. Like FieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike FieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the getOrdTermsEnum(org.apache.lucene.index.AtomicReader) method, and then seek-by-ord to get the term's bytes. While normally term ords are type long, in this API they are int as the internal representation here cannot address more than MAX_INT unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. In addition, there is an internal limit (16 MB) on how many bytes each chunk of documents may consume. If you trip this limit you'll hit an IllegalStateException. Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords. The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparator) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once). This class tests whether the provided reader is able to retrieve terms by ord (ie, it's single segment, and it uses an ord-capable terms index). If not, this class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The getOrdTermsEnum(org.apache.lucene.index.AtomicReader) method then provides this wrapped enum, if necessary. The RAM consumption of this class can be high!
  • Field Details

    • DEFAULT_INDEX_INTERVAL_BITS

      public static final int DEFAULT_INDEX_INTERVAL_BITS
      Every 128th term is indexed, by default.
      See Also:
    • maxTermDocFreq

      protected final int maxTermDocFreq
      Don't uninvert terms that exceed this count.
    • field

      protected final String field
      Field we are uninverting.
    • numTermsInField

      protected int numTermsInField
      Number of terms in the field.
    • termInstances

      protected long termInstances
      Total number of references to term numbers.
    • total_time

      protected int total_time
      Total time to uninvert the field.
    • phase1_time

      protected int phase1_time
      Time for phase1 of the uninvert process.
    • index

      protected int[] index
      Holds the per-document ords or a pointer to the ords.
    • tnums

      protected byte[][] tnums
      Holds term ords for documents.
    • sizeOfIndexedStrings

      protected long sizeOfIndexedStrings
      Total bytes (sum of term lengths) for all indexed terms.
    • indexedTermsArray

      protected BytesRef[] indexedTermsArray
      Holds the indexed (by default every 128th) terms.
    • prefix

      protected BytesRef prefix
      If non-null, only terms matching this prefix were indexed.
    • ordBase

      protected int ordBase
      Ordinal of the first term in the field, or 0 if the PostingsFormat does not implement TermsEnum.ord().
    • docsEnum

      protected DocsEnum docsEnum
      Used while uninverting.
  • Constructor Details

    • DocTermOrds

      public DocTermOrds(AtomicReader reader, Bits liveDocs, String field) throws IOException
      Inverts all terms
      Throws:
      IOException
    • DocTermOrds

      public DocTermOrds(AtomicReader reader, Bits liveDocs, String field, BytesRef termPrefix) throws IOException
      Inverts only terms starting w/ prefix
      Throws:
      IOException
    • DocTermOrds

      public DocTermOrds(AtomicReader reader, Bits liveDocs, String field, BytesRef termPrefix, int maxTermDocFreq) throws IOException
      Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq
      Throws:
      IOException
    • DocTermOrds

      public DocTermOrds(AtomicReader reader, Bits liveDocs, String field, BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits) throws IOException
      Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).
      Throws:
      IOException
    • DocTermOrds

      protected DocTermOrds(String field, int maxTermDocFreq, int indexIntervalBits)
      Subclass inits w/ this, but be sure you then call uninvert, only once
  • Method Details

    • ramUsedInBytes

      public long ramUsedInBytes()
      Returns total bytes used.
    • getOrdTermsEnum

      public TermsEnum getOrdTermsEnum(AtomicReader reader) throws IOException
      Returns a TermsEnum that implements ord. If the provided reader supports ord, we just return its TermsEnum; if it does not, we build a "private" terms index internally (WARNING: consumes RAM) and use that index to implement ord. This also enables ord on top of a composite reader. The returned TermsEnum is unpositioned. This returns null if there are no terms.

      NOTE: you must pass the same reader that was used when creating this class

      Throws:
      IOException
    • numTerms

      public int numTerms()
      Returns the number of terms in this field
    • isEmpty

      public boolean isEmpty()
      Returns true if no terms were indexed.
    • visitTerm

      protected void visitTerm(TermsEnum te, int termNum) throws IOException
      Subclass can override this
      Throws:
      IOException
    • setActualDocFreq

      protected void setActualDocFreq(int termNum, int df) throws IOException
      Invoked during uninvert(AtomicReader,Bits,BytesRef) to record the document frequency for each uninverted term.
      Throws:
      IOException
    • uninvert

      protected void uninvert(AtomicReader reader, Bits liveDocs, BytesRef termPrefix) throws IOException
      Call this only once (if you subclass!)
      Throws:
      IOException
    • lookupTerm

      public BytesRef lookupTerm(TermsEnum termsEnum, int ord) throws IOException
      Returns the term (BytesRef) corresponding to the provided ordinal.
      Throws:
      IOException
    • iterator

      public SortedSetDocValues iterator(AtomicReader reader) throws IOException
      Returns a SortedSetDocValues view of this instance
      Throws:
      IOException