Package org.apache.lucene.index
Class DocTermOrds
java.lang.Object
org.apache.lucene.index.DocTermOrds
This class enables fast access to multiple term ords for
a specified field across all docIDs.
Like FieldCache, it uninverts the index and holds a
packed data structure in RAM to enable fast access.
Unlike FieldCache, it can handle multi-valued fields,
and, it does not hold the term bytes in RAM. Rather, you
must obtain a TermsEnum from the
getOrdTermsEnum(org.apache.lucene.index.AtomicReader)
method, and then seek-by-ord to get the term's bytes.
While normally term ords are type long, in this API they are
int as the internal representation here cannot address
more than MAX_INT unique terms. Also, typically this
class is used on fields with relatively few unique terms
vs the number of documents. In addition, there is an
internal limit (16 MB) on how many bytes each chunk of
documents may consume. If you trip this limit you'll hit
an IllegalStateException.
Deleted documents are skipped during uninversion, and if
you look them up you'll get 0 ords.
The returned per-document ords do not retain their
original order in the document. Instead they are returned
in sorted (by ord, ie term's BytesRef comparator) order. They
are also de-dup'd (ie if doc has same term more than once
in this field, you'll only get that ord back once).
This class tests whether the provided reader is able to
retrieve terms by ord (ie, it's single segment, and it
uses an ord-capable terms index). If not, this class
will create its own term index internally, allowing to
create a wrapped TermsEnum that can handle ord. The
getOrdTermsEnum(org.apache.lucene.index.AtomicReader)
method then provides this
wrapped enum, if necessary.
The RAM consumption of this class can be high!-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
Every 128th term is indexed, by default.protected DocsEnum
Used while uninverting.protected final String
Field we are uninverting.protected int[]
Holds the per-document ords or a pointer to the ords.protected BytesRef[]
Holds the indexed (by default every 128th) terms.protected final int
Don't uninvert terms that exceed this count.protected int
Number of terms in the field.protected int
Ordinal of the first term in the field, or 0 if thePostingsFormat
does not implementTermsEnum.ord()
.protected int
Time for phase1 of the uninvert process.protected BytesRef
If non-null, only terms matching this prefix were indexed.protected long
Total bytes (sum of term lengths) for all indexed terms.protected long
Total number of references to term numbers.protected byte[][]
Holds term ords for documents.protected int
Total time to uninvert the field. -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotected
DocTermOrds
(String field, int maxTermDocFreq, int indexIntervalBits) Subclass inits w/ this, but be sure you then call uninvert, only onceDocTermOrds
(AtomicReader reader, Bits liveDocs, String field) Inverts all termsDocTermOrds
(AtomicReader reader, Bits liveDocs, String field, BytesRef termPrefix) Inverts only terms starting w/ prefixDocTermOrds
(AtomicReader reader, Bits liveDocs, String field, BytesRef termPrefix, int maxTermDocFreq) Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreqDocTermOrds
(AtomicReader reader, Bits liveDocs, String field, BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits) Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term). -
Method Summary
Modifier and TypeMethodDescriptiongetOrdTermsEnum
(AtomicReader reader) Returns a TermsEnum that implements ord.boolean
isEmpty()
Returnstrue
if no terms were indexed.iterator
(AtomicReader reader) Returns a SortedSetDocValues view of this instancelookupTerm
(TermsEnum termsEnum, int ord) Returns the term (BytesRef
) corresponding to the provided ordinal.int
numTerms()
Returns the number of terms in this fieldlong
Returns total bytes used.protected void
setActualDocFreq
(int termNum, int df) Invoked duringuninvert(AtomicReader,Bits,BytesRef)
to record the document frequency for each uninverted term.protected void
uninvert
(AtomicReader reader, Bits liveDocs, BytesRef termPrefix) Call this only once (if you subclass!)protected void
Subclass can override this
-
Field Details
-
DEFAULT_INDEX_INTERVAL_BITS
public static final int DEFAULT_INDEX_INTERVAL_BITSEvery 128th term is indexed, by default.- See Also:
-
maxTermDocFreq
protected final int maxTermDocFreqDon't uninvert terms that exceed this count. -
field
Field we are uninverting. -
numTermsInField
protected int numTermsInFieldNumber of terms in the field. -
termInstances
protected long termInstancesTotal number of references to term numbers. -
total_time
protected int total_timeTotal time to uninvert the field. -
phase1_time
protected int phase1_timeTime for phase1 of the uninvert process. -
index
protected int[] indexHolds the per-document ords or a pointer to the ords. -
tnums
protected byte[][] tnumsHolds term ords for documents. -
sizeOfIndexedStrings
protected long sizeOfIndexedStringsTotal bytes (sum of term lengths) for all indexed terms. -
indexedTermsArray
Holds the indexed (by default every 128th) terms. -
prefix
If non-null, only terms matching this prefix were indexed. -
ordBase
protected int ordBaseOrdinal of the first term in the field, or 0 if thePostingsFormat
does not implementTermsEnum.ord()
. -
docsEnum
Used while uninverting.
-
-
Constructor Details
-
DocTermOrds
Inverts all terms- Throws:
IOException
-
DocTermOrds
public DocTermOrds(AtomicReader reader, Bits liveDocs, String field, BytesRef termPrefix) throws IOException Inverts only terms starting w/ prefix- Throws:
IOException
-
DocTermOrds
public DocTermOrds(AtomicReader reader, Bits liveDocs, String field, BytesRef termPrefix, int maxTermDocFreq) throws IOException Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq- Throws:
IOException
-
DocTermOrds
public DocTermOrds(AtomicReader reader, Bits liveDocs, String field, BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits) throws IOException Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).- Throws:
IOException
-
DocTermOrds
Subclass inits w/ this, but be sure you then call uninvert, only once
-
-
Method Details
-
ramUsedInBytes
public long ramUsedInBytes()Returns total bytes used. -
getOrdTermsEnum
Returns a TermsEnum that implements ord. If the provided reader supports ord, we just return its TermsEnum; if it does not, we build a "private" terms index internally (WARNING: consumes RAM) and use that index to implement ord. This also enables ord on top of a composite reader. The returned TermsEnum is unpositioned. This returns null if there are no terms.NOTE: you must pass the same reader that was used when creating this class
- Throws:
IOException
-
numTerms
public int numTerms()Returns the number of terms in this field -
isEmpty
public boolean isEmpty()Returnstrue
if no terms were indexed. -
visitTerm
Subclass can override this- Throws:
IOException
-
setActualDocFreq
Invoked duringuninvert(AtomicReader,Bits,BytesRef)
to record the document frequency for each uninverted term.- Throws:
IOException
-
uninvert
Call this only once (if you subclass!)- Throws:
IOException
-
lookupTerm
Returns the term (BytesRef
) corresponding to the provided ordinal.- Throws:
IOException
-
iterator
Returns a SortedSetDocValues view of this instance- Throws:
IOException
-