Class BlockTreeTermsWriter
- All Implemented Interfaces:
Closeable
,AutoCloseable
Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.
Files:
- .tim: Term Dictionary
- .tip: Term Index
Term Dictionary
The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and per-term metadata (typically pointers to the postings list for that term in the inverted index).
The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.
NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections.
- TermsDict (.tim) --> Header, PostingsHeader, NodeBlockNumBlocks, FieldSummary, DirOffset
- NodeBlock --> (OuterNode | InnerNode)
- OuterNode --> EntryCount, SuffixLength, ByteSuffixLength, StatsLength, < TermStats >EntryCount, MetaLength, <TermMetadata>EntryCount
- InnerNode --> EntryCount, SuffixLength[,Sub?], ByteSuffixLength, StatsLength, < TermStats ? >EntryCount, MetaLength, <TermMetadata ? >EntryCount
- TermStats --> DocFreq, TotalTermFreq
- FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, ByteRootCodeLength, SumTotalTermFreq?, SumDocFreq, DocCount>NumFields
- Header -->
CodecHeader
- DirOffset -->
Uint64
- EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields,
FieldNumber,RootCodeLength,DocCount -->
VInt
- TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq -->
VLong
Notes:
- Header is a
CodecHeader
storing the version information for the BlockTree implementation. - DirOffset is a pointer to the FieldSummary section.
- DocFreq is the count of documents which contain the term.
- TotalTermFreq is the total number of occurrences of the term. This is encoded as the difference between the total number of occurrences and the DocFreq.
- FieldNumber is the fields number from
FieldInfos
. (.fnm) - NumTerms is the number of unique terms for the field.
- RootCode points to the root block for the field.
- SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
- DocCount is the number of documents that have at least one posting for this field.
- PostingsHeader and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
- For inner nodes of the tree, every entry will steal one bit to mark whether it points to child nodes(sub-block). If so, the corresponding TermStats and TermMetaData are omitted
Term Index
The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.
- TermsIndex (.tip) --> Header, FSTIndexNumFields <IndexStartFP>NumFields, DirOffset
- Header -->
CodecHeader
- DirOffset -->
Uint64
- IndexStartFP -->
VLong
- FSTIndex -->
FST<byte[]>
Notes:
- The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
- DirOffset is a pointer to the start of the IndexStartFPs for all fields
- It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
Suggested default value for themaxItemsInBlock
parameter toBlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int)
.static final int
Suggested default value for theminItemsInBlock
parameter toBlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int)
.static final int
Append-onlystatic final int
Current index format.static final int
Meta data as arraystatic final int
Initial index format.static final int
Append-onlystatic final int
Current terms format.static final int
Meta data as arraystatic final int
Initial terms format. -
Constructor Summary
ConstructorsConstructorDescriptionBlockTreeTermsWriter
(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock) Create a new writer. -
Method Summary
Modifier and TypeMethodDescriptionAdd a new fieldvoid
close()
Called when we are done adding everything.protected void
writeHeader
(IndexOutput out) Writes the terms file header.protected void
Writes the index file header.protected void
writeIndexTrailer
(IndexOutput indexOut, long dirStart) Writes the index file trailer.protected void
writeTrailer
(IndexOutput out, long dirStart) Writes the terms file trailer.Methods inherited from class org.apache.lucene.codecs.FieldsConsumer
merge
-
Field Details
-
DEFAULT_MIN_BLOCK_SIZE
public static final int DEFAULT_MIN_BLOCK_SIZESuggested default value for theminItemsInBlock
parameter toBlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int)
.- See Also:
-
DEFAULT_MAX_BLOCK_SIZE
public static final int DEFAULT_MAX_BLOCK_SIZESuggested default value for themaxItemsInBlock
parameter toBlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int)
.- See Also:
-
TERMS_VERSION_START
public static final int TERMS_VERSION_STARTInitial terms format.- See Also:
-
TERMS_VERSION_APPEND_ONLY
public static final int TERMS_VERSION_APPEND_ONLYAppend-only- See Also:
-
TERMS_VERSION_META_ARRAY
public static final int TERMS_VERSION_META_ARRAYMeta data as array- See Also:
-
TERMS_VERSION_CURRENT
public static final int TERMS_VERSION_CURRENTCurrent terms format.- See Also:
-
TERMS_INDEX_VERSION_START
public static final int TERMS_INDEX_VERSION_STARTInitial index format.- See Also:
-
TERMS_INDEX_VERSION_APPEND_ONLY
public static final int TERMS_INDEX_VERSION_APPEND_ONLYAppend-only- See Also:
-
TERMS_INDEX_VERSION_META_ARRAY
public static final int TERMS_INDEX_VERSION_META_ARRAYMeta data as array- See Also:
-
TERMS_INDEX_VERSION_CURRENT
public static final int TERMS_INDEX_VERSION_CURRENTCurrent index format.- See Also:
-
-
Constructor Details
-
BlockTreeTermsWriter
public BlockTreeTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock) throws IOException Create a new writer. The number of items (terms or sub-blocks) per block will aim to be between minItemsPerBlock and maxItemsPerBlock, though in some cases the blocks may be smaller than the min.- Throws:
IOException
-
-
Method Details
-
writeHeader
Writes the terms file header.- Throws:
IOException
-
writeIndexHeader
Writes the index file header.- Throws:
IOException
-
writeTrailer
Writes the terms file trailer.- Throws:
IOException
-
writeIndexTrailer
Writes the index file trailer.- Throws:
IOException
-
addField
Description copied from class:FieldsConsumer
Add a new field- Specified by:
addField
in classFieldsConsumer
- Throws:
IOException
-
close
Description copied from class:FieldsConsumer
Called when we are done adding everything.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in classFieldsConsumer
- Throws:
IOException
-