Package org.apache.lucene.index
Table Of Contents
Postings APIs
Fields
Fields
is the initial entry point into the
postings APIs, this can be obtained in several ways:
// access indexed fields for an index segment Fields fields = reader.fields(); // access term vector fields for a specified document Fields fields = reader.getTermVectors(docid);Fields implements Java's Iterable interface, so its easy to enumerate the list of fields:
// enumerate list of fields for (String field : fields) { // access the terms for this field Terms terms = fields.terms(field); }
Terms
Terms
represents the collection of terms
within a field, exposes some metadata and statistics,
and an API for enumeration.
// metadata about the field System.out.println("positions? " + terms.hasPositions()); System.out.println("offsets? " + terms.hasOffsets()); System.out.println("payloads? " + terms.hasPayloads()); // iterate through terms TermsEnum termsEnum = terms.iterator(null); BytesRef term = null; while ((term = termsEnum.next()) != null) { doSomethingWith(termsEnum.term()); }
TermsEnum
provides an iterator over the list
of terms within a field, some statistics about the term,
and methods to access the term's documents and
positions.
// seek to a specific term boolean found = termsEnum.seekExact(new BytesRef("foobar")); if (found) { // get the document frequency System.out.println(termsEnum.docFreq()); // enumerate through documents DocsEnum docs = termsEnum.docs(null, null); // enumerate through documents and positions DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null); }
Documents
DocsEnum
is an extension of
DocIdSetIterator
that iterates over the list of
documents for a term, along with the term frequency within that document.
int docid; while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { System.out.println(docid); System.out.println(docsEnum.freq()); }
Positions
DocsAndPositionsEnum
is an extension of
DocsEnum
that additionally allows iteration
of the positions a term occurred within the document, and any additional
per-position information (offsets and payload)
int docid; while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { System.out.println(docid); int freq = docsAndPositionsEnum.freq(); for (int i = 0; i < freq; i++) { System.out.println(docsAndPositionsEnum.nextPosition()); System.out.println(docsAndPositionsEnum.startOffset()); System.out.println(docsAndPositionsEnum.endOffset()); System.out.println(docsAndPositionsEnum.getPayload()); } }
Index Statistics
Term statistics
TermsEnum.docFreq()
: Returns the number of documents that contain at least one occurrence of the term. This statistic is always available for an indexed term. Note that it will also count deleted documents, when segments are merged the statistic is updated as those deleted documents are merged away.TermsEnum.totalTermFreq()
: Returns the number of occurrences of this term across all documents. Note that this statistic is unavailable (returns-1
) if term frequencies were omitted from the index (DOCS_ONLY
) for the field. Like docFreq(), it will also count occurrences that appear in deleted documents.
Field statistics
Terms.size()
: Returns the number of unique terms in the field. This statistic may be unavailable (returns-1
) for some Terms implementations such asMultiTerms
, where it cannot be efficiently computed. Note that this count also includes terms that appear only in deleted documents: when segments are merged such terms are also merged away and the statistic is then updated.Terms.getDocCount()
: Returns the number of documents that contain at least one occurrence of any term for this field. This can be thought of as a Field-level docFreq(). Like docFreq() it will also count deleted documents.Terms.getSumDocFreq()
: Returns the number of postings (term-document mappings in the inverted index) for the field. This can be thought of as the sum ofTermsEnum.docFreq()
across all terms in the field, and like docFreq() it will also count postings that appear in deleted documents.Terms.getSumTotalTermFreq()
: Returns the number of tokens for the field. This can be thought of as the sum ofTermsEnum.totalTermFreq()
across all terms in the field, and like totalTermFreq() it will also count occurrences that appear in deleted documents, and will be unavailable (returns-1
) if term frequencies were omitted from the index (DOCS_ONLY
) for the field.
Segment statistics
IndexReader.maxDoc()
: Returns the number of documents (including deleted documents) in the index.IndexReader.numDocs()
: Returns the number of live documents (excluding deleted documents) in the index.IndexReader.numDeletedDocs()
: Returns the number of deleted documents in the index.Fields.size()
: Returns the number of indexed fields.Fields.getUniqueTermCount()
: Returns the number of indexed terms, the sum ofTerms.size()
across all fields.
Document statistics
Document statistics are available during the indexing process for an indexed field: typically
a Similarity
implementation will store some
of these values (possibly in a lossy way), into the normalization value for the document in
its Similarity.computeNorm(org.apache.lucene.index.FieldInvertState)
method.
FieldInvertState.getLength()
: Returns the number of tokens for this field in the document. Note that this is just the number of times thatTokenStream.incrementToken()
returned true, and is unrelated to the values inPositionIncrementAttribute
.FieldInvertState.getNumOverlap()
: Returns the number of tokens for this field in the document that had a position increment of zero. This can be used to compute a document length that discounts artificial tokens such as synonyms.FieldInvertState.getPosition()
: Returns the accumulated position value for this field in the document: computed from the values ofPositionIncrementAttribute
and includingAnalyzer.getPositionIncrementGap(java.lang.String)
s across multivalued fields.FieldInvertState.getOffset()
: Returns the total character offset value for this field in the document: computed from the values ofOffsetAttribute
returned byTokenStream.end()
, and includingAnalyzer.getOffsetGap(java.lang.String)
s across multivalued fields.FieldInvertState.getUniqueTermCount()
: Returns the number of unique terms encountered for this field in the document.FieldInvertState.getMaxTermFrequency()
: Returns the maximum frequency across all unique terms encountered for this field in the document.
Additional user-supplied statistics can be added to the document as DocValues fields and
accessed via AtomicReader.getNumericDocValues(java.lang.String)
.
-
ClassDescription
AtomicReader
is an abstract class, providing an interface for accessing an index.IndexReaderContext
forAtomicReader
instances.BaseCompositeReader<R extends IndexReader>Base class for implementingCompositeReader
s based on an array of sub-readers.A per-document byte[]Basic tool and API to check the health of an index and write a new segments file that removes reference to problematic segments.Returned fromCheckIndex.checkIndex()
detailing the health and status of the index.Status from testing DocValuesStatus from testing field norms.Holds the status of each segment in the index.Status from testing stored fields.Status from testing term index.Status from testing stored fields.Instances of this reader type can only be used to get stored fields from the underlying AtomicReaders, but it is not possible to directly retrieve postings.IndexReaderContext
forCompositeReader
instance.AMergeScheduler
that runs each merge using a separate thread.This exception is thrown when Lucene detects an inconsistency in the index.DirectoryReader is an implementation ofCompositeReader
that can read indexes in aDirectory
.Also iterates through positions.Iterates through the documents and term freqs.This class enables fast access to multiple term ords for a specified field across all docIDs.Access to the Field Info file that describes document fields and whether or not they are indexed.DocValues types.Controls how much information is stored in the postings lists.Collection ofFieldInfo
s (accessible by number or by name).This class tracks the number and position / offset parameters of terms being added to the index.Flex API for access to fields and termsAFilterAtomicReader
contains another AtomicReader, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.Base class for filteringDocsAndPositionsEnum
implementations.Base class for filteringDocsEnum
implementations.Base class for filteringFields
implementations.Base class for filteringTerms
implementations.Base class for filteringTermsEnum
implementations.A FilterDirectoryReader wraps another DirectoryReader, allowing implementations to transform or extend it.A no-op SubReaderWrapper that simply returns the parent DirectoryReader's original subreaders.Factory class passed to FilterDirectoryReader constructor that allows subclasses to wrap the filtered DirectoryReader's subreaders.Abstract class for enumerating a subset of all terms.Return value, if term should be accepted or the iteration shouldEND
.Represents a single field for indexing.Describes the properties of a field.Expert: represents a single commit into an index as seen by theIndexDeletionPolicy
orIndexReader
.Expert: policy for deletion of staleindex commits
.This class contains useful constants representing filenames and extensions used by lucene, as well as convenience methods for querying whether a file name matches an extension (matchesExtension
), as well as generating file names from a segment name, generation and extension (fileNameFromGeneration
,segmentFileName
).This exception is thrown when Lucene detects an index that is newer than this Lucene version.This exception is thrown when Lucene detects an index that is too old for this Lucene versionSignals that no index was found in the Directory.IndexReader is an abstract class, providing an interface for accessing an index.A custom listener that's invoked when the IndexReader is closed.A struct like class that represents a hierarchical relationship betweenIndexReader
instances.This is an easy-to-use tool that upgrades all segments of an index from previous Lucene versions to the current segment file format.AnIndexWriter
creates and maintains an index.IfDirectoryReader.open(IndexWriter,boolean)
has been called (ie, this writer is in near real-time mode), then after a merge completes, this class can be invoked to warm the reader on the newly merged segment, before the merge commits.Holds all the configuration that is used to create anIndexWriter
.Specifies the open mode forIndexWriter
.ThisIndexDeletionPolicy
implementation that keeps only the most recent commit and immediately removes all prior commits after a new commit is done.Holds all the configuration used byIndexWriter
with few setters for settings that can be changed on anIndexWriter
instance "live".This is aLogMergePolicy
that measures size of a segment as the total byte size of the segment's files.This is aLogMergePolicy
that measures size of a segment as the number of documents (not taking deletions into account).This class implements aMergePolicy
that tries to merge segments into levels of exponentially increasing size, where each level has fewer segments than the value of the merge factor.Expert: a MergePolicy determines the sequence of primitive merge operations.A map of doc IDs.Thrown when a merge was explicity aborted becauseIndexWriter.close(boolean)
was called withfalse
.Exception thrown if there are any problems while executing a merge.A MergeSpecification instance provides the information necessary to perform multiple merges.MergeTrigger is passed toMergePolicy.findMerges(MergeTrigger, SegmentInfos)
to indicate the event that triggered the merge.OneMerge provides the information necessary to perform an individual primitive merge operation, resulting in a single new segment.Expert:IndexWriter
uses an instance implementing this interface to execute the merges selected by aMergePolicy
.Holds common state used during segment merging.Class for recording units of work when merging segments.Remaps docids around deletes during mergeExposes flex API, merged from flex API of sub-segments.Holds aDocsAndPositionsEnum
along with the correspondingReaderSlice
.Holds aDocsEnum
along with the correspondingReaderSlice
.A wrapper for CompositeIndexReader providing access to DocValues.Implements SortedDocValues over n subs, using an OrdinalMapImplements MultiSortedSetDocValues over n subs, using an OrdinalMapmaps per-segment ordinals to/from global ordinal spaceExposes flex API, merged from flex API of sub-segments.ACompositeReader
which reads multiple indexes, appending their content.Exposes flex API, merged from flex API of sub-segments.AnIndexDeletionPolicy
which keeps all index commits around, never deleting them.AMergePolicy
which never returns merges to execute (hence it's name).AMergeScheduler
which never executes any merges.A per-document numeric value.An ordinal basedTermState
AnAtomicReader
which reads multiple, parallel indexes.AnCompositeReader
which reads multiple, parallel indexes.ASnapshotDeletionPolicy
which adds a persistence layer so that snapshots can be maintained across the life of an application.Utility class to safely shareDirectoryReader
instances across multiple threads, while periodically reopening.Subreader slice from a parent composite reader.Common util methods for dealing withIndexReader
s andIndexReaderContext
s.Embeds a [read-only] SegmentInfo and adds per-commit fields.Information about a segment such as it's name, directory, and files related to the segment.A collection of segmentInfo objects with methods for operating on those segments in relation to the file system.Utility class for executing code that needs to do something with the current segments file.IndexReader implementation over a single segment.Called when the shared core for this SegmentReader is closed.Holder class for common parameters used during read.Holder class for common parameters used during write.AMergeScheduler
that simply does each merge sequentially, using the current thread.A very simple merged segment warmer that just ensures data structures are initialized.Subclass of FilteredTermsEnum for enumerating a single term.Exposes multi-valued view over a single-valued instance.This class forces a composite reader (eg aMultiReader
orDirectoryReader
) to emulate an atomic reader.AnIndexDeletionPolicy
that wraps any otherIndexDeletionPolicy
and adds the ability to hold and later release snapshots of an index.A per-document byte[] with presorted values.A per-document set of presorted byte[] values.Expert: provides a low-level means of accessing the stored field values in an index.Enumeration of possible return values forStoredFieldVisitor.needsField(org.apache.lucene.index.FieldInfo)
.A Term represents a word from text.Access to the terms in a specific field.Iterator to seek (TermsEnum.seekCeil(BytesRef)
,TermsEnum.seekExact(BytesRef)
) or step through (BytesRefIterator.next()
terms to obtain frequency information (TermsEnum.docFreq()
),DocsEnum
orDocsAndPositionsEnum
for the current term (TermsEnum.docs(org.apache.lucene.util.Bits, org.apache.lucene.index.DocsEnum)
.Represents returned result fromTermsEnum.seekCeil(org.apache.lucene.util.BytesRef)
.Encapsulates all required internal state to position the associatedTermsEnum
without re-seeking.Merges segments of approximately equal size, subject to an allowed number of segments per tier.Holds score and explanation for a single candidate merge.Class that tracks changes to a delegated IndexWriter, used byControlledRealTimeReopenThread
to ensure specific changes are visible.An interface for implementations that support 2-phase commit.A utility for executing 2-phase commit on several objects.Thrown byTwoPhaseCommitTool.execute(TwoPhaseCommit...)
when an object fails to commit().Thrown byTwoPhaseCommitTool.execute(TwoPhaseCommit...)
when an object fails to prepareCommit().ThisMergePolicy
is used for upgrading all existing segments of an index when callingIndexWriter.forceMerge(int)
.