Package org.apache.lucene.util.fst
Class Builder<T>
java.lang.Object
org.apache.lucene.util.fst.Builder<T>
Builds a minimal FST (maps an IntsRef term to an arbitrary
output) from pre-sorted terms with outputs. The FST
becomes an FSA if you use NoOutputs. The FST is written
on-the-fly into a compact serialized format byte array, which can
be saved to / loaded from a Directory or used directly
for traversal. The FST is always finite (no cycles).
NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
The parameterized type T is the output type. See the
subclasses of Outputs
.
FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Expert: holds a pending (seen but not yet serialized) arc.static class
Expert: this is invoked by Builder whenever a suffix is serialized.static final class
Expert: holds a pending (seen but not yet serialized) Node. -
Constructor Summary
ConstructorsConstructorDescriptionBuilder
(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, Builder.FreezeTail<T> freezeTail, boolean doPackFST, float acceptableOverheadRatio, boolean allowArrayArcs, int bytesPageBits) Instantiates an FST/FSA builder with all the possible tuning and construction tweaks.Builder
(FST.INPUT_TYPE inputType, Outputs<T> outputs) Instantiates an FST/FSA builder without any pruning. -
Method Summary
Modifier and TypeMethodDescriptionvoid
It's OK to add the same input twice in a row with different outputs, as long as outputs impls the merge method.finish()
Returns final FST.long
long
long
long
-
Constructor Details
-
Builder
Instantiates an FST/FSA builder without any pruning. A shortcut toBuilder(FST.INPUT_TYPE, int, int, boolean, boolean, int, Outputs, FreezeTail, boolean, float, boolean, int)
with pruning options turned off. -
Builder
public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, Builder.FreezeTail<T> freezeTail, boolean doPackFST, float acceptableOverheadRatio, boolean allowArrayArcs, int bytesPageBits) Instantiates an FST/FSA builder with all the possible tuning and construction tweaks. Read parameter documentation carefully.- Parameters:
inputType
- The input type (transition labels). Can be anything fromFST.INPUT_TYPE
enumeration. Shorter types will consume less memory. Strings (character sequences) are represented asFST.INPUT_TYPE.BYTE4
(full unicode codepoints).minSuffixCount1
- If pruning the input graph during construction, this threshold is used for telling if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node is kept.minSuffixCount2
- (Note: only Mike McCandless knows what this one is really doing...)doShareSuffix
- Iftrue
, the shared suffixes will be compacted into unique paths. This requires an additional RAM-intensive hash map for lookups in memory. Setting this parameter tofalse
creates a single suffix path for all input sequences. This will result in a larger FST, but requires substantially less memory and CPU during building.doShareNonSingletonNodes
- Only used if doShareSuffix is true. Set this to true to ensure FST is fully minimal, at cost of more CPU and more RAM during building.shareMaxTailLength
- Only used if doShareSuffix is true. Set this to Integer.MAX_VALUE to ensure FST is fully minimal, at cost of more CPU and more RAM during building.outputs
- The output type for each input sequence. Applies only if building an FST. For FSA, useNoOutputs.getSingleton()
andNoOutputs.getNoOutput()
as the singleton output object.doPackFST
- Pass true to create a packed FST.acceptableOverheadRatio
- How to trade speed for space when building the FST. This option is only relevant when doPackFST is true. @see PackedInts#getMutable(int, int, float)allowArrayArcs
- Pass false to disable the array arc optimization while building the FST; this will make the resulting FST smaller but slower to traverse.bytesPageBits
- How many bits wide to make each byte[] block in the BytesStore; if you know the FST will be large then make this larger. For example 15 bits = 32768 byte pages.
-
-
Method Details
-
getTotStateCount
public long getTotStateCount() -
getTermCount
public long getTermCount() -
getMappedStateCount
public long getMappedStateCount() -
add
It's OK to add the same input twice in a row with different outputs, as long as outputs impls the merge method. Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not. So if your outputs are changeable (egByteSequenceOutputs
orIntSequenceOutputs
) then you cannot reuse across calls.- Throws:
IOException
-
finish
Returns final FST. NOTE: this will return null if nothing is accepted by the FST.- Throws:
IOException
-
fstSizeInBytes
public long fstSizeInBytes()
-