Package org.apache.lucene.analysis
Class TokenStreamToAutomaton
java.lang.Object
org.apache.lucene.analysis.TokenStreamToAutomaton
Consumes a TokenStream and creates an
Automaton
where the transition labels are UTF8 bytes (or Unicode
code points if unicodeArcs is true) from the TermToBytesRefAttribute
. Between tokens we insert
POS_SEP and for holes we insert HOLE.-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected BytesRef
changeToken
(BytesRef in) Subclass & implement this if you need to change the token (such as escaping certain bytes) before it's turned into a graph.void
setPreservePositionIncrements
(boolean enablePositionIncrements) Whether to generate holes in the automaton for missing positions,true
by default.void
setUnicodeArcs
(boolean unicodeArcs) Whether to make transition labels Unicode code points instead of UTF8 bytes,false
by defaultPulls the graph (includingPositionLengthAttribute
) from the providedTokenStream
, and creates the corresponding automaton where arcs are bytes (or Unicode code points if unicodeArcs = true) from each term.
-
Field Details
-
POS_SEP
public static final int POS_SEPWe create transition between two adjacent tokens.- See Also:
-
HOLE
public static final int HOLEWe add this arc to represent a hole.- See Also:
-
-
Constructor Details
-
TokenStreamToAutomaton
public TokenStreamToAutomaton()Sole constructor.
-
-
Method Details
-
setPreservePositionIncrements
public void setPreservePositionIncrements(boolean enablePositionIncrements) Whether to generate holes in the automaton for missing positions,true
by default. -
setUnicodeArcs
public void setUnicodeArcs(boolean unicodeArcs) Whether to make transition labels Unicode code points instead of UTF8 bytes,false
by default -
changeToken
Subclass & implement this if you need to change the token (such as escaping certain bytes) before it's turned into a graph. -
toAutomaton
Pulls the graph (includingPositionLengthAttribute
) from the providedTokenStream
, and creates the corresponding automaton where arcs are bytes (or Unicode code points if unicodeArcs = true) from each term.- Throws:
IOException
-