Package org.apache.lucene.analysis.morph
Class Viterbi<T extends Token,U extends Viterbi.Position>
java.lang.Object
org.apache.lucene.analysis.morph.Viterbi<T,U>
- Type Parameters:
T- output token classU- position class
- Direct Known Subclasses:
ViterbiNBest
Performs Viterbi algorithm for
morphological Tokenizers, which split texts by Hidden Markov Model or Conditional Random Fields.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classHolds all back pointers arriving to this position.static final classHolds partial graph (array of positions) for calculating the minimum cost path -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final RollingCharBufferprotected final ConnectionCostsprotected booleanprotected booleanprotected intprotected static final intprotected booleanprotected booleanprotected intprotected final Viterbi.WrappedPositionArray<U> protected static final booleanprotected final IntsRef -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedViterbi(TokenInfoFST fst, FST.BytesReader fstReader, BinaryDictionary<? extends MorphData> dictionary, TokenInfoFST userFST, FST.BytesReader userFSTReader, Dictionary<? extends MorphData> userDictionary, ConnectionCosts costs, Class<U> positionImpl) -
Method Summary
Modifier and TypeMethodDescriptionprotected final voidadd(MorphData morphData, Viterbi.Position fromPosData, int wordPos, int endPos, int wordID, TokenType type, boolean addPenalty) Add a token on the minimum cost path to the pending token list.protected abstract voidbacktrace(Viterbi.Position endPosData, int fromIDX) Backtrace from the provided position, back to the last time we back-traced, accumulating the resulting tokens to the pending list.protected voidbacktraceNBest(Viterbi.Position endPosData, boolean useEOS) Backtrace the n-best path.protected intcomputePenalty(int pos, int length) Returns the penalty for a specific input regionprotected intcomputeSpacePenalty(MorphData morphData, int wordID, int numSpaces) Returns the space penalty.protected voidRemove duplicated tokens from the pending list; this is needed becausebacktrace(Position, int)andbacktraceNBest(Position, boolean)can add same tokens to the list.final voidforward()Incrementally parse some more characters.intgetPos()booleanisEnd()booleanprotected abstract intprocessUnknownWord(boolean anyMatches, Viterbi.Position posData) Add unknown words to the position graph.voidresetBuffer(Reader reader) voidprotected booleanshouldSkipProcessUnknownWord(int unknownWordEndIndex, Viterbi.Position posData)
-
Field Details
-
VERBOSE
protected static final boolean VERBOSE- See Also:
-
MAX_UNKNOWN_WORD_LENGTH
protected static final int MAX_UNKNOWN_WORD_LENGTH- See Also:
-
costs
-
wordIdRef
-
buffer
-
positions
-
end
protected boolean end -
lastBackTracePos
protected int lastBackTracePos -
pos
protected int pos -
pending
-
outputNBest
protected boolean outputNBest -
enableSpacePenaltyFactor
protected boolean enableSpacePenaltyFactor -
outputLongestUserEntryOnly
protected boolean outputLongestUserEntryOnly
-
-
Constructor Details
-
Viterbi
protected Viterbi(TokenInfoFST fst, FST.BytesReader fstReader, BinaryDictionary<? extends MorphData> dictionary, TokenInfoFST userFST, FST.BytesReader userFSTReader, Dictionary<? extends MorphData> userDictionary, ConnectionCosts costs, Class<U> positionImpl)
-
-
Method Details
-
forward
Incrementally parse some more characters. This runs the viterbi search forwards "enough" so that we generate some more tokens. How much forward depends on the chars coming in, since some chars could cause longer-lasting ambiguity in the parsing. Once the ambiguity is resolved, then we back trace, produce the pending tokens, and return.- Throws:
IOException
-
shouldSkipProcessUnknownWord
-
processUnknownWord
protected abstract int processUnknownWord(boolean anyMatches, Viterbi.Position posData) throws IOException Add unknown words to the position graph.- Returns:
- word length
- Throws:
IOException
-
backtrace
Backtrace from the provided position, back to the last time we back-traced, accumulating the resulting tokens to the pending list. The pending list is then in-reverse (last token should be returned first).- Throws:
IOException
-
backtraceNBest
Backtrace the n-best path. Subclasses that support n-best paths should implement this method.- Throws:
IOException
-
fixupPendingList
protected void fixupPendingList()Remove duplicated tokens from the pending list; this is needed becausebacktrace(Position, int)andbacktraceNBest(Position, boolean)can add same tokens to the list. Subclasses that support n-best paths should implement this method. -
add
protected final void add(MorphData morphData, Viterbi.Position fromPosData, int wordPos, int endPos, int wordID, TokenType type, boolean addPenalty) throws IOException Add a token on the minimum cost path to the pending token list.- Throws:
IOException
-
computeSpacePenalty
Returns the space penalty. -
computePenalty
Returns the penalty for a specific input region- Throws:
IOException
-
getPos
public int getPos() -
isEnd
public boolean isEnd() -
getPending
-
isOutputNBest
public boolean isOutputNBest() -
resetBuffer
-
resetState
public void resetState()
-