Class DefaultSimilarity
encodes norm values as a single byte before being stored. At search time,
the norm byte value is read from the index
directory and
decoded back to a float norm value.
This encoding/decoding, while reducing index size, comes with the price of
precision loss - it is not guaranteed that decode(encode(x)) = x. For
instance, decode(encode(0.89)) = 0.75.
Compression of norm values to a single byte saves memory at search time, because once a field is referenced at search time, its norms - for all documents - are maintained in memory.
The rationale supporting such lossy compression of norm values is that given
the difficulty (and inaccuracy) of users to express their true information
need by a query, only big differences matter.
Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarity for search.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
Similarity.SimScorer, Similarity.SimWeight -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected booleanTrue if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionfloatcoord(int overlap, int maxOverlap) Implemented asoverlap / maxOverlap.final floatdecodeNormValue(long norm) Decodes the norm value, assuming it is a single byte.final longencodeNormValue(float f) Encodes a normalization factor for storage in an index.booleanReturns true if overlap tokens are discounted from the document's length.floatidf(long docFreq, long numDocs) Implemented aslog(numDocs/(docFreq+1)) + 1.floatlengthNorm(FieldInvertState state) Implemented asstate.getBoost()*lengthNorm(numTerms), wherenumTermsisFieldInvertState.getLength()ifsetDiscountOverlaps(boolean)is false, else it'sFieldInvertState.getLength()-FieldInvertState.getNumOverlap().floatqueryNorm(float sumOfSquaredWeights) Implemented as1/sqrt(sumOfSquaredWeights).floatscorePayload(int doc, int start, int end, BytesRef payload) The default implementation returns1voidsetDiscountOverlaps(boolean v) Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.floatsloppyFreq(int distance) Implemented as1 / (distance + 1).floattf(float freq) Implemented assqrt(freq).toString()Methods inherited from class org.apache.lucene.search.similarities.TFIDFSimilarity
computeNorm, computeWeight, idfExplain, idfExplain, simScorer
-
Field Details
-
discountOverlaps
protected boolean discountOverlapsTrue if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
-
-
Constructor Details
-
DefaultSimilarity
public DefaultSimilarity()Sole constructor: parameter-free
-
-
Method Details
-
coord
public float coord(int overlap, int maxOverlap) Implemented asoverlap / maxOverlap.- Specified by:
coordin classTFIDFSimilarity- Parameters:
overlap- the number of query terms matched in the documentmaxOverlap- the total number of terms in the query- Returns:
- a score factor based on term overlap with the query
-
queryNorm
public float queryNorm(float sumOfSquaredWeights) Implemented as1/sqrt(sumOfSquaredWeights).- Specified by:
queryNormin classTFIDFSimilarity- Parameters:
sumOfSquaredWeights- the sum of the squares of query term weights- Returns:
- a normalization factor for query weights
-
encodeNormValue
public final long encodeNormValue(float f) Encodes a normalization factor for storage in an index.The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7x10^9 to 2x10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.
- Specified by:
encodeNormValuein classTFIDFSimilarity- See Also:
-
decodeNormValue
public final float decodeNormValue(long norm) Decodes the norm value, assuming it is a single byte.- Specified by:
decodeNormValuein classTFIDFSimilarity- See Also:
-
lengthNorm
Implemented asstate.getBoost()*lengthNorm(numTerms), wherenumTermsisFieldInvertState.getLength()ifsetDiscountOverlaps(boolean)is false, else it'sFieldInvertState.getLength()-FieldInvertState.getNumOverlap().- Specified by:
lengthNormin classTFIDFSimilarity- Parameters:
state- statistics of the current field (such as length, boost, etc)- Returns:
- an index-time normalization value
-
tf
public float tf(float freq) Implemented assqrt(freq).- Specified by:
tfin classTFIDFSimilarity- Parameters:
freq- the frequency of a term within a document- Returns:
- a score factor based on a term's within-document frequency
-
sloppyFreq
public float sloppyFreq(int distance) Implemented as1 / (distance + 1).- Specified by:
sloppyFreqin classTFIDFSimilarity- Parameters:
distance- the edit distance of this sloppy phrase match- Returns:
- the frequency increment for this match
- See Also:
-
scorePayload
The default implementation returns1- Specified by:
scorePayloadin classTFIDFSimilarity- Parameters:
doc- The docId currently being scored.start- The start position of the payloadend- The end position of the payloadpayload- The payload byte array to be scored- Returns:
- An implementation dependent float to be used as a scoring factor
-
idf
public float idf(long docFreq, long numDocs) Implemented aslog(numDocs/(docFreq+1)) + 1.- Specified by:
idfin classTFIDFSimilarity- Parameters:
docFreq- the number of documents which contain the termnumDocs- the total number of documents in the collection- Returns:
- a score factor based on the term's document frequency
-
setDiscountOverlaps
public void setDiscountOverlaps(boolean v) Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms. -
getDiscountOverlaps
public boolean getDiscountOverlaps()Returns true if overlap tokens are discounted from the document's length.- See Also:
-
toString
-