Class BM25Similarity
java.lang.Object
org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.BM25Similarity
BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker,
 Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3.
 In Proceedings of the Third Text REtrieval Conference (TREC 1994).
 Gaithersburg, USA, November 1994.
- 
Nested Class SummaryNested classes/interfaces inherited from class org.apache.lucene.search.similarities.SimilaritySimilarity.SimScorer, Similarity.SimWeight
- 
Field SummaryFieldsModifier and TypeFieldDescriptionprotected booleanTrue if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
- 
Constructor SummaryConstructorsConstructorDescriptionBM25 with these default values:k1 = 1.2,b = 0.75.BM25Similarity(float k1, float b) BM25 with the supplied parameter values.
- 
Method SummaryModifier and TypeMethodDescriptionprotected floatavgFieldLength(CollectionStatistics collectionStats) The default implementation computes the average assumTotalTermFreq / maxDoc, or returns1if the index does not store sumTotalTermFreq (Lucene 3.x indexes or any field that omits frequency information).final longcomputeNorm(FieldInvertState state) Computes the normalization value for a field, given the accumulated state of term processing for this field (seeFieldInvertState).final Similarity.SimWeightcomputeWeight(float queryBoost, CollectionStatistics collectionStats, TermStatistics... termStats) Compute any collection-level weight (e.g.protected floatdecodeNormValue(byte b) protected byteencodeNormValue(float boost, int fieldLength) The default implementation encodesboost / sqrt(length)withSmallFloat.floatToByte315(float).floatgetB()Returns thebparameterbooleanReturns true if overlap tokens are discounted from the document's length.floatgetK1()Returns thek1parameterprotected floatidf(long docFreq, long numDocs) Implemented aslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) Computes a score factor for a simple term and returns an explanation for that score factor.idfExplain(CollectionStatistics collectionStats, TermStatistics[] termStats) Computes a score factor for a phrase.protected floatscorePayload(int doc, int start, int end, BytesRef payload) The default implementation returns1voidsetDiscountOverlaps(boolean v) Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.final Similarity.SimScorersimScorer(Similarity.SimWeight stats, AtomicReaderContext context) Creates a newSimilarity.SimScorerto score matching documents from a segment of the inverted index.protected floatsloppyFreq(int distance) Implemented as1 / (distance + 1).toString()Methods inherited from class org.apache.lucene.search.similarities.Similaritycoord, queryNorm
- 
Field Details- 
discountOverlapsprotected boolean discountOverlapsTrue if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
 
- 
- 
Constructor Details- 
BM25Similaritypublic BM25Similarity(float k1, float b) BM25 with the supplied parameter values.- Parameters:
- k1- Controls non-linear term frequency normalization (saturation).
- b- Controls to what degree document length normalizes tf values.
 
- 
BM25Similaritypublic BM25Similarity()BM25 with these default values:- k1 = 1.2,
- b = 0.75.
 
 
- 
- 
Method Details- 
idfprotected float idf(long docFreq, long numDocs) Implemented aslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
- 
sloppyFreqprotected float sloppyFreq(int distance) Implemented as1 / (distance + 1).
- 
scorePayloadThe default implementation returns1
- 
avgFieldLengthThe default implementation computes the average assumTotalTermFreq / maxDoc, or returns1if the index does not store sumTotalTermFreq (Lucene 3.x indexes or any field that omits frequency information).
- 
encodeNormValueprotected byte encodeNormValue(float boost, int fieldLength) The default implementation encodesboost / sqrt(length)withSmallFloat.floatToByte315(float). This is compatible with Lucene's default implementation. If you change this, then you should changedecodeNormValue(byte)to match.
- 
decodeNormValueprotected float decodeNormValue(byte b) 
- 
setDiscountOverlapspublic void setDiscountOverlaps(boolean v) Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
- 
getDiscountOverlapspublic boolean getDiscountOverlaps()Returns true if overlap tokens are discounted from the document's length.- See Also:
 
- 
computeNormDescription copied from class:SimilarityComputes the normalization value for a field, given the accumulated state of term processing for this field (seeFieldInvertState).Matches in longer fields are less precise, so implementations of this method usually set smaller values when state.getLength()is large, and larger values whenstate.getLength()is small.- Specified by:
- computeNormin class- Similarity
- Parameters:
- state- current processing state for this field
- Returns:
- computed norm value
 
- 
idfExplainComputes a score factor for a simple term and returns an explanation for that score factor.The default implementation uses: idf(docFreq, searcher.maxDoc()); Note thatCollectionStatistics.maxDoc()is used instead ofIndexReader#numDocs()because alsoTermStatistics.docFreq()is used, and when the latter is inaccurate, so isCollectionStatistics.maxDoc(), and in the same direction. In addition,CollectionStatistics.maxDoc()is more efficient to compute- Parameters:
- collectionStats- collection-level statistics
- termStats- term-level statistics for the term
- Returns:
- an Explain object that includes both an idf score factor and an explanation for the term.
 
- 
idfExplainComputes a score factor for a phrase.The default implementation sums the idf factor for each term in the phrase. - Parameters:
- collectionStats- collection-level statistics
- termStats- term-level statistics for the terms in the phrase
- Returns:
- an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
 
- 
computeWeightpublic final Similarity.SimWeight computeWeight(float queryBoost, CollectionStatistics collectionStats, TermStatistics... termStats) Description copied from class:SimilarityCompute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.- Specified by:
- computeWeightin class- Similarity
- Parameters:
- queryBoost- the query-time boost.
- collectionStats- collection-level statistics, such as the number of tokens in the collection.
- termStats- term-level statistics, such as the document frequency of a term across the collection.
- Returns:
- SimWeight object with the information this Similarity needs to score a query.
 
- 
simScorerpublic final Similarity.SimScorer simScorer(Similarity.SimWeight stats, AtomicReaderContext context) throws IOException Description copied from class:SimilarityCreates a newSimilarity.SimScorerto score matching documents from a segment of the inverted index.- Specified by:
- simScorerin class- Similarity
- Parameters:
- stats- collection information from- Similarity.computeWeight(float, CollectionStatistics, TermStatistics...)
- context- segment of the inverted index to be scored.
- Returns:
- SloppySimScorer for scoring documents across context
- Throws:
- IOException- if there is a low-level I/O error
 
- 
toString
- 
getK1public float getK1()Returns thek1parameter- See Also:
 
- 
getBpublic float getB()Returns thebparameter- See Also:
 
 
-