Apache Jackrabbit : Search

Search

Note: the following is for Jackrabbit 2.x and does not apply for Jackrabbit Oak.

Search
Features
Search Configuration
Proprietary Features
Fulltext Indexing of Chinese, Japanese and Korea
Rebuilding the Index
Analyzing Query Performance
SQL-2
Further Development

Features

Node names and property values are indexed as soon as the data is saved or as soon as the transaction is committed.

Text extraction is done asynchronously in a in a background thread. That means changed or added text is not available immediately, but after a short delay. The exact behavior can be configured using the extractor* settings.

Search Configuration

The search index in Jackrabbit is pluggable and has a default implementation based on Apache Lucene. It is configured in the file workspace.xml once the workspace is created. For new workspaces, the configuration in the file repository.xml is used as a template.

To disable the search index, disable (comment out) the index configuration in the file repository.xml and workspace.xml file(s).

This default implementation has the following options:

Parameter	Default Value	Description	Since
path	none	The location of the index directory. This parameter is mandatory. A reasonable value is: {`${wsp.home}/index`}	1.0
useCompoundFile	true	Advises lucene to use compound files for the index files.	1.0
minMergeDocs	100	Minimum number of nodes in an index until segments are merged	1.0
volatileIdleTime	3	Idle time in seconds until the volatile index part is moved to a persistent index even though minMergeDocs is not reached.	1.0
maxMergeDocs	100000, >=1.4: 2147483647	Maximum number of nodes in segments that will be merged. The default value changed in Jackrabbit 1.4 to Integer.MAX_VALUE.	1.0
mergeFactor	10	Determines how often segment indices are merged.	1.0
maxFieldLength	10000	The number of words that are fulltext indexed at most per property.	1.1
bufferSize	10	Maximum number of documents that are held in a pending queue until added to the index	1.0
cacheSize	1000	Size of the document number cache. This cache maps uuids to lucene document numbers	1.0
forceConsistencyCheck	false	Runs a consistency check on every startup. If false, a consistency check is only performed when the search index detects a prior forced shutdown.	1.0
autoRepair	true	Errors detected by a consistency check are automatically repaired. If false, errors are only written to the log.	1.0
analyzer	`org.apache.lucene.analysis.standard.StandardAnalyzer`	Class name of a lucene analyzer to use for fulltext indexing of text.	1.0
queryClass	`org.apache.jackrabbit.core.query.QueryImpl`	Class name that implements the `javax.jcr.query.Query` interface. This class must also extend from the class: `org.apache.jackrabbit.core.query.AbstractQueryImpl`	1.0
respectDocumentOrder	true, >=1.5: false	If true and the query does not contain an 'order by' clause, result nodes will be in document order. For better performance when queries return a lot of nodes set to 'false' (In 1.5 'false' is now the default).	1.0
textFilterClasses	`org.apache.jackrabbit.core.query.lucene.TextPlainTextFilter`	Sets the list of text filters (and text extractors) to use for extracting text content from binary properties. The list must be comma (or whitespace) separated, and contain fully qualified class names of the `TextFilter` (and since 1.3 `TextExtractor` ) classes to be used. The configured classes must all have a public default constructor.	1.0
resultFetchSize	2147483647	The number of results the query handler should initially fetch when a query is executed. Default value: Integer.MAX_VALUE (-> all)	1.2.1
extractorPoolSize	0, >=1.5: twice #ofAvailProcessors	Defines the maximum number of background threads that are used to extract text from binary properties. If set to zero no background threads are allocated and text extractors run in the current thread.	1.3
extractorTimeout	100	A text extractor is executed using a background thread if it doesn't finish within this timeout defined in milliseconds. This parameter has no effect if extractorPoolSize is zero.	1.3
extractorBackLogSize	100, >=1.6: 2147483647	The size of the extractor pool back log. If all threads in the pool are busy, incomming work is put into a wait queue. If the wait queue reaches the back log size, incomming extractor work will not be queued anymore but will be executed with the current thread.	1.3
excerptProviderClass	1.3: `org.apache.jackrabbit.core.query.lucene.DefaultXMLExcerpt`, >=1.4: `org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt`	The name of the class that implements `org.apache.jackrabbit.core.query.lucene.ExcerptProvider` and should be used for the rep:excerpt() function in a query.	1.3
supportHighlighting	false	If set to `true` additional information is stored in the index to support highlighting using the rep:excerpt() function.	1.3
synonymProviderClass	none	The name of a class that implements `org.apache.jackrabbit.core.query.lucene.SynonymProvider`. The default value is null (-> not set).	1.4
synonymProviderConfigPath	none	The path to the synonym provider configuration file. This path interpreted relative to the `path` parameter. If there is a `FileSystem` element inside the `SearchIndex` element, then this path is interpreted relative to the root path of the `FileSystem`. Whether this parameter is mandatory depends on the synonym provider implementation. The default value is null (-> not set).	1.4
indexingConfiguration	none	The path to the indexing configuration file. See also IndexingConfiguration	1.4
indexingConfigurationClass	`org.apache.jackrabbit.core.query.lucene.IndexingConfigurationImpl`	The name of the class that implements `org.apache.jackrabbit.core.query.lucene.IndexingConfiguration`. See also IndexingConfiguration.	1.4
enableConsistencyCheck	false	If set to `true` a consistency check is performed depending on the parameter forceConsistencyCheck. If set to `false` no consistency check is performed on startup, even if a redo log had been applied.	1.4
spellCheckerClass	none	The name of a class that implements `org.apache.jackrabbit.core.query.lucene.SpellChecker`. See also SpellChecker	1.4
similarityClass	Depends on what `Similarity.getDefault()` returns	The name of a class that extends `org.apache.lucene.search.Similarity`.	1.5
maxVolatileIndexSize	1048576	The maximum volatile index size in bytes until it is written to disk. The default value is 1MB.	1.6
initializeHierarchyCache	true	With the default value of `true` the hierarchy cache is initialized on startup and control is only given back when the initialization has completed. When set to `false` the cache is populated during regular use.	1.6

Note: all parameters (except path) have default values and can be omitted to use the default.

Proprietary Features

Jackrabbit supports some advanced features, which are not specified in JSR 170:

Extract text from binary content: TextExtractor; TextExtractorExamples
Get a text excerpt with highlighted words that matched the query: ExcerptProvider
Search for a term and its synonyms: SynonymSearch
Search for similar nodes: SimilaritySearch
Define index aggregates, rules and scores: IndexingConfiguration
Check spelling of a fulltext query statement: SpellChecker

Fulltext Indexing of Chinese, Japanese and Korea

To index documents written in one of those languages, use the analyzer org.apache.lucene.analysis.cjk.CJKAnalyzer. Due to a limitation of PDFBox, some PDF files may not be indexed at all or indexed correctly. If this is the case, a warning message is written to the log file ("Failed to extract PDF text content").

Rebuilding the Index

After a power outage or after killing the process, the index may become inconsistent. To rebuild the index, stop Jackrabbit, delete the index directories, and start Jackrabbit. The index will automatically be re-built. There is one index directory for each workspace at <repositoryHome>/<workspaceName>/index, plus one index directory for the version store at <repositoryHome>/repository/index.

Analyzing Query Performance

To get query statements and timings, set the following log level in log4j.xml:

<logger name="org.apache.jackrabbit.core.query.QueryImpl">
    <level value="debug"/>
</logger>

SQL-2

The default query language for JCR 2.0 is SQL-2.

Further Development

ReduceMemOfSharedFieldCache