Apache Jackrabbit : IndexingConfiguration

Starting with version 1.4, the default search index implementation in Jackrabbit allows you to control which properties of a node are indexed and how much they will affect the jcr:score value of that node in the result.

The configuration parameter is called indexingConfiguration and per default is not set. This means all properties of a node are indexed.

If you wish to configure the indexing behaviour you need to add a parameter to the SearchIndex element in your workspace.xml and repository.xml file. The following example uses a configuration file, which is located in the same folder like the workspace.xml file:

<param name="indexingConfiguration" value="${wsp.home}/indexing_configuration.xml"/>

Index rules

To optimize the index size you can index only certain properties of a node type.

With the below configuration only properties named Text are indexed for nodes of type nt:unstructured. This also applies to nodes with a type that extends from nt:unstructured.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

Please note that you have to declare the namespace prefixes in the configuration element that you are using throughout the XML file!

It is also possible to configure a boost value for the nodes that match the index rule. The default boost value is 1.0. Higher boost values (a reasonable range is 1.0 - 5.0) will yield a higher score value and appear as more relevant.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0">
    <property>Text</property>
  </index-rule>
</configuration>

If you do not whish to boost the complete node but only certain properties you can also provide a boost value for the listed properties:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property boost="3.0">Title</property>
    <property boost="1.5">Text</property>
  </index-rule>
</configuration>

You may also add a condition to the index rule and have multiple rules with the same nodeType. The first index rule that matches will apply and all remaining ones are ignored:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="@priority = 'high'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

In the above example the first index rule only applies if the nt:unstructured node has a priority property with a value 'high'. The condition only supports the equals operator and a string literal.

You may also reference properties in the condition that are not on the current node:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="ancestor::*/@priority = 'high'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured"
              boost="0.5"
              condition="parent::foo/@priority = 'low'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured"
              boost="1.5"
              condition="bar/@priority = 'medium'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

The indexing configuration also allows you to specify the type of a node in the condition. Please note however that the type match must be exact. It does not consider sub types of the specified node type.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="element(*, nt:unstructured)/@priority = 'high'">
    <property>Text</property>
  </index-rule>
</configuration>

Per default the configured properties are fulltext indexed if they are of type STRING and included in the node scope index. That is, you can do a jcr:contains(., 'foo') and it will return nodes that have a string property that contains the word foo. This behaviour can be disabled:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property nodeScopeIndex="false">Text</property>
  </index-rule>
</configuration>

Starting with Jackrabbit version 1.5 you may also use a regular expression for the local name of a property. A regular expression for the namespace prefix is not supported in versions 1.5 and 1.6! Please make sure you use the correct DTD (the 1.1 version):

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.1.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property isRegexp="true">.*Text</property>
  </index-rule>
</configuration>

As of Jackrabbit 2.0 you can also use the match all regexp for the namespace prefix part of a property name. However that's currently the only supported regular expression. The following configuration will include all properties that end with 'Text' in the node scope fulltext index, all others are also indexed, but not used in the node scope index. Please notice the colon in the regular expression, with the .* expression for the namespace prefix. This matches all namespaces, even the empty namespace. That is, properties without a prefix. In contrast a simple .* expression only matches property names without prefix, but none with a prefix.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.1.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property isRegexp="true">.*Text</property>
    <property isRegexp="true" nodeScopeIndex="false">.*:.*</property>
  </index-rule>
</configuration>

Another new feature in Jackrabbit version 1.5 is a new attribute that controls whether the value of a property should be used to create an excerpt. The value of the property is still full-text indexed when set to false, but it will never show up in an excerpt for its parent node.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.1.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property useInExcerpt="false">LastModifiedBy</property>
    <property isRegexp="true">.*Text</property>
  </index-rule>
</configuration>

For backward compatibility reasons default value for the new property useInExcerpt is true.

Index Aggregates

Sometimes it is useful to include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes.

Jackrabbit allows you to define index aggregates based on relative path patterns and primary node types.

Changes to aggregated items cause the main item to be reindexed, even if it was not modified. The indexer observes the workspace for any relevant changes, and reindexes affected items automatically.

The following example creates an index aggregate on nt:file that includes the content of the jcr:content node:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include>jcr:content</include>
  </aggregate>
</configuration>

You can also restrict the included nodes to a certain type:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include primaryType="nt:resource">jcr:content</include>
  </aggregate>
</configuration>

You may also use the * to match all child nodes:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include primaryType="nt:resource">*</include>
  </aggregate>
</configuration>

If you wish to include nodes up to a certain depth below the current node you can add multiple include elements. E.g. the nt:file node may contain an exploded XML document under jcr:content:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include>*</include>
    <include>*/*</include>
    <include>*/*/*</include>
  </aggregate>
</configuration>

As of Jackrabbit 1.6 you can define includes on a property level. Properties that match the path against the root of an indexing aggregate are included in the aggregated node index. Aggregated properties are also used to speed up sorting of query results when the order by clause references a property with a relative path. Please make sure you use the correct DTD (the 1.2 version):

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.2.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include>jcr:content</include>
    <include-property>jcr:content/jcr:lastModified</include-property>
  </aggregate>
</configuration>

As of Jackrabbit 2.3.2, JCR-2989 allows for recursive aggregates. This allows for a pretty powerful definition of hierarchy of aggregates, but it also can very easily get out of hand [0].

This feature can be enabled by setting the recursive flag to true.

<aggregate primaryType="nt:folder" recursive="true" recursiveLimit="10">

The recursiveLimit setting controls the number of levels up the indexing should include for same node types. The default value is 100, and setting it to 0 will give you support for full inclusion without any upper bound.

[0] see this comment for an example of a scenario where indexing aggregates can hurt performance

Index Analyzers

With this configuration part, you define how a property should be analysed. If a property has an analyzer configured, this analyzer is used for indexing and searching this property. For example:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <analyzers> 
        <analyzer class="org.apache.lucene.analysis.KeywordAnalyzer">
            <property>mytext</property>
        </analyzer>
        <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer">
            <property>mytext2</property>
        </analyzer>
  </analyzers> 
</configuration>

The configuration above means that the property "mytext" for the entire workspace is indexed (and searched) with the lucene keywordAnalyzer, and property "mytext2" with whitespaceAnalyzer. Using different analyzers for different languages is specifically useful.

Though, when using analyzers, you may find unexpected behavior when searching within a property compared to searching within a node scope:
When your query is for example:

xpath = "//*[jcr:contains(mytext,'analyzer')]"

and the property "mytext" contained the text : "testing my analyzers".

Now, when not having configured any analyzers for the property "mytext" (and not changed the default analyzer in SearchIndex), this xpath does not return a hit in the node with the property above. Also xpath = "//*[jcr:contains(.,'analyzer')]", won't give a hit. Realize, that you can only set specific analyzers on a node property, and that the node scope indexing/analyzing always is done with the globally defined analyzer in SearchIndex element. Now, when I would change the analyzer used to index the "mytext" property above to

<analyzer class="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">
     <property>mytext</property>
</analyzer>

and I would do the same search again, then for xpath = "//*[jcr:contains(mytext,'analyzer')]" I would find a hit because of stemming! The other search, xpath = "//*[jcr:contains(.,'analyzer')]" still would not give a result, since the node scope is indexed with the global analyzer, which in this case did not do stemming.

So, realize that when using analyzers for specific properties, you might find a hit in a property for some search text, and you do not find a hit with the same search text in the node scope of the property!

Important note: Both index rules and index aggregates influence how content is indexed in Jackrabbit. If you change the configuration the existing content is not automatically re-indexed according to the new rules. You therefore have to manually re-index the content when you change the configuration!