Oak Run Indexing

Oak Run Indexing

@since Oak 1.7.0

Work in progress. Not to be used on production setups

With Oak 1.7 we have added some tooling as part of oak-run index command. Below are details around various operations supported by this command.

The index command supports connecting to different NodeStores via various options which are documented here. Example below assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup use the appropriate connection options.

By default the tool would generate output file in directory indexing-result which is referred to as output directory.

Unless specified all operations connect to the repository in read only mode

Common Options

All the commands support following common options

--index-paths - Comma separated list of index paths for which the selected operations need to be performed. If not specified then the operation would be performed against all the indexes.

Also refer to help output via -h command for some other options

Generate Index Info

java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-info

Generates a report consisting of various stats related to indexes present in the given repository. The generated report is stored by default in <output dir>/index-info.txt

Supported for all index types

Dump Index Definitions

java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-definitions

--index-definitions operation dumps the index definition in json format to a file <output dir>/index-definitions.json. The json file contains index definitions keyed against the index paths

Supported for all index types

Dump Index Data

java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-dump

--index-dump operation dumps the index content in output directory. The output directory would contain one folder for each index. Each folder would have a property file index-details.txt which contains indexPath

Supported for only Lucene indexes.

Index Consistency Check

java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-consistency-check

--index-consistency-check operation performs index consistency check against various indexes. It supports 2 level

Level 1 - Specified as --index-consistency-check=1. Performs a basic check to determine if all blobs referred in index are valid
Level 2 - Specified as --index-consistency-check=2. Performs a more through check to determine if all index files are valid and no corruption has happened. This check is slower

It would generate a report in <output dir>/index-consistency-check-report.txt

Supported for only Lucene indexes.

Reindex

The reindex operation supports 2 modes of index

Out-of-band indexing - Here oak-run would connect to repository in read only mode. It would require certain manual steps
Online Indexing - Here oak-run would connect to repository in --read-write mode

Supported for only Lucene indexes.

If the indexes being reindex have fulltext indexing enabled then refer to Tika Setup for steps on how to adapt the command to include Tika support for text extraction

A - out-of-band indexing

Out of band indexing has following phases

Get checkpoint issued
Perform indexing with read only connection to NodeStore upto checkpoint state
Import the generated indexes
Complete the increment indexing from checkpoint state to current head

Step 1 - Text PreExtraction

If the index being reindexed involves fulltext index and the repository has binary content then its recommended that first text pre-extraction is performed. This ensures that costly operation around text extraction is done prior to actual indexing so that actual indexing does not do text extraction in critical path

Step 2 - Create Checkpoint

Go to CheckpointMBean and create a checkpoint with a long enough lifetime like 10 days. For this invoke CheckpointMBean#createCheckpoint with 864000000 as argument for lifetime

Step 3 - Perform Reindex

In this step we perform the actual indexing via oak-run where it connects to repository in read only mode.

 java -jar oak-run*.jar index --reindex \
 --index-paths=/oak:index/indexName \
 --checkpoint=0fd2a388-de87-47d3-8f30-e86b1cf0a081 \	
 --fds-path=/path/to/datastore  /path/to/segmentstore/

Here following options can be used

--pre-extracted-text-dir - Directory path containing pre extracted text generated via step #1 (optional)
--index-paths - This command requires an explicit set of index paths which need to be indexed (required)
--checkpoint - The checkpoint up to which the index is updated, when indexing in read only mode. For testing purpose, it can be set to ‘head’ to indicate that the head state should be used. (required)
--index-definitions-file - json file file path which contains updated index definitions

If the index does not support fulltext indexing then you can omit providing BlobStore details

Step 4 - Import the index

As a last step we need to import the index back in the repository. This can be done in one of the following ways

4.1 - Via oak-run

In this mode we import the index using oak-run

java -jar oak-run*.jar index --index-import --read-write \
    --index-import-dir=<index dir>  \
    --fds-path=/path/to/datastore /path/to/segmentstore

Here “index dir” is the directory which contains the index files created in step #3. Check the logs from previous command for the directory path.

This mode should only be used when repository is from Oak version 1.7+ as oak-run connects to the repository in read-write mode.

4.2 - Via IndexerMBean

In this mode we import the index using JMX. Looks for IndexerMBean and then import the index directory using the importIndex operation

4.3 - Via script

TODO - Provide a way to import the data on older setup using some script

B - Online indexing

Online indexing automates some of the manual steps which are required for out-of-band indexing.

This mode should only be used when repository is from Oak version 1.7+ as oak-run connects to the repository in read-write mode.

Step 1 - Text PreExtraction

This is same as in out-of-band indexing

Step 2 - Perform reindexing

In this step we configure oak-run to connect to repository in read-write mode and let it perform all other steps i.e checkpoint creation, indexing and import

java -jar oak-run*.jar index --reindex --index-paths=/oak:index/lucene --read-write --fds-path=/path/to/datastore /path/to/segmentstore

Updating or Adding New Index Definitions

@since Oak 1.7.5

Index tooling support updating and adding new index definitions to existing setups. This can be done by passing in path of a json file which contains index definitions

java -jar oak-run*.jar index --reindex --index-paths=/oak:index/newAssetIndex \
--index-definitions-file=index-definitions.json \
--fds-path=/path/to/datastore /path/to/segmentstore

Where index-definitions.json has following structure

{
  "/oak:index/newAssetIndex": {
    "evaluatePathRestrictions": true,
    "compatVersion": 2,
    "type": "lucene",
    "async": "async",
    "jcr:primaryType": "oak:QueryIndexDefinition",
    "indexRules": {
      "jcr:primaryType": "nt:unstructured",
      "dam:Asset": {
        "jcr:primaryType": "nt:unstructured",
        "properties": {
          "jcr:primaryType": "nt:unstructured",
          "valid": {
            "name": "valid",
            "propertyIndex": true,
            "jcr:primaryType": "nt:unstructured",
            "notNullCheckEnabled": true
          },
          "mimetype": {
            "name": "mimetype",
            "analyzed": true,
            "jcr:primaryType": "nt:unstructured"
          }
        }
      }
    }
  }
}

Some points to note about this json file

Each key of top level object refers to the index path
The value of each such key refers to complete index definition
If the index path is not present in existing repository then it would result in a new index being created
In case of new index it must be ensured that parent path structure must already exist in repository. So if a new index is being created at /content/en/oak:index/contentIndex then path upto /content/en/oak:index should already exist in repository
If this option is used with online indexing then do ensure that oak-run version matches with the Oak version used by target repository

You can also use the json file generated from Oakutils. It needs to be modified to confirm to above structure i.e. enclose the whole definition under the intended index path key.

In general the index definitions does not need any special encoding of values as Index definitions in Oak use only String, Long and Double types mostly. However if the index refers to binary config like Tika config then the binary data would need to encoded. Refer to next section for more details.

This option is supported in both online and out-of-band indexing.

For more details refer to OAK-6471

JSON File Format

Some of the standard types used in Oak are not supported directly by JSON like names, blobs etc. Those would need to be encoded in a specific format.

Below are the encoding rules

LONG: No encoding required; “compatVersion”: 2
BOOLEAN: No encoding required; “propertyIndex”: true,
DOUBLE: No encoding required; “weight”: 1.5
STRING: Prefix the value with str:; Generally the value need not be encoded. Encoding is only required if the string starts with 3 letters and then colon; “pathPropertyName”: “str:jcr:path”
DATE: Prefix the value with dat:. The value is ISO8601 formatted date string; “created”: “dat:2017-07-20T13:23:21.196+05:30”
NAME: Prefix the value with nam:.; For jcr:primaryType and jcr:mixins no encoding is required. Any property with these names would be converted to NAME type; “nodetype”: “nam:nt:base”
PATH: Prefix the value with pat:; “imagePath”: “pat:/content/assets/book.jpg”
URI: Prefix the value with uri:; “serverURI”: “uri:http://foo.example.com”
BINARY: By default the binary values are encoded as Base64 string if the binary is less than 1 MB size. The encoded value is prefixed with :blobId:; “jcr:data”: “:blobId:axygz”

Tika Setup

If the indexes being reindex have fulltext indexing enabled then you need to include Tika library in classpath. This is required even if pre extraction is used so as to ensure that any new binary added after pre-extraction is done can be indexed.

First download the tika-app jar from Tika downloads. You should be able to use 1.15 version with Oak 1.7.4 jar.

Then modify the index command like below. The rest of arguments remain same as documented before.

java -cp oak-run.jar:tika-app-1.15.jar org.apache.jackrabbit.oak.run.Main index