Oak 1.6 added support for Lucene Hybrid Index (OAK-4412). That enables near real time (NRT) support for Lucene based indexes. It also had a limited support for sync indexes. This feature aims to improve that to next level and enable support for sync property indexes.
Synchronous Index Usecases
Synchronous indexes are required in following usecases
For unique indexes like uuid index, principal name index it needs to be ensured that indexed value is unique across whole of the repository at time of commit itself. If the indexed value already exists e.g. principal with same name already exist then that commit should fail. To meet this requirement we need synchronous index which get updated as part of commit itself.
Depending on application requirements the query results may be
- Eventually Consistent - Any changes done get eventually reflected in query results.
- Consistent - Any change done gets immediately reflected in query result
For most cases like user driven search eventual consistent search result work fine and hence async indexes can be used. With recent support for NRT indexes (OAK-4412) the user experience get better and changes done by user get reflected very soon in search result.
However for some cases we need to support fully consistent search results. For e.g. assume there is component which maintains a cache for nodes of type
app:Component and uses a observation listener to listen for changes in nodes of type
app:Component and upon finding any changes it rebuilds the cache by queriying for all such nodes. For this cache to be correct it needs to be ensured query results are consistent wrt session state associated with the listener. Otherwise it may miss on picking a new component and later request to cache for such component would fail.
For such usecases its required that indexes are synchronous and results provided by index are consistent
Drawbacks of current property indexes
Oak currently has support for synchronous property indexes which are used to meet above usecases. However the current implementation has certain drawbacks
- Perform poorly over remote storage - The property indexes are stores as normal NodeState and hence reading them over remote storage like Mongo performs poorly
- Prone to conflicts - The content mirror store strategy is prone to conflict if the index content is volatile
- Storage overhead - The storage over head is large specially for remote storage as each NodeState is mapped to 1 Document.
To overcome the drawbacks and still meet the synchronous index requirements we can implement a hybrid index where the indexes content is stored using both property index (for recent enrties) and lucene indexes (for older entries). At high level flow would be
- Store recently added index content as normal property index
- As part of async indexer run index the same content as part of lucene index
- Later prune the property index content which would have been indexed as part of lucene index
- Any query would result in union of query results from both property index and lucene indexes (with some caveats)
The synchronous index support would need to be enabled via index definition
async- This needs to have an entry
syncto true for each property definition which needs to be indexed in a sync way
For unique indexes set
unique i.e. true
The property index content would be stored as hidden nodes under the index definition nodes. The storage structure would be similar to existing format for property index with some changes
:property-index- hidden node under which property indexes would be stored for various properties which are marked as sync
- For unique index entry each entry would also have a time stamp which would later used for pruning
Here we create new buckets of index values which simplifies the pruning. New buckets would get created after each successful async indexer run and older buckets would get removed. The bucket would in turn have structure similar to content mirror store strategy
For each property being index keep a
head property which refers to the current active bucket. This would be changed by
IndexPruner. In addition there would be a
previous bucket to refer to the last active bucket.
On each run of IndexPruner
- Check if
IndexStatsMBean#LastIndexedTimeis changed from last known time
- If changed then
- Create a new bucket by incrementing the current head value
previousto current head
headto new head value
previousbucket to now
- Remove all other buckets
We require both
previous bucket as there would be some overlap between content in
Index Pruner is a periodic task which would be responsible for pruning the index content. It would make use of
IndexStatsMBean#LastIndexedTime to determine upto which time async indexer has indexed the repository and then remove entries from the property index which are older than that time
- Property index - here pruning would be done by creating a new bucket and then removing the older bucket.
- Unique index - Here prunining would be done by iterating over current indexed content and removing the older ones
On the query side we would be performing a union query over the 2 index types. A union cursor would be created which would consist of
- LucenePathCursor - Primary cursor backed by Lucene index
- PropertyIndexCursor - A union of path cursor from current
If there are multiple property definition in Lucene index marked with
sync and query involves constraints on more than 1 then which property index should be picked