Where and When

March 19th - 23rd 2018
Location: Adobe Basel (Meeting Room Rhein, 5th floor), guests please register at reception on 2nd floor.

Attendees

Who	When
Julian Reschke	19. and 22./23. (remotely)
Marcel Reutegger	19. - 22.
Matt Ryan	19. - 23.
Robert Munteanu	20. - 22.
Thomas Mueller	20. - 23.

Topics/Discussions/Goals

Title	Summary	Effort	Participants	Proposed by
Avoid reads from MongoDB primary	Work on a prototype using MongoDB 3.6 client sessions with causal consistency. See OAK-6087.	2-3 days		Marcel Reutegger
CompositeDataStore - How to map blob IDs to delegates	Revisit discussion on how to map blob IDs to CompositeDataStore delegates. Proposals include: maintaining MPHF tables or Bloom filters to map IDs to delegates, rebuilding them every time data store GC is run or on some other schedule; encoding a delegate identifier into the blob ID; ???	2 hours to discuss; we could also try to prototype the solution (2-3 days).	Matt Ryan, Amit Jain, Thomas Mueller, Chetan Mehrotra	Matt Ryan
CompositeDataStore - review pull request	The first version of the CompositeDataStore is in pull request and ready to review - can we go through it and resolve the issues?	2 hours	Matt Ryan, Amit Jain, Chetan Mehrotra, Thomas Mueller	Matt Ryan
DocumentMK Roadmap	Development focus for 2018	0.5h		Marcel Reutegger
DataStore Roadmap	Changes needed for better container support	0.5h		Matt Ryan
Modularization	Where do we stand for modularization and where to go next	1h		Robert Munteanu

Agenda Proposal

Mon
9:00-12:30	9:00 Setup 9:30 5-10 mins overview per topic 10:00 DocumentMK Roadmap
13:30-17:00

Tue
9:00-12:30	9:30 TarMK Roadmap 10:30 CompositeDataStore BlobID mapping
13:30-17:00	13:30 - 15:30 Disposable Publishers

Wed
9:00-12:30	9:30 Indexing Roadmap
13:30-17:00

Thu
9:00-12:30	9:30 Modularization
13:30-17:00

Fri
9:00-12:30
13:30-17:00

Prep Work

Notes from the Oakathon

Add documentation how to run integration tests on different backend. E.g. Postgres.

DocumentMK Roadmap

Cloud deployments
Areas of improvement
Roundtrips
- Need to pay for each call
- Review and reduce communication with service
Monitoring
- More metrics
- Document existing metrics (meaning, thresholds)
DevOps
- Avoid manual tasks
- Upgrade/rollback
- Compatibility
- Tooling (machine readable output, versions, compatibility)
Deployment
- Container ready (local disk?, persistent cache)
- Provisioned DocumentStore ops (something in OSGi?)
- Independent maintenance tasks (serverless)
Resilience
- Lease timeout behaviour
- Incorrect/erratic clock
- Unit tests / integration tests / gap! / longevity tests
- Analyze reported issues
Alternative back-ends
- Cosmos DB

CompositeDataStore Blob ID Mapping

Since the CompositeDataStore contains multiple delegate data stores inside it, an obvious question is: How does it know which delegate to choose for any particular record?

There are two sub-questions:

Is it required to implement a blob ID mapping in order to accept the current CompositeDataStore pull requests (1, 2), or can we accept them into Oak without this capability in order to get testing started using CompositeDataStore, and add the mapping later?
What technique should we use to do map blob IDs to delegates? Some of the ideas proposed include:
- Encoding the delegate into the blob ID.
  - Challenges:
    - Other CompositeDataStore scenarios like storage tiering (automatically moving blobs to cheaper storage) would then require re-encoding the blob ID each time the data is moved from one delegate to another.
- Maintaining a mapping data structure.
  - Proposed ideas include MPHF or Bloom filters (both proposed by @Thomas)
    - Challenges:
      - Initial creation of table
        
        Build anew each startup or save it somewhere and reload?
      - Resizing the table
        
        Adjust as needed or wait until a certain event (e.g. data store GC times)?
        
        What if user never runs data store GC?
        
        How to maintain state in the meantime?

Decisions that came out of the session:

Some minimal amount of testing needs to be done to quantify the performance impact of accepting the current pull requests without doing some sort of mapping. This needs to be known before we can make a decision on accepting the pull request.
To address the blob ID mapping question, the decision was to first try to implement an encoding of a data store identifier into the blob ID. Since we don't have any current requirements to support moving of blobs from one data store delegate to another (and since in the last Oakathon we made a similar decision that this was beyond the responsibilities of Oak anyway), we won't use that as a reason to not encode the data store ID into the blob ID. If subsequently we need to move beyond those capabilities we can additionally add an in-memory mapping of some sort that would be used first, before looking at the blob ID, as an override for the data store location.
The implementation and use case needs to be more fully documented within oak-doc.

Modularisation

We discussed what the end goal is for the Oak modularisation. For 1.8 we split out many modules, which are now more cohesive and we have more fine-grained depedencies. Aiming for independendent module releases presents other challenges we have not discussed yet.

What is an Oak release ?

Up till now we have released all modules in lock-step, which means that there is a very clear definition of what an Oak release is. With modules released independently it's not clear what modules should be used by consumers. This basically affects:

consumers using Oak in their product (Sling/AEM)
oak-run, as the tools that embeds many Oak modules

Consumers already ask about which dependency of oak-run goes with which Oak version, this will get more complicated with modular releases.

A number of ideas were raised for simplifying oak-run management:

specify dependencies (using CLI/files) - downside is that it's tedious and requires network access
using jar files from an existing Oak/Sling/AEM installation - this can be done, but it requires reading from an exploded Felix container bundle state, not very stable. Also no guarantee that all oak-run dependencies are present ( e.g. CLI parsing code usually not in Sling/AEM )
generate a new version of Oak-run ( fat jar or script ) whenever a specific oak bundle is started - cleaner version of the above, with the same note that some dependencies might not be present. We can add the missing dependencies to the deployment, given that they don't have a large disk footprint.

ASF release note: if needed, we can release multiple modules in one vote, e.g. oak-segment-tar and oak-run.

How do we deliver Oak updates?

Now consumers update all Oak bundles together. Upgrading just parts of the bundles can lead to confusion, e.g. "What Oak version am I on?".

Where do we keep modules?

Once extracting a module, we need to find a proper place for it. The options we discussed were

https://svn.apache.org/repos/asf/jackrabbit/$\{module}
https://svn.apache.org/repos/asf/jackrabbit/modules/$\{module}
a separate git repository

We did not reach a conclusion so modules will live in the same SVN location, just removed from the main reactor POM ( see also Next steps below ).

Next steps

To minimise risk and get a better understanding of the implications of independent release we will start off with isolated modules and give them an independent release cycles. Good candidates would:

be leaves in the dependency tree - no other modules depend on them
have no SNAPSHOT dependencies
have few dependencies
have only versioned OSGi imports

One clear candidate is the oak-blob-cloud-azure module.

The simplest way to cut it from the release is to remove it from the reactor pom and keep it in the same location in SVN. The project will be built and imported into IDEs independently.

Apache Jackrabbit : Oakathon March 2018