Apache Jackrabbit : Direct Binary Access

Oak Direct Binary Access

Summary

Oak Direct Binary Access is a capability of Oak wherein Oak allows an authenticated client to upload or download blobs directly to/from the blob store, assuming the authenticated user has appropriate permission to do so. The primary value of this feature is that the I/O of transferring large binary files to or from the blob store can be offloaded entirely from Oak and performed directly between a client application and the blob store. The availability of this feature is subject to certain capabilities being available in the DataStore implementation as well as the service provider underlying that implementation.

Javadocs

Motivation

The transfer of binary data through Oak has a significant impact on the overall performance of the system. Every time a binary is uploaded or downloaded the data flows from the storage location through Oak and on to the client. Handling this I/O impacts Oak's performance. This new capability offloads the transfer of data so it takes place directly between client and storage, while still relying upon Oak to manage the content hierarchy and enforce permissions.

Use cases UC6 and UC13 from JCR Binary Usecase are related to this capability and can be at least partially addressed by it.

Features Overview

This capability offers two basic features: Direct Binary Download and Direct Binary Upload.

Direct Binary Download

This feature allows a client application to download a binary directly from a supported storage location via a URL. The application user must be successfully authenticated to the system and must have permission to read the binary object in question in order to obtain a Binary object. Using this object, the client may then cast it to a BinaryDownload. If successful, the client may then call a new JCR API, getURI(BinaryDownloadOptions), which returns a URL that the client may then use to read the binary directly.

This block diagram shows the main parties involved for downloading (oak-direct-binary-download-block-diagram.gliffy):

Direct Binary Upload

This feature allows a client application to upload a binary directly to a supported storage location via one or more URLs. This is done in a three-step process. First, the application obtains a reference to the active ValueFactory. If this is an instance of JackrabbitValueFactory, the application then calls a new JCR API, initiateBinaryUpload(long, int), passing two arguments: the maximum expected or known size of the binary, and the maximum number of upload URLs that the client can support. A BinaryUpload object will be returned containing instructions to aid in completing the upload, including one or more URLs to use and an upload token that is required later. Second, the client application may then use one or more of the provided URLs, which can be obtained by calling getUploadURIs() on the returned BinaryUpload object, to upload the binary directly to the storage. Multi-part uploads are supported and handled automatically if desired - all the client need do is upload chunks of the binary using the provided URLs in sequence. When all the uploads are complete, the client then uses the JackrabbitValueFactory to call a new JCR API, completeBinaryUpload(String), providing a signed upload token that can be obtained by calling getUploadToken() on the BinaryUpload object that was returned from the original call to initiateHttpUpload(long, int), and returning a Binary. Calling completeHttpUpload(String) notifies the storage provider of the upload parts to be assembled into a single binary object, and also allows that the corresponding property need not be created until after the binary upload is complete, at which point the returned Binary can be associated to the property.

This block diagram shows the main parties involved for uploading (oak-direct-binary-upload-block-diagram.gliffy):

The following sequence diagram shows the detailed steps. Note that the "Oak" agent aggregates the JCR, Root/Tree, NodeStore and BlobStore layers. (sources for the diagram)

Security

There are a number of security-related aspects to be considered for this capability.

URLs

This new capability allows clients to interact directly with storage via URLs. Obviously this increases the security exposure of the storage in question. However, the risk can be managed via signed URLs with short TTLs. Thus to improve security it is strongly recommended to use a storage service provider that:

  • Can provide URLs for both storing and retrieving content directly to/from the storage provider
  • Enforces TTLs on provided URLs and disallows their use after the TTL has expired
  • Signs URLs and includes a signature in the URL, which is used to verify both the origin of the URL and that the contents of the URL have not been altered since approval
  • Enforces permissions on provided URLs to only perform the action(s) originally requested when the URL was obtained

Both Amazon AWS S3 and Microsoft Azure Blob Storage meet these requirements, and thus both !S3DataStore and AzureDataStore can be configured for this purpose. Note that it will still be incumbent upon administrators to avoid configuring the system in insecure ways (for example, configuring a TTL that is far too large to be reasonably secure).

Authentication

To interact with this capability programmatically via the JCR API, a client needs to establish a valid session which requires valid user credentials to authenticate.

Authorization

To obtain a download URL, a valid Binary object must be obtained which is an instance of BinaryImpl. This can be obtained via an authenticated session calling into the JCR API ot request a Binary for an existing object. Because the binary must be obtained first from the JCR, it has already been established that the user represented by the session has permission to read the binary, so giving the URL for use by this user is acceptable.

Once a binary is uploaded (direct or otherwise), an authenticated session must have the correct permission to associate the Binary with a binary property in the JCR. Thus it is not possible to add a binary to the repository by using direct binary upload unless sufficient permissions exist to add the property. Note that may be possible to upload the binary to blob storage this way; however, this is also possible to do using the traditional upload mechanism. Such unreferenced binaries are not considered part of the repository and will be garbage collected. If a client wishes to avoid uploading a binary when no permission exists to add the binary to a property, the client should first check for adequate permission at the property path and then perform the upload.

Other Approaches

Direct Binary Download (Sept / Oct 2017)

A proposal was submitted in September / October of 2017 to evaluate a form of direct binary download support for Oak. This proposal was discussed in OAK-6575 as well as in this thread and this thread on oak-dev and in this thread and this thread on sling-dev. A primary difference between this proposal and the one from 2017 is that the 2017 proposal was to support an ability to convert a Binary directly to a signed download URL using newly added code, whereas this proposal takes the approach of a client requesting a URL for a Binary directly via API.

Each proposal takes a somewhat different approach to address a similar problem (although the scope of this proposal is more broad as it covers upload as well as download). It's important to take into account the conversations that took place about this issue in 2017 as we move forward.