Apache Jackrabbit : JCR Binary Usecase

NOTE - Any proposed solution should be added to section Solution Proposals

Below are few usecases which have been seen in past which cannot be met with current Oak Binary support. All such cases are meant to improve performance by reducing IO in cases where possible. Feature wise current Stream based approach supported by JCR Binary meets all requirement. The objective of this document is to capture such usecases and then come up with ways/solutions to meet these requirements. Which of the below requirements should we try to meet is something to be discussed. Objective here is to have some usecases to initiate the discussion

Per current design some implementation details

  1. S3DataStore
    1. While performing any read the stream from S3Object is first copied to a local cache and then a FileInputStream is provided from that
    2. Due to above even if code needs to read initial few bytes (say video metadata) then also the whole file is first spooled to a file in local cache and then stream is opened on that
  2. FileDataStore
    1. The file are stored in a directory structure like /xx/yy/zz/<contenthash> where xx,yy,zz are the initial few letters of the hex encoded content hash

Usecases

UC1 - processing a binary in JCR with a native library that only has access to the file system

Need access to absolute path of the File which back JCR Binary when using FileDataStore for processing by native program

There are deployments where lots of images gets uploaded to the repository and there are some conversions (rendition generation) which are performed by OS specific native executable. Such programs work directly on file handle.

Without this change currently we need to first spool the file content into some temporary location and then pass that to the other program. This add unnecessary overhead and something which can be avoided in case there is a DataStore where binary content is already in form of a file on the file system like FileDataStore

UC2 - Efficient replication across regions in S3

For binary less replication between multiple Sling instances in non shared DataStore setup across multiple regions need access to S3Object ID backing the blob such that it can be efficiently copied to a bucket in different region via S3 Copy Command

DataStore - S3DataStore

This for setup which is running on Oak with S3DataStore. There we have global deployment where a Sling based app (running on Oak with S3DataStore) is running in 1 region and binary content is to be distributed to publish instances running in different regions. The DataStore size is huge say 100TB and for efficient operation we need to use Binary less replication. In most cases only a very small subset of binary content would need to be present in other regions. Current way (via shared DataStore) to support that would involve synchronizing the S3 bucket across all such regions which would increase the storage cost considerably.

Instead of that plan is to replicate the specific assets via s3 copy operation. This would ensure that big assets can be copied efficiently at S3 level

Note that such a case can also be present in other DataStore where a binary content can be retrieved from source DataStore and added to target DataStore in optimal way (copy the binary from one repository to another repository).

UC3 - Text Extraction without temporary File with Tika

Avoid creation of temporary file where possible

While performing text extraction by Tika in many cases it would be creating a temporary file as many parser need random access to the binary. So while using BlobStore where per implementation the binary exist as File we can use a TikaInputStream backed by that file which would avoid creation of such temporary file and thus speed up text extraction performance

Going forward if we need to make use of Out of Process Text Extraction then this aspect would be useful there also

UC4 - Spooling the binary content to socket output via NIO

Enable use of NIO based zero copy file transfers

DataStore - S3DataStore, FileDataStore

For some time Jetty has support for doing async io and do a zero copy file transfer. This would allow transferring the file content to the HTTP socket without making it pass through jvm and thus should improve throughput.

Key aspect here is that where possible we should be able to avoid IO. Also have a look at Kafka design which tries to make use of OS cache as much as possible and avoid Io via jvm if possible thus providing much better throughputs

UC5 - Transferring the file to FileDataStore with minimal overhead

_ Need a way to construct JCR Binary via a File reference where File instance "ownership is transferred" say via rename without spooling its content again_

DataStore - FileDataStore

In some deployments a customer would typically upload lots of files in a FTP folder and then from there the files are transferred to Oak. As mentioned in 2b above with NAS based storage this would result in file being copied twice. So to avoid the extra overhead it would be helpful if one can create a File directly in the NFS as per FileDataStore structure (content hash -> split 3 level) and then add the Binary via ReferenceBinary approach

UC6 - S3 import

This somewhat similar to previous case but more around S3 support

Usecase here is that a customer has lots of existing binaries and those need to be imported to Oak repository. The binary might already exist on S3 or on there existing systems. S3 has lots of tooling to import large data sets efficiently , so its faster to bulk upload such binaries to an S3 bucket and then somehow transfer them to Oak for further management

The problem though: how to efficiently get them into the S3DS, ideally without moving them

UC7 - Random write access in binaries

Think: a video file exposed onto the desktop via WebDAV. Desktop tools would do random writes in that file. How can we cover this use case without up/downloading the large file. (essentially: random write access in binaries)

UC8 - X-SendFile

X-SendFile is module support in Apache which enabled spooling the file content from the OS via using Apache internals including all optimizations like caching-headers and sendfile or mmap if configured. So if the file is present on the filesystem which Apache can access then it would be spooled in much more efficient way and avoid add load to the JVM.

For this feature to work the web layer like Sling needs to know the path to the binary. Note that path is not disclosed to the client.

To an extent this feature is similar to UC1 however here the scope is more broader.

NB Although mod_xsendfile in Apache needs a path on the file system local to the Apache instance to get the binary it is a pointer. It does not need to be the same path that Oak uses internally provided it is presented as the patch on the file system. Other variants of X-sendfile (https://www.nginx.com/resources/wiki/start/topics/examples/xsendfile/) allow that pointer to be resoved to an http location for streaming.

UC9 - S3 datastore in a cluster

Currently, each cluster node connects to S3 and has a "local cache" on the file system. Binaries are uploaded asynchronously to S3, that means first written to the local cache. If a binary is added in one cluster node, it is not immediately available on S3 to be read from another cluster node, if async upload is enabled. See also OAK-4903.

UC10 - Efficiently concatenate / split / modify binaries

Multiple (some large) TIFF files are concatenated into a PTIFF file, and when reading, in some cases only a subset of the PTIFF is read, and sometimes the whole. Metadata of an existing (large) TIFF is changed.

UC11 - Text extraction during upload

Thomas One possible option would be to do text extraction while uploading. It will not scale if all uploads happen on the same machine however.

UC12 - Text extraction stored "next to" binary

Thomas Extracted text (independent from who created it) could be stored in the datastore. Possibly other metadata.

UC13 - adopt-a-binary

Mentioned by Bertrand on oak-dev. Idea here is that files already exist somewhere and are hence not "managed" by Oak. The binary content just needs to be made accessible via JCR Binary API. Possibly can be implemented if we have hierarchical BlobStore support where one of them enables resolving "external" blobIds like this

UC14 - Hierarchical BlobStore

Idea here is to have blobs stored in different BlobStores depending on type/usage/path. But read happens via same Binary API. Something like this existed in JR2 MultiDataStore. Also see JR2 Wiki

Solution Proposals

UC1 - processing a binary in JCR with a native library that only has access to the file system

Thomas

I assume the native library only requires read-only access and will not delete or move the file; could we somehow enforce this? Maybe using a symbolic link to the real file in a read-only directory?

How to ensure Oak GC doesn't delete the binary too early. One solution is that if the native library reads the file (or knows it will need to read the file soon), it updates the last modified time. This should already work. Another solution might be to add the file name to a text file (read log), but it would probably be more complicated, and probably wouldn't improve performance much.

Ian This use case needs to cover file systems and other storage mechanisms like S3. Controlling access is outside the scope of what Oak can control and depends on deployment teams.

UC3 - Text Extraction without temporary File with Tika

Thomas

We could add random access features to the binary, and possibly change Tika so it doesn't require a java.io.File but maybe a java.nio.channels.SeekableByteChannel, or just FileChannel, or ByteBuffer, or so. We might still need to write a wrapper for Tika, but maybe not.

UC4 - Spooling the binary content to socket output via NIO

Thomas Similar to Tika, we could extend the binary and add the missing / required features (for example get a ByteBuffer, which might be backed by a memory mapped file).

Ian Transfers should drill down to the underlying stream to see if they support NIO and use it if present. An example being the DS is a File system DS which has NIO, as does S3 the Jetty stream also supports NIO, so it should be possible for a servlet to get hold of both streams and connect the channels. This requires that the streams are available directly which requires the rest of the implementation to be efficient enough to not require local caching and copies of the files. There are a number of issues in Sling that need addressing first, some of which are being worked on. Streaming uploads, Streamed downloads to IE11 etc. I dont think adding NIO capabilities to streams that dont natively support NIO is the right solution and will only hide a more fundamental issue. The biggest issue (imvho) is that JCR Binary doesn't provide both an OutputStream and an InputStream directly connected to the raw underlying storage, blocking the client from performing zero cost transfers available to most other stacks.

UC5 - Transferring the file to FileDataStore with minimal overhead

Thomas

Or provide a way to create a JCR binary from a temp file. Oak might then move (File.renameTo or otherwise) the file or copy the content if needed. That way we don't expose the implementation details (hash algorithm, file name format).

UC6 - S3 import

Thomas Or provide a way to create a JCR binary from a S3 binary. Moving from S3 to S3 might not be a problem.

UC7 + UC10 - Random write access in binaries

Thomas

The Oak BlobStore chunks binaries, so that chunks could be shared among binaries. Random writes could then copy just the references to the chunks if possible. That would make random write relatively efficient, but still binaries would be immutable. We would need to add the required API. Please note this only works when using the BlobStore, not the current FileDataStore and S3DataStore as is (at least a wrapper around the FileDataStore / S3DataStore would be needed). This includes efficiently cutting away some bytes in the middle of a binary, or inserting some bytes. Typical file system don't support this case efficiently, however with the BlobStore it is possible.

UC8 - X-SendFile

Ian The Aim of X-Sendfile is to offload the streaming of large binaries from an expensive server capable of performing complex authorizations to a less expensive farm of servers capable of streaming data to a large number of clients. The X-sendfile header provides a location where the upstream proxy can find the response. That location has to be resolvable to the upstream server, it may contain authZ for the response and it may not divulge the structure of the store of neighbouring resources. What can be achieved depends on the implementation of the upstream servers X-sendfile capability. The apache mod_xsendfile module only supports mapping the location to the filesystem, so the DS would have to be mounted. Other X-Sendfile implementations, like nginX X-Accel-redriect that created the concept, supports mapping the location through to any URI location including through to http locations. This would allow the X-Accel-Redirect location to be mapped through to a http location capable of serving > C10K request all streaming. In AWS, ELBs support signed URLs, so if a S3 store needed to be exposed, the URL X-Accel-Redirect location could be a S3 bucket location fronted by an ELB configured to only allow access to signed requests conforming to a ACL policy. That policy including token expiry. Other variants of this are possible including requiring signed urls and hosting the content behind an elastic farm of Node.js/Jetty or any C10K capable server, each one validating the signature and token on every request from the nginX front end. To achieve this Oak or Sling would need to expose the pointer to the binary, and document a signing structure giving access to that binary. If the identifier of the Binary is already exposed, via JCR properties, this may already be possible, with knowledge of the DS without any changes to Oak.

Documentation on the AWS ELB signed urls is here http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-signed-urls.html Documentation on nginX's original concept is here https://www.nginx.com/resources/wiki/start/topics/examples/xsendfile/

UC9 - S3 datastore in a cluster

Thomas Possible solutions are: (1) disable async upload; (2) not sure if that works correctly: use a shared local cache (NFS for example). Other solutions (would need more work probably) could be to send / request the binary using the broadcasting cache mechanism.

Ian This has been partially addressed with streaming uploads in Sling Engine 2.4.6 and Sling Servlets Post 2.3.14. WHen async cache is disabled, the session.save() connects the request InputStream for the upload directly to the S3 OutputStream performing the transfer with no local disk IO, using a byte[]. As noted under UC4, this should be done over NIO wherever possible. Downloads of the binary also need to be streamed in a similar way. Local Disk IO is reported to be an expensive commodity by those who are deploying Sling/AEM at scale.