Apache Jackrabbit : Goals and non goals for Jackrabbit 3

Design principle

Best effort: everything might be corrupt at any time:

  • node types
  • child node existence
  • clients may not make any consistency assumptions

Goals

  • Pass TCK. But TCK might be adapted for invalid or edge cases.
  • Node type consistency on save and set type (including mixin). Inconsistencies occurring do to write skew or degradation effects are acceptable though.
  • Scalability:
    • Read throughput: no degradation from current Jackrabbit 2, repeated read not slow, take advantage of locality for random reads. TODO: Needs further clarification
    • High write throughput across cluster nodes.
    • Big lists of direct child nodes (10M)
    • Concurrent writes within single cluster node. TODO: Needs further clarification: concurrency itself might not be the goal but the means to reach high single user throughput
    • Big transactions (> 100k nodes at 1kB each)
    • Start up time < 1s
    • Number of nodes in repository: 100M
    • Number of nodes in shared cloud: 10T
    • 1G binaries with 2MB per binary => 2PB Repository size
  • Simple/Fast queries (i.e. through specialized indexes) (3ms)
  • Partitioning of observation. TODO: Needs further clarification
    • Handling of recursive deletes: large number of NODE_REMOVED events vs. delete event for specific properties in subtree.
  • Number of users: 200M / 20M per group
  • Full versioning model
  • Flexible durability (depending on durability guarantees of back end)

Non goals

  • Node type consistency when node type definition changes
  • Consistency guarantees
  • Scalability:
    • Big property list
    • Same name siblings
    • Namespace remapping
  • Query index complete
  • Fast move
  • JCR lock support (best effort only)

Maybe

  • Scalability:
    • Large number of values for multi valued properties
  • Sharable nodes
  • Fast delete

TBD

  • Everything is content: search index, configuration, workspaces
    • At what level (i.e. JCR, SPI, Microkernel, persistence store)?
  • Microkernel portable to C:
    • Or maybe better "language agnostic API"
  • Flexible persistence layer (RDBMS, Cassandra, ...)
  • Small and embeddable
    • How small?,
    • Embeddable into what?
  • Characteristics of clustering (partitioning, replication, merging, consistency)
  • Tunable consistency (e.g. when clustered)