Class Lucene45DocValuesFormat

  • All Implemented Interfaces:
    NamedSPILoader.NamedSPI

    public final class Lucene45DocValuesFormat
    extends DocValuesFormat
    Lucene 4.5 DocValues format.

    Encodes the four per-document value types (Numeric,Binary,Sorted,SortedSet) with these strategies:

    NUMERIC:

    • Delta-compressed: per-document integers written in blocks of 16k. For each block the minimum value in that block is encoded, and each entry is a delta from that minimum value. Each block of deltas is compressed with bitpacking. For more information, see BlockPackedWriter.
    • Table-compressed: when the number of unique values is very small (< 256), and when there are unused "gaps" in the range of values used (such as SmallFloat), a lookup table is written instead. Each per-document entry is instead the ordinal to this table, and those ordinals are compressed with bitpacking (PackedInts).
    • GCD-compressed: when all numbers share a common divisor, such as dates, the greatest common denominator (GCD) is computed, and quotients are stored using Delta-compressed Numerics.

    BINARY:

    • Fixed-width Binary: one large concatenated byte[] is written, along with the fixed length. Each document's value can be addressed directly with multiplication (docID * length).
    • Variable-width Binary: one large concatenated byte[] is written, along with end addresses for each document. The addresses are written in blocks of 16k, with the current absolute start for the block, and the average (expected) delta per entry. For each document the deviation from the delta (actual - expected) is written.
    • Prefix-compressed Binary: values are written in chunks of 16, with the first value written completely and other values sharing prefixes. chunk addresses are written in blocks of 16k, with the current absolute start for the block, and the average (expected) delta per entry. For each chunk the deviation from the delta (actual - expected) is written.

    SORTED:

    • Sorted: a mapping of ordinals to deduplicated terms is written as Prefix-Compressed Binary, along with the per-document ordinals written using one of the numeric strategies above.

    SORTED_SET:

    • SortedSet: a mapping of ordinals to deduplicated terms is written as Prefix-Compressed Binary, an ordinal list and per-document index into this list are written using the numeric strategies above.

    Files:

    1. .dvd: DocValues data
    2. .dvm: DocValues metadata
    1. The DocValues metadata or .dvm file.

      For DocValues field, this stores metadata, such as the offset into the DocValues data (.dvd)

      DocValues metadata (.dvm) --> Header,<Entry>NumFields

      • Entry --> NumericEntry | BinaryEntry | SortedEntry | SortedSetEntry
      • NumericEntry --> GCDNumericEntry | TableNumericEntry | DeltaNumericEntry
      • GCDNumericEntry --> NumericHeader,MinValue,GCD
      • TableNumericEntry --> NumericHeader,TableSize,Int64TableSize
      • DeltaNumericEntry --> NumericHeader
      • NumericHeader --> FieldNumber,EntryType,NumericType,MissingOffset,PackedVersion,DataOffset,Count,BlockSize
      • BinaryEntry --> FixedBinaryEntry | VariableBinaryEntry | PrefixBinaryEntry
      • FixedBinaryEntry --> BinaryHeader
      • VariableBinaryEntry --> BinaryHeader,AddressOffset,PackedVersion,BlockSize
      • PrefixBinaryEntry --> BinaryHeader,AddressInterval,AddressOffset,PackedVersion,BlockSize
      • BinaryHeader --> FieldNumber,EntryType,BinaryType,MissingOffset,MinLength,MaxLength,DataOffset
      • SortedEntry --> FieldNumber,EntryType,BinaryEntry,NumericEntry
      • SortedSetEntry --> EntryType,BinaryEntry,NumericEntry,NumericEntry
      • FieldNumber,PackedVersion,MinLength,MaxLength,BlockSize,ValueCount --> VInt
      • EntryType,CompressionType --> Byte
      • Header --> CodecHeader
      • MinValue,GCD,MissingOffset,AddressOffset,DataOffset --> Int64
      • TableSize --> vInt

      Sorted fields have two entries: a BinaryEntry with the value metadata, and an ordinary NumericEntry for the document-to-ord metadata.

      SortedSet fields have three entries: a BinaryEntry with the value metadata, and two NumericEntries for the document-to-ord-index and ordinal list metadata.

      FieldNumber of -1 indicates the end of metadata.

      EntryType is a 0 (NumericEntry) or 1 (BinaryEntry)

      DataOffset is the pointer to the start of the data in the DocValues data (.dvd)

      NumericType indicates how Numeric values will be compressed:

      • 0 --> delta-compressed. For each block of 16k integers, every integer is delta-encoded from the minimum value within the block.
      • 1 -->, gcd-compressed. When all integers share a common divisor, only quotients are stored using blocks of delta-encoded ints.
      • 2 --> table-compressed. When the number of unique numeric values is small and it would save space, a lookup table of unique values is written, followed by the ordinal for each document.

      BinaryType indicates how Binary values will be stored:

      • 0 --> fixed-width. All values have the same length, addressing by multiplication.
      • 1 -->, variable-width. An address for each value is stored.
      • 2 --> prefix-compressed. An address to the start of every interval'th value is stored.

      MinLength and MaxLength represent the min and max byte[] value lengths for Binary values. If they are equal, then all values are of a fixed size, and can be addressed as DataOffset + (docID * length). Otherwise, the binary values are of variable size, and packed integer metadata (PackedVersion,BlockSize) is written for the addresses.

      MissingOffset points to a byte[] containing a bitset of all documents that had a value for the field. If its -1, then there are no missing values.

    2. The DocValues data or .dvd file.

      For DocValues field, this stores the actual per-document data (the heavy-lifting)

      DocValues data (.dvd) --> Header,<NumericData | BinaryData | SortedData>NumFields

      SortedSet entries store the list of ordinals in their BinaryData as a sequences of increasing vLongs, delta-encoded.