COCO’s LEB128-like Compression#

Run-length encoding already compresses masks by storing run lengths instead of pixels. But how do we store the run lengths themselves?

The naive approach uses fixed-width integers. A uint32 per run gives us 4 bytes × number of runs. For a mask with 100 runs, that’s 400 bytes. But most run lengths are small—a few hundred pixels at most. We’re wasting bits on leading zeros.

Variable-length encoding saves space by using fewer bytes for smaller numbers.

LEB128 background#

LEB128 (Little Endian Base 128) is a standard variable-length integer encoding used in DWARF debug info, WebAssembly, and Protocol Buffers.

The idea: use 7 bits per byte for data, and the high bit as a continuation flag. If the high bit is 1, more bytes follow. If 0, this is the last byte.

Example: encoding 300 (binary: 100101100)

  • Split into 7-bit chunks from the right: 0000010, 0101100

  • Reverse order (little endian): 0101100, 0000010

  • Add continuation bits: 10101100, 00000010

  • Result: two bytes, 0xAC 0x02

Small numbers (0-127) fit in one byte. Larger numbers grow as needed.

COCO’s modification#

COCO stores masks in JSON. Binary bytes would need base64 encoding or escaping, adding overhead and complexity.

COCO’s solution: use 6 bits per character instead of 8 bits per byte. 5 data bits + 1 continuation bit = 6 bits, which maps to 64 values. Add 48 to get ASCII codes 48-111, which are the characters ‘0’ through ‘o’.

All printable, all JSON-safe, no escaping needed. The mask becomes a readable string like "oW3a0f2N" in the JSON file.

The tradeoff: 5 data bits per character vs 7 data bits per byte means roughly 40% more characters than a binary encoding would need bytes. But JSON overhead for binary data (base64) would be 33% anyway, and this stays human-inspectable.

Delta encoding#

COCO adds another layer: delta encoding. Instead of storing raw run lengths, store the difference from the previous run of the same type.

Runs alternate between 0s and 1s. Consecutive runs of 0s (indices 0, 2, 4, …) often have similar lengths. Same for runs of 1s (indices 1, 3, 5, …).

So we encode:

encoded[i] = cnts[i] - cnts[i-2]    (for i >= 2)
encoded[i] = cnts[i]                (for i < 2)

For typical blob-like objects, this works well: RLE scans column by column, and a continuous shape has similar vertical extent in neighboring columns. Each column’s foreground run length changes gradually, so deltas cluster near zero. Small deltas need fewer characters.

Signed representation#

Deltas can be negative. The encoding handles this with sign extension.

Each 5-bit chunk has a sign bit (the 5th bit, value 0x10). When decoding the last chunk, if this bit is set, we fill all higher bits with 1s, giving a negative two’s complement number.

Example: encoding -3

  • -3 in two’s complement (infinite width): …11111101

  • Take last 5 bits: 11101

  • High bit (0x10) is set, so this signals negative

  • No more chunks needed (remaining bits are all 1s)

  • Add continuation bit (0): 011101

  • Add 48: ASCII 77, which is ‘M’

Decoding ‘M’:

  • ‘M’ is ASCII 77, subtract 48: 29 = 0b011101

  • Continuation bit (bit 5, 0x20): 0, so this is the last chunk

  • Data bits (bits 0-4, 0x1f): 0b11101 = 29

  • Sign bit (bit 4, 0x10): set, so sign-extend with 1s

  • Result: …11111 11101 = -3

Encoding walkthrough#

Let’s encode the run lengths [8, 12, 6, 15].

First, compute deltas:

cnts[0] = 8   → delta = 8   (no previous run of same type)
cnts[1] = 12  → delta = 12  (no previous run of same type)
cnts[2] = 6   → delta = 6 - 8 = -2
cnts[3] = 15  → delta = 15 - 12 = 3

Now encode each delta:

8: binary 01000

  • Fits in 5 bits, sign bit (0x10) is 0, matches sign (positive)

  • No continuation needed

  • Chunk: 01000 = 8, add 48 → ASCII 56 = ‘8’

12: binary 01100

  • Fits in 5 bits, sign bit is 0, matches sign

  • Chunk: 01100 = 12, add 48 → ASCII 60 = ‘<’

-2: binary …111110

  • Last 5 bits: 11110

  • Sign bit is 1, remaining bits are all 1s, so we’re done

  • Chunk: 11110 = 30, add 48 → ASCII 78 = ‘N’

3: binary 00011

  • Fits in 5 bits, sign bit is 0, matches sign

  • Chunk: 00011 = 3, add 48 → ASCII 51 = ‘3’

Result: "8<N3"

Decoding walkthrough#

Decode "8<N3" back to run lengths.

‘8’: 56 - 48 = 8 = 0b01000

  • Continuation bit: 0, last chunk

  • Data: 8, sign bit 0, no extension

  • Delta: 8, cnts[0] = 8

‘<’: 60 - 48 = 12 = 0b01100

  • Continuation bit: 0, last chunk

  • Data: 12, sign bit 0, no extension

  • Delta: 12, cnts[1] = 12

‘N’: 78 - 48 = 30 = 0b11110

  • Continuation bit: 0, last chunk

  • Data: 30 & 0x1f = 30, sign bit 1, extend with 1s

  • Delta: -2, cnts[2] = cnts[0] + (-2) = 8 - 2 = 6

‘3’: 51 - 48 = 3 = 0b00011

  • Continuation bit: 0, last chunk

  • Data: 3, sign bit 0, no extension

  • Delta: 3, cnts[3] = cnts[1] + 3 = 12 + 3 = 15

Result: [8, 12, 6, 15]

Compression ratio#

For typical segmentation masks:

  • Average ~1.5 characters per run (empirically measured on human pose datasets)

  • Compare to 4 bytes per uint32: roughly 2.7× compression

  • 99th percentile is still under 2 characters per run

Optional gzip compression (stored as zcounts instead of counts) adds another ~40% reduction for masks with repetitive structure.

The code#

Encoding in rleToString:

for each run i:
    x = cnts[i]
    if i > 2:
        x -= cnts[i-2]  # delta from same-type run

    do:
        c = x & 0x1f           # low 5 bits
        x >>= 5                # arithmetic shift
        more = (c & 0x10) ? (x != -1) : (x != 0)
        if more:
            c |= 0x20          # set continuation bit
        output(c + 48)
    while more

Decoding in rleFrString:

for each run:
    x = 0
    k = 0
    do:
        c = input() - 48
        x |= (c & 0x1f) << k
        more = c & 0x20
        k += 5
    while more

    if c & 0x10:               # sign extend
        x |= -1 << k

    if i > 2:
        x += cnts[i-2]         # undo delta
    cnts[i] = x