Random-Access Compressed String Store (RACSS)

Cut storage costs for large static datasets with zero decompression latency

RACSS is a lookup-optimized string and attribute store that provides instant random access to heavily compressed data with zero decoding latency — it is not a streaming compressor.

What RACSS actually provides

instant random access to compressed data with zero decoding latency;
significant storage size reduction for string- and attribute-heavy datasets;
minimal RAM footprint (no full-block or full-file decompression);
tiny and simple decoder (~200 lines of C), easy to integrate and audit;
substantial I/O throughput gain due to reduced data transfers;
real-time performance on resource-limited hardware;
flash- and mmap-friendly access pattern (no staging buffers, no working set expansion);

Core concept

Strings are stored using a recursive dictionary representation:

dictionary entries may reference other entries;
references are compact and efficiently encoded;
decoding requires no auxiliary tables or external state.

In real-world datasets, recursion depth is very shallow, even for large and highly redundant inputs.

This makes RACSS suitable for systems where:

stack depth is limited,
memory usage must be predictable,
decompression logic must remain simple and inspectable.

Example: how RACSS tokenizes real text

Below is a minimal real-world example showing how RACSS tokenizes text.

Input text (8 lines):

Jingle Bells, Jingle Bells
Jingle all the way
Oh what fun it is to ride in a
One horse open sleigh
Jingle Bells, Jingle Bells
Jingle all the way
Oh what fun it is to ride in a one
Horse open sleigh

After compression, RACSS produces two kinds of entries:

L - logical input lines (stored as token sequences)
D - dictionary entries (reusable substrings)

Tokens in square brackets ([n]) refer to dictionary entries by index.

Logical lines

L       1 :"[7]"
L       2 :"[3]"
L       3 :"[2]"
L       4 :"On[6]h[4]"
L       5 :"[7]"
L       6 :"[3]"
L       7 :"[2] one"
L       8 :"H[4]"

Each logical line is stored independently as a sequence of literals and dictionary references.

Dictionary entries

D       1 :"[5]Bells"
D       2 :"Oh what fu[8]it is to rid[6]i[8]a"
D       3 :"[5]all th[6]way"
D       4 :"ors[6]ope[8]sleigh"
D       5 :"Jingl[6]"
D       6 :"e "
D       7 :"[1], [1]"
D       8 :"n "

The dictionary is self-referential: dictionary entries may reference other dictionary entries. This allows RACSS to represent recurring substrings compactly without flattening them into a single global stream.

What this demonstrates

No global stream dependency — each line is decoded independently.
Fine-grained reuse — common substrings like "Jingle", "ells", "e ", "n " are factored out once and reused everywhere.
Recursive structure, not linear history — RACSS builds a small DAG of substrings instead of a sliding window.
True random access — any logical line is reconstructed directly, without decoding unrelated data.

How RACSS differs from gzip / LZ / zstd

Aspect                 gzip/LZ    RACSS
Data model             Stream     String collection
Random access          -          +
Partial decompression  -          +
Runtime state          Complex    Minimal
Embedded suitability   Limited    High

RACSS targets a different problem than gzip/zstd/LZ: random-access lookup in compressed data with predictable latency. General-purpose compressors optimize sequential streams; RACSS optimizes retrieval.

RACSS is particularly useful for game localization files, dictionaries, navigation databases, and embedded systems, where compact storage and efficient random access are critical.

File                       Raw size   RACSS          GZIP
"Wonderful World" lyrics   596        444 (74.5%)    338 (56.6%)
"Let My People Go" lyrics  730        359 (49.2%)    280 (38.3%)
10,000 chemical names      1245737    346364 (27.8%) 334000 (26.8%)
Multilingual article       2445       1234 (50.5%)   1067 (43.6%)

Typical application areas

LLM / AI infrastructure

static side-data around models (tokenizers, dictionaries, ID maps),
metadata catalogs, lookup tables and reference data,
read-mostly indices replicated across many nodes,
situations where NVMe/RAM per node limits scaling and latency spikes are unacceptable.

Game engines and consoles

localization string tables,
dialogs and UI text,
tight storage constraints.

Large static structured datasets

name tables and labels,
object catalogs and reference databases,
read-only metadata collections.

Embedded and legacy devices

firmware with fixed flash size,
security updates on old hardware,
environments with strict stack and RAM limits.

Offline reference data

dictionaries,
help systems,
static text databases.

Demo retrieval tool

The distribution includes a minimal demo retrieval program (`rfetch`), implemented in approximately 200 lines of C.

Its purpose is to:

demonstrate that random access decompression works in practice;
show that no additional in-memory data structures are required;
allow performance measurement and inspection of decoding logic.

This program is not a full API or a reference specification. It is intentionally minimal.

Supported modes:

Usage:
  ./rfetch <file.rc>                    - unpack all lines (default)
  ./rfetch <file.rc>  N                 - unpack line N (1-based)
  ./rfetch <file.rc>  0                 - print raw lines and dictionary in debug format
  ./rfetch <file.rc> -N                 - unpack dict entry N (1-based)
  ./rfetch <file.rc> <out-of-range num> - print valid range and header

What is published

This release provides:

an example of the RACSS binary format as produced by the toolchain;
a minimal, self-contained decompression demo;
sample datasets for comparison with gzip.

It intentionally does not include:

a formal format specification;
a full reference decoder API;
dictionary construction algorithms.

The goal is to demonstrate feasibility, performance, and simplicity, not to disclose the full compression pipeline.

Why this is commercially relevant

RACSS enables:

extending the lifetime of legacy hardware;
reducing storage footprint without architectural changes;
deterministic, low-risk runtime behavior;
easy integration into existing embedded and firmware projects.

It targets products where storage size, predictability, and backward compatibility matter more than peak compression ratio.

In practical terms, RACSS is both a cost reducer and a capacity unlocker:

reduce required flash/NVMe footprint for static assets and catalogs;
keep datasets compressed without paying CPU at read time (no decompression step);
avoid architectural workarounds around monolithic compression (file splitting, pre-expansion, RAM buffering);
enable predictable tail latency for random reads in constrained or high-scale environments.

If you can share one representative dataset (or we use a public proxy dataset), we can produce a short benchmark report: compression ratio vs gzip/zstd (including index overhead), random-access latency distribution (p50/p99), and peak memory usage during retrieval.

This directory contains decompressor source code and example of compressed data.