Random-Access Compressed String Store (RACSS)
Cut storage costs for large static datasets with zero decompression latency
RACSS is a lookup-optimized string and attribute store that provides instant random access to heavily compressed data with zero decoding latency — it is not a streaming compressor.
What RACSS actually provides
- instant random access to compressed data with zero decoding latency;
- significant storage size reduction for string- and attribute-heavy datasets;
- minimal RAM footprint (no full-block or full-file decompression);
- tiny and simple decoder (~200 lines of C), easy to integrate and audit;
- substantial I/O throughput gain due to reduced data transfers;
- real-time performance on resource-limited hardware;
- flash- and mmap-friendly access pattern (no staging buffers, no working set expansion);
Core concept
Strings are stored using a recursive dictionary representation:
- dictionary entries may reference other entries;
- references are compact and efficiently encoded;
- decoding requires no auxiliary tables or external state.
In real-world datasets, recursion depth is very shallow, even for large and highly redundant inputs.
This makes RACSS suitable for systems where:
- stack depth is limited,
- memory usage must be predictable,
- decompression logic must remain simple and inspectable.
Example: how RACSS tokenizes real text
Below is a minimal real-world example showing how RACSS tokenizes text.
Input text (8 lines):
Jingle Bells, Jingle Bells Jingle all the way Oh what fun it is to ride in a One horse open sleigh Jingle Bells, Jingle Bells Jingle all the way Oh what fun it is to ride in a one Horse open sleigh
After compression, RACSS produces two kinds of entries:
- L - logical input lines (stored as token sequences)
- D - dictionary entries (reusable substrings)
Tokens in square brackets ([n]) refer to dictionary entries by index.
Logical lines
L 1 :"[7]" L 2 :"[3]" L 3 :"[2]" L 4 :"On[6]h[4]" L 5 :"[7]" L 6 :"[3]" L 7 :"[2] one" L 8 :"H[4]"
Each logical line is stored independently as a sequence of literals and dictionary references.
Dictionary entries
D 1 :"[5]Bells" D 2 :"Oh what fu[8]it is to rid[6]i[8]a" D 3 :"[5]all th[6]way" D 4 :"ors[6]ope[8]sleigh" D 5 :"Jingl[6]" D 6 :"e " D 7 :"[1], [1]" D 8 :"n "
The dictionary is self-referential: dictionary entries may reference other dictionary entries. This allows RACSS to represent recurring substrings compactly without flattening them into a single global stream.
What this demonstrates
- No global stream dependency — each line is decoded independently.
- Fine-grained reuse — common substrings like "Jingle", "ells", "e ", "n " are factored out once and reused everywhere.
- Recursive structure, not linear history — RACSS builds a small DAG of substrings instead of a sliding window.
- True random access — any logical line is reconstructed directly, without decoding unrelated data.
How RACSS differs from gzip / LZ / zstd
Aspect gzip/LZ RACSS Data model Stream String collection Random access - + Partial decompression - + Runtime state Complex Minimal Embedded suitability Limited High
RACSS targets a different problem than gzip/zstd/LZ: random-access lookup in compressed data with predictable latency. General-purpose compressors optimize sequential streams; RACSS optimizes retrieval.
RACSS is particularly useful for game localization files, dictionaries, navigation databases, and embedded systems, where compact storage and efficient random access are critical.
File Raw size RACSS GZIP "Wonderful World" lyrics 596 444 (74.5%) 338 (56.6%) "Let My People Go" lyrics 730 359 (49.2%) 280 (38.3%) 10,000 chemical names 1245737 346364 (27.8%) 334000 (26.8%) Multilingual article 2445 1234 (50.5%) 1067 (43.6%)
Typical application areas
LLM / AI infrastructure
- static side-data around models (tokenizers, dictionaries, ID maps),
- metadata catalogs, lookup tables and reference data,
- read-mostly indices replicated across many nodes,
- situations where NVMe/RAM per node limits scaling and latency spikes are unacceptable.
Game engines and consoles
- localization string tables,
- dialogs and UI text,
- tight storage constraints.
Large static structured datasets
- name tables and labels,
- object catalogs and reference databases,
- read-only metadata collections.
Embedded and legacy devices
- firmware with fixed flash size,
- security updates on old hardware,
- environments with strict stack and RAM limits.
Offline reference data
- dictionaries,
- help systems,
- static text databases.
Demo retrieval tool
The distribution includes a minimal demo retrieval program (`rfetch`), implemented in approximately 200 lines of C.
Its purpose is to:
- demonstrate that random access decompression works in practice;
- show that no additional in-memory data structures are required;
- allow performance measurement and inspection of decoding logic.
This program is not a full API or a reference specification. It is intentionally minimal.
Supported modes:
Usage: ./rfetch <file.rc> - unpack all lines (default) ./rfetch <file.rc> N - unpack line N (1-based) ./rfetch <file.rc> 0 - print raw lines and dictionary in debug format ./rfetch <file.rc> -N - unpack dict entry N (1-based) ./rfetch <file.rc> <out-of-range num> - print valid range and header
What is published
This release provides:
- an example of the RACSS binary format as produced by the toolchain;
- a minimal, self-contained decompression demo;
- sample datasets for comparison with gzip.
It intentionally does not include:
- a formal format specification;
- a full reference decoder API;
- dictionary construction algorithms.
The goal is to demonstrate feasibility, performance, and simplicity, not to disclose the full compression pipeline.
Why this is commercially relevant
RACSS enables:
- extending the lifetime of legacy hardware;
- reducing storage footprint without architectural changes;
- deterministic, low-risk runtime behavior;
- easy integration into existing embedded and firmware projects.
It targets products where storage size, predictability, and backward compatibility matter more than peak compression ratio.
In practical terms, RACSS is both a cost reducer and a capacity unlocker:
- reduce required flash/NVMe footprint for static assets and catalogs;
- keep datasets compressed without paying CPU at read time (no decompression step);
- avoid architectural workarounds around monolithic compression (file splitting, pre-expansion, RAM buffering);
- enable predictable tail latency for random reads in constrained or high-scale environments.
If you can share one representative dataset (or we use a public proxy dataset), we can produce a short benchmark report: compression ratio vs gzip/zstd (including index overhead), random-access latency distribution (p50/p99), and peak memory usage during retrieval.
This directory contains decompressor source code and example of compressed data.
Contact us to discuss and to request technical evaluation. We’ll reply by email.