Abstract
Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques—such as deduplication and compression—are either LLM-oblivious or not compatible with each other, limiting data reduction effectiveness.
Our large-scale characterization study across all publicly-available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication offers strong synergy with model-aware compressors.
Building on these insights, we present BitX, an effective, fast, lossless delta compression algorithm that compresses XORed redundancy between fine-tuned and base LLMs. We build ZipLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, ZipLLM reduces model storage consumption by 49.5%, over 20% more than state-of-the-art deduplication and compression designs.
Motivation
Exponential Growth of LLM Storage
Large language models (LLMs) have become foundational tools in modern artificial intelligence (AI), with millions of models now publicly available through model hubs like Hugging Face and TensorFlow Hub.
Hugging Face alone hosts over 10 petabytes (PB) of models, with storage volume growing exponentially. By the end of 2025, projections suggest that total model storage will surpass 1 exabyte (EB)—three orders of magnitude more than in 2023.
Two observations underscore this challenge:
- Fine-tuned LLMs vastly outnumber base models and contribute disproportionately to overall storage footprint
- LLM storage is dominated by two floating-point formats: BF16 and FP32
Hugging Face's model count and storage consumption grow at an exponential rate
Key Insights
Element-wise weight deltas are small and structured
Fine-tuned models derived from the same base exhibit tiny differences, making them highly suitable for lossless delta compression.
Bitwise similarity enables LLM clustering
Bit distance, a new metric based on bitwise Hamming distance, serves as a lightweight, robust signal for grouping LLMs by family, supporting applications like model provenance, duplicate detection, and clustering.
Chunk-based deduplication is LLM-oblivious and suboptimal
Chunk-level deduplication operates on raw byte streams without LLM structure awareness, resulting in the loss of crucial information needed for effective model-aware compression. It's also computationally expensive and scales poorly with storage capacity.
Tensor-level deduplication is synergistic with LLM-aware compressors
Operating directly at the tensor granularity achieves deduplication ratios comparable to CDC, but with significantly lower metadata overhead. Meanwhile, it synergistically enables model-aware compressors.
BitX: Lossless Delta Compression
BitX Compression Algorithm
BitX is a highly effective, fast, lossless delta compression algorithm that compresses LLM variants using XOR-based deltas.
Key features:
- Operates on the bit level to capture subtle differences between related models
- Uses XOR operation to identify exact bit-level changes between models
- Applies a generic lossless compression algorithm (zstd) to further reduce storage
- Preserves exact bit-level fidelity for model weights
BitX workflow:
- Align matching tensors from base and fine-tuned models
- Perform bitwise XOR on corresponding values
- Compress the resulting bit patterns using zstd
- Store reference to base model and compressed delta
BitX compression workflow. The example uses BF16, but BitX supports all floating-point types.
ZipLLM System
System Architecture
ZipLLM is a model storage reduction pipeline that synergizes tensor-level deduplication and lossless BitX compression. It combines multiple strategies to address both exact and approximate redundancy in LLM storage.
Overview of the ZipLLM storage reduction workflow.
File Deduplication
Eliminates exact file duplicates using hash-based comparison.
Tensor-level Deduplication
Operates directly at the tensor level, leveraging model structure explicitly exposed in LLM formats.
LLM Clustering
Uses bit distance metric to identify model families and group related models.
BitX Compression
Compresses XORed redundancy between fine-tuned and base LLMs.
Results
Performance Evaluation
ZipLLM was evaluated on 1,742 randomly sampled LLMs from Hugging Face, with the following results:
- 49.5% reduction in storage size
- 20% higher reduction ratio than state-of-the-art designs
- 2× higher ingestion throughput
These results demonstrate that ZipLLM significantly outperforms existing solutions while maintaining lossless compression and high performance.
ZipLLM achieves both a high data reduction ratio and throughput.