ZipLLM

Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression

Abstract

Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques—such as deduplication and compression—are either LLM-oblivious or not compatible with each other, limiting data reduction effectiveness.

Our large-scale characterization study across all publicly-available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication offers strong synergy with model-aware compressors.

Building on these insights, we present BitX, an effective, fast, lossless delta compression algorithm that compresses XORed redundancy between fine-tuned and base LLMs. We build ZipLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, ZipLLM reduces model storage consumption by 49.5%, over 20% more than state-of-the-art deduplication and compression designs.

Motivation

Exponential Growth of LLM Storage

Large language models (LLMs) have become foundational tools in modern artificial intelligence (AI), with millions of models now publicly available through model hubs like Hugging Face and TensorFlow Hub.

Hugging Face alone hosts over 10 petabytes (PB) of models, with storage volume growing exponentially. By the end of 2025, projections suggest that total model storage will surpass 1 exabyte (EB)—three orders of magnitude more than in 2023.

Two observations underscore this challenge:

  • Fine-tuned LLMs vastly outnumber base models and contribute disproportionately to overall storage footprint
  • LLM storage is dominated by two floating-point formats: BF16 and FP32
Growth of model repositories at Hugging Face

Hugging Face's model count and storage consumption grow at an exponential rate

Key Insights

Element-wise weight deltas are small and structured

Fine-tuned models derived from the same base exhibit tiny differences, making them highly suitable for lossless delta compression.

Bitwise similarity enables LLM clustering

Bit distance, a new metric based on bitwise Hamming distance, serves as a lightweight, robust signal for grouping LLMs by family, supporting applications like model provenance, duplicate detection, and clustering.

Chunk-based deduplication is LLM-oblivious and suboptimal

Chunk-level deduplication operates on raw byte streams without LLM structure awareness, resulting in the loss of crucial information needed for effective model-aware compression. It's also computationally expensive and scales poorly with storage capacity.

Tensor-level deduplication is synergistic with LLM-aware compressors

Operating directly at the tensor granularity achieves deduplication ratios comparable to CDC, but with significantly lower metadata overhead. Meanwhile, it synergistically enables model-aware compressors.

BitX: Lossless Delta Compression

BitX Compression Algorithm

BitX is a highly effective, fast, lossless delta compression algorithm that compresses LLM variants using XOR-based deltas.

Key features:

  • Operates on the bit level to capture subtle differences between related models
  • Uses XOR operation to identify exact bit-level changes between models
  • Applies a generic lossless compression algorithm (zstd) to further reduce storage
  • Preserves exact bit-level fidelity for model weights

BitX workflow:

  1. Align matching tensors from base and fine-tuned models
  2. Perform bitwise XOR on corresponding values
  3. Compress the resulting bit patterns using zstd
  4. Store reference to base model and compressed delta
BitX Compression Workflow

BitX compression workflow. The example uses BF16, but BitX supports all floating-point types.

ZipLLM System

System Architecture

ZipLLM is a model storage reduction pipeline that synergizes tensor-level deduplication and lossless BitX compression. It combines multiple strategies to address both exact and approximate redundancy in LLM storage.

ZipLLM System Workflow

Overview of the ZipLLM storage reduction workflow.

File Deduplication

Eliminates exact file duplicates using hash-based comparison.

Tensor-level Deduplication

Operates directly at the tensor level, leveraging model structure explicitly exposed in LLM formats.

LLM Clustering

Uses bit distance metric to identify model families and group related models.

BitX Compression

Compresses XORed redundancy between fine-tuned and base LLMs.

Results

Performance Evaluation

ZipLLM was evaluated on 1,742 randomly sampled LLMs from Hugging Face, with the following results:

  • 49.5% reduction in storage size
  • 20% higher reduction ratio than state-of-the-art designs
  • 2× higher ingestion throughput

These results demonstrate that ZipLLM significantly outperforms existing solutions while maintaining lossless compression and high performance.

Compression results summary

ZipLLM achieves both a high data reduction ratio and throughput.