Use model compression pathways #1419

kylesayrs · 2025-05-08T15:30:49Z

Purpose

Use in-memory model compression pathway in order to reduce memory requirements when saving models
These changes along with postprocessing changes move users towards a pattern where they are aware of the status of the model (frozen/compressed) and call save_pretrained manually

Prerequisites

[Tests] Use proper offloading utils in test_compress_tensor_utils #1449

Changes

Modify save_pretrained_wrapper to use compress_model(model) rather than compress(state_dict)
Modify save_pretrained_wrapper so that the state dict is only retrieved if not skipping compression stats
Modify save_pretrained_wrapper to save dictionary and python files, even if there is no explicit compressor
Modify save_checkpoint (used by training) to decompress after the checkpoint is saved

Example/Testing Changes

As far as I can tell, below lists all of the instances where a model undergoes saving (no immediately followed by script exit)

File Path	Solution
examples/trl_mixin/ex_trl_constant.py test_oneshot_and_finetune.py	Decompress in between stages
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py test_oneshot_and_finetune_with_tokenizer.py	Do not save in between stages to avoid compressed state
test_oneshot_then_finetune.py	No work is required, as model is decompressed upon loading from disk
test_compress_tensor_utils.py	Fix test to use `dispatch_model` (which is actually used by transformers) rather than `cpu_offload`

Testing

State Dict	In Memory

oneshot_save.py

import torch
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from pttp import TensorProfiler

#MODEL_ID = "DeepSeek-V3_local_bf16"
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

with TensorProfiler() as prof:
    prof.mark_event("Load model")
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)

    prof.mark_event("Oneshot")
    oneshot(
        model=model,
        recipe=QuantizationModifier(targets="Linear", scheme="W4A16"),
        trust_remote_code_model=True,
    )

    prof.mark_event("Save model")
    model.save_pretrained("sav_testing", save_compressed=True, skip_compression_stats=True)

prof.save_memory_timeline("save_timeline.png")

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

github-actions · 2025-05-08T15:30:57Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…ession-pathways

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta

exciting!

…1449) ## Purpose ## * Prerequisite for #1419 * This PR disables getting the offloaded state dict unless necessary (sparsity statistics). However, the utility function `cpu_offload` only works if the offloaded state dict is retrieved. Let's replace this with `dispatch_model`, which is the actual function used by `PretrainedModel`, not `cpu_offload` ## Changes ## * Rename `device_map` to `device` * Use `dispatch_model` rather than `cpu_offload` * Use `align_module_device` and `update_offload_parameter` utilities * This change is necessary because, after these changes, some of these test models no longer have offloaded state dicts (which is the way it should always have been) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

change compression

bd563e1

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs added 2 commits May 14, 2025 11:33

use model compression pathway

d226556

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

modify tests and examples

f453408

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs added the ready When a PR is ready for review label May 14, 2025

avoid circular import

bbe068c

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs removed the ready When a PR is ready for review label May 14, 2025

kylesayrs added 3 commits May 19, 2025 14:46

fix test which should use dispatch_model rather than cpu_offload

a432d9d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix reload, uncompress for oneshot_and_finetune

8f891f8

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Merge remote-tracking branch 'origin' into kylesayrs/use-memory-compr…

e2fbbf3

…ession-pathways

kylesayrs added the ready When a PR is ready for review label May 19, 2025

use proper offloading

b301983

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request May 19, 2025

[Tests] Use proper offloading utils in test_compress_tensor_utils #1449

Merged

kylesayrs marked this pull request as ready for review May 20, 2025 04:44

kylesayrs changed the title ~~[WIP] Use model compression pathways~~ Use model compression pathways May 20, 2025

brian-dellabetta approved these changes May 20, 2025

View reviewed changes

brian-dellabetta mentioned this pull request May 20, 2025

OOM during save_pretrained of compressed model #1183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use model compression pathways #1419

Use model compression pathways #1419

kylesayrs commented May 8, 2025 •

edited

Loading

github-actions bot commented May 8, 2025

brian-dellabetta left a comment

Use model compression pathways #1419

Are you sure you want to change the base?

Use model compression pathways #1419

Conversation

kylesayrs commented May 8, 2025 • edited Loading

Purpose

Prerequisites

Changes

Example/Testing Changes

Testing

github-actions bot commented May 8, 2025

brian-dellabetta left a comment

Choose a reason for hiding this comment

kylesayrs commented May 8, 2025 •

edited

Loading