Skip to content

Use model compression pathways #1419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented May 8, 2025

Purpose

  • Use in-memory model compression pathway in order to reduce memory requirements when saving models
  • These changes along with postprocessing changes move users towards a pattern where they are aware of the status of the model (frozen/compressed) and call save_pretrained manually

Prerequisites

Changes

  • Modify save_pretrained_wrapper to use compress_model(model) rather than compress(state_dict)
  • Modify save_pretrained_wrapper so that the state dict is only retrieved if not skipping compression stats
  • Modify save_pretrained_wrapper to save dictionary and python files, even if there is no explicit compressor
  • Modify save_checkpoint (used by training) to decompress after the checkpoint is saved

Example/Testing Changes

As far as I can tell, below lists all of the instances where a model undergoes saving (no immediately followed by script exit)

File Path Solution
examples/trl_mixin/ex_trl_constant.py
test_oneshot_and_finetune.py
Decompress in between stages
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
test_oneshot_and_finetune_with_tokenizer.py
Do not save in between stages to avoid compressed state
test_oneshot_then_finetune.py No work is required, as model is decompressed upon loading from disk
test_compress_tensor_utils.py Fix test to use dispatch_model (which is actually used by transformers) rather than cpu_offload

Testing

State Dict In Memory
previous now
oneshot_save.py
import torch
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from pttp import TensorProfiler

#MODEL_ID = "DeepSeek-V3_local_bf16"
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

with TensorProfiler() as prof:
    prof.mark_event("Load model")
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)

    prof.mark_event("Oneshot")
    oneshot(
        model=model,
        recipe=QuantizationModifier(targets="Linear", scheme="W4A16"),
        trust_remote_code_model=True,
    )

    prof.mark_event("Save model")
    model.save_pretrained("sav_testing", save_compressed=True, skip_compression_stats=True)

prof.save_memory_timeline("save_timeline.png")

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link

github-actions bot commented May 8, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

kylesayrs added 2 commits May 14, 2025 11:33
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs added the ready When a PR is ready for review label May 14, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs removed the ready When a PR is ready for review label May 14, 2025
@kylesayrs kylesayrs added the ready When a PR is ready for review label May 19, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs marked this pull request as ready for review May 20, 2025 04:44
@kylesayrs kylesayrs changed the title [WIP] Use model compression pathways Use model compression pathways May 20, 2025
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exciting!

dsikka pushed a commit that referenced this pull request May 20, 2025
…1449)

## Purpose ##
* Prerequisite for #1419
* This PR disables getting the offloaded state dict unless necessary
(sparsity statistics). However, the utility function `cpu_offload` only
works if the offloaded state dict is retrieved. Let's replace this with
`dispatch_model`, which is the actual function used by
`PretrainedModel`, not `cpu_offload`

## Changes ##
* Rename `device_map` to `device`
* Use `dispatch_model` rather than `cpu_offload`
* Use `align_module_device` and `update_offload_parameter` utilities
* This change is necessary because, after these changes, some of these
test models no longer have offloaded state dicts (which is the way it
should always have been)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants