-
Notifications
You must be signed in to change notification settings - Fork 6k
[core] ConsisID #10140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
[core] ConsisID #10140
Changes from 69 commits
Commits
Show all changes
72 commits
Select commit
Hold shift + click to select a range
0036376
Update __init__.py
SHYuanBest 940ec92
Merge branch 'huggingface:main' into main
SHYuanBest c78cf01
add consisid
SHYuanBest 61c85f7
update consisid
SHYuanBest 12855b2
update consisid
SHYuanBest 787a69c
make style
SHYuanBest 33d4291
make_style
SHYuanBest 455d68d
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest 8f310c5
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest 0f447a4
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest d348901
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest a35f92a
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest 33f3acb
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest 6503a17
add doc
SHYuanBest a24a4ee
Merge branch 'main' into main
SHYuanBest 19d1fa3
Merge branch 'huggingface:main' into main
SHYuanBest c13fb17
make style
SHYuanBest 61ad37b
Rename consisid .md to consisid.md
SHYuanBest 3a274ca
Update geodiff_molecule_conformation.ipynb
hlky 02c16ba
Update geodiff_molecule_conformation.ipynb
hlky e76338e
Update geodiff_molecule_conformation.ipynb
hlky a597713
Update demo.ipynb
hlky 0a633e4
Merge branch 'main' into main
hlky 51003e8
Update pipeline_consisid.py
hlky a0e746e
make fix-copies
hlky 14ad9af
Update docs/source/en/using-diffusers/consisid.md
SHYuanBest e5c84c7
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest 0bb54c9
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest c389400
Update docs/source/en/using-diffusers/consisid.md
SHYuanBest 4fb4529
Update docs/source/en/using-diffusers/consisid.md
SHYuanBest 9b2bd31
update doc & pipeline code
SHYuanBest 211331b
fix typo
SHYuanBest 590b1bd
make style
SHYuanBest 8e5b070
update example
SHYuanBest f234376
Merge branch 'huggingface:main' into main
SHYuanBest a0acc02
Update docs/source/en/using-diffusers/consisid.md
SHYuanBest d23d933
Merge branch 'huggingface:main' into main
SHYuanBest 2a722f2
update example
SHYuanBest 1c5a1f2
update example
SHYuanBest 7ceffc9
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest 95decbd
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest 665d1b4
Merge branch 'huggingface:main' into main
SHYuanBest 5139afc
update
SHYuanBest 1e10927
Merge branch 'huggingface:main' into main
SHYuanBest 58f6570
add test and update
SHYuanBest 32649b2
Merge branch 'huggingface:main' into main
SHYuanBest 141038b
remove some changes from docs
a-r-r-o-w d0fe503
refactor
a-r-r-o-w 60856c7
fix
a-r-r-o-w 313c2e3
undo changes to examples
a-r-r-o-w 935319a
remove save/load and fuse methods
a-r-r-o-w 0f5d677
update
a-r-r-o-w aa7b0eb
link hf-doc-img & make test extremely small
SHYuanBest aa98858
update
SHYuanBest 03ebc66
Merge branch 'huggingface:main' into main
SHYuanBest c8ba3c0
Merge branch 'huggingface:main' into main
SHYuanBest 2e15509
Merge branch 'huggingface:main' into main
SHYuanBest b174d9f
add lora
SHYuanBest fbb09aa
fix test
SHYuanBest 3b05257
Merge branch 'huggingface:main' into main
SHYuanBest 5813825
update
SHYuanBest 7734a29
update
SHYuanBest 5fd9a81
change expected_diff_max to 0.4
SHYuanBest 0937753
Merge branch 'huggingface:main' into main
SHYuanBest cdc04bf
fix typo
SHYuanBest 0af2f83
fix link
SHYuanBest e17aa82
fix typo
SHYuanBest 3b17e2e
Merge branch 'main' into main
SHYuanBest 71982b2
Merge branch 'huggingface:main' into main
SHYuanBest 31c94a0
update docs
SHYuanBest cca81bf
update
a-r-r-o-w 5348111
remove consisid lora tests
a-r-r-o-w File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. --> | ||
|
||
# ConsisIDTransformer3DModel | ||
|
||
A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/pdf/2411.17440) by Peking University & University of Rochester & etc. | ||
|
||
The model can be loaded with the following code snippet. | ||
|
||
```python | ||
from diffusers import ConsisIDTransformer3DModel | ||
|
||
transformer = ConsisIDTransformer3DModel.from_pretrained("BestWishYsh/ConsisID-preview", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") | ||
``` | ||
|
||
## ConsisIDTransformer3DModel | ||
|
||
[[autodoc]] ConsisIDTransformer3DModel | ||
|
||
## Transformer2DModelOutput | ||
|
||
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
--> | ||
|
||
# ConsisID | ||
|
||
[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan. | ||
|
||
The abstract from the paper is: | ||
|
||
*Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our **ConsisID** achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V. The model weight of ConsID is publicly available at https://github.com/PKU-YuanGroup/ConsisID.* | ||
|
||
<Tip> | ||
|
||
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. | ||
|
||
</Tip> | ||
|
||
This pipeline was contributed by [SHYuanBest](https://github.com/SHYuanBest). The original codebase can be found [here](https://github.com/PKU-YuanGroup/ConsisID). The original weights can be found under [hf.co/BestWishYsh](https://huggingface.co/BestWishYsh). | ||
|
||
There are two official ConsisID checkpoints for identity-preserving text-to-video. | ||
|
||
| checkpoints | recommended inference dtype | | ||
|:---:|:---:| | ||
| [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 | | ||
| [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 | | ||
|
||
### Memory optimization | ||
|
||
ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script. | ||
|
||
| Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved | | ||
| :----------------------------- | :------------------- | :------------------ | | ||
| - | 37 GB | 44 GB | | ||
| enable_model_cpu_offload | 22 GB | 25 GB | | ||
| enable_sequential_cpu_offload | 16 GB | 22 GB | | ||
| vae.enable_slicing | 16 GB | 22 GB | | ||
| vae.enable_tiling | 5 GB | 7 GB | | ||
|
||
## ConsisIDPipeline | ||
|
||
[[autodoc]] ConsisIDPipeline | ||
|
||
- all | ||
- __call__ | ||
|
||
## ConsisIDPipelineOutput | ||
|
||
[[autodoc]] pipelines.consisid.pipeline_output.ConsisIDPipelineOutput |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
--> | ||
# ConsisID | ||
|
||
[ConsisID](https://github.com/PKU-YuanGroup/ConsisID) is an identity-preserving text-to-video generation model that keeps the face consistent in the generated video by frequency decomposition. The main features of ConsisID are: | ||
|
||
- Frequency decomposition: The characteristics of the DiT architecture are analyzed from the frequency domain perspective, and based on these characteristics, a reasonable control information injection method is designed. | ||
- Consistency training strategy: A coarse-to-fine training strategy, dynamic masking loss, and dynamic cross-face loss further enhance the model's generalization ability and identity preservation performance. | ||
- Inference without finetuning: Previous methods required case-by-case finetuning of the input ID before inference, leading to significant time and computational costs. In contrast, ConsisID is tuning-free. | ||
|
||
This guide will walk you through using ConsisID for use cases. | ||
|
||
## Load Model Checkpoints | ||
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method. | ||
|
||
|
||
```python | ||
# !pip install consisid_eva_clip insightface facexlib | ||
import torch | ||
from diffusers import ConsisIDPipeline | ||
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer | ||
from huggingface_hub import snapshot_download | ||
|
||
# Download ckpts | ||
snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview") | ||
|
||
# Load face helper model to preprocess input face image | ||
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16) | ||
|
||
# Load consisid base model | ||
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16) | ||
pipe.to("cuda") | ||
``` | ||
|
||
## Identity-Preserving Text-to-Video | ||
For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results. | ||
|
||
```python | ||
from diffusers.utils import export_to_video | ||
|
||
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel." | ||
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_input.png?download=true" | ||
|
||
id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True) | ||
|
||
video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42)) | ||
export_to_video(video.frames[0], "output.mp4", fps=8) | ||
``` | ||
<table> | ||
a-r-r-o-w marked this conversation as resolved.
Show resolved
Hide resolved
|
||
<tr> | ||
<th style="text-align: center;">Face Image</th> | ||
<th style="text-align: center;">Video</th> | ||
<th style="text-align: center;">Description</th | ||
</tr> | ||
<tr> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_image_0.png?download=true" style="height: auto; width: 600px;"></td> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_output_0.gif?download=true" style="height: auto; width: 2000px;"></td> | ||
<td>The video, in a beautifully crafted animated style, features a confident woman riding a horse through a lush forest clearing. Her expression is focused yet serene as she adjusts her wide-brimmed hat with a practiced hand. She wears a flowy bohemian dress ......</td> | ||
</tr> | ||
<tr> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_image_1.png?download=true" style="height: auto; width: 600px;"></td> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_output_1.gif?download=true" style="height: auto; width: 2000px;"></td> | ||
<td>The video, in a captivating animated style, shows a woman standing in the center of a snowy forest, her eyes narrowed in concentration as she extends her hand forward. She is dressed in a deep blue cloak, her breath visible in the cold air ......</td> | ||
</tr> | ||
<tr> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_image_2.png?download=true" style="height: auto; width: 600px;"></td> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_output_2.gif?download=true" style="height: auto; width: 2000px;"></td> | ||
<td>The animation features a whimsical portrait of a balloon seller standing in a gentle breeze, captured with soft, hazy brushstrokes that evoke the feel of a serene spring day. His face is framed by a gentle smile, his eyes ......</td> | ||
</tr> | ||
<tr> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_image_3.png?download=true" style="height: auto; width: 600px;"></td> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_output_3.gif?download=true" style="height: auto; width: 2000px;"></td> | ||
<td>The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured ......</td> | ||
</tr> | ||
<tr> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_image_4.png?download=true" style="height: auto; width: 600px;"></td> | ||
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F406/diffusers/consisid/consisid_output_4.gif?download=true" style="height: auto; width: 2000px;"></td> | ||
<td>The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge ......</td> | ||
</tr> | ||
</table> | ||
|
||
## Resources | ||
|
||
Learn more about ConsisID with the following resources. | ||
- A [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) demonstrating ConsisID's main features. | ||
- The research paper, [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440) for more details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.