History

Samit 33296a813a Add CAI-OpenSora (#444 )		1 month ago
..
ad	rename `MSVersion` to match PEP 8 (#407)	1 month ago

configs	Add min-snr weighting to improve AD training && Support any-shape infer (#398)	2 months ago

models	Add AnimateDiff T2V & Extract common utilities & update FA for SD2.0 (#271)	4 months ago

scripts	Add min-snr weighting to improve AD training && Support any-shape infer (#398)	2 months ago

tests	Add CAI-OpenSora (#444)	1 month ago

tools	Fix bug when ms.set_context(max_device_memory="xxGB") (#394)	2 months ago

README.md	Add min-snr weighting to improve AD training && Support any-shape infer (#398)	2 months ago

args_train.py	Add min-snr weighting to improve AD training && Support any-shape infer (#398)	2 months ago

requirements.txt	Add abs path prefix for modelarts running (#384)	2 months ago

text_to_video.py	Animatediff: support loading init latent noise file to infer (#397)	2 months ago

train.py	Add min-snr weighting to improve AD training && Support any-shape infer (#398)	2 months ago

README.md

AnimateDiff based on MindSpore

AnimateDiff based on MindSpore

This repository is the MindSpore implementation of AnimateDiff.

Features

Text-to-video generation with AnimdateDiff v2, supporting 16 frames @512x512 resolution on Ascend 910B, 16 frames @256x256 resolution on GPU 3090
MotionLoRA inference
Motion Module Training
Motion LoRA Training
AnimateDiff v3 Inference
AnimateDiff v3 Training
SDXL support

Requirements

pip install -r requirements.txt

In case decord package is not available, try pip install eva-decord.
For EulerOS, instructions on ffmpeg and decord installation are as follows.

1. install ffmpeg 4, referring to https://ffmpeg.org/releases
    wget https://ffmpeg.org/releases/ffmpeg-4.0.1.tar.bz2 --no-check-certificate
    tar -xvf ffmpeg-4.0.1.tar.bz2
    mv ffmpeg-4.0.1 ffmpeg
    cd ffmpeg
    ./configure --enable-shared         # --enable-shared is needed for sharing libavcodec with decord
    make -j 64
    make install
2. install decord, referring to https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source
    git clone --recursive https://github.com/dmlc/decord
    cd decord
    rm build && mkdir build && cd build
    cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release
    make -j 64
    make install
    cd ../python
    python3 setup.py install --user

Prepare Model Weights

First, download the torch pretrained weights referring to torch animatediff checkpoints.

Convert SD dreambooth model

To download ToonYou-Beta3 dreambooth model, please refer to this civitai website, or use the following command:

wget https://civitai.com/api/download/models/78755 -P models/torch_ckpts/ --content-disposition --no-check-certificate

After downloading this dreambooth checkpoint under animatediff/models/torch_ckpts/, convert the dreambooth checkpoint using:

cd ../examples/stable_diffusion_v2
python tools/model_conversion/convert_weights.py  --source ../animatediff/models/torch_ckpts/toonyou_beta3.safetensors   --target models/toonyou_beta3.ckpt  --model sdv1  --source_version pt

In addition, please download RealisticVision V5.1 dreambooth checkpoint and convert it similarly.

Convert Motion Module

cd ../examples/animatediff/tools
python motion_module_convert.py --src ../torch_ckpts/mm_sd_v15_v2.ckpt --tar ../models/motion_module

If converting the animatediff v3 motion module checkpoint,

cd ../examples/animatediff/tools
python motion_module_convert.py -v v3 --src ../torch_ckpts/v3_sd15_mm.ckpt  --tar ../models/motion_module

Convert Motion LoRA

cd ../examples/animatediff/tools
python motion_lora_convert.py --src ../torch_ckpts/.ckpt --tar ../models/motion_lora

Convert Domain Adapter LoRA

cd ../examples/animatediff/tools
python domain_adapter_lora_convert.py --src ../torch_ckpts/v3_sd15_adapter.ckpt --tar ../models/domain_adapter_lora

Convert SparseCtrl Encoder

cd ../examples/animatediff/tools
python sparsectrl_encoder_convert.py --src ../torch_ckpts/v3_sd15_sparsectrl_{}.ckpt --tar ../models/sparsectrl_encoder

The full tree of expected checkpoints is shown below:

models
├── domain_adapter_lora
│   └── v3_sd15_adapter.ckpt
├── dreambooth_lora
│   ├── realisticVisionV51_v51VAE.ckpt
│   └── toonyou_beta3.ckpt
├── motion_lora
│   └── v2_lora_ZoomIn.ckpt
├── motion_module
│   ├── mm_sd_v15.ckpt
│   ├── mm_sd_v15_v2.ckpt
│   └── v3_sd15_mm.ckpt
├── sparsectrl_encoder
│   ├── v3_sd15_sparsectrl_rgb.ckpt
│   └── v3_sd15_sparsectrl_scribble.ckpt
└── stable_diffusion
    └── sd_v1.5-d0ab7146.ckpt

Inference (AnimateDiff v3 and SparseCtrl)

Running On Ascend 910*:

# download demo images
bash scripts/download_demo_images.sh

# under general T2V setting
python text_to_video.py --config configs/prompts/v3/v3-1-T2V.yaml

# image animation (on RealisticVision)
python text_to_video.py --config configs/prompts/v3/v3-2-animation-RealisticVision.yaml

# sketch-to-animation and storyboarding (on RealisticVision)
python text_to_video.py --config configs/prompts/v3/v3-3-sketch-RealisticVision.yaml

Results:

Input (by RealisticVision)	Animation	Input	Animation

Input Scribble	Output	Input Scribbles	Output

Running on GPU:

Please append --device_target GPU to the end of the commands above.

If you use the checkpoint converted from torch for inference, please also append --vae_fp16=False to the command above.

Inference (AnimateDiff v2)

Text-to-Video

Running On Ascend 910*:

python text_to_video.py --config configs/prompts/v2/1-ToonYou.yaml --L 16 --H 512 --W 512

By default, DDIM sampling is used, and the sampling speed is 1.07s/iter.

Results:

Running on GPU:

python text_to_video.py --config configs/prompts/v2/1-ToonYou.yaml --L 16 --H 256 --W 256 --device_target GPU

If you use the checkpoint converted from torch for inference, please also append --vae_fp16=False to the command above.

Motion LoRA

Running On Ascend 910*:

python text_to_video.py --config configs/prompts/v2/1-ToonYou-MotionLoRA.yaml --L 16 --H 512 --W 512

By default, DDIM sampling is used, and the sampling speed is 1.07s/iter.

Results using Zoom-In motion lora:

Running on GPU:

python text_to_video.py --config configs/prompts/v2/1-ToonYou-MotionLoRA.yaml --L 16 --H 256 --W 256 --device_target GPU

Training

Image Finetuning

python train.py --config configs/training/image_finetune.yaml

For 910B, please set export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" before running training.

Motion Module Training

python train.py --config configs/training/mmv2_train.yaml

For 910B, please set export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" before running training.

You may change the arguments including data path, output directory, lr, etc in the yaml config file. You can also change by command line arguments referring to args_train.py or python train.py --help

Evaluation

To infer with the trained model, run

python text_to_video.py --config configs/prompts/v2/base_video.yaml \
    --motion_module_path {path to saved checkpoint} \
    --prompt  {text prompt}  \

You can also create a new config yaml to specify the prompts to test and the motion moduel path based on configs/prompt/v2/base_video.yaml.

Here are some generation results after MM training on 512x512 resolution and 16-frame data.

Disco light leaks disco ball light reflections shaped rectangular and line with motion blur effect	Cloudy moscow kremlin time lapse	Sharp knife to cut delicious smoked fish	A baker turns freshly baked loaves of sourdough bread

Min-SNR Weighting

Min-SNR weighting can be used to improve diffusion training convergence. You can enable it by appending --snr_gamma=5.0 to the training command.

Motion LoRA Training

python train.py --config configs/training/mmv2_lora.yaml

For 910B, please set export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" before running training.

Evaluation

To infer with the trained model, run

python text_to_video.py --config configs/prompts/v2/base_video.yaml \
    --motion_lora_path {path to saved checkpoint} \
    --prompt  {text prompt}  \

Here are some generation results after lora fine-tuning on 512x512 resolution and 16-frame data.

Disco light leaks disco ball light reflections shaped rectangular and line with motion blur effect	Cloudy moscow kremlin time lapse	Sharp knife to cut delicious smoked fish	A baker turns freshly baked loaves of sourdough bread

Training on GPU

Please add --device_target GPU in the above training commands and adjust image_size/num_frames/train_batch_size to fit your device memory. Below is an example for 3090.

# reduce num frames and batch size to avoid OOM in 3090
python train.py --config configs/training/mmv2_train.yaml --data_path ../videocomposer/datasets/webvid5 --image_size 256 --num_frames=4 --device_target GPU --train_batch_size=1

Performance

Inference

Model	Context	Scheduler	Steps	Resolution	Frame	Speed (step/s)	Time(s/video)
AnimateDiff v2	D910*x1-MS2.2.10	DDIM	30	512x512	16	1.2	25

Context: {Ascend chip}-{number of NPUs}-{mindspore version}.

Training

Model	Context	Task	Local BS x Grad. Accu.	Resolution	Frame	Step T. (s/step)
AnimateDiff v2	D910*x1-MS2.2.10	MM training	1x1	512x512	16	1.29
AnimateDiff v2	D910*x1-MS2.2.10	Motion Lora	1x1	512x512	16	1.26
AnimateDiff v2	D910*x1-MS2.2.10	MM training w/ Embed. cached	1x1	512x512	16	0.75
AnimateDiff v2	D910*x1-MS2.2.10	Motion Lora w/ Embed. cached	1x1	512x512	16	0.71

Context: {Ascend chip}-{number of NPUs}-{mindspore version}.

MM training: Motion Module training

Embed. cached: The video embedding (VAE-encoder outputs) and text embedding are pre-computed and stored before diffusion training.

one for all, Optimal generator with No Exception

https://github.com/mindspore-lab/mindone

Python Text Markdown Shell

285365963@qq.com 33117903+wtomin@users.noreply.github.com 52945530+Songyuanwei@users.noreply.github.com

16683750+hadipash@users.noreply.github.com

How to access data resources in code

README.md

AnimateDiff based on MindSpore

Features

Requirements

Prepare Model Weights

Inference (AnimateDiff v3 and SparseCtrl)

Inference (AnimateDiff v2)

Text-to-Video

Motion LoRA

Training

Image Finetuning

Motion Module Training

Min-SNR Weighting

Motion LoRA Training

Training on GPU

Performance

Inference

Training

Contributors (25+) All

Contributors (25+)
All