Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
HaoyangLI 4c49de3e3e | 2 months ago | |
---|---|---|
.. | ||
configs | 6 months ago | |
datasets | 2 months ago | |
demo_video | 8 months ago | |
deploy | 6 months ago | |
model_weights | 6 months ago | |
scripts | 4 months ago | |
tests | 6 months ago | |
tools | 6 months ago | |
vc | 6 months ago | |
.gitignore | 8 months ago | |
LICENSE | 8 months ago | |
README.md | 4 months ago | |
analyze_video_frames.py | 2 months ago | |
analyze_video_meta.py | 2 months ago | |
convert_lite.py | 6 months ago | |
export.py | 6 months ago | |
infer.py | 7 months ago | |
lite_infer.py | 6 months ago | |
requirements.txt | 6 months ago | |
train.py | 2 months ago |
MindSpore implementation & optimization of VideoComposer: Compositional Video Synthesis with Motion Controllability.
Condition: image depth
Text input: "A black swan swam in the water"
Condition: local image
Text input: "A black swan swam in the water"
Condition: mask
Text input: "A black swan swam in the water"
Condition: motion
Text input: "A black swan swam in the water"
Condition: sketch
Text input: "A black swan swam in the water"
Text input: "Beneath Van Gogh's Starry Sky"
Text input: "A beautiful big silver moon on the water"
Text input: "A sunflower in a field of flowers"
Style image Text input: "Red-backed Shrike lanius collurio"
Text input: "A little bird is standing on a branch"
Text input: "Ironman is fighting against the enemy, big fire in the background, photorealistic"
Style image Text input: "Van Gogh played tennis under the stars"
Inference scripts and configuration referring to
scripts/run_infer.sh
.
VideoComposer Architecture
NOTES: The training code of VC is well tested on NPU 910 + MindSpore 2.2 (20230907) + CANN 7.0T2 + Ascend driver 23.0.rc3.b060*. Other mindspore and CANN versions may suffer from precision issues.
ll /usr/local/Ascend/latest
cat /usr/local/Ascend/driver/version.info
pip show mindspore
For CANN 7.0T2, please disable AdamApplyOneFusionPasss
to avoid overflow in training. It can be done by modifying /usr/local/Ascend/latest/ops/built-in/fusion_pass/config/fusion_config.json
as follows:
{
"Switch":{
"GraphFusion":{
"AdamApplyOneFusionPass":"off", # ==> add this line in the file
"GroupConv2DFusionPass": "off",
...
},
"UBFusion":{
...
}
}
```shell
pip install -r requirements.txt
```
For `ffmpeg`, install by
```shell
conda install ffmpeg
```
If case you fail to install `motion-vector-extractor` via pip, please manually install it referring to the [official](https://github.com/LukasBommes/mv-extractor) repo.
Notes for 910: the code is also runnable on 910 for training and inference. But the number of frames
max_frames
for training should be changed from 16 to 8 frames or fewer due to memory limitation.
The root path of downloading must be ${PROJECT_ROOT}\model_weights
, where ${PROJECT_ROOT}
means the root path of project.
Download the checkpoints shown in model_weights/README.md from https://download.mindspore.cn/toolkits/mindone/videocomposer/model_weights/ and https://download.mindspore.cn/toolkits/mindone/stable_diffusion/depth_estimator/midas_v3_dpt_large-c8fd1049.ckpt
The training videos and their captions (.txt) should be placed in the following folder structure.
├── {DATA_DIR}
│ ├── video_name1.mp4
│ ├── video_name1.txt
│ ├── video_name2.mp4
│ ├── video_name2.txt
│ ├── ...
Run examples/videocomposer/tools/data_converter.py
to generate video_caption.csv
in {DATA_DIR}
.
python data_converter.py {DATA_DIR}
Format of video_caption.csv
:
video,caption
video_name1.mp4,"an airliner is taxiing on the tarmac at Dubai Airport"
video_name2.mp4,"a pigeon sitting on the street near the house"
...
To run all video generation tasks on 910 or 910*, please run
bash scripts/run_infer.sh
On 910, to run a single task, you can pick the corresponding snippet of code in scripts/run_infer.sh
, such as
# export MS_ENABLE_GE=1 # for 910*
# export MS_ENABLE_REF_MODE=1 # for 910* and Mindspore > 2.1
python infer.py \
--cfg configs/exp02_motion_transfer_vs_style.yaml \
--seed 9999 \
--input_video "demo_video/motion_transfer.mp4" \
--image_path "demo_video/moon_on_water.jpg" \
--style_image "demo_video/moon_on_water.jpg" \
--input_text_desc "A beautiful big silver moon on the water"
On 910*, you need to enable the GE Mode first by running export MS_ENABLE_GE=1
. For Mindspore >2.1, you also need to enable the REF mode first by running export MS_ENABLE_REF_MODE=1
.
It takes additional time for graph compilation to execute the first step inference (around 5~8 minutes).
You can adjust the arguments in vc/config/base.py
(lower-priority) or configs/exp{task_name}.yaml
(higher-priority, will overwrite base.py if overlap). Below are the key arguments influencing inference speed and memory usage.
You need to have a Mindspore Lite Environment first for offline inference.
To install Mindspore Lite, please refer to Lite install
Download the supporting tar.gz and whl packages according to the environment.
Unzip the tar.gz package and install the corresponding version of the WHL package.
tar -zxvf mindspore-lite-2.1.0-*.tar.gz
pip install mindspore_lite-2.1.0-*.whl
Configure Lite's environment variables
LITE_HOME
is the folder path extracted from tar.gz, and it is recommended to use an absolute path.
export LITE_HOME=/path/to/mindspore-lite-{version}-{os}-{platform}
export LD_LIBRARY_PATH=$LITE_HOME/runtime/lib:$LITE_HOME/tools/converter/lib:$LD_LIBRARY_PATH
export PATH=$LITE_HOME/tools/converter/converter:$LITE_HOME/tools/benchmark:$PATH
For different tasks, you can use the corresponding snippet of the code in scripts/run_infer.sh
, and change infer.py
to export.py
to save the MindIR model. Please remember to run export MS_ENABLE_GE=1
first on 910* and run export MS_ENABLE_REF_MODE=1
on 910* and Mindspore > 2.1 before running the code snippet.
# export MS_ENABLE_GE=1 # for 910*
# export MS_ENABLE_REF_MODE=1 # for 910* and Mindspore > 2.1
python export.py\
--cfg configs/exp02_motion_transfer_vs_style.yaml \
--input_video "demo_video/motion_transfer.mp4" \
--image_path "demo_video/moon_on_water.jpg" \
--style_image "demo_video/moon_on_water.jpg" \
--input_text_desc "A beautiful big silver moon on the water"
The exported MindIR models will be saved at models/mindir
directory. Once the exporting is finished, you need to convert the MindIR model to Mindspore Lite MindIR model. We have provided a script convert_lite.py
to convert all MindIR models in models/mindir
directory. Please note that on 910*, you need to unset MS_ENABLE_GE
and MS_ENABLE_REF_MODE
environmental variables before running the conversion.
unset MS_ENABLE_GE # Remember to unset MS_ENABLE_GE on 910*
unset MS_ENABLE_REF_MODE # Remember to unset MS_ENABLE_REF_MODE on 910* and Mindspore > 2.1
python convert_lite.py
Then you can run the offline inference using the infer_lite.py
for the given task, e.g,
python lite_infer.py\
--cfg configs/exp02_motion_transfer_vs_style.yaml \
--seed 9999 \
--input_video "demo_video/motion_transfer.mp4" \
--image_path "demo_video/moon_on_water.jpg" \
--style_image "demo_video/moon_on_water.jpg" \
--input_text_desc "A beautiful big silver moon on the water"
The compiling time is much shorter compared with the online inference mode.
To run training on a specific task, please refer to scripts/run_train.sh
.
After changing the task_name
and yaml_file
in the script for your task, run:
bash scripts/run_train.sh $DEVICE_ID
e.g. bash scripts/run_train.sh 0
to launch the training task using NPU card 0.
Under configs/
, we provide several tasks' yaml files:
configs/
├── train_exp02_motion_transfer_vs_style.yaml
├── train_exp02_motion_transfer.yaml
├── train_exp03_sketch2video_style.yaml
├── train_exp04_sketch2video_wo_style.yaml
├── train_exp05_text_depths_wo_style.yaml
└── train_exp06_text_depths_vs_style.yaml
Taking configs/train_exp02_motion_transfer.yaml
as an example, there is one critical argument:
video_compositions: ['text', 'mask', 'depthmap', 'sketch', 'single_sketch', 'motion', 'image', 'local_image']
video_compositions
defines all available conditions:
text
: the text embedding.mask
: the masked video frames.depthmap
: the depth images extracted from visual frames.sketch
: the sketch images extracted from visual frames.single_sketch
: the first sketch image from sketch
.motion
: the motion vectors extracted from the training video.image
: the image embedding used as an image style vector.local_image
: the first frame extracted from the training video.However, not all conditions are included in the training process in each of the tasks above. As defined in configs/train_exp02_motion_transfer.yaml
,
conditions_for_train: ['text', 'local_image', 'motion']
conditions_for_train
defines the three conditions used for training which are ['text', 'local_image', 'motion']
.
Please generate the HCCL config file on your running server at first referring to this tutorial. Then update scripts/run_train_distribute.sh
by setting
rank_table_file=path/to/hccl_8p_01234567_xxx.json
After that, please set task_name
according to your target task. The default training task is train_exp02_motion_transfer
.
Then execute,
bash scripts/run_train_distribute.sh
By default, training is done in epoch mode, i.e. checkpoint will be saved in every ckpt_save_interval
epoch.
To change to step mode, in train_xxx.yaml, please modify as:
dataset_sink_mode: False
step_mode: True
ckpt_save_interval: 1000
e.g., it will save checkpoints every 1000 training steps.
Currently, it's not compatible with dataset_sink_mode=True. It can be solved by setting sink_size=ckpt_save_intervel
and epochs=num_epochs*(num_steps_per_epoch//ckpt_save_intervel)
in model.train(...)
, which is under testing.
Both json and csv file are supported. JSON has a higher priority.
You can adjust the arguments in configs/train_base.py
(lower-priority) or configs/train_exp{task_name}.yaml
(higher-priority, will overwrite train_base.py if overlap). Below are the key arguments.
adamw
or momentum
. Recommend momentum
for 910 to avoid OOM and adamw
for 910* for better loss convergence.root_dir
: dataset root dir which should contain a csv annotation file. default is demo_video
, which contains an example annotation file demo_video/video_caption.csv
for demo traning.num_parallel_workers
: default is 2. Increasing it can help reduce video processing time cost if CPU cores are enough (i.e. num_workers * num_cards < num_cpu_cores) and Memory is enough (i.e. approximately, prefetch_size * max_row_size * num_workers < mem size)The training performance for exp02-motion transfer is as follows.
NPU | ** Num. Cards** | Dataset | Batch size | ** Performance (ms/step)** |
---|---|---|---|---|
910* | 1x8 | WebVid | 1 | ~950 |
910* | 8x8 | WebVid | 1 | ~1100 |
The video generation speed is as follows.
NPU | ** Framework ** | ** Sampler ** | ** Steps ** | ** Performance (s/trial)** |
---|---|---|---|---|
910* | MindSpore-2.2(20230907) | DDIM | 50 | 12 |
910* | MindSpore-Lite-2.2(20230907) | DDIM | 50 | 11.6 |
Note that with MindSpore-Lite, the graph compilation time is eliminated.
one for all, Optimal generator with No Exception
https://github.com/mindspore-lab/mindone
Python Text Markdown
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》