21 KiB

Raw Permalink Blame History

Open-Sora Plan

We are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. We are training for higher resolution (>1024) as well as longer duration (>10s) videos, here is a preview of the next release. We show compressed .gif on github, which loses some quality.

Thanks to HUAWEI Ascend NPU Team for supporting us.

目前已支持国产AI芯片(华为昇腾910b，期待更多国产算力芯片)进行推理，下一步将支持国产算力训练，具体可参考PR180
.

257×512×512 (10s)	65×1024×1024 (2.7s)	65×1024×1024 (2.7s)

Time-lapse of a coastal landscape transitioning from sunrise to nightfall...	A quiet beach at dawn, the waves gently lapping at the shore and the sky painted in pastel hues....	Sunset over the sea.

65×512×512 (2.7s)	65×512×512 (2.7s)	65×512×512 (2.7s)

A serene underwater scene featuring a sea turtle swimming...	Yellow and black tropical fish dart through the sea.	a dynamic interaction between the ocean and a large rock...

The dynamic movement of tall, wispy grasses swaying in the wind...	Slow pan upward of blazing oak fire in an indoor fireplace.	A serene waterfall cascading down moss-covered rocks...

💪 Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome!!!

本项目希望通过开源社区的力量复现Sora，由北大-兔展AIGC联合实验室共同发起，当前版本离目标差距仍然较大，仍需持续完善和快速迭代，欢迎Pull request！！！

Project stages:

Primary

Setup the codebase and train a un-conditional model on a landscape dataset.
Train models that boost resolution and duration.

Extensions

Conduct text2video experiments on landscape dataset.
Train the 1080p model on video2text dataset.
Control model with more conditions.

📰 News

[2024.04.07] 🚀🚀🚀 Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.

[2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.

[2024.03.10] 🚀🚀🚀 This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.

[2024.03.08] We support the training code of text condition with 16 frames of 512x512. The code is mainly borrowed from Latte.

[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.

[2024.03.05] See our latest todo, pull requests are welcome.

[2024.03.04] We re-organizes and modulizes our code to make it easy to contribute to the project, to contribute please see the Repo structure.

[2024.03.03] We opened some discussions to clarify several issues.

[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch 👀 this repository for the latest updates.

✊ Todo

Setup the codebase and train a unconditional model on landscape dataset

Fix typos & Update readme. 🤝 Thanks to @mio2333, @CreamyLong, @chg0901, @Nyx-177, @HowardLi1984, @sennnnn, @Jason-fan20
Setup environment. 🤝 Thanks to @nameless1117
Add docker file. ⌛ [WIP] 🤝 Thanks to @Mon-ius, @SimonLeeGit
Enable type hints for functions. 🤝 Thanks to @RuslanPeresy, 🙏 [Need your contribution]
Resume from checkpoint.
Add Video-VQGAN model, which is borrowed from VideoGPT.
Support variable aspect ratios, resolutions, durations training on DiT.
Support Dynamic mask input inspired by FiT.
Add class-conditioning on embeddings.
Incorporating Latte as main codebase.
Add VAE model, which is borrowed from Stable Diffusion.
Joint dynamic mask input with VAE.
Add VQVAE from VQGAN. 🙏 [Need your contribution]
Make the codebase ready for the cluster training. Add SLURM scripts. 🙏 [Need your contribution]
Refactor VideoGPT. 🤝 Thanks to @qqingzheng, @luo3300612, @sennnnn
Add sampling script.
Add DDP sampling script. ⌛ [WIP]
Use accelerate on multi-node. 🤝 Thanks to @sysuyy
Incorporate SiT. 🤝 Thanks to @khan-yin
Add evaluation scripts (FVD, CLIP score). 🤝 Thanks to @rain305f

Train models that boost resolution and duration

Add PI to support out-of-domain size. 🤝 Thanks to @jpthu17
Add 2D RoPE to improve generalization ability as FiT. 🤝 Thanks to @jpthu17
Compress KV according to PixArt-sigma.
Support deepspeed for videogpt training. 🤝 Thanks to @sennnnn
Train a low dimension Video-AE, whether it is VAE or VQVAE.
Extract offline feature.
Train with offline feature.
Add frame interpolation model. 🤝 Thanks to @yunyangge
Add super resolution model. 🤝 Thanks to @Linzy19
Add accelerate to automatically manage training.
Joint training with images. 🙏 [Need your contribution]
Implement MaskDiT technique for fast training. 🙏 [Need your contribution]
Incorporate NaViT. 🙏 [Need your contribution]
Add FreeNoise support for training-free longer video generation. 🙏 [Need your contribution]

Conduct text2video experiments on landscape dataset.

Implement PeRFlow for improving the sampling process. 🙏 [Need your contribution]
Finish data loading, pre-processing utils.
Add T5 support.
Add CLIP support. 🤝 Thanks to @Ytimed2020
Add text2image training script.
Add prompt captioner.
- Collect training data.
  
  Need video-text pairs with caption. 🙏 [Need your contribution]
  
  Extract multi-frame descriptions by large image-language models. 🤝 Thanks to @HowardLi1984
  
  Extract video description by large video-language models. 🙏 [Need your contribution]
  
  Integrate captions to get a dense caption by using a large language model, such as GPT-4. 🤝 Thanks to @HowardLi1984
- Train a captioner to refine captions. 🚀 [Require more computation]

Train the 1080p model on video2text dataset

Looking for a suitable dataset, welcome to discuss and recommend. 🙏 [Need your contribution]
Add synthetic video created by game engines or 3D representations. 🙏 [Need your contribution]
Finish data loading, and pre-processing utils. ⌛ [WIP]
Support memory friendly training.
- Add flash-attention2 from pytorch.
- Add xformers. 🤝 Thanks to @jialin-zhao
- Support mixed precision training.
- Add gradient checkpoint.
- Support for ReBased and Ring attention. 🤝 Thanks to @kabachuha
- Train using the deepspeed engine. 🤝 Thanks to @sennnnn
- Integrate with Colossal-AI for a cheaper, faster, and more efficient. 🙏 [Need your contribution]
Train with a text condition. Here we could conduct different experiments: 🚀 [Require more computation]
- Train with T5 conditioning.
- Train with CLIP conditioning.
- Train with CLIP + T5 conditioning (probably costly during training and experiments).

Control model with more condition

Load pretrained weights from Latte.
Incorporating ControlNet. 🙏 [Need your contribution]

📂 Repo structure (WIP)

├── README.md
├── docs
│   ├── Data.md                    -> Datasets description.
│   ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts                        -> All scripts.
├── opensora
│   ├── dataset
│   ├── models
│   │   ├── ae                     -> Compress videos to latents
│   │   │   ├── imagebase
│   │   │   │   ├── vae
│   │   │   │   └── vqvae
│   │   │   └── videobase
│   │   │       ├── vae
│   │   │       └── vqvae
│   │   ├── captioner
│   │   ├── diffusion              -> Denoise latents
│   │   │   ├── diffusion         
│   │   │   ├── dit
│   │   │   ├── latte
│   │   │   └── unet
│   │   ├── frame_interpolation
│   │   ├── super_resolution
│   │   └── text_encoder
│   ├── sample
│   ├── train                      -> Training code
│   └── utils

🛠️ Requirements and Installation

Clone this repository and navigate to Open-Sora-Plan folder

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan

Install required packages

conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Install optional requirements such as static type checking:

pip install -e '.[dev]'

🗝️ Usage

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command. We also provide online demo and in Huggingface Spaces.

🤝 Enjoying the and , created by @camenduru, who generously supports our research!

python -m opensora.serve.gradio_web_server

CLI Inference

sh scripts/text_condition/sample_video.sh

Datasets

Refer to Data.md

Evaluation

Refer to the document EVAL.md.

Causal Video VAE

Reconstructing

python examples/rec_video_vae.py --rec-path test_video.mp4 --video-path video.mp4 --resolution 512 --num-frames 1440 --sample-rate 1 --sample-fps 24 -
-device cuda --ckpt <Your ckpt>

VideoGPT VQVAE

Please refer to the document VQVAE.

Video Diffusion Transformer

Training

sh scripts/text_condition/train_videoae_17x256x256.sh

sh scripts/text_condition/train_videoae_65x256x256.sh

sh scripts/text_condition/train_videoae_65x512x512.sh

🚀 Improved Training Performance

In comparison to the original implementation, we implement a selection of training speed acceleration and memory saving features including gradient checkpointing, mixed precision training, and pre-extracted features, xformers, deepspeed. Some data points using a batch size of 1 with a A100:

64×32×32 (origin size: 256×256×256)

gradient checkpointing	mixed precision	xformers	feature pre-extraction	deepspeed config	compress kv	training speed	memory
✔	✔	✔	✔	❌	❌	0.64 steps/sec	43G
✔	✔	✔	✔	Zero2	❌	0.66 steps/sec	14G
✔	✔	✔	✔	Zero2	✔	0.66 steps/sec	15G
✔	✔	✔	✔	Zero2 offload	❌	0.33 steps/sec	11G
✔	✔	✔	✔	Zero2 offload	✔	0.31 steps/sec	12G

128×64×64 (origin size: 512×512×512)

gradient checkpointing	mixed precision	xformers	feature pre-extraction	deepspeed config	compress kv	training speed	memory
✔	✔	✔	✔	❌	❌	0.08 steps/sec	77G
✔	✔	✔	✔	Zero2	❌	0.08 steps/sec	41G
✔	✔	✔	✔	Zero2	✔	0.09 steps/sec	36G
✔	✔	✔	✔	Zero2 offload	❌	0.07 steps/sec	39G
✔	✔	✔	✔	Zero2 offload	✔	0.07 steps/sec	33G

💡 How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

Latte: The main codebase we built upon and it is an wonderful video gererated model.
PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
VideoGPT: Video Generation using VQ-VAE and Transformers.
DiT: Scalable Diffusion Models with Transformers.
FiT: Flexible Vision Transformer for Diffusion Model.
Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

🔒 License

See LICENSE for details.

21 KiB Raw Permalink Blame History