关于GCU、沐曦GPGPU、MLU、0卡V100资源4月7日恢复上架的公告>>> 关于共建具身智能开源数据集的倡议>>> 关于云脑任务中统一路径访问方式的公告>>> 关于将启智集群GPU资源迁移至智算集群的公告>>>

16 KiB

Raw Permalink Blame History

Introduction
Benchmark Details
How to run the benchmark
Results
- Pytorch 1.11.0 compiled with cuDNN 8.3.2.44
- OLD: Pytorch 1.7.1 compiled with cuDNN 8.1.0.77
Troubleshooting

Introduction

Trainings can take some time. A well-running training setup is essential to get the most of nnU-Net. nnU-Net does not
require any fancy hardware, just a well-balanced system. We recommend at least 32 GB of RAM, 6 CPU cores (12 threads),
SSD storage (this can be SATA and does not have to be PCIe. DO NOT use an external SSD connected via USB!) and a
2080 ti GPU. If your system has multiple GPUs, the other components need to scale linearly with the number of GPUs.

Benchmark Details

To ensure your system is running as intended, we provide some benchmark numbers against which you can compare. Here
are the details about benchmarking:

We benchmark 2d, 3d_fullres and a modified 3d_fullres that uses 3x the default batch size (called 3d_fullres large here)
The datasets Task002_Heart, Task005_Prostate and Task003_Liver of the Medical Segmentation Decathlon are used
(they provide a good spectrum of dataset properties)
we use the nnUNetTrainerV2_5epochs trainer. This will run only for 5 epochs and it will skip validation.
From the 5 epochs, we select the fastest one as the epoch time.
We will also be running the nnUNetTrainerV2_5epochs_dummyLoad trainer on the 3d_fullres config
(called 3d_fullres dummy). This trainer does not use the dataloader and instead uses random dummy inputs,
bypassing all data augmentation (CPU) and I/O bottlenecks.
All trainings are done with mixed precision. This is why Pascal GPUs (Titan Xp) are so slow (they do not have
tensor cores)

How to run the benchmark

First go into the folder where the preprocessed data and plans file of the task you would like to use are located.
For me this is /home/fabian/data/nnUNet_preprocessed/Task002_Heart

Then run the following python snippet. This will create our custom 3d_fullres_large configuration. Note that this
large configuration will only run on GPUs with 16GB or more! We included it in the test because some GPUs
(V100, A100) can shine when they get more work to do per iteration.

from batchgenerators.utilities.file_and_folder_operations import *
plans = load_pickle('nnUNetPlansv2.1_plans_3D.pkl')
stage = max(plans['plans_per_stage'].keys())
plans['plans_per_stage'][stage]['batch_size'] *= 3
save_pickle(plans, 'nnUNetPlansv2.1_bs3x_plans_3D.pkl')

Now you can run the benchmarks. Each should only take a couple of minutes

nnUNet_train 2d nnUNetTrainerV2_5epochs TASKID 0
nnUNet_train 3d_fullres nnUNetTrainerV2_5epochs TASKID 0
nnUNet_train 3d_fullres nnUNetTrainerV2_5epochs_dummyLoad TASKID 0
nnUNet_train 3d_fullres nnUNetTrainerV2_5epochs TASKID 0 -p nnUNetPlansv2.1_bs3x # optional, only for GPUs with more than 16GB of VRAM

The time we are interested in is the epoch time. You can find it in the text output (stdout) or the log file
located in your RESULTS_FOLDER. Note that the trainers used here run for 5 epochs. Select the fastest time from your
output as your benchmark time.

Results

The following table shows the results we are getting on our servers/workstations. We are using pytorch 1.11.0 that we
compiled ourselves using the instructions found here. The cuDNN
version we used is 8.3.2.44. You should be seeing similar numbers when you
run the benchmark on your server/workstation. Note that fluctuations of a couple of seconds are normal!

IMPORTANT: Compiling pytorch from source is currently mandatory for best performance! Pytorch 1.8 does not have
working tensorcore acceleration for 3D convolutions when installed with pip or conda!

IMPORTANT: A100 and V100 are very fast with the newer cuDNN versions and need more CPU workers to prevent bottlenecks,
set the environment variable nnUNet_n_proc_DA=XX to increase the number of data augmentation workers.
Recommended: 20 for V100, 32 for A100. Datasets with many input modalities (BraTS: 4) require A LOT of CPU and
should be used with even larger values for nnUNet_n_proc_DA.

Pytorch 1.11.0 compiled with cuDNN 8.3.2.44

	A100 40GB PCIe 250W	A100 40GB (DGX A100) 400W	V100 32GB SXM3 (DGX2) 350W	V100 32GB PCIe 250W	Quadro RTX6000 24GB 260W	Titan RTX 24GB 280W	RTX 2080 ti 11GB 250W
Task002_Heart 2d	36.75	41.48	57.4	61.32	70.96	70.39	86.1
Task002_Heart 3d_fullres	47.16	46.22	81.92	84.82	109.5	107.44	123.27
Task002_Heart 3d_fullres dummy	46.52	43.6	66.84	78.75	99.88	97.39	116.36
Task002_Heart 3d_fullres large	121.55	111.64	221.03	242.56	284.73	302.02	OOM

Task003_Liver 2d	35.66	39.26	65.34	65.76	79.72	70.44	86.37
Task003_Liver 3d_fullres	41.49	39.67	74.21	76.79	77.63	86.75	94.16
Task003_Liver 3d_fullres dummy	40.63	37.71	62.37	70.55	76.37	74.66	86.8
Task003_Liver 3d_fullres large	102.48	97.85	202.04	209.4	226.45	254.77	OOM

Task005_Prostate 2d	36.41	37.07	64.58	65.88	70.47	77.95	88.54
Task005_Prostate 3d_fullres	42.95	41.92	90.92	95.54	85.69	90.66	109.78
Task005_Prostate 3d_fullres dummy	41.78	39.56	78.22	88.83	83.69	81.31	101.75
Task005_Prostate 3d_fullres large	106.98	102.75	239.32	259.14	255.64	270.8	OOM

OLD: Pytorch 1.7.1 compiled with cuDNN 8.1.0.77

(Columns are different. The table above includes the A100 PCIe and lacks the Titan Xp GPUs!)

	A100 40GB (DGX A100) 400W	V100 32GB SXM3 (DGX2) 350W	V100 32GB PCIe 250W	Quadro RTX6000 24GB 260W	Titan RTX 24GB 280W	RTX 2080 ti 11GB 250W	Titan Xp 12GB 250W
Task002_Heart 2d	40.06	66.03	76.19	78.01	79.78	98.49	177.87
Task002_Heart 3d_fullres	51.17	85.96	99.29	110.47	112.34	148.36	504.93
Task002_Heart 3d_fullres dummy	48.53	79	89.66	105.16	105.56	138.4	501.64
Task002_Heart 3d_fullres large	118.5	220.45	251.25	322.28	300.96	OOM	OOM

Task003_Liver 2d	39.71	60.69	69.65	72.29	76.17	92.54	183.73
Task003_Liver 3d_fullres	44.48	75.53	87.19	85.18	86.17	106.76	290.87
Task003_Liver 3d_fullres dummy	41.1	70.96	80.1	79.43	79.43	101.54	289.03
Task003_Liver 3d_fullres large	115.33	213.27	250.09	261.54	266.66	OOM	OOM

Task005_Prostate 2d	42.21	68.88	80.46	83.62	81.59	102.81	183.68
Task005_Prostate 3d_fullres	47.19	76.33	85.4	100	102.05	132.82	415.45
Task005_Prostate 3d_fullres dummy	43.87	70.58	81.32	97.48	98.99	124.73	410.12
Task005_Prostate 3d_fullres large	117.31	209.12	234.28	277.14	284.35	OOM	OOM

Troubleshooting

Your epoch times are substantially slower than ours? That's not good! This section will help you figure out what is
wrong. Note that each system is unique and we cannot help you find bottlenecks beyond providing the information
presented in this section!

First step: Make sure you have the right software!

In order to get maximum performance, you need to have pytorch that ships with or was compiled with a recent cuDNN v
ersion (8002 or newer is a must!).
You can check your cudnn versionlike this:

python -c 'import torch;print(torch.backends.cudnn.version())'

If the output is 8002 or higher, then you are good to go. If not you may have to take action: either update your
pytorch version or maybe even compile it yourself.
Compiling yourself will almost always give the maximum performance. Please follow the
instructions on the pytorch website. You'll need the cuDNN tar file
which you can download from the Nvidia homepage.

Identifying the bottleneck

If the software is up to date and you are still experiencing problems, this is how you can figure out what is going on:

While a training is running, run htop and watch -n 0.1 nvidia-smi (depending on your region you may have to use
0,1 instead). If you have physical access to the machine, also have a look at the LED indicating I/O activity.

Here is what you can read from that:

nvidia-smi shows the GPU activity. watch -n 0.1 makes this command refresh every 0.1s. This will allow you to
see your GPU in action. A well running training will have your GPU pegged at 90-100% with no drops in GPU utilization.
Your power should also be close to the maximum (for example 237W / 250 W) at all times.
htop gives you an overview of the CPU usage. nnU-Net uses 12 processes for data augmentation + one main process.
This means that up to 13 processes should be running simultaneously.
the I/O LED indicates that your system is reading/writing data from/to your hard drive/SSD. Whenever this is
blinking your system is doing something with your HDD/SSD.

GPU bottleneck

If nvidia-smi is constantly showing 80-100% GPU utilization and the reported power draw is near the maximum, your
GPU is the bottleneck. This is great! That means that your other components are not slowing it down. Your epochs times
should be the same as ours reported above. If they are not then you need to investigate your software stack (see cuDNN stuff above).

What can you do about it?

There is nothing holding you back. Everything is fine!
If you need faster training, consider upgrading your GPU. Performance numbers are above, feel free to use them for guidance.
Think about whether you need more (slower) GPUs or less (faster) GPUs. Make sure to include Server/Workstation
costs into your calculations. Sometimes it is better to go with more cheaper but slower GPUs run run multiple trainings
in parallel.

CPU bottleneck

You can recognize a CPU bottleneck as follows:

htop is consistently showing 10+ processes that are associated with your nnU-Net training
nvidia-smi is reporting jumps of GPU activity with zeroes in between