Trainings can take some time. A well-running training setup is essential to get the most of nnU-Net. nnU-Net does not
require any fancy hardware, just a well-balanced system. We recommend at least 32 GB of RAM, 6 CPU cores (12 threads),
SSD storage (this can be SATA and does not have to be PCIe. DO NOT use an external SSD connected via USB!) and a
2080 ti GPU. If your system has multiple GPUs, the other components need to scale linearly with the number of GPUs.
To ensure your system is running as intended, we provide some benchmark numbers against which you can compare. Here
are the details about benchmarking:
First go into the folder where the preprocessed data and plans file of the task you would like to use are located.
For me this is /home/fabian/data/nnUNet_preprocessed/Task002_Heart
Then run the following python snippet. This will create our custom 3d_fullres_large configuration. Note that this
large configuration will only run on GPUs with 16GB or more! We included it in the test because some GPUs
(V100, A100) can shine when they get more work to do per iteration.
from batchgenerators.utilities.file_and_folder_operations import *
plans = load_pickle('nnUNetPlansv2.1_plans_3D.pkl')
stage = max(plans['plans_per_stage'].keys())
plans['plans_per_stage'][stage]['batch_size'] *= 3
save_pickle(plans, 'nnUNetPlansv2.1_bs3x_plans_3D.pkl')
Now you can run the benchmarks. Each should only take a couple of minutes
nnUNet_train 2d nnUNetTrainerV2_5epochs TASKID 0
nnUNet_train 3d_fullres nnUNetTrainerV2_5epochs TASKID 0
nnUNet_train 3d_fullres nnUNetTrainerV2_5epochs_dummyLoad TASKID 0
nnUNet_train 3d_fullres nnUNetTrainerV2_5epochs TASKID 0 -p nnUNetPlansv2.1_bs3x # optional, only for GPUs with more than 16GB of VRAM
The time we are interested in is the epoch time. You can find it in the text output (stdout) or the log file
located in your RESULTS_FOLDER
. Note that the trainers used here run for 5 epochs. Select the fastest time from your
output as your benchmark time.
The following table shows the results we are getting on our servers/workstations. We are using pytorch 1.11.0 that we
compiled ourselves using the instructions found here. The cuDNN
version we used is 8.3.2.44. You should be seeing similar numbers when you
run the benchmark on your server/workstation. Note that fluctuations of a couple of seconds are normal!
IMPORTANT: Compiling pytorch from source is currently mandatory for best performance! Pytorch 1.8 does not have
working tensorcore acceleration for 3D convolutions when installed with pip or conda!
IMPORTANT: A100 and V100 are very fast with the newer cuDNN versions and need more CPU workers to prevent bottlenecks,
set the environment variable nnUNet_n_proc_DA=XX
to increase the number of data augmentation workers.
Recommended: 20 for V100, 32 for A100. Datasets with many input modalities (BraTS: 4) require A LOT of CPU and
should be used with even larger values for nnUNet_n_proc_DA
.
A100 40GB PCIe 250W | A100 40GB (DGX A100) 400W | V100 32GB SXM3 (DGX2) 350W | V100 32GB PCIe 250W | Quadro RTX6000 24GB 260W | Titan RTX 24GB 280W | RTX 2080 ti 11GB 250W | |
---|---|---|---|---|---|---|---|
Task002_Heart 2d | 36.75 | 41.48 | 57.4 | 61.32 | 70.96 | 70.39 | 86.1 |
Task002_Heart 3d_fullres | 47.16 | 46.22 | 81.92 | 84.82 | 109.5 | 107.44 | 123.27 |
Task002_Heart 3d_fullres dummy | 46.52 | 43.6 | 66.84 | 78.75 | 99.88 | 97.39 | 116.36 |
Task002_Heart 3d_fullres large | 121.55 | 111.64 | 221.03 | 242.56 | 284.73 | 302.02 | OOM |
Task003_Liver 2d | 35.66 | 39.26 | 65.34 | 65.76 | 79.72 | 70.44 | 86.37 |
Task003_Liver 3d_fullres | 41.49 | 39.67 | 74.21 | 76.79 | 77.63 | 86.75 | 94.16 |
Task003_Liver 3d_fullres dummy | 40.63 | 37.71 | 62.37 | 70.55 | 76.37 | 74.66 | 86.8 |
Task003_Liver 3d_fullres large | 102.48 | 97.85 | 202.04 | 209.4 | 226.45 | 254.77 | OOM |
Task005_Prostate 2d | 36.41 | 37.07 | 64.58 | 65.88 | 70.47 | 77.95 | 88.54 |
Task005_Prostate 3d_fullres | 42.95 | 41.92 | 90.92 | 95.54 | 85.69 | 90.66 | 109.78 |
Task005_Prostate 3d_fullres dummy | 41.78 | 39.56 | 78.22 | 88.83 | 83.69 | 81.31 | 101.75 |
Task005_Prostate 3d_fullres large | 106.98 | 102.75 | 239.32 | 259.14 | 255.64 | 270.8 | OOM |
(Columns are different. The table above includes the A100 PCIe and lacks the Titan Xp GPUs!)
A100 40GB (DGX A100) 400W | V100 32GB SXM3 (DGX2) 350W | V100 32GB PCIe 250W | Quadro RTX6000 24GB 260W | Titan RTX 24GB 280W | RTX 2080 ti 11GB 250W | Titan Xp 12GB 250W | |
---|---|---|---|---|---|---|---|
Task002_Heart 2d | 40.06 | 66.03 | 76.19 | 78.01 | 79.78 | 98.49 | 177.87 |
Task002_Heart 3d_fullres | 51.17 | 85.96 | 99.29 | 110.47 | 112.34 | 148.36 | 504.93 |
Task002_Heart 3d_fullres dummy | 48.53 | 79 | 89.66 | 105.16 | 105.56 | 138.4 | 501.64 |
Task002_Heart 3d_fullres large | 118.5 | 220.45 | 251.25 | 322.28 | 300.96 | OOM | OOM |
Task003_Liver 2d | 39.71 | 60.69 | 69.65 | 72.29 | 76.17 | 92.54 | 183.73 |
Task003_Liver 3d_fullres | 44.48 | 75.53 | 87.19 | 85.18 | 86.17 | 106.76 | 290.87 |
Task003_Liver 3d_fullres dummy | 41.1 | 70.96 | 80.1 | 79.43 | 79.43 | 101.54 | 289.03 |
Task003_Liver 3d_fullres large | 115.33 | 213.27 | 250.09 | 261.54 | 266.66 | OOM | OOM |
Task005_Prostate 2d | 42.21 | 68.88 | 80.46 | 83.62 | 81.59 | 102.81 | 183.68 |
Task005_Prostate 3d_fullres | 47.19 | 76.33 | 85.4 | 100 | 102.05 | 132.82 | 415.45 |
Task005_Prostate 3d_fullres dummy | 43.87 | 70.58 | 81.32 | 97.48 | 98.99 | 124.73 | 410.12 |
Task005_Prostate 3d_fullres large | 117.31 | 209.12 | 234.28 | 277.14 | 284.35 | OOM | OOM |
Your epoch times are substantially slower than ours? That's not good! This section will help you figure out what is
wrong. Note that each system is unique and we cannot help you find bottlenecks beyond providing the information
presented in this section!
In order to get maximum performance, you need to have pytorch that ships with or was compiled with a recent cuDNN v
ersion (8002 or newer is a must!).
You can check your cudnn versionlike this:
python -c 'import torch;print(torch.backends.cudnn.version())'
If the output is 8002
or higher, then you are good to go. If not you may have to take action: either update your
pytorch version or maybe even compile it yourself.
Compiling yourself will almost always give the maximum performance. Please follow the
instructions on the pytorch website. You'll need the cuDNN tar file
which you can download from the Nvidia homepage.
If the software is up to date and you are still experiencing problems, this is how you can figure out what is going on:
While a training is running, run htop
and watch -n 0.1 nvidia-smi
(depending on your region you may have to use
0,1
instead). If you have physical access to the machine, also have a look at the LED indicating I/O activity.
Here is what you can read from that:
nvidia-smi
shows the GPU activity. watch -n 0.1
makes this command refresh every 0.1s. This will allow you to237W / 250 W
) at all times.htop
gives you an overview of the CPU usage. nnU-Net uses 12 processes for data augmentation + one main process.If nvidia-smi
is constantly showing 80-100% GPU utilization and the reported power draw is near the maximum, your
GPU is the bottleneck. This is great! That means that your other components are not slowing it down. Your epochs times
should be the same as ours reported above. If they are not then you need to investigate your software stack (see cuDNN stuff above).
What can you do about it?
You can recognize a CPU bottleneck as follows:
What can you do about it?
nnUNet_n_proc_DA
to a number higher than 12.nnUNet_n_proc_DA
On a workstation, I/O bottlenecks can be identified by looking at the LED indicating I/O activity. This is what an
I/O bottleneck looks like:
Detecting I/O bottlenecks is difficult on servers where you may not have physical access. Tools like iotop
are
difficult to read and can only be run with sudo. However, the presence of an I/O LED is not strictly necessary. If
then the only possible issue to my knowledge is in fact an I/O bottleneck.
Here is what you can do about an I/O bottleneck:
nnUNet_preprocessed
). Do not use anDear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》