DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
| Interactive🤗 SVS
DiffSinger (SVS)
In PART1, we only focus on spectrum modeling (acoustic model) and assume the ground-truth (GT) F0 to be given as the pitch information following these papers [1][2][3]. If you want to conduct experiments with F0 prediction, please move to PART2.
Thus, the pipeline of this part can be summarized as:
[lyrics] -> [linguistic representation] (Frontend)
[linguistic representation] + [GT F0] + [GT phoneme duration] -> [mel-spectrogram] (Acoustic model)
[mel-spectrogram] + [GT F0] -> [waveform] (Vocoder)
[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.
[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.
[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.
Click here for detailed instructions: link.
Thanks Opencpop team for releasing their SVS dataset with MIDI label, Jan.20, 2022 (after we published our paper).
Since there are elaborately annotated MIDI labels, we are able to supplement the pipeline in PART 1 by adding a naive melody frontend.
2.A
Thus, the pipeline of 2.A can be summarized as:
[lyrics] + [MIDI] -> [linguistic representation (with MIDI information)] + [predicted F0] + [predicted phoneme duration] (Melody frontend)
[linguistic representation] + [predicted F0] + [predicted phoneme duration] -> [mel-spectrogram] (Acoustic model)
[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
Click here for detailed instructions: link.
2.B
In 2.1, we find that if we predict F0 explicitly in the melody frontend, there will be many bad cases of uv/v prediction. Then, we abandon the explicit prediction of the F0 curve in the melody frontend and make a joint prediction with spectrograms.
Thus, the pipeline of 2.B can be summarized as:
[lyrics] + [MIDI] -> [linguistic representation] + [predicted phoneme duration] (Melody frontend)
[linguistic representation (with MIDI information)] + [predicted phoneme duration] -> [mel-spectrogram] (Acoustic model)
[mel-spectrogram] -> [predicted F0] (Pitch extractor)
[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
Click here for detailed instructions: link.
PART3. Customize your phonemes (🎉Exclusive in this forked repository!)
In PART2, we observed many bad cases with phonemes that has multiple pronumciations, e.g. i
in bi
, ci
, chi
and e
in ce
, ye
. However, the original codebase has heavy dependency on the Opencpop dataset and labels, including its phoneme systems, which bring difficulties to changing the phoneme system.
In this repository, we decoupled the code from the Opencpop phoneme system and dictionary, configured all information of the phoneme system in one single file, and released a revised version of the Opencpop pinyin dictionary. This refactor also made it possible for customized phoneme systems and dictionaries, such as Japanese, Korean, etc.
Click here for requirements and instructions: link.
FAQ
Q1: Why do I need F0 in Vocoders?
A1: See vocoder parts in HiFiSinger, DiffSinger or SingGAN. This is a common practice now.
Q2: Why not run MIDI version SVS on PopCS dataset? or Why not release MIDI labels for PopCS dataset?
A2: Our laboratory has no funds to label PopCS dataset. But there are funds for labeling other singing dataset, which is coming soon.
Q3: Why " 'HifiGAN' object has no attribute 'model' "?
A3: Please put the pretrained vocoders in your checkpoints
dictionary.
Q4: How to check whether I use GT information or predicted information during inference from packed test set?
A4: Please see codes here.
...