Abstract
Generative adversarial networks and denoising probabilistic models have recently achieved impressive performances in image and audio synthesis. After revisiting their success in conditional speech synthesis, we find that 1) GANs sacrifice sample diversity for quality and speed, 2) diffusion models exhibit outperformed sample quality and diversity while requiring a large number of iterative refinements. Achieving high-quality and diverse speech synthesis at a low computational cost has become an open problem for all neural synthesizers. In this work, we propose to converge advantages from GANs and diffusion models by incorporating both classes, introducing dual-empowered modeling perspectives: 1) DiffGAN-Wave, a diffusion model whose denoising process is parametrized by conditional GANs, and the non-Gaussian denoising distribution makes it much more stable to implement the reverse process with large steps sizes; and 2) GANDiff-Wave, a generative adversarial network whose forward process is constructed by multiple denoising diffusion iterations, which exhibits better sample diversity than traditional GANs. Experimental results show that both variants enjoy an efficient 4-step sampling process and demonstrate superior sample quality and diversity. Audio samples are available at https://RevisitSpeech.github.io
Comparison with other models
Text: in being comparatively modern.
GT | WaveNet | WaveGlow | HIFI-GAN | UnivNet | Diffwave | WaveGrad | FastDiff | DiffGAN-Wave | GANDiff-Wave |
---|---|---|---|---|---|---|---|---|---|
Text: the earliest book printed with movable types, the gutenberg, or forty two line.
GT | WaveNet | WaveGlow | HIFI-GAN | UnivNet | Diffwave | WaveGrad | FastDiff | DiffGAN-Wave | GANDiff-Wave |
---|---|---|---|---|---|---|---|---|---|
Text: the middle ages brought calligraphy to perfection , and it was natural therefore.
GT | WaveNet | WaveGlow | HIFI-GAN | UnivNet | Diffwave | WaveGrad | FastDiff | DiffGAN-Wave | GANDiff-Wave |
---|---|---|---|---|---|---|---|---|---|
Model Generalization
Text: Ask her to bring these things with her from the store.
GT | WaveNet | WaveGlow | HIFI-GAN | UnivNet | Diffwave | WaveGrad | FastDiff | DiffGAN-Wave | GANDiff-Wave |
---|---|---|---|---|---|---|---|---|---|
Text: Please call Stella.
GT | WaveNet | WaveGlow | HIFI-GAN | UnivNet | Diffwave | WaveGrad | FastDiff | DiffGAN-Wave | GANDiff-Wave |
---|---|---|---|---|---|---|---|---|---|
Ablation Study
Text: in being comparatively modern.
DiffGAN-Wave | w/o Diffusion Reparameterization | w/o Reconstruction Objective | GANDiff-Wave | w/o Reconstruction Objective |
---|---|---|---|---|
Text: it is of the first importance that the letter used should be fine in form.
DiffGAN-Wave | w/o Diffusion Reparameterization | w/o Reconstruction Objective | GANDiff-Wave | w/o Reconstruction Objective |
---|---|---|---|---|
Text: the middle ages brought calligraphy to perfection , and it was natural therefore.
DiffGAN-Wave | w/o Diffusion Reparameterization | w/o Reconstruction Objective | GANDiff-Wave | w/o Reconstruction Objective |
---|---|---|---|---|