Adventures training audio models with a 4060

A journey through setting up the IDIAP fork of 🐸Coqui TTS on a Windows machine

Recently I’ve been making audiobooks and wanted more control than online tools typically offer. So, I set out to train some FOSS models (Tacotron2, GlowTTS) using the IDIAP fork of Coqui TTS, with a RTX 4060.


🐸 Why the IDIAP Fork?

The official Coqui TTS repo is unmaintained as of 2024. I used the IDIAP fork:

  • Better recipe support (e.g., for LJSpeech)
  • Cleaner training API and updated configs

💻 System Specs

  • OS: Windows 11
  • GPU: NVIDIA GeForce RTX 4060 8GB
  • Python: 3.12
  • Shell: Git Bash (critical note below)
  • PyTorch: 2.5.1 + CUDA 11.8 (must match driver/CUDA 12.6 compatibility)

🔧 Step-by-Step: Getting It to Train

1. Download LJSpeech Dataset

I manually downloaded it from keithito.com since Git Bash didn’t support wget out of the box.

2. Install the Right PyTorch + CUDA Version

Even though my system has CUDA 12.6, I had to match PyTorch with a compatible build.

3. ⚠️ Windows Multiprocessing Requires __main__

Windows doesn’t use the fork() model like Linux. So when torch.utils.data.DataLoader uses multiprocessing, you must wrap all logic like this:

if __name__ == "__main__":
    trainer.fit()

4. Git Bash vs PowerShell: Setting CUDA_VISIBLE_DEVICES

In Git Bash:

CUDA_VISIBLE_DEVICES="0" python train_tacotron_dca.py

5. Run Training and Watch Logs

Once everything was set up, I monitored training and logs via TensorBoard: tensorboard --logdir=.

Early step loss values:

decoder_loss: 32.8
postnet_loss: 34.8
total_loss: 17.5

⚠️ Warnings I Saw (and Ignored)

  • Character '͡' not found in the vocabulary. → Safe to ignore (phoneme ligatures)
  • audio amplitude out of range, auto clipped. → Normalization can fix it

✅ Summary: What Worked

ComponentOutcome
Git BashWorks with CUDA — just mind how it handles env vars
IDIAP ForkClean, modular, supports Tacotron2-DCA
GPU (RTX 4060)Detected and used via PyTorch
TensorBoardWorked great to monitor logs
TrainingSlow at first, but stable

🧠 Lessons for Anyone Training on Windows

  • Wrap all training logic inside if __name__ == "__main__":
  • Manually download datasets if wget fails
  • Use the correct CUDA version for your PyTorch build
  • If a file is “in use,” TensorBoard is likely the culprit

Results

$ tts --text "Hello, this is a test of the Tacotron2-DCA text to speech model." \
    --model_path best_model_203.pth \
    --config_path config.json \
    --out_path output.wav

Tacotron2 Epoch ≈ steps per epoch (202) x step time (6s average) ≈ 1212s ≈ 20’/e.

July, 2 2025: The first go using the tacotron2-dca had an avg_align_error increasing past 0.96 to 0.966 over 3k+ steps, with no diagonal in attention. This means the model never learned correct alignments, so it wasn’t able to associate phonemes with the corresponding audio.

Leave a Reply

Your email address will not be published. Required fields are marked *