---
title: 'Adventures in TTS'
date: '2025-02-19T01:27:21+00:00'
type: post
word_count: 466
char_count: 2613
tokens: 606
categories:
  - Uncategorized
---

# Adventures in TTS

So, it turns out there’s a [new text to speech engine](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) on the block. It’s really good. Voices sound pretty human have little breaths and make sensible pauses and generally create well-formed speech. I wouldn’t say they’re as good as a great audiobook reader, but I’d say they’re better than the average audiobook reader. And you get to run it on your local machine.

There are also some new [Edge-TTS voices](https://github.com/rany2/edge-tts), which if you haven’t played with them are really quite good, free?, and super-fast. Also, there are 300+ voices in *many* languages including about 50 for english. Most of the voices “VoicePersonalities” are set to “Friendly, Positive ” but there are some new ones that list things like warm, confident, authentic, honest, and rational.

I wanted to compare the voices and so I wrote a [little tool that swaps between voices](https://gregr.org/tts-samples/) ([code](https://gist.github.com/greg-randall/09af04cfcedef0bc6f0a21336885a70e)) but keeps your place in the recording. I had to spend a while trying to get the volume of all the recordings to be identical. I noticed that if a voice was slightly louder, I strongly preferred it. I wrote up a [bit of code](https://gist.github.com/greg-randall/de71c82c8543d39a5db59456b34e6a18) to try and fix that issue. I also struggled to find a piece of text that was a good test of the TTS engines, I picked a bit from The Fall of the House of Usher which is not in copyright and also has some words that are uncommon.

---

![Voice Comparison Screenshot](https://gregr.org/wp-content/uploads/2025/02/image.png)

Then what I thought I really needed was a [blind A/B comparison of the voices](https://gregr.org/tts-samples/a-vs-b.php) to see which one was **the best**. Which then necessitated some way to rank the choices. Initially I wrote some code to count the win/loss ratio for each of the voices but that seemed like not the best way. I worked up an [Elo](https://en.wikipedia.org/wiki/Elo_rating_system#Theory) chess ranking system to [sort the results](https://gregr.org/tts-samples/a-vs-b_results.php).

I’ve run almost 200 blind A/B tests so far, I find the results pretty believable, but more would be better probably. If you want a quick TLDR; try af\_bella and bm\_lewis.

I hope to share the code for the testing/ranking soon but it’s all still in flux. It’s actually really hard to decide which paring of voices will give you the most information.

![rankings screenshot](https://gregr.org/wp-content/uploads/2025/02/image-1-1024x496.png)