tests : add WER benchmarks #2454

ggerganov · 2024-10-05T09:35:23Z

It would be nice to start measuring the word error rate (WER) of whisper.cpp across some representative dataset:

short audio
long audio
english
non-english
etc.

This will help us catch regressions in the future. I'm not familiar with what is typically used for TTS WER benchmarks, so looking for help from the community.

The text was updated successfully, but these errors were encountered:

harvestingmoon · 2025-02-05T08:30:35Z

Hi Grigory, perhaps we can use LibriSpeech for measuring long audio (approx ~ 1000 hours but could trim it to fit for requirements). For short audio, we can use Libri-Light

Alternatively, there are other audio datasets for measuring WER: https://github.com/jim-schwoebel/voice_datasets

I could start making small sample scripts to see how whisper.cpp fairs among these datasets

ggerganov · 2025-02-05T09:39:57Z

Thanks. Yes, I'm not sure what is typically used. But in general, I think any dataset would work. The main goal here is not to compare whisper.cpp numbers with other numbers, but to create a reference set of WER numbers that we track as the development continues. This would allow us to catch regressions when they appear, because the WER scores would get worse in such cases.

Ideally, we can have scripts that perform heavier benchmarks that developers would use locally. But we also need a mode where the scripts run just a few fast benchmarks that can be added to the CI without overloading it, so that these would be computed on every commit.

foldl · 2025-02-07T10:24:32Z

@harvestingmoon are you working on this?

harvestingmoon · 2025-02-07T10:32:36Z

@foldl hi yes I'm looking at it, more or less likely to start after 12 as it's currently Chinese new year period...

foldl · 2025-02-17T00:14:48Z

I think we need a tiny dataset (~10MB) just contained in this repo. WER can then be measured on-the-fly.

harvestingmoon · 2025-02-17T06:54:45Z

Sorry please ignore the WER calculation above, I will develop another script since the calculations are completely off from what it should be . I will also look for a smaller lightweight dataset so that audio can be measured on the fly

harvestingmoon · 2025-02-17T13:28:38Z

I have created a better and more robust lightweight script that meets the requirements @foldl , @ggerganov

WER is measured at 0.3.

It uses this lightweight dataset:
https://arxiv.org/abs/2104.01497 and is based off nvidia's tutorial for calculating WER:
https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/tts-evaluate.html

My script calculates the WER for each individual audio file as well as the overall average file, here is the pull request #2824
For context, WER is measured between 0 and 1. Ideally, it is good to have a WER of around 0.33%, that means transcription accuracy is about 67%. The current measurement is around 70%, which is fairly good for a lightweight model.

Link for reference: https://huggingface.co/learn/audio-course/en/chapter5/evaluation

harvestingmoon · 2025-02-17T13:32:02Z

The pull request contains the script as well as the full ~10mb dataset, making it fairly lightweight when measuring on the fly as well

ggerganov · 2025-02-18T18:06:49Z

Hi @harvestingmoon, thank you for the effort, but I expect more attention to detail. Will close the PR for now and let someone else give this a try.

WilliamTambellini · 2025-03-13T17:18:24Z

Should nt the very first step to add an edit dist (used to compute WER/TER) minimalist src code (header only?) to measure it?
eg
https://github.com/flashlight/flashlight/blob/f59d770b52ea678b039d9ba44693341ba80cf7c5/flashlight/fl/meter/EditDistanceMeter.h

redraskal · 2025-03-16T23:35:55Z

@ggerganov I'm not sure there is a reasonably sized dataset containing short audio, long audio, english, & non-english content.

What do you think about an approach like tests/run-tests.sh in the CI script to measure WER for a list of audio URLs in the aforementioned categories? Could print the results for each category/model as a table.

This would be lightweight for each commit.

We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration.

ggerganov · 2025-03-17T07:50:19Z

What do you think about an approach like tests/run-tests.sh in the CI script to measure WER for a list of audio URLs in the aforementioned categories? Could print the results for each category/model as a table.

Yes, sounds good. The CI should download audio files with wget or curl and run WER tests on them. We can combine different sources at the start. Later on, we can expand the more powerful nodes such as the CUDA and M1 to run larger datasets.

We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration.

Yes, a much bigger dataset for local testing would be useful.

fujimotos · 2025-04-03T09:43:31Z

@ggerganov Hi. I was working on this ticket for a while, and spent last
few days benchmarking whisper.cpp on LibriSpeech corpus.

Now, here is the summary of my measurement results:

The following graph shows the recognition accuracy (measured in
Word Error Rate) on LibriSpeech test-clean dataset.

Comparison with OpenAI whisper

To illustrate the result shown above, the following table compares whisper.cpp's
performance with OpenAI's official WER scores.

To put it very short, the performance was pretty much comparable!

Model	WER [whisper.cpp]	WER [openai-whisper] *
tiny	6.90	6.7
base	4.81	4.9
small	3.37	3.3
medium	2.70	2.7
large-v1	2.67	2.8
large-v2	2.58	2.5
large-v3	1.85	Not published
large-v3-turbo	1.92	Not published

Official scores. Retrieved from https://arxiv.org/abs/2212.04356 (Appendix D.1.2).

How I performed the benchmark test

I submitted the code I wrote for the benchmark test in PR #2999. The code
should be basically the same as how OpenAI evaluate their models.

The testing process is fairly automated (using the power of Makefile),
and I also attached some documentation how to use it.

Please tell me if anything is unclear! I hope it's interesting for you.

ggerganov · 2025-04-03T10:31:00Z

@fujimotos Thank you, this is very interesting! Will be taking a look in the next few days.

fujimotos · 2025-04-03T11:02:44Z

@ggerganov Thank you!

Techinical Note: how long it took to perform the full benchmark

This time, I rent an EC2 c8g.xlarge instance from AWS to perform the
benchmark test.

It took roughly 80 hours to benchmark all the eight model sizes.
Here is the breakdown of the running time:

MODEL	WER	TIME [REAL]	Real Time Factor
tiny	6.90	28m	0.08
base	4.81	56m	0.17
small	3.37	3h2m	0.56
medium	2.70	9h20m	1.72
large-v1	2.67	17h52m	3.30
large-v2	2.58	17h55m	3.31
large-v3	1.85	17h46m	3.29
large-v3-turbo	1.92	14h28m	2.67

Observation: Tradeoff between speed vs accuracy

Looking from a different angle, I think this confirms the existence of
trade-off between speed vs accuracy in whisper.cpp models.

The following graph should illustrate the relationship:

The X-axis ("Real time factor") is computed by (inference time) / (Audio Length), so the lower is better.
Note that LibriSpeech test-clean contains 5 hours 24 minutes of speech.

ggerganov · 2025-04-03T11:32:59Z

It would be interesting to perform these benchmarks with Q8_0 quantized models and see how the WER changes. But I think it would be better to run this on a GPU in order to reduce the processing time. Will see how this performs on my M2 Ultra - I think it would be much faster than the AWS instance.

ggerganov · 2025-04-04T15:19:05Z

Here are some results on M2 Ultra with Flash Attention enabled:

MODEL	WER	TIME [REAL]
base	4.90	13m28s
base-q8_0	4.89	12m32s
small	3.39	24m4s
small-q8_0	3.36	20m33s

Though the timings might be a bit off because I was using the computer while the computations were running. But overall, there is no degradation of the quality when going to Q8 models, which is expected, but good to confirm.

ggerganov added help wanted Extra attention is needed research🔬 labels Oct 5, 2024

ggerganov added this to ggml : roadmap Oct 5, 2024

ggerganov moved this to Todo in ggml : roadmap Oct 5, 2024

ggerganov added this to whisper.cpp : roadmap Feb 4, 2025

ggerganov removed this from ggml : roadmap Feb 4, 2025

ggerganov changed the title ~~whisper : add WER tests~~ tests : add WER benchmarks Feb 4, 2025

ggerganov moved this to Todo in whisper.cpp : roadmap Feb 4, 2025

ggerganov added roadmap Part of a roadmap project high priority Very important issue labels Feb 4, 2025

ggerganov mentioned this issue Mar 14, 2025

ggml-ci: add run.sh #2877

Merged

fujimotos mentioned this issue Apr 4, 2025

Benchmark results #89

Open

ggerganov added this to whisper.cpp : roadmap Apr 4, 2025

ggerganov moved this to Todo in whisper.cpp : roadmap Apr 4, 2025

ggerganov moved this from Todo to In Progress in whisper.cpp : roadmap Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests : add WER benchmarks #2454

tests : add WER benchmarks #2454

ggerganov commented Oct 5, 2024

harvestingmoon commented Feb 5, 2025 •

edited

Loading

ggerganov commented Feb 5, 2025

foldl commented Feb 7, 2025

harvestingmoon commented Feb 7, 2025

foldl commented Feb 17, 2025

harvestingmoon commented Feb 17, 2025 •

edited

Loading

harvestingmoon commented Feb 17, 2025 •

edited

Loading

harvestingmoon commented Feb 17, 2025

ggerganov commented Feb 18, 2025

WilliamTambellini commented Mar 13, 2025

redraskal commented Mar 16, 2025

ggerganov commented Mar 17, 2025

fujimotos commented Apr 3, 2025 •

edited

Loading

ggerganov commented Apr 3, 2025

fujimotos commented Apr 3, 2025

ggerganov commented Apr 3, 2025

ggerganov commented Apr 4, 2025

tests : add WER benchmarks #2454

tests : add WER benchmarks #2454

Comments

ggerganov commented Oct 5, 2024

harvestingmoon commented Feb 5, 2025 • edited Loading

ggerganov commented Feb 5, 2025

foldl commented Feb 7, 2025

harvestingmoon commented Feb 7, 2025

foldl commented Feb 17, 2025

harvestingmoon commented Feb 17, 2025 • edited Loading

harvestingmoon commented Feb 17, 2025 • edited Loading

harvestingmoon commented Feb 17, 2025

ggerganov commented Feb 18, 2025

WilliamTambellini commented Mar 13, 2025

redraskal commented Mar 16, 2025

ggerganov commented Mar 17, 2025

fujimotos commented Apr 3, 2025 • edited Loading

ggerganov commented Apr 3, 2025

fujimotos commented Apr 3, 2025

ggerganov commented Apr 3, 2025

ggerganov commented Apr 4, 2025

harvestingmoon commented Feb 5, 2025 •

edited

Loading

harvestingmoon commented Feb 17, 2025 •

edited

Loading

harvestingmoon commented Feb 17, 2025 •

edited

Loading

fujimotos commented Apr 3, 2025 •

edited

Loading