Skip to content

tests : add WER benchmarks #2454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ggerganov opened this issue Oct 5, 2024 · 18 comments
Open

tests : add WER benchmarks #2454

ggerganov opened this issue Oct 5, 2024 · 18 comments
Labels
help wanted Extra attention is needed high priority Very important issue research🔬 roadmap Part of a roadmap project

Comments

@ggerganov
Copy link
Member

It would be nice to start measuring the word error rate (WER) of whisper.cpp across some representative dataset:

  • short audio
  • long audio
  • english
  • non-english
  • etc.

This will help us catch regressions in the future. I'm not familiar with what is typically used for TTS WER benchmarks, so looking for help from the community.

@ggerganov ggerganov added help wanted Extra attention is needed research🔬 labels Oct 5, 2024
@ggerganov ggerganov moved this to Todo in ggml : roadmap Oct 5, 2024
@ggerganov ggerganov changed the title whisper : add WER tests tests : add WER benchmarks Feb 4, 2025
@ggerganov ggerganov added roadmap Part of a roadmap project high priority Very important issue labels Feb 4, 2025
@harvestingmoon
Copy link

harvestingmoon commented Feb 5, 2025

Hi Grigory, perhaps we can use LibriSpeech for measuring long audio (approx ~ 1000 hours but could trim it to fit for requirements). For short audio, we can use Libri-Light

Alternatively, there are other audio datasets for measuring WER: https://github.com/jim-schwoebel/voice_datasets

I could start making small sample scripts to see how whisper.cpp fairs among these datasets

@ggerganov
Copy link
Member Author

Thanks. Yes, I'm not sure what is typically used. But in general, I think any dataset would work. The main goal here is not to compare whisper.cpp numbers with other numbers, but to create a reference set of WER numbers that we track as the development continues. This would allow us to catch regressions when they appear, because the WER scores would get worse in such cases.

Ideally, we can have scripts that perform heavier benchmarks that developers would use locally. But we also need a mode where the scripts run just a few fast benchmarks that can be added to the CI without overloading it, so that these would be computed on every commit.

@foldl
Copy link
Collaborator

foldl commented Feb 7, 2025

@harvestingmoon are you working on this?

@harvestingmoon
Copy link

@foldl hi yes I'm looking at it, more or less likely to start after 12 as it's currently Chinese new year period...

@foldl
Copy link
Collaborator

foldl commented Feb 17, 2025

I think we need a tiny dataset (~10MB) just contained in this repo. WER can then be measured on-the-fly.

@harvestingmoon
Copy link

harvestingmoon commented Feb 17, 2025

Sorry please ignore the WER calculation above, I will develop another script since the calculations are completely off from what it should be . I will also look for a smaller lightweight dataset so that audio can be measured on the fly

@harvestingmoon
Copy link

harvestingmoon commented Feb 17, 2025

I have created a better and more robust lightweight script that meets the requirements @foldl , @ggerganov

WER is measured at 0.3.

It uses this lightweight dataset:
https://arxiv.org/abs/2104.01497 and is based off nvidia's tutorial for calculating WER:
https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/tts-evaluate.html

Image

My script calculates the WER for each individual audio file as well as the overall average file, here is the pull request #2824
For context, WER is measured between 0 and 1. Ideally, it is good to have a WER of around 0.33%, that means transcription accuracy is about 67%. The current measurement is around 70%, which is fairly good for a lightweight model.

Link for reference: https://huggingface.co/learn/audio-course/en/chapter5/evaluation

@harvestingmoon
Copy link

The pull request contains the script as well as the full ~10mb dataset, making it fairly lightweight when measuring on the fly as well

@ggerganov
Copy link
Member Author

Hi @harvestingmoon, thank you for the effort, but I expect more attention to detail. Will close the PR for now and let someone else give this a try.

@WilliamTambellini
Copy link
Contributor

Should nt the very first step to add an edit dist (used to compute WER/TER) minimalist src code (header only?) to measure it?
eg
https://github.com/flashlight/flashlight/blob/f59d770b52ea678b039d9ba44693341ba80cf7c5/flashlight/fl/meter/EditDistanceMeter.h

@redraskal
Copy link
Collaborator

@ggerganov I'm not sure there is a reasonably sized dataset containing short audio, long audio, english, & non-english content.

What do you think about an approach like tests/run-tests.sh in the CI script to measure WER for a list of audio URLs in the aforementioned categories? Could print the results for each category/model as a table.

This would be lightweight for each commit.

We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration.

@ggerganov
Copy link
Member Author

What do you think about an approach like tests/run-tests.sh in the CI script to measure WER for a list of audio URLs in the aforementioned categories? Could print the results for each category/model as a table.

Yes, sounds good. The CI should download audio files with wget or curl and run WER tests on them. We can combine different sources at the start. Later on, we can expand the more powerful nodes such as the CUDA and M1 to run larger datasets.

We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration.

Yes, a much bigger dataset for local testing would be useful.

@fujimotos
Copy link
Contributor

fujimotos commented Apr 3, 2025

@ggerganov Hi. I was working on this ticket for a while, and spent last
few days benchmarking whisper.cpp on LibriSpeech corpus.

Now, here is the summary of my measurement results:

  • The following graph shows the recognition accuracy (measured in
    Word Error Rate) on LibriSpeech test-clean dataset.

Image

Comparison with OpenAI whisper

To illustrate the result shown above, the following table compares whisper.cpp's
performance with OpenAI's official WER scores.

To put it very short, the performance was pretty much comparable!

Model WER [whisper.cpp] WER [openai-whisper] *
tiny 6.90 6.7
base 4.81 4.9
small 3.37 3.3
medium 2.70 2.7
large-v1 2.67 2.8
large-v2 2.58 2.5
large-v3 1.85 Not published
large-v3-turbo 1.92 Not published

How I performed the benchmark test

I submitted the code I wrote for the benchmark test in PR #2999. The code
should be basically the same as how OpenAI evaluate their models.

The testing process is fairly automated (using the power of Makefile),
and I also attached some documentation how to use it.

Please tell me if anything is unclear! I hope it's interesting for you.

@ggerganov
Copy link
Member Author

@fujimotos Thank you, this is very interesting! Will be taking a look in the next few days.

@fujimotos
Copy link
Contributor

@ggerganov Thank you!


Techinical Note: how long it took to perform the full benchmark

This time, I rent an EC2 c8g.xlarge instance from AWS to perform the
benchmark test.

It took roughly 80 hours to benchmark all the eight model sizes.
Here is the breakdown of the running time:

MODEL WER TIME [REAL] Real Time Factor
tiny 6.90 28m 0.08
base 4.81 56m 0.17
small 3.37 3h2m 0.56
medium 2.70 9h20m 1.72
large-v1 2.67 17h52m 3.30
large-v2 2.58 17h55m 3.31
large-v3 1.85 17h46m 3.29
large-v3-turbo 1.92 14h28m 2.67

Observation: Tradeoff between speed vs accuracy

Looking from a different angle, I think this confirms the existence of
trade-off between speed vs accuracy in whisper.cpp models.

The following graph should illustrate the relationship:

  • The X-axis ("Real time factor") is computed by (inference time) / (Audio Length), so the lower is better.
  • Note that LibriSpeech test-clean contains 5 hours 24 minutes of speech.

@ggerganov
Copy link
Member Author

It would be interesting to perform these benchmarks with Q8_0 quantized models and see how the WER changes. But I think it would be better to run this on a GPU in order to reduce the processing time. Will see how this performs on my M2 Ultra - I think it would be much faster than the AWS instance.

@ggerganov ggerganov moved this from Todo to In Progress in whisper.cpp : roadmap Apr 4, 2025
@ggerganov
Copy link
Member Author

Here are some results on M2 Ultra with Flash Attention enabled:

MODEL WER TIME [REAL]
base 4.90 13m28s
base-q8_0 4.89 12m32s
small 3.39 24m4s
small-q8_0 3.36 20m33s

Though the timings might be a bit off because I was using the computer while the computations were running. But overall, there is no degradation of the quality when going to Q8 models, which is expected, but good to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed high priority Very important issue research🔬 roadmap Part of a roadmap project
Projects
Status: In Progress
Development

No branches or pull requests

7 participants
@WilliamTambellini @ggerganov @foldl @redraskal @fujimotos @harvestingmoon and others