Skip to content

new WER script #2824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

new WER script #2824

wants to merge 4 commits into from

Conversation

harvestingmoon
Copy link

WER testing based off speaker 6097 of the HiTTS Dataset. Audio carried is ~ 10mb and contains dozens of short 10 second audio. WER_Scripting.py would then calculate the WER via DP.

@foldl
Copy link
Collaborator

foldl commented Feb 17, 2025

Are those Java class files are redundant?

@harvestingmoon
Copy link
Author

Ah yes I think those gradle files are redundant, pls ignore them

Only use the files that are in wer_testing

test.py Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed under new script

@harvestingmoon
Copy link
Author

harvestingmoon commented Feb 18, 2025

have also added some preview to cli which would be iterated at every audio loop:

../build/bin/whisper-cli -m ../models/ggml-base.en.bin -t 4 -p 1 -f ./6097_5_mins/audio/presentpictureofnsw_02_mann_0083.wav
Word transcribed is : ['and all boats to be moored within the hospital waff and hulk.']
Actual word is: and all boats to be moored within the hospital wharf and hulk
wer for audio/presentpictureofnsw_02_mann_0083.wav is 0.17

@ggerganov
Copy link
Member

  • Don't push changes to the bindings
  • The audio files should not be committed in the repo, but should be downloaded instead
  • The script should not parse stdout - use existing file output options
  • Should handle short audio inputs (i.e. less than 1s)

@ggerganov ggerganov closed this Feb 18, 2025
@foldl
Copy link
Collaborator

foldl commented Feb 19, 2025

  • The audio files should not be committed in the repo, but should be downloaded instead

Downloadable datasets are huge with several hundreds of hours of audio. While I think this work item is to create some light-weight tests on WER performance which can be integrated into github workflows.

If there is a tiny dataset contained in repo, then WER benchmarking will work just out of the box and on-the-fly. A selected tiny dataset can also cover English/non-English, clean or noisy, which will be handy & useful. For example, if some sort of noise cancellation is added, then some noisy audio files can be added to the dataset and get benchmark easily & quickly.

Anyway, I suggested to add audio files to this repo. My apologies to @harvestingmoon.

@harvestingmoon
Copy link
Author

harvestingmoon commented Feb 19, 2025

  • Should handle short audio inputs (i.e. less than 1s)

I have tried short audio inputs with the Google Command Dataset which contains audio input files approx 1s each. However, the problem with this is that whisper.cpp is unable to capture any words at all (I believe it is because the audio inputs are just too short) so there is difficulty in calculating WER. Hence, I switched over to the Hifi-TTS dataset.

No worries @foldl ! am glad to try to help / contribute 😄

I can continue slowly developing the script if given the green-light 👍🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants