Skip to content

memory leak when use the server mode #2605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wjzsuperman opened this issue Dec 4, 2024 · 1 comment
Open

memory leak when use the server mode #2605

wjzsuperman opened this issue Dec 4, 2024 · 1 comment

Comments

@wjzsuperman
Copy link

Hello, when I use the server mode, I found that there is a memory leak problem.
Through monitoring, I found that after each server call, the container memory cannot return to the value before the call,
and my audio files do not exceed 30M, but each leak is about 350M.
However, when I use the main mode to call, the memory can return to the normal level.
The memory situation can be seen in the following figure, the previous one is the server mode, and the next one is the main mode.
企业微信截图_1733133167846
I am not a C++ developer, and I used valgrind to analyze the memory, please help me check it.the situation is as follows,
^C==354==
==354== Process terminating with default action of signal 2 (SIGINT)
==354== at 0x4B9C3CA: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==354== by 0x15D8CF: ggml_graph_compute_thread (in /asr/bin/server)
==354== by 0x4B9986D: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==354== by 0x4BE4608: start_thread (pthread_create.c:477)
==354== by 0x4D20352: clone (clone.S:95)
==354==
==354== HEAP SUMMARY:
==354== in use at exit: 2,166,601,182 bytes in 108,817 blocks
==354== total heap usage: 127,213 allocs, 18,396 frees, 2,315,489,188 bytes allocated
==354==
==354== 304 bytes in 1 blocks are possibly lost in loss record 160 of 507
==354== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==354== by 0x40149DA: allocate_dtv (dl-tls.c:286)
==354== by 0x40149DA: _dl_allocate_tls (dl-tls.c:532)
==354== by 0x4BE5322: allocate_stack (allocatestack.c:622)
==354== by 0x4BE5322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==354== by 0x49250C9: std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_deletestd::thread::_State >, void ()()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==354== by 0x1D2E8A: whisper_full_parallel (in /asr/bin/server)
==354== by 0x21A699: main::{lambda(httplib::Request const&, httplib::Response&)#3}::operator()(httplib::Request const&, httplib::Response&) const (in /asr/bin/server)
==354== by 0x223D78: httplib::Server::dispatch_request(httplib::Request&, httplib::Response&, std::vector<std::pair<std::unique_ptr<httplib::detail::MatcherBase, std::default_deletehttplib::detail::MatcherBase >, std::function<void (httplib::Request const&, httplib::Response&)> >, std::allocator<std::pair<std::unique_ptr<httplib::detail::MatcherBase, std::default_deletehttplib::detail::MatcherBase >, std::function<void (httplib::Request const&, httplib::Response&)> > > > const&) (in /asr/bin/server)
==354== by 0x24400B: httplib::Server::routing(httplib::Request&, httplib::Response&, httplib::Stream&) (in /asr/bin/server)
==354== by 0x244FE9: httplib::Server::process_request(httplib::Stream&, bool, bool&, std::function<void (httplib::Request&)> const&) (in /asr/bin/server)
==354== by 0x245B87: httplib::Server::process_and_close_socket(int) (in /asr/bin/server)
==354== by 0x21F9FC: std::thread::_State_impl<std::thread::_Invoker<std::tuplehttplib::ThreadPool::worker > >::_M_run() (in /asr/bin/server)
==354== by 0x4924DF3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==354==
==354== 912 bytes in 3 blocks are possibly lost in loss record 263 of 507
==354== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==354== by 0x40149DA: allocate_dtv (dl-tls.c:286)
==354== by 0x40149DA: _dl_allocate_tls (dl-tls.c:532)
==354== by 0x4BE5322: allocate_stack (allocatestack.c:622)
==354== by 0x4BE5322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==354== by 0x4B99ECA: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==354== by 0x4B918E0: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==354== by 0x161613: ggml_graph_compute (in /asr/bin/server)
==354== by 0x16E456: ggml_backend_cpu_graph_compute(ggml_backend
, ggml_cgraph*) (in /asr/bin/server)
==354== by 0x1738BA: ggml_backend_sched_graph_compute_async (in /asr/bin/server)
==354== by 0x1739B2: ggml_backend_sched_graph_compute (in /asr/bin/server)
==354== by 0x1B841C: whisper_encode_internal(whisper_context&, whisper_state&, int, int, bool ()(void), void*) (in /asr/bin/server)
==354== by 0x1B85E2: whisper_encode_with_state (in /asr/bin/server)
==354== by 0x1BCCD5: whisper_lang_auto_detect_with_state (in /asr/bin/server)
==354==
==354== 912 bytes in 3 blocks are possibly lost in loss record 264 of 507
==354== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==354== by 0x40149DA: allocate_dtv (dl-tls.c:286)
==354== by 0x40149DA: _dl_allocate_tls (dl-tls.c:532)
==354== by 0x4BE5322: allocate_stack (allocatestack.c:622)
==354== by 0x4BE5322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==354== by 0x49250C9: std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_deletestd::thread::_State >, void ()()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==354== by 0x1B93B7: log_mel_spectrogram(whisper_state&, float const
, int, int, int, int, int, int, whisper_filters const&, bool, whisper_mel&) [clone .constprop.0] (in /asr/bin/server)
==354== by 0x1BA71D: whisper_pcm_to_mel_with_state (in /asr/bin/server)
==354== by 0x1CDB3B: whisper_full_with_state (in /asr/bin/server)
==354== by 0x1D2FA4: whisper_full_parallel (in /asr/bin/server)
==354== by 0x21A699: main::{lambda(httplib::Request const&, httplib::Response&)#3}::operator()(httplib::Request const&, httplib::Response&) const (in /asr/bin/server)
==354== by 0x223D78: httplib::Server::dispatch_request(httplib::Request&, httplib::Response&, std::vector<std::pair<std::unique_ptr<httplib::detail::MatcherBase, std::default_deletehttplib::detail::MatcherBase >, std::function<void (httplib::Request const&, httplib::Response&)> >, std::allocator<std::pair<std::unique_ptr<httplib::detail::MatcherBase, std::default_deletehttplib::detail::MatcherBase >, std::function<void (httplib::Request const&, httplib::Response&)> > > > const&) (in /asr/bin/server)
==354== by 0x24400B: httplib::Server::routing(httplib::Request&, httplib::Response&, httplib::Stream&) (in /asr/bin/server)
==354== by 0x244FE9: httplib::Server::process_request(httplib::Stream&, bool, bool&, std::function<void (httplib::Request&)> const&) (in /asr/bin/server)
==354==
==354== 2,128 bytes in 7 blocks are possibly lost in loss record 292 of 507
==354== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==354== by 0x40149DA: allocate_dtv (dl-tls.c:286)
==354== by 0x40149DA: _dl_allocate_tls (dl-tls.c:532)
==354== by 0x4BE5322: allocate_stack (allocatestack.c:622)
==354== by 0x4BE5322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==354== by 0x49250C9: std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_deletestd::thread::_State >, void ()()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==354== by 0x2265F0: void std::vector<std::thread, std::allocatorstd::thread >::_M_realloc_inserthttplib::ThreadPool::worker(__gnu_cxx::__normal_iterator<std::thread
, std::vector<std::thread, std::allocatorstd::thread > >, httplib::ThreadPool::worker&&) (in /asr/bin/server)
==354== by 0x2268A9: std::_Function_handler<httplib::TaskQueue* (), httplib::Server::Server()::{lambda()#1}>::_M_invoke(std::_Any_data const&) (in /asr/bin/server)
==354== by 0x223785: httplib::Server::listen_internal() (in /asr/bin/server)
==354== by 0x1221E9: main (in /asr/bin/server)
==354==
==354== 2,432 bytes in 8 blocks are possibly lost in loss record 297 of 507
==354== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==354== by 0x40149DA: allocate_dtv (dl-tls.c:286)
==354== by 0x40149DA: _dl_allocate_tls (dl-tls.c:532)
==354== by 0x4BE5322: allocate_stack (allocatestack.c:622)
==354== by 0x4BE5322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==354== by 0x49250C9: std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_deletestd::thread::_State >, void ()()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==354== by 0x1B93B7: log_mel_spectrogram(whisper_state&, float const
, int, int, int, int, int, int, whisper_filters const&, bool, whisper_mel&) [clone .constprop.0] (in /asr/bin/server)
==354== by 0x1BA71D: whisper_pcm_to_mel_with_state (in /asr/bin/server)
==354== by 0x1CDB3B: whisper_full_with_state (in /asr/bin/server)
==354== by 0x1D3B41: std::thread::_State_impl<std::thread::_Invoker<std::tuple<int ()(whisper_context, whisper_state*, whisper_full_params, float const*, int), whisper_context*, whisper_state*, whisper_full_params, float const*, int> > >::_M_run() (in /asr/bin/server)
==354== by 0x4924DF3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==354== by 0x4BE4608: start_thread (pthread_create.c:477)
==354== by 0x4D20352: clone (clone.S:95)
==354==
==354== 17,024 bytes in 56 blocks are possibly lost in loss record 361 of 507
==354== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==354== by 0x40149DA: allocate_dtv (dl-tls.c:286)
==354== by 0x40149DA: _dl_allocate_tls (dl-tls.c:532)
==354== by 0x4BE5322: allocate_stack (allocatestack.c:622)
==354== by 0x4BE5322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==354== by 0x49250C9: std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_deletestd::thread::_State >, void ()()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==354== by 0x22686C: std::_Function_handler<httplib::TaskQueue
(), httplib::Server::Server()::{lambda()#1}>::_M_invoke(std::_Any_data const&) (in /asr/bin/server)
==354== by 0x223785: httplib::Server::listen_internal() (in /asr/bin/server)
==354== by 0x1221E9: main (in /asr/bin/server)
==354==
==354== LEAK SUMMARY:
==354== definitely lost: 0 bytes in 0 blocks
==354== indirectly lost: 0 bytes in 0 blocks
==354== possibly lost: 23,712 bytes in 78 blocks
==354== still reachable: 2,166,577,470 bytes in 108,739 blocks
==354== suppressed: 0 bytes in 0 blocks
==354== Reachable blocks (those to which a pointer was found) are not shown.
==354== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==354==
==354== For lists of detected and suppressed errors, rerun with: -s
==354== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 0 from 0)

Also, i analyze the memory with main model, perhaps it is the same cause that triggered the problem.
valgrind --leak-check=full ./main -osrt -m /root/.cache/models/ggml-base.bin -f test.wav
==19064== Memcheck, a memory error detector
==19064== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==19064== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==19064== Command: ./main -osrt -m /root/.cache/models/ggml-base.bin -f test.wav
==19064==
whisper_init_from_file_with_params_no_state: loading model from '/root/.cache/models/ggml-base.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 147.37 MB
whisper_model_load: model size = 147.37 MB
whisper_init_state: kv self size = 6.29 MB
whisper_init_state: kv cross size = 18.87 MB
whisper_init_state: kv pad size = 3.15 MB
whisper_init_state: compute buffer (conv) = 16.26 MB
whisper_init_state: compute buffer (encode) = 85.86 MB
whisper_init_state: compute buffer (cross) = 4.65 MB
whisper_init_state: compute buffer (decode) = 96.35 MB

system_info: n_threads = 4 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0

main: processing 'test.wav' (5587296 samples, 349.2 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

^C==19064==
==19064== Process terminating with default action of signal 2 (SIGINT)
==19064== at 0x120AD0: ggml_vec_dot_f16 (in /asr/bin/main)
==19064== by 0x1346E2: ggml_compute_forward_mul_mat (in /asr/bin/main)
==19064== by 0x1592E1: ggml_graph_compute_thread (in /asr/bin/main)
==19064== by 0x4B918E5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==19064== by 0x15D053: ggml_graph_compute (in /asr/bin/main)
==19064== by 0x169E96: ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) (in /asr/bin/main)
==19064== by 0x16F2FA: ggml_backend_sched_graph_compute_async (in /asr/bin/main)
==19064== by 0x16F3F2: ggml_backend_sched_graph_compute (in /asr/bin/main)
==19064== by 0x1B3F31: whisper_encode_internal(whisper_context&, whisper_state&, int, int, bool ()(void), void*) (in /asr/bin/main)
==19064== by 0x1C9BA9: whisper_full_with_state (in /asr/bin/main)
==19064== by 0x1CF04A: whisper_full_parallel (in /asr/bin/main)
==19064== by 0x11C477: main (in /asr/bin/main)
==19064==
==19064== HEAP SUMMARY:
==19064== in use at exit: 675,274,316 bytes in 105,230 blocks
==19064== total heap usage: 108,955 allocs, 3,725 frees, 783,778,742 bytes allocated
==19064==
==19064== 912 bytes in 3 blocks are possibly lost in loss record 167 of 277
==19064== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==19064== by 0x40149DA: allocate_dtv (dl-tls.c:286)
==19064== by 0x40149DA: _dl_allocate_tls (dl-tls.c:532)
==19064== by 0x4BE5322: allocate_stack (allocatestack.c:622)
==19064== by 0x4BE5322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==19064== by 0x49250C9: std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_deletestd::thread::_State >, void ()()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==19064== by 0x1B4DF7: log_mel_spectrogram(whisper_state&, float const
, int, int, int, int, int, int, whisper_filters const&, bool, whisper_mel&) [clone .constprop.0] (in /asr/bin/main)
==19064== by 0x1B615D: whisper_pcm_to_mel_with_state (in /asr/bin/main)
==19064== by 0x1C957B: whisper_full_with_state (in /asr/bin/main)
==19064== by 0x1CF04A: whisper_full_parallel (in /asr/bin/main)
==19064== by 0x11C477: main (in /asr/bin/main)
==19064==
==19064== LEAK SUMMARY:
==19064== definitely lost: 0 bytes in 0 blocks
==19064== indirectly lost: 0 bytes in 0 blocks
==19064== possibly lost: 912 bytes in 3 blocks
==19064== still reachable: 675,273,404 bytes in 105,227 blocks
==19064== suppressed: 0 bytes in 0 blocks
==19064== Reachable blocks (those to which a pointer was found) are not shown.
==19064== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==19064==
==19064== For lists of detected and suppressed errors, rerun with: -s
==19064== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

@davens
Copy link
Contributor

davens commented Jan 2, 2025

Hello.

TLDR: i don't use the server myself, but i have a suggestion after looking at the server code. Call the "load" model endpoint to see if that stops/reduces the leak. I've seen in the code for server, that the load endpoint calls whisper_free which should cleanup the context and state. You could call it everytime, or when the memory usage hits a certain point.

For devs to consider: I've noticed a couple of issues with whisper.cpp with memory

  1. After inferencing, whisper retains context, and then leaks on subsequent inferences (but whisper_free cleans up)
  2. whisper can't inference more than 30secs at a go. whisper.cpp must be stitching up 30 sec inferences. Unfortunately, it appears that it allocates for all the context required for the audio passed in (rather than reusing what is needed for 30 secs). It really blows up in memory. Compounding problem (1). I have got around this by only calling whisper.cpp for 30seconds at time, and then overlapping the audio window and using prompt tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants