MBASE Non-blocking LLM inference in C++ built on top of llama.cpp #12005
Replies: 1 comment
-
What do I Mean by Non-blockingLet's don't think about the LLMs for a second and think about IO management in the program. IO and network operations are expensive operations. Their performance are limited by the read/write speed or the network operations highly influenced by your network environment speec etc. In IO scenario, when you want to write multiple Gigs of data into a disk, you should write a mechanism in your program so that the write won't block your main application logic. You may do this by writing data to a file by specifying a threshold let's say writing 1KB every iteration. Or you may do your IO operations in seperate thread and write a synchronization mechanism based-off of your needs. Or, you can use async io libraries that can do this for you. In my opinion, LLM inference deserve its own non-blocking terminology because the operations such as language model initialization(llama_model_load_from_file), destruction(llama_model_free), context creation(llama_init_from_model), encoder/decoder(llama_encode/llama_decode) methods are extremelly expensive which makes them really difficult to integrate into your main application logic. Even with high-end GPU, the amount your program halts on llama_model_load_from_file, llama_init_from_model, llama_encode/llama_decode prevents people from integrating LLMs into their applications. This SDK apply those operations in a non-blocking manner or in other words, the model initialization, destruction, context creation, encoder/decoder methods doesn't block your main thread and synchronization is handled by the MBASE SDK. Using this, you will be able to load/unload multiple models, create contexts, and apply encode/decoder operations all at the same time while not blocking your main application thread because MBASE will handle all those operations in parallel and will provide you the synchronized callbacks so that you won't need to consider issues arise from parallel programming. |
Beta Was this translation helpful? Give feedback.
-
Clarification: By non-blocking, I mean the inference objects and methods in MBASE SDK doesn't block the main application thread.
Repo link: https://github.com/Emreerdog/mbase
Docs link: https://docs.mbasesoftware.com/inference/about
Hello! I am excited to announce a project I have been working on for couple of months.
MBASE inference library is a high-level C++ non-blocking LLM inference library
written on top of the llama.cpp library to provide the necessary tools and APIs to allow
developers to integrate LLMs into their applications with minimal performance loss and development time.
The MBASE SDK will make LLM integration into games and high-performance applications possible
through its fast and non-blocking behavior which also makes it possible to run multiple LLMs in parallel.
Features can roughly be listed as:
system prompt caching support which implies significant performance boost.
are supported by default.
There also is a detailed incomplete documentation written for MBASE SDK to show how to use the SDK and some
useful information in general.
There are some useful programs written using the MBASE SDK to utilize and show-off the SDK:
Openai Server: An Openai API compatible HTTP/HTTPS server for serving LLMs. This program provides chat completion API for TextToText
models and embeddings API For embedder models. Hosting multiple LLMs and system prompt caching is supported.
Benchmark T2T: It is a program written to measure the performance of the given T2T LLM and its
impact on your main application logic.
Embedding: An example program for generating the embeddings of the given prompt or prompts.
Retrieval: An example for retrival program.
Simple Conversation: It is a simple executable program where you are having a dialogue with the LLM you provide.
Typo Fixer: This is an applied example use case of the MBASE SDK. The program is reading a user-supplied
text file and fixing the typos.
I am planning to implement a high-level C++ API for all other models written using the ggml
library such as whisper.cpp, bark.cpp, stable-diffusion.cpp etc. and provide their usage
in a non-blocking manner in the future with clean documentation.
The general assumption would be "If there is an LLM implemented in ggml, there is a non-blocking C++ API in MBASE SDK".
I would like to hear all your feedbacks and if possible contributions are much appreciated.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions