MBASE Non-blocking LLM inference in C++ built on top of llama.cpp #12005

Emreerdog · 2025-02-21T16:00:47Z

Emreerdog
Feb 21, 2025

Clarification: By non-blocking, I mean the inference objects and methods in MBASE SDK doesn't block the main application thread.

Repo link: https://github.com/Emreerdog/mbase
Docs link: https://docs.mbasesoftware.com/inference/about

Hello! I am excited to announce a project I have been working on for couple of months.

MBASE inference library is a high-level C++ non-blocking LLM inference library
written on top of the llama.cpp library to provide the necessary tools and APIs to allow
developers to integrate LLMs into their applications with minimal performance loss and development time.

The MBASE SDK will make LLM integration into games and high-performance applications possible
through its fast and non-blocking behavior which also makes it possible to run multiple LLMs in parallel.

Features can roughly be listed as:

Non-blocking TextToText LLM inference SDK.
Non-blocking Embedder model inference SDK.
GGUF file meta-data manipulation SDK.
Openai server program supporting both TextToText and Embedder endpoints with
system prompt caching support which implies significant performance boost.
Hosting multiple models in a single Openai server program.
Using llama.cpp as an inference backend so that models that are supported by the llama.cpp library
are supported by default.
Benchmark application for measuring the impact of LLM inference on your application.
Plus anything llama.cpp supports.

There also is a detailed incomplete documentation written for MBASE SDK to show how to use the SDK and some
useful information in general.

There are some useful programs written using the MBASE SDK to utilize and show-off the SDK:

Openai Server: An Openai API compatible HTTP/HTTPS server for serving LLMs. This program provides chat completion API for TextToText
models and embeddings API For embedder models. Hosting multiple LLMs and system prompt caching is supported.
Benchmark T2T: It is a program written to measure the performance of the given T2T LLM and its
impact on your main application logic.
Embedding: An example program for generating the embeddings of the given prompt or prompts.
Retrieval: An example for retrival program.
Simple Conversation: It is a simple executable program where you are having a dialogue with the LLM you provide.
Typo Fixer: This is an applied example use case of the MBASE SDK. The program is reading a user-supplied
text file and fixing the typos.

I am planning to implement a high-level C++ API for all other models written using the ggml
library such as whisper.cpp, bark.cpp, stable-diffusion.cpp etc. and provide their usage
in a non-blocking manner in the future with clean documentation.

The general assumption would be "If there is an LLM implemented in ggml, there is a non-blocking C++ API in MBASE SDK".

I would like to hear all your feedbacks and if possible contributions are much appreciated.

Thank you!

Emreerdog · 2025-02-22T07:10:41Z

Emreerdog
Feb 22, 2025
Author

What do I Mean by Non-blocking

Let's don't think about the LLMs for a second and think about IO management in the program.

IO and network operations are expensive operations. Their performance are limited by the read/write speed or the network operations highly influenced by your network environment speec etc.

In IO scenario, when you want to write multiple Gigs of data into a disk, you should write a mechanism in your program so that the write won't block your main application logic. You may do this by writing data to a file by specifying a threshold let's say writing 1KB every iteration. Or you may do your IO operations in seperate thread and write a synchronization mechanism based-off of your needs. Or, you can use async io libraries that can do this for you.

In my opinion, LLM inference deserve its own non-blocking terminology because the operations such as language model initialization(llama_model_load_from_file), destruction(llama_model_free), context creation(llama_init_from_model), encoder/decoder(llama_encode/llama_decode) methods are extremelly expensive which makes them really difficult to integrate into your main application logic.

Even with high-end GPU, the amount your program halts on llama_model_load_from_file, llama_init_from_model, llama_encode/llama_decode prevents people from integrating LLMs into their applications.

This SDK apply those operations in a non-blocking manner or in other words, the model initialization, destruction, context creation, encoder/decoder methods doesn't block your main thread and synchronization is handled by the MBASE SDK.

Using this, you will be able to load/unload multiple models, create contexts, and apply encode/decoder operations all at the same time while not blocking your main application thread because MBASE will handle all those operations in parallel and will provide you the synchronized callbacks so that you won't need to consider issues arise from parallel programming.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBASE Non-blocking LLM inference in C++ built on top of llama.cpp #12005

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

MBASE Non-blocking LLM inference in C++ built on top of llama.cpp #12005

Emreerdog Feb 21, 2025

Replies: 1 comment

Emreerdog Feb 22, 2025 Author

What do I Mean by Non-blocking

Emreerdog
Feb 21, 2025

Emreerdog
Feb 22, 2025
Author