Breaking Changes #9276

mcharytoniuk · 2024-09-02T12:36:57Z

mcharytoniuk
Sep 2, 2024

This change broke both my project (https://github.com/distantmagic/paddler) and the infrastructure (forced me to update both the prod environment and other related projects by surprise):
#9056

I did not notice that in time because I've been using an older build of llama.cpp at the moment. That also forced me to rebuild the llama.cpp instances I had deployed in prod.

Do you plan to introduce a communication channel that notifies about breaking changes in llama.cpp? It would be important to have something like this for the sake of stability and reliability in prod environments. It would be nice to have some way to avoid such unexpected changes in the future. It could be anything—a Discord server, a mailing list, anything (also, it would be best to notify before they happen).

Also, since llama.cpp uses rolling releases instead of semantic versioning it is impossible to target specific versions of llama.cpp with 3rd party projects - all that is left is "duck checking" - I have to check if the currently installed llama.cpp version supports the new endpoint, or the older one, which is also not ideal.

Also, in the case of Paddler (or any tool that monitors llama.cpp), it forces requests to two endpoints (/health and /slots instead of one), which is more taxing on the infra, and this wasn't discussed anywhere beforehand.

I love the project and do my best to use it in production and rely on it, but I was kind of taken by surprise. I think backward compatibility needs some consideration. This is a must if llama.cpp can be trusted to be used in production or to be the foundational building block for other projects.

cc @ggerganov

ggerganov · 2024-09-02T13:02:07Z

ggerganov
Sep 2, 2024
Maintainer

I'm open to suggestions how to improve the notification for breaking changes. We generally try to update the readme when there are updates to the C-style API. Maybe we can start updating it for server-related changes too.

a Discord server, a mailing list

I don't think these options have any advantage over watching the repo. Please, correct me if I'm wrong.

Also, since llama.cpp uses rolling releases instead of semantic versioning it is impossible to target specific versions of llama.cpp with 3rd party projects - all that is left is "duck checking" - I have to check if the currently installed llama.cpp version supports the new endpoint, or the older one, which is also not ideal.

For semantic versioning to be effective, we would need to start writing detailed release notes. We can try to do that, though I think that the existing commit messages already provide 99% of the information. How would have semantic versioning helped in this case?

Also, in the case of Paddler (or any tool that monitors llama.cpp), it forces requests to two endpoints (/health and /slots instead of one), which is more taxing on the infra, and this wasn't discussed anywhere beforehand.

cc @ngxson for further clarifications of why this change is needed. AFAIU, without it, there is no way to check efficiently the server health because the requests would possibly timeout during the processing of bigger batches.

5 replies

ngxson Sep 2, 2024
Collaborator

The server example is not (yet?) declared as prod-ready, so IMO a changelog.txt would make more sense in this case. We can additionally use semver if needed, but it will be quite tricky if there are multiple PRs that increase patch/minor number at the same time.

For why /health and /slots need to separated:

Conventionally, /health fails if something goes very wrong and the server cannot handle the response (i.e. process crashes). This definition is, however, not suitable for fail_on_no_slot. When server has no slot, it just busy doing its job, so it is an expected case and thus /health should not fail.
As @ggerganov pointed out, another problem is that forcing /health to check for slot status will make it timeout when processing large batch. This will make health check to report wrong status (i.e. it indicates that server is not healthy, while it totally is). On huggingface inference endpoint, an unhealthy container will be forced to restart, which drop all on-going tasks.
Finally, if you already use /health to fetch the slots status, then /slots endpoint is the direct equivalent to this. In short, in your project, you just need to change /health to /slots. In other words, /slots endpoint already handle reporting server's health, so you don't need /health anymore.

ngxson Sep 2, 2024
Collaborator

it forces requests to two endpoints (/health and /slots instead of one)

(I repeat this point so it's more clear) Now you only need /slots, as it's already slots + health check

mcharytoniuk Sep 2, 2024
Author

@ggerganov From the organizational perspective and from the perspective of maintainers like me (early adopters), it would help tremendously to have some channel and opportunity to discuss such changes beforehand. Changing endpoints like that can help in some llama.cpp use cases but can break other types of use cases. This particular change helped to handle big batches with a Docker setup but made it harder to observe the instances in cases like mine.

I think there should be a bit more structure to balance the rapid development with some stability. The current situation with llama.cpp reminds me kind of what Node.js was going through in the first years after it was released (about 15 years ago). :) They also needed to move fast and break things. They did not want semantic versioning and ended up with a stability index instead: https://nodejs.org/docs/latest/api/documentation.html#stability-index. Maybe similar solution might also be applied?

It would also help to keep a list of some projects dependent on llama.cpp (and how they depend on it)—that can help to see the impact of the potential changes. From my perspective, if someone wants to introduce changes to certain endpoints or other breaking changes, I think a system where third-party project maintainers can have an opportunity to at least discuss the changes beforehand would be helpful. I can help implement that if you are interested.

Also, @ngxson - your change makes perfect sense from the technical point of view, I am also trying to underline the implications for the business/product side of things. I am not at all against your contribution and such changes. I am only trying to find a way to improve the organizational process.

ngxson Sep 2, 2024
Collaborator

Yeah I totally agree that the current organizational process could be improved, as more and more people want to use the server example in a prod-ready environment (and thanks for your suggestions too). I'm open to discussions about this point.

Personally I'm a bit doubtful about the stability index, as most features in llama.cpp are quite subject to changes/breaking changes (except for OAI-compat endpoints). But I think we can already start by enumerating all existing features. This could be useful to keep track what we're currently having.

P/s: Another project that come to my mind is immich, which has gone though quite many breaking changes since I first setup it on my home server (~1 year ago). They rely on semver and github releases for changelog. I'm not sure how they organize internally, but this seems be to a good case study.

Vaibhavs10 Sep 3, 2024
Collaborator

Maybe we can start updating it for server-related changes too.

This would be a good idea in the short term (till llama-server matures into a stand-alone project) - this way, people would have one place to track all the changes.

mcharytoniuk · 2024-09-02T19:58:02Z

mcharytoniuk
Sep 2, 2024
Author

My genuine question is - does llama.cpp aim to be used in prod?

I mean - is that a project's goal at all, or is llama.cpp supposed to be a tool for running open source LLMs locally? Because without any guarantees of backwards compatibility it would be really hard to build anything on top of llama.cpp

cc @ggerganov

0 replies

slaren · 2024-09-02T20:05:34Z

slaren
Sep 2, 2024
Maintainer

Just to state the obvious, using the master branch in production is always going to be a bad idea in any project, but especially with llama.cpp since it is effectively the development branch. While we don't use semantic versioning to tag releases, there is nothing stopping you from using a specific commit or tag (all master commits are tagged). You should never update your version of llama.cpp in production (or any other software) without testing it first.

10 replies

mcharytoniuk Sep 3, 2024
Author

I've created the following two issues to keep track of the API changes:

changelog : libllama API #9289

changelog : llama-server REST API #9291

You can try following #9291 for upcoming llama-server updates. We'll try to keep it up-to-date. Hope this helps.

That is an awesome step. Thank you so much!

IMO introducing semantic versioning at this stage would have almost no benefit and will only add extra work for the maintainers. The LLM field is changing very fast, with tens of new models coming out every month and breaking changes needed to support them all the time.

I think you may overestimate the effort required to maintain semantic versioning. Llama.cpp is 99% there, and implementing it won't require additional work; instead, it just requires updating the automated workflow (not the maintainers' workflow).

Key ideas:

The project already has a breaking change label, so major version increments are practically taken care of.
With semantic versioning, frequent major version increments are not an issue.
New model support can update the minor version, security issues can update the patch version.

Implementation suggestion:

Continue tagging nightly builds as you currently do (e.g., v3658, v3659).
Introduce automated semantic versioning for releases (e.g. start with, v3568.0.0).
Use GitHub workflows to automate version increments based on issue labels.

Examples:

If a breaking change is detected: v3659.0.0
If no breaking changes: v3658.1.0

Releases already have automated messages built from merged issues (most commits have a referenced issue in the commit message). GitHub workflow can detect if any such issue has a breaking change tag. In that case, it would create a new release with a major version increment.

The effect? The combination of nightly builds (just a version tag v3658, etc.) allows for clear communication about newest/cutting-edge development changes (nightly builds) and stable, production-ready releases (semantic versions). Users who want to test the very latest changes, contribute to development, or need a specific new feature would typically use these cutting-edge nightly builds.

If you're interested in this automated solution, I would be happy to contribute a GitHub workflow that implements this process.

ngxson Sep 3, 2024
Collaborator

Continue tagging nightly builds as you currently do (e.g., v3658, v3659).

I don't really understand this point. Moving from numerical version to semver will add .0.0 suffix to the current number, which is already a breaking change, so why bother keeping the current build number? Why not just counting the semver from fresh 0.0.0?

Beside that, I think there is a small misconception about how the current version/build number is counted:

The build number (i.e. v3658, v3659, etc) is not being counted actively, but is actually tracked passively by the number of commits on main branch (git rev-list --count HEAD). Therefore, we cannot just simply add something to this number. Doing what you said, something like v3658.1.0 will effectively be the 3659-th commits on master branch. So in this case, we will have double versioning system - that doesn't seem maintainable to me.

ngxson Sep 3, 2024
Collaborator

I think you may overestimate the effort required to maintain semantic versioning. Llama.cpp is 99% there, and implementing it won't require additional work; instead, it just requires updating the automated workflow (not the maintainers' workflow).

Correct me if I'm wrong, but I can't imagine how it can be implement without additional work. I know there are many cool ready-to-use github actions that handle semver, but they all need to be triggered manually.

Beside, in actual production environment, I never see anyone increase semver in an automated way (at least for major and minor number), it's basically too risky to accidentally increase the major, which leads to miscommunication. Also, most projects release multiple new features in one major release (usually with a changelog), which again requires manually bump the major number by one.

ggerganov Sep 4, 2024
Maintainer

Another trouble for semver is that we have 2 public APIs (the C-style API and the HTTP API) that we want to describe using a single version. It will be confusing, since breaking changes in one API would bump the major version for both APIs.

Anyway, I think the process we currently have is good enough to not make any major changes. The changelog issues should help resolve the original problem that started the discussion.

mcharytoniuk Sep 4, 2024
Author

Anyway, I think the process we currently have is good enough to not make any major changes. The changelog issues should help resolve the original problem that started the discussion.

Ok, understood. I won't push for semver further then. Thanks again for the aggregated issue.

ngxson · 2024-09-02T21:50:03Z

ngxson
Sep 2, 2024
Collaborator

Just want to add to the discussion: A while ago @Vaibhavs10 and I had a small discussion (non-formal) on the subject of having a "stable channel" of llama.cpp server image. The idea is to have some commits on master branch to be tested (semi-)automatically, then being tagged as "stable". This could be done periodically, maybe once a week or so.

We don't have a clear plan yet. But on the way doing so, having a communication channel for breaking changes could be a good idea to add.

1 reply

edlee123 Feb 25, 2025

Hello, interested in this - any update in the thinking / process e.g., idea of tagging "stable" for llama.cpp server image?

Breaking Changes #9276

mcharytoniuk Sep 2, 2024

Replies: 4 comments · 16 replies

ggerganov Sep 2, 2024 Maintainer

ngxson Sep 2, 2024 Collaborator

ngxson Sep 2, 2024 Collaborator

mcharytoniuk Sep 2, 2024 Author

ngxson Sep 2, 2024 Collaborator

Vaibhavs10 Sep 3, 2024 Collaborator

mcharytoniuk Sep 2, 2024 Author

slaren Sep 2, 2024 Maintainer

mcharytoniuk Sep 3, 2024 Author

ngxson Sep 3, 2024 Collaborator

ngxson Sep 3, 2024 Collaborator

ggerganov Sep 4, 2024 Maintainer

mcharytoniuk Sep 4, 2024 Author

ngxson Sep 2, 2024 Collaborator

edlee123 Feb 25, 2025

mcharytoniuk
Sep 2, 2024

Replies: 4 comments 16 replies

ggerganov
Sep 2, 2024
Maintainer

ngxson Sep 2, 2024
Collaborator

ngxson Sep 2, 2024
Collaborator

mcharytoniuk Sep 2, 2024
Author

ngxson Sep 2, 2024
Collaborator

Vaibhavs10 Sep 3, 2024
Collaborator

mcharytoniuk
Sep 2, 2024
Author

slaren
Sep 2, 2024
Maintainer

mcharytoniuk Sep 3, 2024
Author

ngxson Sep 3, 2024
Collaborator

ngxson Sep 3, 2024
Collaborator

ggerganov Sep 4, 2024
Maintainer

mcharytoniuk Sep 4, 2024
Author

ngxson
Sep 2, 2024
Collaborator