Skip to content

Misc. bug: n_probs is not working with llama.cpp server #10733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
henryclw opened this issue Dec 9, 2024 · 6 comments · Fixed by #10783
Closed

Misc. bug: n_probs is not working with llama.cpp server #10733

henryclw opened this issue Dec 9, 2024 · 6 comments · Fixed by #10783

Comments

@henryclw
Copy link

henryclw commented Dec 9, 2024

Name and Version

build: 4291 (ce8784b) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Docker image name: ggerganov/llama.cpp:server-cuda
Docker image hash: sha256:8fa3ccfdcd21874c8a8b257b6bf6abf10070d612e00394b477ec124bd56f2d12

Operating systems

No response

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

Started the server with no speculative decoding.

curl --request POST \
     --url http://localhost:8080/completion \
     --header "Content-Type: application/json" \
     --data '{"prompt": "Why is the sky is blue?", "n_probs": 10}'

The output doesn't contain completion_probabilities, which it should

First Bad Commit

HINT:

For docker image server-cuda-b4274, n_probs is working as expected
For docker image server-cuda-b4277, n_probs is not working

Relevant log output

No response

@henryclw
Copy link
Author

henryclw commented Dec 9, 2024

I just found out this bug was introduced at 6c5bc06
@ngxson Hi, do you mind giving this a look if you have a minute?

@thkodin
Copy link

thkodin commented Dec 10, 2024

Adding that the same issue occurs on /chat/completions. I realized because I shifted from a local build (b3912) to a dockerized version running the latest one (b4291 at the time of this writing). Same observation as @henryclw, server-cuda-b4274 works, and b4277 does not produce the same output.

Basically, if you were to pass logprobs=True and top_logprobs=5 in the OAI-like chat request, e.g:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer sk-1234" \
    -d '{"model": "gpt-4", "messages": [{"role": "system", "content": "You are an expert nutritionist."}, {"role": "user", "content": "What are mangoes? Respond very briefly."}], "logprobs": true, "top_logprobs": 2}'

You'd get (truncated here to the first two tokens, followed by the last token):

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"Mangoes are a tropical fruit from the genus Mangifera, specifically the species Mangifera indica.","role":"assistant"}}],"created":1733789118,"model":"gpt-4","object":"chat.completion","usage":{"completion_tokens":23,"prompt_tokens":40,"total_tokens":63},"id":"chatcmpl-oiyl71CKlD1C3X3MqofQAI7HlIrSOQAx","completion_probabilities":[{"content":"M","probs":[{"tok_str":"M","prob":1.0},{"tok_str":"A","prob":0.0}]},{"content":"ango","probs":[{"tok_str":"ango","prob":1.0},{"tok_str":"ang","prob":0.0}]},{"content":"<|im_end|>","probs":[{"tok_str":"<|im_end|>","prob":1.0},{"tok_str":" They","prob":0.0}]}]}

Now, on b4277 (also on b4291 image), with the same POST, it returns the following output:

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"Mangoes are a type of tropical fruit.","role":"assistant"}}],"created":1733792861,"model":"gpt-4","object":"chat.completion","usage":{"completion_tokens":11,"prompt_tokens":40,"total_tokens":51},"id":"chatcmpl-acfFHInMS3Ha3Hvb72e1KTgkRdAYasBj","timings":{"prompt_n":40,"prompt_ms":9609.985,"prompt_per_token_ms":240.249625,"prompt_per_second":4.162337402191574,"predicted_n":11,"predicted_ms":9075.829,"predicted_per_token_ms":825.0753636363636,"predicted_per_second":1.2120104951294257}}

This is not technically the OpenAI compatible probabilities (which is logprobs and not probabilities), so it might be intentional as part of the shift? I think this because in 6c5bc06 within server.cpp, it looks like probs_output which seems to be the struct representing the completion probabilities is not present in any of the to_json functions for OpenAI compatible responses, though I am nowhere near adept at C++ so I am probably wrong.

@ngxson
Copy link
Collaborator

ngxson commented Dec 11, 2024

@thkodin The probs_output has been removed from /chat/completion because it's not openai-compat. The idea of the commit that you mentioned is to completely separate the to_json to 2 versions: non-oai-compat and oai-compat.

On the bright side, this new structure allow adding llama_token_probs::to_json_oai_compat very easily. In fact, I planned to do that this week. Token probs is needed for benchmarking quality (i.e. calculating perplexity)

@StoyanStAtanasov
Copy link

I'm here also because trying to calc perplexity. Waiting for the merge!

@StoyanStAtanasov
Copy link

@henryclw
Copy link
Author

@thkodin The probs_output has been removed from /chat/completion because it's not openai-compat. The idea of the commit that you mentioned is to completely separate the to_json to 2 versions: non-oai-compat and oai-compat.

On the bright side, this new structure allow adding llama_token_probs::to_json_oai_compat very easily. In fact, I planned to do that this week. Token probs is needed for benchmarking quality (i.e. calculating perplexity)

Hi, thank you for the kind reply. I really love the idea of splitting the API into non-OpenAI format and the OpenAI format. On one hand, the OpenAI format could easily cooperate with other projects. But the non-OpenAI format is the one that we could do research on.

And yes, as others have mentioned, the logprobs is really useful in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants