Skip to content

Commit 5896c65

Browse files
authored
server : add OAI compat for /v1/completions (#10974)
* server : add OAI compat for /v1/completions * add test * add docs * better docs
1 parent bc7b1f8 commit 5896c65

File tree

5 files changed

+400
-146
lines changed

5 files changed

+400
-146
lines changed

examples/server/README.md

+161-91
Original file line numberDiff line numberDiff line change
@@ -345,7 +345,7 @@ node index.js
345345

346346
> [!IMPORTANT]
347347
>
348-
> This endpoint is **not** OAI-compatible
348+
> This endpoint is **not** OAI-compatible. For OAI-compatible client, use `/v1/completions` instead.
349349

350350
*Options:*
351351

@@ -523,6 +523,7 @@ These words will not be included in the completion, so make sure to add them to
523523
- `tokens_evaluated`: Number of tokens evaluated in total from the prompt
524524
- `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)
525525

526+
526527
### POST `/tokenize`: Tokenize a given text
527528

528529
*Options:*
@@ -574,6 +575,10 @@ With input 'á' (utf8 hex: C3 A1) on tinyllama/stories260k
574575

575576
### POST `/embedding`: Generate embedding of a given text
576577

578+
> [!IMPORTANT]
579+
>
580+
> This endpoint is **not** OAI-compatible. For OAI-compatible client, use `/v1/embeddings` instead.
581+
577582
The same as [the embedding example](../embedding) does.
578583

579584
*Options:*
@@ -744,96 +749,6 @@ To use this endpoint with POST method, you need to start server with `--props`
744749

745750
- None yet
746751

747-
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
748-
749-
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
750-
751-
*Options:*
752-
753-
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such as `mirostat` are supported.
754-
755-
The `response_format` parameter supports both plain JSON output (e.g. `{"type": "json_object"}`) and schema-constrained JSON (e.g. `{"type": "json_object", "schema": {"type": "string", "minLength": 10, "maxLength": 100}}` or `{"type": "json_schema", "schema": {"properties": { "name": { "title": "Name", "type": "string" }, "date": { "title": "Date", "type": "string" }, "participants": { "items": {"type: "string" }, "title": "Participants", "type": "string" } } } }`), similar to other OpenAI-inspired API providers.
756-
757-
*Examples:*
758-
759-
You can use either Python `openai` library with appropriate checkpoints:
760-
761-
```python
762-
import openai
763-
764-
client = openai.OpenAI(
765-
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
766-
api_key = "sk-no-key-required"
767-
)
768-
769-
completion = client.chat.completions.create(
770-
model="gpt-3.5-turbo",
771-
messages=[
772-
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
773-
{"role": "user", "content": "Write a limerick about python exceptions"}
774-
]
775-
)
776-
777-
print(completion.choices[0].message)
778-
```
779-
780-
... or raw HTTP requests:
781-
782-
```shell
783-
curl http://localhost:8080/v1/chat/completions \
784-
-H "Content-Type: application/json" \
785-
-H "Authorization: Bearer no-key" \
786-
-d '{
787-
"model": "gpt-3.5-turbo",
788-
"messages": [
789-
{
790-
"role": "system",
791-
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
792-
},
793-
{
794-
"role": "user",
795-
"content": "Write a limerick about python exceptions"
796-
}
797-
]
798-
}'
799-
```
800-
801-
### POST `/v1/embeddings`: OpenAI-compatible embeddings API
802-
803-
This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
804-
805-
*Options:*
806-
807-
See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
808-
809-
*Examples:*
810-
811-
- input as string
812-
813-
```shell
814-
curl http://localhost:8080/v1/embeddings \
815-
-H "Content-Type: application/json" \
816-
-H "Authorization: Bearer no-key" \
817-
-d '{
818-
"input": "hello",
819-
"model":"GPT-4",
820-
"encoding_format": "float"
821-
}'
822-
```
823-
824-
- `input` as string array
825-
826-
```shell
827-
curl http://localhost:8080/v1/embeddings \
828-
-H "Content-Type: application/json" \
829-
-H "Authorization: Bearer no-key" \
830-
-d '{
831-
"input": ["hello", "world"],
832-
"model":"GPT-4",
833-
"encoding_format": "float"
834-
}'
835-
```
836-
837752
### POST `/embeddings`: non-OpenAI-compatible embeddings API
838753

839754
This endpoint supports all poolings, including `--pooling none`. When the pooling is `none`, the responses will contain the *unnormalized* embeddings for *all* input tokens. For all other pooling types, only the pooled embeddings are returned, normalized using Euclidian norm.
@@ -1064,6 +979,161 @@ To know the `id` of the adapter, use GET `/lora-adapters`
1064979
]
1065980
```
1066981

982+
## OpenAI-compatible API Endpoints
983+
984+
### GET `/v1/models`: OpenAI-compatible Model Info API
985+
986+
Returns information about the loaded model. See [OpenAI Models API documentation](https://platform.openai.com/docs/api-reference/models).
987+
988+
The returned list always has one single element.
989+
990+
By default, model `id` field is the path to model file, specified via `-m`. You can set a custom value for model `id` field via `--alias` argument. For example, `--alias gpt-4o-mini`.
991+
992+
Example:
993+
994+
```json
995+
{
996+
"object": "list",
997+
"data": [
998+
{
999+
"id": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
1000+
"object": "model",
1001+
"created": 1735142223,
1002+
"owned_by": "llamacpp",
1003+
"meta": {
1004+
"vocab_type": 2,
1005+
"n_vocab": 128256,
1006+
"n_ctx_train": 131072,
1007+
"n_embd": 4096,
1008+
"n_params": 8030261312,
1009+
"size": 4912898304
1010+
}
1011+
}
1012+
]
1013+
}
1014+
```
1015+
1016+
### POST `/v1/completions`: OpenAI-compatible Completions API
1017+
1018+
Given an input `prompt`, it returns the predicted completion. Streaming mode is also supported. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps.
1019+
1020+
*Options:*
1021+
1022+
See [OpenAI Completions API documentation](https://platform.openai.com/docs/api-reference/completions).
1023+
1024+
llama.cpp `/completion`-specific features such as `mirostat` are supported.
1025+
1026+
*Examples:*
1027+
1028+
Example usage with `openai` python library:
1029+
1030+
```python
1031+
import openai
1032+
1033+
client = openai.OpenAI(
1034+
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
1035+
api_key = "sk-no-key-required"
1036+
)
1037+
1038+
completion = client.completions.create(
1039+
model="davinci-002",
1040+
prompt="I believe the meaning of life is",
1041+
max_tokens=8
1042+
)
1043+
1044+
print(completion.choices[0].text)
1045+
```
1046+
1047+
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
1048+
1049+
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
1050+
1051+
*Options:*
1052+
1053+
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such as `mirostat` are supported.
1054+
1055+
The `response_format` parameter supports both plain JSON output (e.g. `{"type": "json_object"}`) and schema-constrained JSON (e.g. `{"type": "json_object", "schema": {"type": "string", "minLength": 10, "maxLength": 100}}` or `{"type": "json_schema", "schema": {"properties": { "name": { "title": "Name", "type": "string" }, "date": { "title": "Date", "type": "string" }, "participants": { "items": {"type: "string" }, "title": "Participants", "type": "string" } } } }`), similar to other OpenAI-inspired API providers.
1056+
1057+
*Examples:*
1058+
1059+
You can use either Python `openai` library with appropriate checkpoints:
1060+
1061+
```python
1062+
import openai
1063+
1064+
client = openai.OpenAI(
1065+
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
1066+
api_key = "sk-no-key-required"
1067+
)
1068+
1069+
completion = client.chat.completions.create(
1070+
model="gpt-3.5-turbo",
1071+
messages=[
1072+
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
1073+
{"role": "user", "content": "Write a limerick about python exceptions"}
1074+
]
1075+
)
1076+
1077+
print(completion.choices[0].message)
1078+
```
1079+
1080+
... or raw HTTP requests:
1081+
1082+
```shell
1083+
curl http://localhost:8080/v1/chat/completions \
1084+
-H "Content-Type: application/json" \
1085+
-H "Authorization: Bearer no-key" \
1086+
-d '{
1087+
"model": "gpt-3.5-turbo",
1088+
"messages": [
1089+
{
1090+
"role": "system",
1091+
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
1092+
},
1093+
{
1094+
"role": "user",
1095+
"content": "Write a limerick about python exceptions"
1096+
}
1097+
]
1098+
}'
1099+
```
1100+
1101+
### POST `/v1/embeddings`: OpenAI-compatible embeddings API
1102+
1103+
This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
1104+
1105+
*Options:*
1106+
1107+
See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
1108+
1109+
*Examples:*
1110+
1111+
- input as string
1112+
1113+
```shell
1114+
curl http://localhost:8080/v1/embeddings \
1115+
-H "Content-Type: application/json" \
1116+
-H "Authorization: Bearer no-key" \
1117+
-d '{
1118+
"input": "hello",
1119+
"model":"GPT-4",
1120+
"encoding_format": "float"
1121+
}'
1122+
```
1123+
1124+
- `input` as string array
1125+
1126+
```shell
1127+
curl http://localhost:8080/v1/embeddings \
1128+
-H "Content-Type: application/json" \
1129+
-H "Authorization: Bearer no-key" \
1130+
-d '{
1131+
"input": ["hello", "world"],
1132+
"model":"GPT-4",
1133+
"encoding_format": "float"
1134+
}'
1135+
```
1136+
10671137
## More examples
10681138

10691139
### Interactive mode

0 commit comments

Comments
 (0)