You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`--slot-save-path PATH`| path to save slot kv cache (default: disabled) |
154
158
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted:<br/>https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
155
159
|`-sps, --slot-prompt-similarity SIMILARITY`| how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)<br/> |
@@ -380,8 +384,6 @@ node index.js
380
384
381
385
`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `false`
382
386
383
-
`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)
384
-
385
387
`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["top_k", "tfs_z", "typical_p", "top_p", "min_p", "temperature"]` - these are all the available values.
386
388
387
389
**Response format**
@@ -519,34 +521,41 @@ Requires a reranker model (such as [bge-reranker-v2-m3](https://huggingface.co/B
519
521
520
522
Takes a prefix and a suffix and returns the predicted completion as stream.
521
523
522
-
*Options:*
524
+
*Options:*
523
525
524
-
`input_prefix`: Set the prefix of the code to infill.
526
+
-`input_prefix`: Set the prefix of the code to infill.
527
+
-`input_suffix`: Set the suffix of the code to infill.
525
528
526
-
`input_suffix`: Set the suffix of the code to infill.
529
+
It also accepts all the options of `/completion` except `stream` and `prompt`.
527
530
528
-
It also accepts all the options of `/completion` except `stream` and `prompt`.
531
+
### **GET**`/props`: Get server global properties.
529
532
530
-
-**GET**`/props`: Return current server settings.
533
+
This endpoint is public (no API key check). By default, it is read-only. To make POST request to change global properties, you need to start server with `--props`
531
534
532
535
**Response format**
533
536
534
537
```json
535
538
{
536
-
"assistant_name": "",
537
-
"user_name": "",
539
+
"system_prompt": "",
538
540
"default_generation_settings": { ... },
539
541
"total_slots": 1,
540
542
"chat_template": ""
541
543
}
542
544
```
543
545
544
-
-`assistant_name` - the required assistant name to generate the prompt in case you have specified a system prompt for all slots.
545
-
-`user_name` - the required anti-prompt to generate the prompt in case you have specified a system prompt for all slots.
546
+
-`system_prompt` - the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
546
547
-`default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
547
548
-`total_slots` - the total number of slots for process requests (defined by `--parallel` option)
548
549
-`chat_template` - the model's original Jinja2 prompt template
549
550
551
+
### POST `/props`: Change server global properties.
552
+
553
+
To use this endpoint with POST method, you need to start server with `--props`
554
+
555
+
*Options:*
556
+
557
+
-`system_prompt`: Change the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
558
+
550
559
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
551
560
552
561
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
@@ -813,28 +822,6 @@ To know the `id` of the adapter, use GET `/lora-adapters`
813
822
814
823
## More examples
815
824
816
-
### Change system prompt on runtime
817
-
818
-
To use the server example to serve multiple chat-type clients while keeping the same system prompt, you can utilize the option `system_prompt`. This only needs to be used once.
819
-
820
-
`prompt`: Specify a context that you want all connecting clients to respect.
821
-
822
-
`anti_prompt`: Specify the word you want to use to instruct the model to stop. This must be sent to each client through the `/props` endpoint.
823
-
824
-
`assistant_name`: The bot's name is necessary for each customer to generate the prompt. This must be sent to each client through the `/props` endpoint.
825
-
826
-
```json
827
-
{
828
-
"system_prompt": {
829
-
"prompt": "Transcript of a never ending dialog, where the User interacts with an Assistant.\nThe Assistant is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.\nUser: Recommend a nice restaurant in the area.\nAssistant: I recommend the restaurant \"The Golden Duck\". It is a 5 star restaurant with a great view of the city. The food is delicious and the service is excellent. The prices are reasonable and the portions are generous. The restaurant is located at 123 Main Street, New York, NY 10001. The phone number is (212) 555-1234. The hours are Monday through Friday from 11:00 am to 10:00 pm. The restaurant is closed on Saturdays and Sundays.\nUser: Who is Richard Feynman?\nAssistant: Richard Feynman was an American physicist who is best known for his work in quantum mechanics and particle physics. He was awarded the Nobel Prize in Physics in 1965 for his contributions to the development of quantum electrodynamics. He was a popular lecturer and author, and he wrote several books, including \"Surely You're Joking, Mr. Feynman!\" and \"What Do You Care What Other People Think?\".\nUser:",
830
-
"anti_prompt": "User:",
831
-
"assistant_name": "Assistant:"
832
-
}
833
-
}
834
-
```
835
-
836
-
**NOTE**: You can do this automatically when starting the server by simply creating a .json file with these options and using the CLI option `-spf FNAME` or `--system-prompt-file FNAME`.
0 commit comments