Skip to content

Commit 20b1982

Browse files
phymbertggerganov
authored andcommitted
server: benchmark: chat/completions scenario and other llm servers comparison (ggml-org#5941)
* server: bench: Init a bench scenario with K6 See ggml-org#5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 parent 767498b commit 20b1982

File tree

3 files changed

+216
-0
lines changed

3 files changed

+216
-0
lines changed

examples/server/bench/README.md

+88
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
### Server benchmark tools
2+
3+
Benchmark is using [k6](https://k6.io/).
4+
5+
##### Install k6
6+
7+
Follow instruction from: https://k6.io/docs/get-started/installation/
8+
9+
Example for ubuntu:
10+
```shell
11+
snap install k6
12+
```
13+
14+
#### Download a dataset
15+
16+
This dataset was originally proposed in [vLLM benchmarks](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md).
17+
18+
```shell
19+
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
20+
```
21+
22+
#### Download a model
23+
Example for PHI-2
24+
25+
```shell
26+
../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf
27+
```
28+
29+
#### Start the server
30+
The server must answer OAI Chat completion requests on `http://localhost:8080/v1` or according to the environment variable `SERVER_BENCH_URL`.
31+
32+
Example:
33+
```shell
34+
server --host localhost --port 8080 \
35+
--model ggml-model-q4_0.gguf \
36+
--cont-batching \
37+
--metrics \
38+
--parallel 8 \
39+
--batch-size 512 \
40+
--ctx-size 4096 \
41+
--log-format text \
42+
-ngl 33
43+
```
44+
45+
#### Run the benchmark
46+
47+
For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:
48+
```shell
49+
k6 run script.js --duration 10m --iterations 500 --vus 8
50+
```
51+
52+
The benchmark values can be overridden with:
53+
- `SERVER_BENCH_URL` server url prefix for chat completions, default `http://localhost:8080/v1`
54+
- `SERVER_BENCH_N_PROMPTS` total prompts to randomly select in the benchmark, default `480`
55+
- `SERVER_BENCH_MODEL_ALIAS` model alias to pass in the completion request, default `my-model`
56+
- `SERVER_BENCH_MAX_TOKENS` max tokens to predict, default: `512`
57+
- `SERVER_BENCH_DATASET` path to the benchmark dataset file
58+
- `SERVER_BENCH_MAX_PROMPT_TOKENS` maximum prompt tokens to filter out in the dataset: default `1024`
59+
- `SERVER_BENCH_MAX_CONTEXT` maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default `2048`
60+
61+
Note: the local tokenizer is just a string space split, real number of tokens will differ.
62+
63+
Or with [k6 options](https://k6.io/docs/using-k6/k6-options/reference/):
64+
65+
```shell
66+
SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8
67+
```
68+
69+
To [debug http request](https://k6.io/docs/using-k6/http-debugging/) use `--http-debug="full"`.
70+
71+
#### Metrics
72+
73+
Following metrics are available computed from the OAI chat completions response `usage`:
74+
- `llamacpp_tokens_second` Trend of `usage.total_tokens / request duration`
75+
- `llamacpp_prompt_tokens` Trend of `usage.prompt_tokens`
76+
- `llamacpp_prompt_tokens_total_counter` Counter of `usage.prompt_tokens`
77+
- `llamacpp_completion_tokens` Trend of `usage.completion_tokens`
78+
- `llamacpp_completion_tokens_total_counter` Counter of `usage.completion_tokens`
79+
- `llamacpp_completions_truncated_rate` Rate of completions truncated, i.e. if `finish_reason === 'length'`
80+
- `llamacpp_completions_stop_rate` Rate of completions stopped by the model, i.e. if `finish_reason === 'stop'`
81+
82+
The script will fail if too many completions are truncated, see `llamacpp_completions_truncated_rate`.
83+
84+
K6 metrics might be compared against [server metrics](../README.md), with:
85+
86+
```shell
87+
curl http://localhost:8080/metrics
88+
```

examples/server/bench/script.js

+120
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
import http from 'k6/http'
2+
import {check, sleep} from 'k6'
3+
import {SharedArray} from 'k6/data'
4+
import {Counter, Rate, Trend} from 'k6/metrics'
5+
import exec from 'k6/execution';
6+
7+
// Server chat completions prefix
8+
const server_url = __ENV.SERVER_BENCH_URL ? __ENV.SERVER_BENCH_URL : 'http://localhost:8080/v1'
9+
10+
// Number of total prompts in the dataset - default 10m / 10 seconds/request * number of users
11+
const n_prompt = __ENV.SERVER_BENCH_N_PROMPTS ? parseInt(__ENV.SERVER_BENCH_N_PROMPTS) : 600 / 10 * 8
12+
13+
// Model name to request
14+
const model = __ENV.SERVER_BENCH_MODEL_ALIAS ? __ENV.SERVER_BENCH_MODEL_ALIAS : 'my-model'
15+
16+
// Dataset path
17+
const dataset_path = __ENV.SERVER_BENCH_DATASET ? __ENV.SERVER_BENCH_DATASET : './ShareGPT_V3_unfiltered_cleaned_split.json'
18+
19+
// Max tokens to predict
20+
const max_tokens = __ENV.SERVER_BENCH_MAX_TOKENS ? parseInt(__ENV.SERVER_BENCH_MAX_TOKENS) : 512
21+
22+
// Max prompt tokens
23+
const n_prompt_tokens = __ENV.SERVER_BENCH_MAX_PROMPT_TOKENS ? parseInt(__ENV.SERVER_BENCH_MAX_PROMPT_TOKENS) : 1024
24+
25+
// Max slot context
26+
const n_ctx_slot = __ENV.SERVER_BENCH_MAX_CONTEXT ? parseInt(__ENV.SERVER_BENCH_MAX_CONTEXT) : 2048
27+
28+
export function setup() {
29+
console.info(`Benchmark config: server_url=${server_url} n_prompt=${n_prompt} model=${model} dataset_path=${dataset_path} max_tokens=${max_tokens}`)
30+
}
31+
32+
const data = new SharedArray('conversations', function () {
33+
const tokenizer = (message) => message.split(/[\s,'".?]/)
34+
35+
return JSON.parse(open(dataset_path))
36+
// Filter out the conversations with less than 2 turns.
37+
.filter(data => data["conversations"].length >= 2)
38+
.filter(data => data["conversations"][0]["from"] === "human")
39+
.map(data => {
40+
return {
41+
prompt: data["conversations"][0]["value"],
42+
n_prompt_tokens: tokenizer(data["conversations"][0]["value"]).length,
43+
n_completion_tokens: tokenizer(data["conversations"][1]["value"]).length,
44+
}
45+
})
46+
// Filter out too short sequences
47+
.filter(conv => conv.n_prompt_tokens >= 4 && conv.n_completion_tokens >= 4)
48+
// Filter out too long sequences.
49+
.filter(conv => conv.n_prompt_tokens <= n_prompt_tokens && conv.n_prompt_tokens + conv.n_completion_tokens <= n_ctx_slot)
50+
// Keep only first n prompts
51+
.slice(0, n_prompt)
52+
})
53+
54+
const llamacpp_prompt_tokens = new Trend('llamacpp_prompt_tokens')
55+
const llamacpp_completion_tokens = new Trend('llamacpp_completion_tokens')
56+
const llamacpp_tokens_second = new Trend('llamacpp_tokens_second')
57+
58+
const llamacpp_prompt_tokens_total_counter = new Counter('llamacpp_prompt_tokens_total_counter')
59+
const llamacpp_completion_tokens_total_counter = new Counter('llamacpp_completion_tokens_total_counter')
60+
61+
const llamacpp_completions_truncated_rate = new Rate('llamacpp_completions_truncated_rate')
62+
const llamacpp_completions_stop_rate = new Rate('llamacpp_completions_stop_rate')
63+
64+
export const options = {
65+
thresholds: {
66+
llamacpp_completions_truncated_rate: [
67+
// more than 80% of truncated input will abort the test
68+
{threshold: 'rate < 0.8', abortOnFail: true, delayAbortEval: '1m'},
69+
],
70+
},
71+
duration: '10m',
72+
vus: 8,
73+
}
74+
75+
export default function () {
76+
const conversation = data[exec.scenario.iterationInInstance % data.length]
77+
const payload = {
78+
"messages": [
79+
{
80+
"role": "system",
81+
"content": "You are ChatGPT, an AI assistant.",
82+
},
83+
{
84+
"role": "user",
85+
"content": conversation.prompt,
86+
}
87+
],
88+
"model": model,
89+
"stream": false,
90+
"max_tokens": max_tokens
91+
}
92+
93+
const body = JSON.stringify(payload)
94+
95+
let res = http.post(`${server_url}/chat/completions`, body, {
96+
headers: {'Content-Type': 'application/json'},
97+
timeout: '300s'
98+
})
99+
100+
check(res, {'success completion': (r) => r.status === 200})
101+
102+
if (res.status === 200) {
103+
const completions = res.json()
104+
105+
llamacpp_prompt_tokens.add(completions.usage.prompt_tokens)
106+
llamacpp_prompt_tokens_total_counter.add(completions.usage.prompt_tokens)
107+
108+
llamacpp_completion_tokens.add(completions.usage.completion_tokens)
109+
llamacpp_completion_tokens_total_counter.add(completions.usage.completion_tokens)
110+
111+
llamacpp_completions_truncated_rate.add(completions.choices[0].finish_reason === 'length')
112+
llamacpp_completions_stop_rate.add(completions.choices[0].finish_reason === 'stop')
113+
114+
llamacpp_tokens_second.add(completions.usage.total_tokens / res.timings.duration * 1.e3)
115+
} else {
116+
console.error(`response: ${res.body} request=${payload}`)
117+
}
118+
119+
sleep(0.3)
120+
}

examples/server/server.cpp

+8
Original file line numberDiff line numberDiff line change
@@ -2133,6 +2133,8 @@ static void server_print_usage(const char * argv0, const gpt_params & params, co
21332133
printf(" --yarn-beta-slow N YaRN: high correction dim or alpha (default: %.1f)\n", params.yarn_beta_slow);
21342134
printf(" --yarn-beta-fast N YaRN: low correction dim or beta (default: %.1f)\n", params.yarn_beta_fast);
21352135
printf(" --pooling {none,mean,cls} pooling type for embeddings, use model default if unspecified\n");
2136+
printf(" -dt N, --defrag-thold N\n");
2137+
printf(" KV cache defragmentation threshold (default: %.1f, < 0 - disabled)\n", params.defrag_thold);
21362138
printf(" -b N, --batch-size N batch size for prompt processing (default: %d)\n", params.n_batch);
21372139
printf(" --memory-f32 use f32 instead of f16 for memory key+value (default: disabled)\n");
21382140
printf(" not recommended: doubles context memory required and no measurable increase in quality\n");
@@ -2355,6 +2357,12 @@ static void server_params_parse(int argc, char ** argv, server_params & sparams,
23552357
else if (value == "mean") { params.pooling_type = LLAMA_POOLING_TYPE_MEAN; }
23562358
else if (value == "cls") { params.pooling_type = LLAMA_POOLING_TYPE_CLS; }
23572359
else { invalid_param = true; break; }
2360+
} else if (arg == "--defrag-thold" || arg == "-dt") {
2361+
if (++i >= argc) {
2362+
invalid_param = true;
2363+
break;
2364+
}
2365+
params.defrag_thold = std::stof(argv[i]);
23582366
} else if (arg == "--threads" || arg == "-t") {
23592367
if (++i >= argc)
23602368
{

0 commit comments

Comments
 (0)