Skip to content

Commit fd89833

Browse files
phymberthodlen
authored andcommitted
server: allow to override threads server pool with --threads-http (ggml-org#5794)
1 parent c9e433e commit fd89833

File tree

2 files changed

+17
-0
lines changed

2 files changed

+17
-0
lines changed

examples/server/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ The project is under active development, and we are [looking for feedback and co
1818

1919
- `--threads N`, `-t N`: Set the number of threads to use during generation.
2020
- `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.
21+
- `--threads-http N`: number of threads in the http server pool to process requests (default: `std::thread::hardware_concurrency()`)
2122
- `-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.gguf`).
2223
- `-a ALIAS`, `--alias ALIAS`: Set an alias for the model. The alias will be returned in API responses.
2324
- `-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference. The size may differ in other models, for example, baichuan models were build with a context of 4096.

examples/server/server.cpp

+16
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ struct server_params {
4343
int32_t write_timeout = 600;
4444
bool slots_endpoint = true;
4545
bool metrics_endpoint = false;
46+
int n_threads_http = -1;
4647
};
4748

4849
bool server_verbose = false;
@@ -2012,6 +2013,7 @@ static void server_print_usage(const char *argv0, const gpt_params &params,
20122013
printf(" -v, --verbose verbose output (default: %s)\n", server_verbose ? "enabled" : "disabled");
20132014
printf(" -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads);
20142015
printf(" -tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)\n");
2016+
printf(" --threads-http N number of threads in the http server pool to process requests (default: hardware concurrency)\n");
20152017
printf(" -c N, --ctx-size N size of the prompt context (default: %d)\n", params.n_ctx);
20162018
printf(" --rope-scaling {none,linear,yarn}\n");
20172019
printf(" RoPE frequency scaling method, defaults to linear unless specified by the model\n");
@@ -2298,6 +2300,15 @@ static void server_params_parse(int argc, char **argv, server_params &sparams,
22982300
}
22992301
params.n_threads_batch = std::stoi(argv[i]);
23002302
}
2303+
else if (arg == "--threads-http")
2304+
{
2305+
if (++i >= argc)
2306+
{
2307+
invalid_param = true;
2308+
break;
2309+
}
2310+
sparams.n_threads_http = std::stoi(argv[i]);
2311+
}
23012312
else if (arg == "-b" || arg == "--batch-size")
23022313
{
23032314
if (++i >= argc)
@@ -3449,6 +3460,11 @@ int main(int argc, char **argv)
34493460
}*/
34503461
//);
34513462

3463+
if (sparams.n_threads_http > 0) {
3464+
log_data["n_threads_http"] = std::to_string(sparams.n_threads_http);
3465+
svr.new_task_queue = [&sparams] { return new httplib::ThreadPool(sparams.n_threads_http); };
3466+
}
3467+
34523468
LOG_INFO("HTTP server listening", log_data);
34533469
// run the HTTP server in a thread - see comment below
34543470
std::thread t([&]()

0 commit comments

Comments
 (0)