Skip to content

Commit af689e3

Browse files
phymberthodlen
authored andcommitted
server : init http requests thread pool with --parallel if set (ggml-org#5836)
1 parent e5f5c2b commit af689e3

File tree

2 files changed

+7
-5
lines changed

2 files changed

+7
-5
lines changed

examples/server/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The project is under active development, and we are [looking for feedback and co
1818

1919
- `--threads N`, `-t N`: Set the number of threads to use during generation.
2020
- `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.
21-
- `--threads-http N`: number of threads in the http server pool to process requests (default: `std::thread::hardware_concurrency()`)
21+
- `--threads-http N`: number of threads in the http server pool to process requests (default: `max(std::thread::hardware_concurrency() - 1, --parallel N + 2)`)
2222
- `-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.gguf`).
2323
- `-a ALIAS`, `--alias ALIAS`: Set an alias for the model. The alias will be returned in API responses.
2424
- `-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference. The size may differ in other models, for example, baichuan models were build with a context of 4096.

examples/server/server.cpp

+6-4
Original file line numberDiff line numberDiff line change
@@ -2026,7 +2026,7 @@ static void server_print_usage(const char *argv0, const gpt_params &params,
20262026
printf(" -v, --verbose verbose output (default: %s)\n", server_verbose ? "enabled" : "disabled");
20272027
printf(" -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads);
20282028
printf(" -tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)\n");
2029-
printf(" --threads-http N number of threads in the http server pool to process requests (default: hardware concurrency)\n");
2029+
printf(" --threads-http N number of threads in the http server pool to process requests (default: max(hardware concurrency - 1, --parallel N + 2))\n");
20302030
printf(" -c N, --ctx-size N size of the prompt context (default: %d)\n", params.n_ctx);
20312031
printf(" --rope-scaling {none,linear,yarn}\n");
20322032
printf(" RoPE frequency scaling method, defaults to linear unless specified by the model\n");
@@ -3468,10 +3468,12 @@ int main(int argc, char **argv)
34683468
}*/
34693469
//);
34703470

3471-
if (sparams.n_threads_http > 0) {
3472-
log_data["n_threads_http"] = std::to_string(sparams.n_threads_http);
3473-
svr.new_task_queue = [&sparams] { return new httplib::ThreadPool(sparams.n_threads_http); };
3471+
if (sparams.n_threads_http < 1) {
3472+
// +2 threads for monitoring endpoints
3473+
sparams.n_threads_http = std::max(params.n_parallel + 2, (int32_t) std::thread::hardware_concurrency() - 1);
34743474
}
3475+
log_data["n_threads_http"] = std::to_string(sparams.n_threads_http);
3476+
svr.new_task_queue = [&sparams] { return new httplib::ThreadPool(sparams.n_threads_http); };
34753477

34763478
LOG_INFO("HTTP server listening", log_data);
34773479
// run the HTTP server in a thread - see comment below

0 commit comments

Comments
 (0)