Skip to content

Compiling LLAMA for cuda is single threaded #311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
B3none opened this issue Sep 18, 2024 · 1 comment
Closed

Compiling LLAMA for cuda is single threaded #311

B3none opened this issue Sep 18, 2024 · 1 comment

Comments

@B3none
Copy link

B3none commented Sep 18, 2024

I ran npx --no node-llama-cpp download --cuda and it takes a seriously long time to compile, seemingly because it's only running on one thread.

Is there anything I can do to speed it up?

@giladgd
Copy link
Contributor

giladgd commented Sep 18, 2024

It takes a long time due to many template compilations for inference performance optimizations.
Nothing can be done to shorten it other than removing support for some GGUF file formats (which is undesirable outside of the development of llama.cpp itself), and the compilation times mainly depend on your hardware.
It will only get slower over time due to increasing support of new features and model architectures in llama.cpp.

I recommend you to switch to version 3 beta, which ships with prebuilt binaries that you can use without compiling anything.

@giladgd giladgd closed this as completed Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants