Skip to content

WIP: Flash Attention implementation (forward + backward) #5010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

FSSRepo
Copy link
Collaborator

@FSSRepo FSSRepo commented Jan 17, 2024

Previus work: llama.cpp#778

Previously, the initiative to implement Flash Attention to improve inference performance in flame.cpp had already been introduced. However, it was assumed that this approach would yield the expected results on the CPU, for that reason, it was discarded, and no further follow-up was given.

Flash Attention is actually designed to enhance GPU resource utilization through the use of tensor cores and shared memory, which is 30 times faster than global memory (VRAM), reducing unnecessary readings and writings.

Implementing this algorithm is particularly challenging because it requires taking into account hardware limitations and conflicts that can degrade performance, even when everything is expected to be fine (a.k.a. shared memory banks conflicts).

Tasks to be carried out during the execution of this project:

  • Analyzing and understanding concisely how Flash Attention works. Even if it's just creating a C++ code that replicates the exact functionality without much complexity. From this base, we move on to parallelize and optimize.
  • Optimize and support different models.
  • Parallelize across GPUs along the attention heads.
  • Create the backward kernel for training.

@FSSRepo FSSRepo closed this Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant