WIP: Flash Attention implementation (forward + backward) #5010

FSSRepo · 2024-01-17T22:10:41Z

Previously, the initiative to implement Flash Attention to improve inference performance in flame.cpp had already been introduced. However, it was assumed that this approach would yield the expected results on the CPU, for that reason, it was discarded, and no further follow-up was given.

Flash Attention is actually designed to enhance GPU resource utilization through the use of tensor cores and shared memory, which is 30 times faster than global memory (VRAM), reducing unnecessary readings and writings.

Implementing this algorithm is particularly challenging because it requires taking into account hardware limitations and conflicts that can degrade performance, even when everything is expected to be fine (a.k.a. shared memory banks conflicts).

Tasks to be carried out during the execution of this project:

Analyzing and understanding concisely how Flash Attention works. Even if it's just creating a C++ code that replicates the exact functionality without much complexity. From this base, we move on to parallelize and optimize.
Optimize and support different models.
Parallelize across GPUs along the attention heads.
Create the backward kernel for training.

cuda: add flash attention + test

f7bcfb0

FSSRepo closed this Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Flash Attention implementation (forward + backward) #5010

WIP: Flash Attention implementation (forward + backward) #5010

FSSRepo commented Jan 17, 2024

WIP: Flash Attention implementation (forward + backward) #5010

WIP: Flash Attention implementation (forward + backward) #5010

Conversation

FSSRepo commented Jan 17, 2024