https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cuda/fattn-tile-f32.cu