https://github.com/triton-lang/triton/blob/main/lib/Dialect/TritonGPU/Transforms/FuseNestedLoops.cpp Triton更新了FuseNestedLoops
from mlir.dialects import nvvm from mlir.dialects import llvm from mlir.dialects import func MLIR Python Pass即将到来
102676561920 memoryTotal
pm.enable_ir_printing(enable_debug_info=True)
num_warps=num_warps,
num_stages=4) https://github.com/kyegomez/FlashAttention20Triton