https://github.com/meta-pytorch/tritonbench/pull/498 Enable TileIR for FA
Feature 'add.f32x2' requires .target sm_100 or higher
heuristics_line = heuristics_line.replace("@triton.jit", "@triton_runner.jit")
(bsz, seq_len, num_q_heads, head_dim) transpose(1, 2) 成了 [bsz, n_q_head, seq_len, head_dim]