It’s possible that this is a bug in MLIR. I used enable_timing() and obtained the following timing results. ===-------------------------------------------------------------------------=== ... Execution time report ... ===-------------------------------------------------------------------------=== Total Execution Time: 0.0081 seconds ----User Time---- ----Wall Time---- ----Name---- 0.0050 ( 27.4%) 0.0050 ( 61.5%) Inliner 0.0002 ( 1.0%) 0.0002 ( 2.2%) (A) CallGraph 0.0028 ( 15.5%) 0.0028 ( 34.7%) 'tt.func' Pipeline 0.0028 ( 15.4%) 0.0028 ( 34.4%) Canonicalizer 0.0001 ( 0.8%) 0.0001 ( 1.7%) TritonRewriteTensorPointer 0.0004 ( 2.1%) 0.0004 ( 4.7%) Canonicalizer 0.0003 ( 1.8%) 0.0003 ( 4.0%) TritonCombineOps 0.0009 ( 4.7%) 0.0009 ( 10.6%) TritonReorderBroadcast 0.0003 ( 1.4%) 0.0003 ( 3.2%) CSE 0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo 0.0002 ( 1.0%) 0.0002 ( 2.3%) SymbolDCE 0.0001 ( 0.6%) 0.0001 ( 1.3%) TritonLoopUnroll 0.0081 ( 44.6%) -0.0019 (-24.1%) Rest 0.0181 (100.0%) 0.0081 (100.0%) Total ===-------------------------------------------------------------------------=== ... Execution time report ... ===-------------------------------------------------------------------------=== Total Execution Time: 0.0222 seconds ----User Time---- ----Wall Time---- ----Name---- 0.0013 ( 2.8%) 0.0013 ( 5.7%) ConvertTritonToTritonGPU 0.0052 ( 11.8%) 0.0052 ( 23.6%) TritonGPUCoalesce 0.0004 ( 0.9%) 0.0004 ( 1.8%) TritonGPUF32DotTC 0.0002 ( 0.5%) 0.0002 ( 0.9%) TritonGPUPlanCTAPass 0.0060 ( 13.5%) 0.0060 ( 27.1%) TritonGPURemoveLayoutConversions 0.0003 ( 0.6%) 0.0003 ( 1.2%) TritonGPUOptimizeThreadLocality 0.0003 ( 0.6%) 0.0003 ( 1.2%) TritonGPUAccelerateMatmul 0.0006 ( 1.4%) 0.0006 ( 2.9%) TritonGPURemoveLayoutConversions 0.0007 ( 1.5%) 0.0007 ( 3.0%) TritonGPUOptimizeDotOperands 0.0007 ( 1.5%) 0.0007 ( 3.1%) 'any' Pipeline 0.0007 ( 1.5%) 0.0007 ( 3.1%) Canonicalizer 0.0002 ( 0.3%) 0.0002 ( 0.7%) TritonNvidiaGPUOptimizeDescriptorEncodingPass 0.0004 ( 0.9%) 0.0004 ( 1.9%) CSE 0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo 0.0002 ( 0.5%) 0.0002 ( 1.1%) TritonGPUFuseNestedLoops 0.0004 ( 0.9%) 0.0004 ( 1.9%) Canonicalizer 0.0002 ( 0.5%) 0.0002 ( 1.1%) TritonLoopInvariantCodeMotion 0.0004 ( 0.9%) 0.0004 ( 1.8%) Canonicalizer 0.0002 ( 0.4%) 0.0002 ( 0.9%) TritonGPUCombineTensorSelectAndIf 0.0004 ( 0.9%) 0.0004 ( 1.8%) TritonGPUPipeline 0.0004 ( 0.8%) 0.0004 ( 1.6%) TritonGPUPrefetch 0.0003 ( 0.8%) 0.0003 ( 1.5%) TritonGPUWGMMAPrefetch 0.0008 ( 1.9%) 0.0008 ( 3.8%) TritonGPUOptimizeDotOperands 0.0003 ( 0.8%) 0.0003 ( 1.5%) TritonGPUCoalesceAsyncCopy 0.0003 ( 0.7%) 0.0003 ( 1.5%) TritonNvidiaGPUOptimizeTMemSubtilingPass 0.0008 ( 1.9%) 0.0008 ( 3.8%) TritonGPURemoveLayoutConversions 0.0002 ( 0.4%) 0.0002 ( 0.9%) TritonGPUReduceDataDuplication 0.0002 ( 0.5%) 0.0002 ( 0.9%) TritonGPUReorderInstructions 0.0001 ( 0.3%) 0.0001 ( 0.6%) CSE 0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo 0.0003 ( 0.7%) 0.0003 ( 1.3%) SymbolDCE 0.0004 ( 0.8%) 0.0004 ( 1.6%) Canonicalizer 0.0222 ( 49.9%) -0.0001 ( -0.5%) Rest 0.0444 (100.0%) 0.0222 (100.0%) Total ===-------------------------------------------------------------------------=== ... Execution time report ... ===-------------------------------------------------------------------------=== Total Execution Time: 307.4728 seconds ----User Time---- ----Wall Time---- ----Name---- 0.0004 ( 0.0%) 0.0004 ( 0.0%) TritonNvidiaGPUMMALoweringPass 0.0001 ( 0.0%) 0.0001 ( 0.0%) TritonGPUCombineTensorSelectAndIf 0.0001 ( 0.0%) 0.0001 ( 0.0%) TritonGPUAllocateWarpGroups 0.0002 ( 0.0%) 0.0002 ( 0.0%) SCFToControlFlowPass 0.0021 ( 0.0%) 0.0021 ( 0.0%) AllocateSharedMemory 0.0003 ( 0.0%) 0.0003 ( 0.0%) TritionTensorMemoryAllocationPass 0.0002 ( 0.0%) 0.0002 ( 0.0%) TritonGPUGlobalScratchAllocationPass 0.5546 ( 0.2%) 0.5546 ( 0.2%) ConvertTritonGPUToLLVM 3.8323 ( 1.2%) 3.8323 ( 1.2%) Canonicalizer 0.1434 ( 0.0%) 0.1434 ( 0.0%) CSE 0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo 0.0961 ( 0.0%) 0.0961 ( 0.0%) ConvertNVGPUToLLVM 0.0878 ( 0.0%) 0.0878 ( 0.0%) ConvertWarpSpecializeToLLVM 0.1212 ( 0.0%) 0.1212 ( 0.0%) Canonicalizer 0.0970 ( 0.0%) 0.0970 ( 0.0%) CSE 0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo 0.0821 ( 0.0%) 0.0821 ( 0.0%) SymbolDCE 0.0364 ( 0.0%) 0.0364 ( 0.0%) LLVMDIScope 307.4728 ( 98.4%) 302.4183 ( 98.4%) Rest 312.5272 (100.0%) 307.4728 (100.0%) Total ===== DONE ===== 352.50017786026