2025-12-05T07:55:12Z
It’s possible that this is a bug in MLIR.
I used enable_timing() and obtained the following timing results.
===-------------------------------------------------------------------------===
... Execution time report ...
===-------------------------------------------------------------------------===
Total Execution Time: 0.0081 seconds
----User Time---- ----Wall Time---- ----Name----
0.0050 ( 27.4%) 0.0050 ( 61.5%) Inliner
0.0002 ( 1.0%) 0.0002 ( 2.2%) (A) CallGraph
0.0028 ( 15.5%) 0.0028 ( 34.7%) 'tt.func' Pipeline
0.0028 ( 15.4%) 0.0028 ( 34.4%) Canonicalizer
0.0001 ( 0.8%) 0.0001 ( 1.7%) TritonRewriteTensorPointer
0.0004 ( 2.1%) 0.0004 ( 4.7%) Canonicalizer
0.0003 ( 1.8%) 0.0003 ( 4.0%) TritonCombineOps
0.0009 ( 4.7%) 0.0009 ( 10.6%) TritonReorderBroadcast
0.0003 ( 1.4%) 0.0003 ( 3.2%) CSE
0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo
0.0002 ( 1.0%) 0.0002 ( 2.3%) SymbolDCE
0.0001 ( 0.6%) 0.0001 ( 1.3%) TritonLoopUnroll
0.0081 ( 44.6%) -0.0019 (-24.1%) Rest
0.0181 (100.0%) 0.0081 (100.0%) Total
===-------------------------------------------------------------------------===
... Execution time report ...
===-------------------------------------------------------------------------===
Total Execution Time: 0.0222 seconds
----User Time---- ----Wall Time---- ----Name----
0.0013 ( 2.8%) 0.0013 ( 5.7%) ConvertTritonToTritonGPU
0.0052 ( 11.8%) 0.0052 ( 23.6%) TritonGPUCoalesce
0.0004 ( 0.9%) 0.0004 ( 1.8%) TritonGPUF32DotTC
0.0002 ( 0.5%) 0.0002 ( 0.9%) TritonGPUPlanCTAPass
0.0060 ( 13.5%) 0.0060 ( 27.1%) TritonGPURemoveLayoutConversions
0.0003 ( 0.6%) 0.0003 ( 1.2%) TritonGPUOptimizeThreadLocality
0.0003 ( 0.6%) 0.0003 ( 1.2%) TritonGPUAccelerateMatmul
0.0006 ( 1.4%) 0.0006 ( 2.9%) TritonGPURemoveLayoutConversions
0.0007 ( 1.5%) 0.0007 ( 3.0%) TritonGPUOptimizeDotOperands
0.0007 ( 1.5%) 0.0007 ( 3.1%) 'any' Pipeline
0.0007 ( 1.5%) 0.0007 ( 3.1%) Canonicalizer
0.0002 ( 0.3%) 0.0002 ( 0.7%) TritonNvidiaGPUOptimizeDescriptorEncodingPass
0.0004 ( 0.9%) 0.0004 ( 1.9%) CSE
0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo
0.0002 ( 0.5%) 0.0002 ( 1.1%) TritonGPUFuseNestedLoops
0.0004 ( 0.9%) 0.0004 ( 1.9%) Canonicalizer
0.0002 ( 0.5%) 0.0002 ( 1.1%) TritonLoopInvariantCodeMotion
0.0004 ( 0.9%) 0.0004 ( 1.8%) Canonicalizer
0.0002 ( 0.4%) 0.0002 ( 0.9%) TritonGPUCombineTensorSelectAndIf
0.0004 ( 0.9%) 0.0004 ( 1.8%) TritonGPUPipeline
0.0004 ( 0.8%) 0.0004 ( 1.6%) TritonGPUPrefetch
0.0003 ( 0.8%) 0.0003 ( 1.5%) TritonGPUWGMMAPrefetch
0.0008 ( 1.9%) 0.0008 ( 3.8%) TritonGPUOptimizeDotOperands
0.0003 ( 0.8%) 0.0003 ( 1.5%) TritonGPUCoalesceAsyncCopy
0.0003 ( 0.7%) 0.0003 ( 1.5%) TritonNvidiaGPUOptimizeTMemSubtilingPass
0.0008 ( 1.9%) 0.0008 ( 3.8%) TritonGPURemoveLayoutConversions
0.0002 ( 0.4%) 0.0002 ( 0.9%) TritonGPUReduceDataDuplication
0.0002 ( 0.5%) 0.0002 ( 0.9%) TritonGPUReorderInstructions
0.0001 ( 0.3%) 0.0001 ( 0.6%) CSE
0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo
0.0003 ( 0.7%) 0.0003 ( 1.3%) SymbolDCE
0.0004 ( 0.8%) 0.0004 ( 1.6%) Canonicalizer
0.0222 ( 49.9%) -0.0001 ( -0.5%) Rest
0.0444 (100.0%) 0.0222 (100.0%) Total
===-------------------------------------------------------------------------===
... Execution time report ...
===-------------------------------------------------------------------------===
Total Execution Time: 307.4728 seconds
----User Time---- ----Wall Time---- ----Name----
0.0004 ( 0.0%) 0.0004 ( 0.0%) TritonNvidiaGPUMMALoweringPass
0.0001 ( 0.0%) 0.0001 ( 0.0%) TritonGPUCombineTensorSelectAndIf
0.0001 ( 0.0%) 0.0001 ( 0.0%) TritonGPUAllocateWarpGroups
0.0002 ( 0.0%) 0.0002 ( 0.0%) SCFToControlFlowPass
0.0021 ( 0.0%) 0.0021 ( 0.0%) AllocateSharedMemory
0.0003 ( 0.0%) 0.0003 ( 0.0%) TritionTensorMemoryAllocationPass
0.0002 ( 0.0%) 0.0002 ( 0.0%) TritonGPUGlobalScratchAllocationPass
0.5546 ( 0.2%) 0.5546 ( 0.2%) ConvertTritonGPUToLLVM
3.8323 ( 1.2%) 3.8323 ( 1.2%) Canonicalizer
0.1434 ( 0.0%) 0.1434 ( 0.0%) CSE
0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo
0.0961 ( 0.0%) 0.0961 ( 0.0%) ConvertNVGPUToLLVM
0.0878 ( 0.0%) 0.0878 ( 0.0%) ConvertWarpSpecializeToLLVM
0.1212 ( 0.0%) 0.1212 ( 0.0%) Canonicalizer
0.0970 ( 0.0%) 0.0970 ( 0.0%) CSE
0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo
0.0821 ( 0.0%) 0.0821 ( 0.0%) SymbolDCE
0.0364 ( 0.0%) 0.0364 ( 0.0%) LLVMDIScope
307.4728 ( 98.4%) 302.4183 ( 98.4%) Rest
312.5272 (100.0%) 307.4728 (100.0%) Total
===== DONE =====
352.50017786026