https://github.com/NVIDIA/cutile-python/blob/main/src/cuda/tile/_compile.py