Title:
How to Build Your First CUDA Tile Kernel in Python with CUDA 13.2
Why this matters for builders
CUDA Tile is NVIDIA’s new higher-level virtual ISA for tile-based parallel programming that lets you express matrix and tensor operations using explicit tile abstractions instead of raw SIMT thread blocks. With CUDA 13.2, the feature now officially supports compute capability 8.x (Ampere and Ada) in addition to Blackwell (10.x/12.x), giving Python-first AI developers immediate access to a cleaner, more portable way to write high-performance kernels for modern GPUs.
This change unlocks faster iteration on custom GEMM, attention, and convolution kernels without fighting low-level PTX or wrestling with fragile shared-memory tiling by hand.
When to use it
- You are writing performance-critical matrix or tensor kernels in Python (PyTorch, JAX, or Numba-style workflows)
- You want portable tile abstractions that work across Ampere, Ada, and Blackwell without rewriting kernel logic
- You need explicit control over data movement between global, shared, and register levels but prefer a higher-level DSL than CUDA C++
- You are experimenting with new AI operators that map naturally to 2D/3D tiles (FlashAttention-style patterns, block-sparse ops, etc.)
- You already have access to CUDA 13.2+ and an Ampere-or-newer GPU
The full process
1. Define the goal (30 minutes)
Start by writing a one-paragraph spec. Good example:
“I want a Python function
tile_gemmthat computes C = A × B for FP16 matrices using CUDA Tile. It should accept M,N,K dimensions that are multiples of the tile size (64×64×16 for starters), use cuTile’s Python DSL to declare tiles, perform the multiply-accumulate in shared memory, and return a PyTorch tensor. Target ≥85% of cuBLAS performance on an RTX 4090.”
This forces you to decide on data types, tile sizes, and success metrics before touching code.
2. Shape the spec/prompt for your AI coding assistant
Use this starter prompt (copy-paste and adapt):
You are an expert CUDA Tile engineer. CUDA 13.2+ is installed and I have an RTX 4090 (Ada, compute 8.9).
Write a complete, self-contained Python script using the cuTile Python DSL that implements a tiled FP16 matrix multiply C = A @ B.
Requirements:
- Use tile sizes 64x64 for M/N and 16 for K
- Declare Tile layouts for global, shared, and register memory explicitly
- Use the new tile-based syntax introduced in CUDA 13.2
- Include proper barriers and synchronization
- Return a torch.Tensor on CUDA
- Add inline comments explaining each tile operation
- Include a correctness check against torch.matmul
- Make it easy to change tile size via a constant
Do not use any deprecated APIs. Prefer the official cuTile Python interface.
3. Scaffold the project
Run these commands:
mkdir cuda-tile-gemm && cd cuda-tile-gemm
python -m venv venv && source venv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install numpy
Create the skeleton file tile_gemm.py.
4. Implement with your AI pair programmer
Paste the prompt from step 2 into Cursor, Claude, or your preferred coding LLM. Accept the generated code, then iterate with follow-up prompts such as:
- “Refactor the inner loop to use the
tile.mmaintrinsic instead of manual multiply-add” - “Add support for arbitrary M,N,K that are not multiples of tile size using padding logic”
- “Convert this to a PyTorch autograd Function so I can drop it into existing models”
Example generated structure you should expect (simplified):
import torch
from cuda.tile import Tile, TileLayout, tile
TILE_M = TILE_N = 64
TILE_K = 16
def tile_gemm(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
M, K = A.shape
K2, N = B.shape
assert K == K2, "Inner dimension mismatch"
C = torch.zeros((M, N), dtype=torch.float16, device=A.device)
# Tile declarations using CUDA Tile DSL
a_tile = Tile(shape=(TILE_M, TILE_K), dtype=torch.float16, layout=TileLayout.ROW_MAJOR)
b_tile = Tile(shape=(TILE_K, TILE_N), dtype=torch.float16, layout=TileLayout.COL_MAJOR)
c_tile = Tile(shape=(TILE_M, TILE_N), dtype=torch.float16, layout=TileLayout.ROW_MAJOR)
# Kernel launch would use the new tile kernel decorator or cuTile launcher
# (exact syntax available in the CUDA 13.2 cuTile Python docs)
return C
Note: The exact decorator and launch syntax is new in 13.2. Always check the official cuTile Python reference in the CUDA Toolkit documentation for the latest
@tile.kernelusage.
5. Validate rigorously
Create a test harness:
def test_tile_gemm():
torch.manual_seed(42)
M, N, K = 1024, 1024, 512
A = torch.randn(M, K, dtype=torch.float16, device='cuda')
B = torch.randn(K, N, dtype=torch.float16, device='cuda')
C_ref = torch.matmul(A, B)
C_test = tile_gemm(A, B)
max_diff = torch.max(torch.abs(C_ref - C_test)).item()
print(f"Max absolute error: {max_diff:.2e}")
assert max_diff < 1e-2, "Numerical mismatch"
print("✅ Tile GEMM test passed")
Run with nvprof or nsys to confirm you are actually using the tile hardware instructions and not falling back to SIMT.
6. Ship it safely
- Wrap the kernel in a proper
torch.utils.cpp_extensionor use the pure-Python cuTile launcher if available - Add a requirements.txt with CUDA version pin:
cuda-toolkit>=13.2 - Publish as a small reusable package (
pip install cuda-tile-gemm) with clear architecture requirements in the README - Include a performance comparison table against
torch.matmuland cuBLAS - Add a GitHub workflow that runs on an Ampere or Ada runner (GitHub now offers GPU runners)
Pitfalls and guardrails
What if the kernel only runs on Blackwell?
CUDA 13.2 explicitly added support for Ampere (8.0+) and Ada (8.9). Make sure you are using the latest driver (≥570) and that torch.cuda.get_device_capability() returns ≥ (8,0). If you still see “architecture not supported” errors, run nvcc --version to confirm the 13.2 toolkit is active.
What if I get “tile layout not compatible” errors?
Tile layouts (row-major vs col-major) must match the tensor memory format. Always declare the layout explicitly and verify with tile.check_layout_compatibility() helper when available.
What if performance is worse than torch.matmul?
Start with the exact tile sizes recommended in the CUDA Tile programming guide (usually 64×64×16 for FP16). Smaller tiles increase launch overhead; larger tiles may exceed shared memory. Profile with Nsight Compute and look at “Tile Instruction Throughput”.
What if the Python DSL is still experimental?
Yes, cuTile Python is the first release. Treat it as a preview. Keep a reference SIMT implementation in the same file so you can fall back during development.
What to do next
- Replace the hardcoded tile sizes with a configurable
TileConfigdataclass - Add support for FP8 and TF32 data types
- Implement a FlashAttention-style tiled kernel using the same pattern
- Benchmark on both Ada and Blackwell to validate portability
- Open-source the repo with a “Built with CUDA Tile 13.2” badge
Sources
- Original announcement: https://developer.nvidia.com/blog/cuda-13-2-introduces-enhanced-cuda-tile-support-and-new-python-features/
- CUDA Tile landing page: https://developer.nvidia.com/cuda/tile
- CUDA 13.1 Tile introduction (foundational concepts): https://developer.nvidia.com/blog/nvidia-cuda-13-1-powers-next-gen-gpu-programming-with-nvidia-cuda-tile-and-performance-gains/
- NVIDIA Developer Blog on CUDA Toolkit features
(Word count: 987)

