CUDA 13.2 Toolkit should already be installed via the NVIDIA installer
News/2026-03-09-cuda-132-toolkit-should-already-be-installed-via-the-nvidia-installer-vibe-codin
Vibe Coding GuideMar 9, 20266 min read
Verified·First-party

CUDA 13.2 Toolkit should already be installed via the NVIDIA installer

Featured:NVIDIA

Title:
How to Build Your First CUDA Tile Kernel in Python with CUDA 13.2

Why this matters for builders
CUDA Tile is NVIDIA’s new higher-level virtual ISA for tile-based parallel programming that lets you express matrix and tensor operations using explicit tile abstractions instead of raw SIMT thread blocks. With CUDA 13.2, the feature now officially supports compute capability 8.x (Ampere and Ada) in addition to Blackwell (10.x/12.x), giving Python-first AI developers immediate access to a cleaner, more portable way to write high-performance kernels for modern GPUs.

This change unlocks faster iteration on custom GEMM, attention, and convolution kernels without fighting low-level PTX or wrestling with fragile shared-memory tiling by hand.

When to use it

  • You are writing performance-critical matrix or tensor kernels in Python (PyTorch, JAX, or Numba-style workflows)
  • You want portable tile abstractions that work across Ampere, Ada, and Blackwell without rewriting kernel logic
  • You need explicit control over data movement between global, shared, and register levels but prefer a higher-level DSL than CUDA C++
  • You are experimenting with new AI operators that map naturally to 2D/3D tiles (FlashAttention-style patterns, block-sparse ops, etc.)
  • You already have access to CUDA 13.2+ and an Ampere-or-newer GPU

The full process

1. Define the goal (30 minutes)

Start by writing a one-paragraph spec. Good example:

“I want a Python function tile_gemm that computes C = A × B for FP16 matrices using CUDA Tile. It should accept M,N,K dimensions that are multiples of the tile size (64×64×16 for starters), use cuTile’s Python DSL to declare tiles, perform the multiply-accumulate in shared memory, and return a PyTorch tensor. Target ≥85% of cuBLAS performance on an RTX 4090.”

This forces you to decide on data types, tile sizes, and success metrics before touching code.

2. Shape the spec/prompt for your AI coding assistant

Use this starter prompt (copy-paste and adapt):

You are an expert CUDA Tile engineer. CUDA 13.2+ is installed and I have an RTX 4090 (Ada, compute 8.9).

Write a complete, self-contained Python script using the cuTile Python DSL that implements a tiled FP16 matrix multiply C = A @ B.

Requirements:
- Use tile sizes 64x64 for M/N and 16 for K
- Declare Tile layouts for global, shared, and register memory explicitly
- Use the new tile-based syntax introduced in CUDA 13.2
- Include proper barriers and synchronization
- Return a torch.Tensor on CUDA
- Add inline comments explaining each tile operation
- Include a correctness check against torch.matmul
- Make it easy to change tile size via a constant

Do not use any deprecated APIs. Prefer the official cuTile Python interface.

3. Scaffold the project

Run these commands:

mkdir cuda-tile-gemm && cd cuda-tile-gemm
python -m venv venv && source venv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install numpy

Create the skeleton file tile_gemm.py.

4. Implement with your AI pair programmer

Paste the prompt from step 2 into Cursor, Claude, or your preferred coding LLM. Accept the generated code, then iterate with follow-up prompts such as:

  • “Refactor the inner loop to use the tile.mma intrinsic instead of manual multiply-add”
  • “Add support for arbitrary M,N,K that are not multiples of tile size using padding logic”
  • “Convert this to a PyTorch autograd Function so I can drop it into existing models”

Example generated structure you should expect (simplified):

import torch
from cuda.tile import Tile, TileLayout, tile

TILE_M = TILE_N = 64
TILE_K = 16

def tile_gemm(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
    M, K = A.shape
    K2, N = B.shape
    assert K == K2, "Inner dimension mismatch"
    
    C = torch.zeros((M, N), dtype=torch.float16, device=A.device)
    
    # Tile declarations using CUDA Tile DSL
    a_tile = Tile(shape=(TILE_M, TILE_K), dtype=torch.float16, layout=TileLayout.ROW_MAJOR)
    b_tile = Tile(shape=(TILE_K, TILE_N), dtype=torch.float16, layout=TileLayout.COL_MAJOR)
    c_tile = Tile(shape=(TILE_M, TILE_N), dtype=torch.float16, layout=TileLayout.ROW_MAJOR)
    
    # Kernel launch would use the new tile kernel decorator or cuTile launcher
    # (exact syntax available in the CUDA 13.2 cuTile Python docs)
    
    return C

Note: The exact decorator and launch syntax is new in 13.2. Always check the official cuTile Python reference in the CUDA Toolkit documentation for the latest @tile.kernel usage.

5. Validate rigorously

Create a test harness:

def test_tile_gemm():
    torch.manual_seed(42)
    M, N, K = 1024, 1024, 512
    A = torch.randn(M, K, dtype=torch.float16, device='cuda')
    B = torch.randn(K, N, dtype=torch.float16, device='cuda')
    
    C_ref = torch.matmul(A, B)
    C_test = tile_gemm(A, B)
    
    max_diff = torch.max(torch.abs(C_ref - C_test)).item()
    print(f"Max absolute error: {max_diff:.2e}")
    assert max_diff < 1e-2, "Numerical mismatch"
    print("✅ Tile GEMM test passed")

Run with nvprof or nsys to confirm you are actually using the tile hardware instructions and not falling back to SIMT.

6. Ship it safely

  • Wrap the kernel in a proper torch.utils.cpp_extension or use the pure-Python cuTile launcher if available
  • Add a requirements.txt with CUDA version pin: cuda-toolkit>=13.2
  • Publish as a small reusable package (pip install cuda-tile-gemm) with clear architecture requirements in the README
  • Include a performance comparison table against torch.matmul and cuBLAS
  • Add a GitHub workflow that runs on an Ampere or Ada runner (GitHub now offers GPU runners)

Pitfalls and guardrails

What if the kernel only runs on Blackwell?

CUDA 13.2 explicitly added support for Ampere (8.0+) and Ada (8.9). Make sure you are using the latest driver (≥570) and that torch.cuda.get_device_capability() returns ≥ (8,0). If you still see “architecture not supported” errors, run nvcc --version to confirm the 13.2 toolkit is active.

What if I get “tile layout not compatible” errors?

Tile layouts (row-major vs col-major) must match the tensor memory format. Always declare the layout explicitly and verify with tile.check_layout_compatibility() helper when available.

What if performance is worse than torch.matmul?

Start with the exact tile sizes recommended in the CUDA Tile programming guide (usually 64×64×16 for FP16). Smaller tiles increase launch overhead; larger tiles may exceed shared memory. Profile with Nsight Compute and look at “Tile Instruction Throughput”.

What if the Python DSL is still experimental?

Yes, cuTile Python is the first release. Treat it as a preview. Keep a reference SIMT implementation in the same file so you can fall back during development.

What to do next

  1. Replace the hardcoded tile sizes with a configurable TileConfig dataclass
  2. Add support for FP8 and TF32 data types
  3. Implement a FlashAttention-style tiled kernel using the same pattern
  4. Benchmark on both Ada and Blackwell to validate portability
  5. Open-source the repo with a “Built with CUDA Tile 13.2” badge

Sources

(Word count: 987)

Comments

No comments yet. Be the first to share your thoughts!