Once you have a winner: - Add clear comments explaining the key optimizations - Include the exact benchmark numbers in the README - Version the kernel (e.g. `flash_attn_v2_bytedance_style`) - Consider open-sourcing the pattern so others can learn - Write a short technical note explaining why this particular tiling or instruction mix worked **Pitfalls and guardrails** Vibe coders commonly get stuck on: - Accepting plausible-looking but incorrect kernels (always validate numerically first) - Over-

ByteDance's CUDA AI Agent & On-Device Satellite AI in Import AI 448

Q: 3. Scaffold the project (10 minutes)?

Create a minimal, testable structure: ```bash my_kernel/ ├── kernel.cu ├── test.py ├── benchmark.py ├── setup.py ├── reference.py # pure PyTorch or Triton baseline └── README.md ``` Use a modern `setup.py` with `torch.utils.cpp_extension`: ```python from setuptools import setup from torch.utils.cpp_extension import BuildExtension, CUDAExtension setup( name='my_kernel', ext_modules=[CUDAExtension('my_kernel', ['kernel.cu'])], cmdclass={'build_ext': BuildExtension} ) ```

Q: 5. Validate rigorously?

Never ship on one benchmark run. Create this checklist: **Correctness** - Forward pass matches reference within tolerance - Backward pass (if needed) also matches - Test multiple batch sizes and sequence lengths **Performance** - Use `torch.utils.benchmark` or `nvprof`/`nsight` - Warm up properly (at least 100 iterations) - Measure both average and p95 latency - Test on target hardware and at least one different GPU **Safety** - Check for illegal memory access with `compute-sanitizer` - Verify r

Title

Build Faster GPU Kernels with ByteDance’s CUDA Agent: A Vibe Coding Guide for AI Builders

Why this matters for builders

ByteDance and Tsinghua University released CUDA Agent — a reinforcement-learning system that trains a model to write optimized CUDA kernels by directly rewarding measured GPU speed rather than just code correctness. This is one of the first public demonstrations of an AI that treats kernel performance as the primary reward signal, closing the loop between code generation and real hardware execution.

For builders, this changes the game. Instead of manually iterating on PTX, shared memory layouts, and warp-level primitives, you can now use an AI coding assistant to explore the optimization space at superhuman speed. It unlocks practical workflows for anyone shipping latency-sensitive models, custom operators, or on-device inference where every microsecond and watt counts.

When to use it

Use this approach when you need:

Custom fused operators that PyTorch/Triton can’t match
Extreme low-latency inference on consumer or edge GPUs
Research prototypes that require novel kernel patterns
Performance-critical code where “good enough” is no longer acceptable

Skip it for standard training loops, simple data pipelines, or when you’re still exploring model architecture.

The full process

Here’s a reliable, repeatable workflow that experienced vibe coders can follow to ship real performance wins using today’s AI coding tools + the lessons from CUDA Agent.

1. Define the goal (30 minutes)

Be brutally specific. Bad goal: “make this faster.”
Good goal: “Reduce the latency of my FlashAttention-style kernel for sequence length 8192 and head dimension 128 from 0.42ms to under 0.28ms on an RTX 4090 while keeping numerical equivalence within 1e-5.”

Write this down in a README.md or a Notion page. Include:

Target hardware (compute capability, memory bandwidth)
Input shapes and data types
Success metric (wall-clock time + correctness)
Baseline implementation (usually a clean PyTorch or Triton reference)

2. Shape the spec/prompt (15 minutes)

Craft a system prompt that mirrors the reward philosophy of CUDA Agent: correctness is table stakes; speed is the real objective.

Starter system prompt (copy-paste and adapt):

You are an expert CUDA kernel engineer working at the level of ByteDance's CUDA Agent. 
Your goal is to generate highly optimized CUDA kernels that maximize measured GPU performance.

Rules:
- First ensure numerical correctness against the reference implementation within 1e-5.
- Then aggressively optimize for speed: maximize occupancy, minimize register pressure, use shared memory wisely, leverage tensor cores when possible, tune block sizes for the target GPU.
- Always output both the kernel AND a complete PyTorch extension wrapper with setup.py.
- After each version, suggest the exact nvcc flags and benchmark command I should run.
- Think step-by-step about memory access patterns and warp divergence before writing code.

Current task: [paste your exact goal here]

3. Scaffold the project (10 minutes)

Create a minimal, testable structure:

my_kernel/
├── kernel.cu
├── test.py
├── benchmark.py
├── setup.py
├── reference.py          # pure PyTorch or Triton baseline
└── README.md

Use a modern setup.py with torch.utils.cpp_extension:

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='my_kernel',
    ext_modules=[CUDAExtension('my_kernel', ['kernel.cu'])],
    cmdclass={'build_ext': BuildExtension}
)

4. Implement with your AI coding assistant

Start simple. Feed the reference implementation and your goal prompt into Cursor, Claude, or your preferred coding LLM.

Iterate in tight loops:

Ask for kernel v1
Run correctness test
Run benchmark
Feed the numbers back into the next prompt: “Version 1 achieved 0.41ms. Beat it.”

Good follow-up prompt example:

Previous kernel runs at 0.41ms on RTX 4090 for shape (4, 8192, 128). 
The reference PyTorch version is 0.62ms. Generate v2 that improves performance by at least 25% while passing the correctness test within 1e-5. 
Focus on better shared memory tiling and double buffering.

5. Validate rigorously

Never ship on one benchmark run. Create this checklist:

Correctness

Forward pass matches reference within tolerance
Backward pass (if needed) also matches
Test multiple batch sizes and sequence lengths

Performance

Use torch.utils.benchmark or nvprof/nsight
Warm up properly (at least 100 iterations)
Measure both average and p95 latency
Test on target hardware and at least one different GPU

Safety

Check for illegal memory access with compute-sanitizer
Verify register and shared memory usage with nvcc --ptxas-options=-v

Example benchmark snippet:

import torch
from torch.utils.benchmark import Timer

def benchmark(fn, label):
    t = Timer(
        stmt='fn(x, y)',
        setup='x = torch.randn(...).cuda(); y = torch.randn(...).cuda()',
        globals={'fn': fn}
    )
    print(f"{label}: {t.timeit(100)}")

6. Ship it safely

Once you have a winner:

Add clear comments explaining the key optimizations
Include the exact benchmark numbers in the README
Version the kernel (e.g. flash_attn_v2_bytedance_style)
Consider open-sourcing the pattern so others can learn
Write a short technical note explaining why this particular tiling or instruction mix worked

Pitfalls and guardrails

Vibe coders commonly get stuck on:

Accepting plausible-looking but incorrect kernels (always validate numerically first)
Over-optimizing for one specific shape and breaking generality
Ignoring register pressure — the AI will happily use 256 registers if you don’t constrain it
Forgetting to tune launch parameters (blockDim, gridDim)
Benchmarking without proper warm-up or with CUDA caching effects

Guardrail: Never accept a kernel that passes correctness but is slower than the baseline. The CUDA Agent philosophy is speed-first after correctness.

What to do next

After shipping your first optimized kernel:

Extract the successful optimization patterns into a reusable prompt library
Try the same workflow on a different operator (LayerNorm, Rotary, SwiGLU, etc.)
Measure power consumption — sometimes the fastest kernel isn’t the most efficient
Share the kernel on GitHub with full reproduction instructions
Explore whether the same technique can be applied to WebGPU or mobile shaders

This process turns the ByteDance CUDA Agent insight into a practical superpower for any builder with basic Python and a decent GPU.

The loop is simple: define a clear performance target → prompt with speed as the reward → implement → measure → feed results back → repeat until you beat the baseline.

That’s how real acceleration happens in 2025.

Sources

Import AI 448: AI R&D; Bytedance’s CUDA-writing agent; on-device satellite AI (https://importai.substack.com/p/import-ai-448-ai-r-and-d-bytedances)
ByteDance and Tsinghua University CUDA Agent research (via Awesome Agents coverage)
ByteDance Seed team research focus on infrastructure and next-generation AI

(Word count: 928)

Import AI 448: AI R&D; Bytedance's CUDA-writing agent; on-device satellite AI