For people just starting out in GPU kernel engineering or LLM inference (FlashAttention / FlashInfer / SGLang / vLLM style work), most job postings still list “C++17, CuTe, CUTLASS” as hard requirements.
At the same time NVIDIA has been pushing CuTeDSL (the Python DSL in CUTLASS 4.x) hard since late 2025 as the new recommended path for new kernels — same performance, no template metaprogramming, JIT, much faster iteration, and direct TorchInductor integration.
The shift feels real in FlashAttention-4, FlashInfer, and SGLang’s NVIDIA collab roadmap.
Question for those already working in this space:
For someone starting fresh in 2026, is it still worth going deep on legacy C++ CuTe/CUTLASS templates, or should they prioritize CuTeDSL → Triton → Mojo (and keep only light C++ for reading old code)?
Is the “new stack” (CuTeDSL + Triton + Rust/Mojo for serving) actually production-viable right now, or are the job postings correct that you still need strong C++ CUTLASS skills to get hired and ship real kernels?
Any war stories or advice on the right learning order for new kernel engineers who want to contribute to FlashInfer / SGLang / FlashAttention?
Looking for honest takes — thanks!
--- TOP COMMENTS ---
Hello! I worked at a hyperscaler with FP4 MoE kernel optimization for LLMs on large GPU clusters, and we used CuteDSL because of the code maintainability and the full control over primitives. A big benefit for us was that you can write comms + compute in the same high-level pythonesque language. The biggest benefit in my own opinion was all of the helpers for Blackwell(sm\_100), you essentially have boilerplate optimal bank swizzles, optimal data loads by shape, optimal tile shapes for MMA-workloads, etc. And when needed you can inline PTX smoothly for example in the epilogue. The framework also allowed to use NVLink features like gradient reduction over comms using SHARP as a part of the framework, which is amazing because it lets you search over that otherwise often unrelated subspace when you optimize the full workload.
In general, the point was that if you were to do the same in CUDA, you would end up with massive files and most importantly with one expert who wrote the code who actually got into it mentally and understood it all, which is a big point of failure. In CuteDSL, especially for Blackwell, the patterns are really similar. You have a lot of well-defined constants for the hardware. You define the 1-1-4 warp specialization if-elses and then you just copy-paste the tma loads, mma, etc.
\---
For my day-to-day work/prototyping, I use Triton, I actually started with Triton when getting into this field a year ago. I think Triton is a great language to understand the Tile paradigm and the GPU architecture, because the workloads are defined as tiles, this was my first contact which I also recommend to everyone:
[https://github.com/gpu-mode/Triton-Puzzles](https://github.com/gpu-mode/Triton-Puzzles)
Triton is also GPGPU, you can write really clean code for a lot of different accelerators in it. I did that for AMD MI3XXX series using the Iris library that is native in Triton, no need for ROCSHMEM, etc:
[https://github.com/ROCm/iris](https://github.com/ROCm/iris)
However, the "practical" reality is that 95% of all companies use NVIDIA HW, and most of them have legacy CUDA codebases. In a lot of interviews for more practical companies with existing codebases, I have been asked about CUDA more than any GPGPU tooling. For this purpose there is actually the [5th ed PMPP book](https://shop.elsevier.com/books/programming-massively-parallel-processors/hwu/978-0-443-43900-1) from this February that you can follow along and implement things from start to finish which includes both distributed and tensor cores.
For inference, you have SGLANG/vLLM/TensorRT with FlashInfer coming up. From my contacts in the industry, they use vLLM the most because it's stable, but I can't really go in depth on serving as I dont have as much experience with that, perhaps someone else can.
I would also recommend the GPUMODE Discord server, there is a lot of channels and people who are learning this field.
\---
My absolutely most important and best advice I cant stress enough if you want to understand this field:
You need to do things yourself, implement things yourself, get out-of-bounds writes and cryptic errors. You need to do it yourself, dont use LLMs to generate all the code for you. You can read any amount of books and generate any amount of kernels, but if you want to understand why and to later on improve things, you have to build the intuition of what it slow and why. If you are unsure where to start, follow the book, or the puzzles, and then eventually move on to problems that matter to you.
At a final note, it's worth to think of the future of the field. Kernel development is a verifiable problem and I would agree with [Karpathy's tweet here](https://x.com/karpathy/status/1990116666194456651). Kernel development is a perfect problem for LLMs/agents, it's a large complex system with highly non-linear interactions, sometimes you will just need to search the solution space, and LLMs are really good for this. Because of this, I would personally recommend you to see kernel development as a means and not as an end. Most likely your value will not be squeezing out the last 4% of some GEMM, but mapping some new energy model or world model or something-architecture effectively to existing GPUs in a way that has not been done before. For this you will need to understand the hardware and it's limitations, and this is what the GPU programming languages will teach you.
---
job postings are always 2 years behind the actual stack. if FlashAttention-4 and FlashInfer are moving to CuTeDSL that's the signal, not LinkedIn job descriptions. I'd learn enough C++ CUTLASS to read existing code without getting lost, then go deep on Triton first since it's more portable and the mental model transfers. CuTeDSL is the right long-term bet but the ecosystem docs are still rough...
Yesterday