Articles by matt_d
1

Generalized Dot-Product Attention: Tackling Real-World Challenges in GPU Kernels (pytorch.org)

1

M^2RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling (arxiv.org)

1

Tools of the Trade: C2C Activation Offloading on Grace Blackwell (poolside.ai)

42

EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages (esolang-bench.vercel.app)

1

Speed-Of-Light ExecBench: A benchmark of real-world DL kernel problems (github.com/nvidia)

2

Equality Saturation and Symbolic Regression (egraphs.org)

2

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL (arxiv.org)

2

Vectorization of Verilog Designs and its Effects on Verification and Synthesis (arxiv.org)

1

LATTE ’26: Workshop on Languages, Tools, and Techniques for Accelerator Design (cornell.edu)

1

Read Less, Steer More (ezyang.com)

1

The Data Structures of Roads (sandboxspirit.com)

1

Verifying Move Borrow Checker in Lean:An Experiment in AI-Assisted PL Metatheory (proofsandintuitions.net)

4

Real or Slop? – Programming Languages Papers Edition (zackg.me)

1

Mamba-3 (together.ai)

1

EvoX: Letting AI Evolve Its Own Evolution Process (skydiscover-ai.github.io)

1

Native DSLs Ops in PyTorch (ianbarber.blog)

1

Flash-KMeans: Fast and Memory-Efficient Exact K-Means (arxiv.org)

2

Gluon: Explicit Performance (lei.chat)

2

Block Number Formats are (Still!) Direction Preservers (constantinides.net)

3

cuTile Rust: a safe, tile-based kernel programming DSL for Rust (github.com/nvlabs)

1

KernelBlaster: A framework for in context learning for code optimization (github.com/nvlabs)

1

Demystifying and Improving Lazy Promotion in Cache Eviction [pdf] (vldb.org)

1

Journeying through Optimization with Heuristics [video] (youtube.com)

3

To Sparsify or to Quantize: A Hardware Architecture View (sigarch.org)

6

Efficient sparse computations using linear algebra aware compilers (2025) (osti.gov)

1

A Field Guide to Reward Hacking in AI Kernel Generation (wafer.ai)

1

AI and the Mixed-Consistency Future (jhellerstein.github.io)

1

FIDES: End-to-end Compartments for Mixed-language Systems [pdf] (kcsrk.info)

1

Practical Type Inference: High‑Throughput Recovery of Real‑World Types (arxiv.org)

1

Idempotent Slices with Applications to Code-Size Reduction (arxiv.org)

1

Designing AI Chip Hardware and Software (docs.google.com)

2

Refinement Modeling and Verification of RISC-V Assembly Using Knuckledragger (philipzucker.com)

2

Breaking Control Flow Integrity by Abusing Modern C++ (Coroutines) – BH USA 2025 [video] (youtube.com)

1

Programming the Loop (ianbarber.blog)

2

Scalable Training of Mixture-of-Experts Models with Megatron Core (arxiv.org)

3

PolyBlocks: A Compiler Infrastructure for AI Chips and Programming Frameworks (arxiv.org)

2

Formalizing Data Structures and Algorithms with Agents (risemsr.github.io)

2

Thinnings: Sublist Witnesses and de Bruijn Index Shift Clumping (philipzucker.com)

2

Advent of Computing: Dan Temkin – Forty-Four Esolangs (libsyn.com)

1

Checking Write Bandwidth on GPUs (clamtech.org)

1

Challenges in Decompilation and Reverse Engineering of CUDA-Based Kernels [pdf] (nicolo.dev)

2

Block Number Formats Are Direction Preservers (constantinides.net)

2

Cutie Fly: CuTe Layout Representation and Algebra, CuTeDSL, FlyDSL (ianbarber.blog)

2

Converting Binary Floating-Point Numbers to Shortest Decimal Strings (wiley.com)

2

Controlling Floating-Point Determinism in NVIDIA CCCL (nvidia.com)

2

Bootstrapping Fuzzers for Compilers of Low-Resource Language Dialects Using LLMs (arxiv.org)

2

Custom Data Structures in E-Graphs (uwplse.org)

2

Formal Verification in the Age of AI (verse.systems)

3

CuTe Layout Representation and Algebra (arxiv.org)

1

Bespoke OLAP: Synthesizing Workload-Specific One-Size-Fits-One Database Engines (arxiv.org)

3

SkyDiscover: A Flexible Framework for AI-Driven Sci. and Algorithmic Discovery (skydiscover-ai.github.io)

4

Silent Backwards Compatibility Breaking Changes in PyTorch (ezyang.com)

1

Building an Open-Source Verilog Simulator with AI: 580K Lines in 43 Days (normalcomputing.com)

1

AgentCgroup: Understanding and Controlling OS Resources of AI Agents (github.com/eunomia-bpf)

1

Equality Saturation for Circuit Synthesis and Verification (imperial.ac.uk)

1

An Introduction to Folios (oracle.com)

2

Perplexity Cannot Always Tell Right from Wrong (ianbarber.blog)

1

Ganak: The Making of a Versatile, High Performance Model Counter (msoos.org)

12

TorchLean: Formalizing Neural Networks in Lean (leandojo.org)

1

Fast Autoscheduling for Sparse ML Frameworks (fredrikbk.com)

1

TENSURE: Fuzzing Sparse Tensor Compilers (Registered Report) (ndss-symposium.org)

1

A Reinforcement Learning Environment for Automatic Code Optimization in MLIR (arxiv.org)

2

Metamorphic Testing for Infrastructure-as-Code Engines [pdf] (programming-group.com)

2

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model (arxiv.org)

1

Midtraining Bridges Pretraining and Posttraining Distributions (arxiv.org)

2

Testing "Raw" GPU Cache Latency (clamtech.org)

4

Hexagon-MLIR: An AI Compilation Stack for Qualcomm's NPUs (arxiv.org)

1

Analyzing Latency Hiding and Parallelism in an MLIR-Based AI Kernel Compiler (arxiv.org)

1

Argus: Automated Discovery of Test Oracles for DBMSs Using LLMs (joyemang33.github.io)

2

A Decade of Docker Containers (acm.org)

1

In Pursuit of High-Fidelity GPU Kernel Benchmarking (standardkernel.com)

2

From ASPLOS to Orbit: Unikernels Twelve Years Later (gazagnaire.org)

1

VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean (utopia-group.github.io)

3

CSLib: The Lean Computer Science Library (arxiv.org)

3

Heliostat: Harnessing Ray Tracing Accelerators for Page Table Walks – ISCA 2025 [video] (youtube.com)

2

LDOS: Toward a Learning-Directed Operating System (sigops.org)

1

GenAI for Systems: Recurring Challenges&Design Principles from SW to Silicon (arxiv.org)

3

Precise exceptions in relaxed architectures [video] (youtube.com)

1

BitFields API: Type-Safe Bit Packing for Lock-Free Data Structures (rocksdb.org)

2

ThunderKittens 2.0: Even Faster Kernels for Your GPUs (stanford.edu)

1

Proof Assistants in the Age of AI (leodemoura.github.io)

1

Open Source Software Projects Are Brands (reidkleckner.dev)

1

Evaluating the Hardest CS Problems in the Age of LLMs (frontier-cs.org)

2

SE Radio 708: Jens Gustedt on C in 2026 (se-radio.net)

1

Spaghetti Bench: Evaluating AI Agents on Concurrency Bug Fixes (pastalab.org)

2

Computer Science as Infrastructure: The Spine of the Lean CSLib (arxiv.org)

2

Problems with a weak tryLock operation in C and C++ standards (swift.org)

1

Two mechanisms for dynamic type checks (wingolog.org)

1

Semantics, Operations, and Properties of P3109 Floating-Point Formats in Lean (github.com/rutgers-apl)

2

Oral History of Michael J. Flynn [video] (youtube.com)

2

Productively Programming Accelerated Computing Systems – Rohan Yadav (Stanford) [video] (youtube.com)

8

How to train your program verifier (risemsr.github.io)

3

Minimalist Design for Space Camera Flight Software (acm.org)

1

AMO-Lean: Towards Formally Verified Optimization via Equality Saturation in Lean (lambdaclass.com)

4

Fine-Tuning GPT-5 for GPU Kernel Generation (arxiv.org)

3

"Am I the only one still wondering what is the deal with linear types?" – Jon S (jonmsterling.com)

1

Running the "Reflections on Trusting Trust" Compiler: Revisiting the Backdoor (acm.org)

2

TileIR (ianbarber.blog)

1

Pushing Tensor Accelerators Beyond MatMul in a User-Schedulable Language (arxiv.org)

2

TLX: Triton-Like Simplicity, a Clear Path to Peak Performance [video] (youtube.com)