Articles by matt_d
1

SSA without Dominance for Higher-Order Programs (arxiv.org)

3

Spotting Specification Gaps with Small Proof-Oriented Tests (risemsr.github.io)

1

Theseus, a Static Windows Emulator (neugierig.org)

1

Advent of Computing: Episode 179 – Programming Block by Block (libsyn.com)

2

Agentic Context Engineering:Evolving Contexts for Self-Improving Language Models (arxiv.org)

2

Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter (arxiv.org)

1

Fundamentals of CuTe Layout Algebra and Category-Theoretic Interpretation [video] (youtube.com)

2

OSS code review, in the era of LLMs (ezyang.com)

1

Proteus: Heterogeneous FPGA Virtualization [pdf] (tum.de)

1

Trevex: A Black-Box Detection Framework for Data-Flow Transient Execution Vulns (roots.ec)

1

From SIMT to Systolic Part 2: A Kernel Author's Field Report (twitter.com/mainzonx)

1

Machine Generated and Checked Proofs for a Verified Compiler (Experience Report) (arxiv.org)

2

Machine-Generated Code Deserves Machine-Checked Proofs (zoep.github.io)

2

What Happens to Software When Proof Is Cheap? Allen School Distinguished Lecture [video] (youtube.com)

1

TileTensor Part 1 – Safer, More Efficient GPU Kernels (modular.com)

1

EuroLLVM 2026 Round Table Summary: MLIR Canonicalization (llvm.org)

1

nanomem: An Simple, Inference-Time Memory Module (openanonymity.ai)

1

Building an Unverified Compiler with Agents (basis.ai)

1

WybeCoder: Verified Imperative Code Generation (facebookresearch.github.io)

2

Parcae: Doing More with Fewer Parameters Using Stable Looped Models (sandyresearch.github.io)

1

Characterizing the Impact of Congestion in Modern HPC Interconnects (arxiv.org)

1

Tessera: Unlocking Heterogeneous GPUs Through Kernel-Granularity Disaggregation (arxiv.org)

2

From SIMT to Systolic: A Foundation for GPU and TPU Architecture (twitter.com/mainzonx)

1

Packrat Parsing at the Speed of Wasm [video] (youtube.com)

1

Sparser, Faster, Lighter Transformer Language Models (arxiv.org)

1

When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Telemetry (arxiv.org)

2

Stupid RCU Tricks: Corner-Case RCU Implementations (kernel.org)

1

How Many Compilers Is Too Many? V8's History, Tradeoffs, and Architecture [video] (youtube.com)

1

Fully-Automatic Type Inference for Borrows with Lifetimes (radbox.org)

14

The GNU libc atanh is correctly rounded (hal.science)

1

MMU Handbook: Memory Management Units and TLBs (kalairajah-personal.github.io)

1

PEP 831 – Frame Pointers Everywhere: Enabling System-Level Observability (python.org)

6

UpDown: Efficient Manycore based on Many Threading & Scalable Memory Parallelism (uchicago.edu)

24

Reflections on 30 years of HPC programming (chapel-lang.org)

1

Recent lld/ELF performance improvements (maskray.me)

4

Circuit Transformations, Loop Fusion, and Inductive Proof (natetyoung.github.io)

1

AI for Systems: Using LLMs to Optimize Database Query Execution (together.ai)

2

Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets (arxiv.org)

2

GCC Translation Validation Part 6: Uninitialized Memory (kristerw.github.io)

2

Agentic Code Optimization via Compiler-LLM Cooperation (arxiv.org)

1

Bespoke OLAP: Using AI to Synthesize Workload-Specific DBMS Engines from Scratch (ucbskyadrs.github.io)

2

Understanding Agents: Code Coverage for Coding Agents (asymmetric.re)

1

Cyclotron: The Streaming Multiprocessor Abstraction Is Broken [pdf] (cornell.edu)

1

UCCL-EP: Portable Expert-Parallel Communication – Full Results (uccl-project.github.io)

3

vLLM IR: A Functional Intermediate Representation for vLLM (github.com/vllm-project)

1

Test-Time Scaling Makes Overtraining Compute-Optimal (arxiv.org)

1

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods (arxiv.org)

3

Tracing a Full MoE Training Step Through the XLA Compiler (patricktoulme.substack.com)

1

Breaking Down the Cerebras Wafer Scale Engine (wafer.substack.com)

1

The need for better compiler frontend benchmarks: Carbon's benchmarking approach (llvm.org)

1

DAXFS: A Lock-Free Shared Filesystem for CXL Disaggregated Memory (arxiv.org)

2

AST Edits: The Code Editing Format Nobody Uses (geometricagi.github.io)

1

KernelEvolve: Meta's Ranking Engineer Agent Optimizes AI Infrastructure (fb.com)

3

Software Engineering Is Becoming Civil Engineering (christophermeiklejohn.com)

1

Adaptive Block-Scaled Data Types (arxiv.org)

1

AC4A: Access Control for Agents (arxiv.org)

1

Rethinking Language Model Scaling Under Transferable Hypersphere Optimization (arxiv.org)

1

Distributed builds of LLVM with CMake, recc, and NativeLink (reidkleckner.dev)

1

A Pattern Generation Language for MLIR Compiler Matching and Rewriting (radbox.org)

2

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon (dao-lab.ai)

3

Compiler as a Service: C++ Goes Live – Interactive C++, interop, and beyond [video] (youtube.com)

3

Measuring AI Ability to Complete Long Software Tasks (muratbuffalo.blogspot.com)

1

6o6 v1.1: Faster 6502-on-6502 virtualization for a C64/Apple II Apple-1 emulator (oldvcr.blogspot.com)

2

uops.info Update: Emerald Rapids, Meteor Lake, Arrow Lake, and Zen 5 (uops.info)

1

MXFP8 GEMM: Up to 99% of cuBLAS Performance Using CUDA and PTX (danielvegamyhre.github.io)

2

PyTorch Autograd and Mutation (ezyang.com)

2

The Future of Python: Evolution or Succession – Brett Slatkin – PyCascades 2026 [video] (youtube.com)

2

SlopCodeBench: Benchmarking How Coding Agents Degrade over Long-Horizon Tasks (scbench.ai)

2

AutoRocq: Agentic Theorem Prover for Verification (github.com/nus-program-verification)

1

Wax: Optimizing Data Center Applications with Stale Profile (github.com/ice-rlab)

2

Dijkstra's Shortest-Path Algorithm: A visual exploration, following Sedgewick (joshmpollock.com)

3

Speculative Decoding: Performance or Illusion? (specdecode-bench.github.io)

2

Goedel-Code-Prover: Hierarchical Proof Search for Open SotA Code Verification (goedelcodeprover.github.io)

1

MLSys 2026 Papers (mlsys.org)

2

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU (arxiv.org)

2

Specula: A framework for finding deep bugs in system code using TLA+ (github.com/specula-org)

3

Equality Saturation for Optimizing High-Level Julia IR (acm.org)

1

UniTe: A Universal Tensor Abstraction for Capturing Spatial Relationships (acm.org)

2

Co-Design of B+-Tree Index with Emerging Zone Interfaces for Small KV Pairs (acm.org)

1

CounterPoint: Using Hardware Counters to Refute and Refine µarch Assumptions (arxiv.org)

1

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost (arxiv.org)

4

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems (muratbuffalo.blogspot.com)

1

What Is Coordination, Really? (jhellerstein.github.io)

1

Idempotent Slices with Applications to Code-Size Reduction (arxiv.org)

1

Microsoft Rust Training Books: Beginner, advanced, expert level material (github.com/microsoft)

2

LUMINA: LLM-Guided GPU Architecture Exploration via Bottleneck Analysis (arxiv.org)

1

Challenges and Design Issues in Finding CUDA Bugs via GPU-Native Fuzzing (arxiv.org)

1

SEVI: Silent Data Corruption of Vector Instructions in Hyper-Scale Datacenters (acm.org)

2

CrypTorch: PyTorch-based Auto-tuning Compiler for ML w/ Multi-party Computation (github.com/psu-paws)

2

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels (arxiv.org)

6

Tony Hoare and His Imprint on Computer Science (acm.org)

1

The End of Dijkstra's Algorithm? Breaking the Sorting Barrier for Shortest Paths [video] (youtube.com)

1

AlgoVeri: An Aligned Benchmark for Verified Code Gen. On Classical Algorithms (arxiv.org)

1

Specy: Learning Specifications for Distributed Systems from Event Traces [pdf] (princeton.edu)

1

Generalized Dot-Product Attention: Tackling Real-World Challenges in GPU Kernels (pytorch.org)

1

M^2RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling (arxiv.org)

1

Tools of the Trade: C2C Activation Offloading on Grace Blackwell (poolside.ai)

43

EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages (esolang-bench.vercel.app)

1

Speed-Of-Light ExecBench: A benchmark of real-world DL kernel problems (github.com/nvidia)

2

Equality Saturation and Symbolic Regression (egraphs.org)