Articles by matt_d
3

How the JVM Optimizes Generic Code – A Deep Dive (inside.java)

1

Tessera: Unlocking Heterogeneous GPUs Through Kernel-Granularity Disaggregation (arxiv.org)

2

MathDuels: Evaluating LLMs as Problem Posers and Solvers (arxiv.org)

1

Kernel Contracts: A Spec. Language for Correctness Across Heterogeneous Silicon (arxiv.org)

2

Revealing NVIDIA Driver Command Streams for CPU-GPU Runtime Behavior Insight (arxiv.org)

2

Guardians: Static verification for AI agent workflows (github.com/metareflection)

2

Fast GPU Linear Algebra via Compile Time Expression Fusion (arxiv.org)

3

The AI Compute Extensions (ACE) for x86 [pdf] (x86ecosystem.org)

1

Finding and Understanding Bugs in FPGA Place-and-Route Engines [video] (youtube.com)

1

AutoSP: Long-Context LLM Training via Compiler-Based Sequence Parallelism (pytorch.org)

2

Partial UDF Inlining (doi.org)

10

From Convergence to Confidence: Push-Button Verification for RDTs (kcsrk.info)

19

Low-Compilation-Cost Register Allocation in LLVM-Based Binary Translation (acm.org)

1

AdaExplore: Search for Efficient Kernel Generation (stiglidu.github.io)

1

vLLM-Compile: Bringing Compiler Optimizations to LLM Inference (docs.google.com)

2

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs (arxiv.org)

2

Compiler Testing – Part 1: Coverage-Guided Fuzzing with Grammars and LLMs (nowarp.io)

1

Disaggregated Serving for Hybrid SSM Models in vLLM (vllm-website-lx4pji0mz-inferact-inc.ver...

1

Great Paper: The Calculated Typer – Iowa Type Theory Commute Podcast S7 E6 (pocketcasts.com)

3

Barbara Liskov, Turing Award'08: Data Abstraction, Dijkstra, Distributed Systems (developing.dev)

4

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell (arxiv.org)

1

Reimagining Kernel Generation at the PTX Layer (standardkernel.com)

1

A Deductive System for (Hardware-Software) Contract Satisfaction Proofs (arxiv.org)

1

Tile Kernels: An optimized GPU kernels library written in TileLang (github.com/deepseek-ai)

2

AMD's Zen: Coming Back from the Dead (clamtech.org)

1

Learning to Repair Lean Proofs from Compiler Feedback (arxiv.org)

1

RLix: A scheduling layer for concurrent LLM RL (github.com/rlops)

2

Primus Projection: Estimate Memory and Performance Before You Train (amd.com)

1

PRowhammer: Propagating Bit-flips from CPU to GPU [pdf] (iitb.ac.in)

2

The Quantization Robustness of Diffusion Language Models in Coding Benchmarks (arxiv.org)

2

Different Perspectives of Memory System Simulation (arxiv.org)

2

Adding Compilation Metadata to Binaries to Make Disassembly Decidable (arxiv.org)

1

ICLR 2026 Outstanding Papers (iclr.cc)

1

Decoupled DiLoCo for Resilient Distributed Pre-Training (arxiv.org)

2

spmd_types: A type system for distributed (SPMD) tensor computations in PyTorch (github.com/meta-pytorch)

1

How Do LLM Agents Think Through SQL Join Orders? (ucbskyadrs.github.io)

1

Gluon&Linear Layouts Deep-Dive:Tile-Based GPU Programming with Low-Level Control [video] (youtube.com)

1

SonicMoE: A HW-Efficient and SW-Extensible Blueprint for Fine-Grained MoEs (dao-lab.ai)

1

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving (arxiv.org)

1

DIRT: Database-Integrated Random Testing (arxiv.org)

1

Scaling Test-Time Compute for Agentic Coding (arxiv.org)

1

An Algorithmic Reconstruction of Normalisation by Evaluation (yangzhixuan.github.io)

2

Faster LLM Inference via Sequential Monte Carlo (arxiv.org)

2

Pure Borrow: Linear Haskell Meets Rust-Style Borrowing (arxiv.org)

1

SSA without Dominance for Higher-Order Programs (arxiv.org)

3

Spotting Specification Gaps with Small Proof-Oriented Tests (risemsr.github.io)

1

Theseus, a Static Windows Emulator (neugierig.org)

1

Advent of Computing: Episode 179 – Programming Block by Block (libsyn.com)

2

Agentic Context Engineering:Evolving Contexts for Self-Improving Language Models (arxiv.org)

4

Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter (arxiv.org)

1

Fundamentals of CuTe Layout Algebra and Category-Theoretic Interpretation [video] (youtube.com)

2

OSS code review, in the era of LLMs (ezyang.com)

1

Proteus: Heterogeneous FPGA Virtualization [pdf] (tum.de)

1

Trevex: A Black-Box Detection Framework for Data-Flow Transient Execution Vulns (roots.ec)

1

From SIMT to Systolic Part 2: A Kernel Author's Field Report (twitter.com/mainzonx)

1

Machine Generated and Checked Proofs for a Verified Compiler (Experience Report) (arxiv.org)

2

Machine-Generated Code Deserves Machine-Checked Proofs (zoep.github.io)

2

What Happens to Software When Proof Is Cheap? Allen School Distinguished Lecture [video] (youtube.com)

1

TileTensor Part 1 – Safer, More Efficient GPU Kernels (modular.com)

1

EuroLLVM 2026 Round Table Summary: MLIR Canonicalization (llvm.org)

1

nanomem: An Simple, Inference-Time Memory Module (openanonymity.ai)

1

Building an Unverified Compiler with Agents (basis.ai)

1

WybeCoder: Verified Imperative Code Generation (facebookresearch.github.io)

2

Parcae: Doing More with Fewer Parameters Using Stable Looped Models (sandyresearch.github.io)

1

Characterizing the Impact of Congestion in Modern HPC Interconnects (arxiv.org)

1

Tessera: Unlocking Heterogeneous GPUs Through Kernel-Granularity Disaggregation (arxiv.org)

2

From SIMT to Systolic: A Foundation for GPU and TPU Architecture (twitter.com/mainzonx)

1

Packrat Parsing at the Speed of Wasm [video] (youtube.com)

1

Sparser, Faster, Lighter Transformer Language Models (arxiv.org)

1

When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Telemetry (arxiv.org)

2

Stupid RCU Tricks: Corner-Case RCU Implementations (kernel.org)

1

How Many Compilers Is Too Many? V8's History, Tradeoffs, and Architecture [video] (youtube.com)

1

Fully-Automatic Type Inference for Borrows with Lifetimes (radbox.org)

14

The GNU libc atanh is correctly rounded (hal.science)

1

MMU Handbook: Memory Management Units and TLBs (kalairajah-personal.github.io)

1

PEP 831 – Frame Pointers Everywhere: Enabling System-Level Observability (python.org)

6

UpDown: Efficient Manycore based on Many Threading & Scalable Memory Parallelism (uchicago.edu)

24

Reflections on 30 years of HPC programming (chapel-lang.org)

1

Recent lld/ELF performance improvements (maskray.me)

4

Circuit Transformations, Loop Fusion, and Inductive Proof (natetyoung.github.io)

1

AI for Systems: Using LLMs to Optimize Database Query Execution (together.ai)

2

Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets (arxiv.org)

2

GCC Translation Validation Part 6: Uninitialized Memory (kristerw.github.io)

2

Agentic Code Optimization via Compiler-LLM Cooperation (arxiv.org)

1

Bespoke OLAP: Using AI to Synthesize Workload-Specific DBMS Engines from Scratch (ucbskyadrs.github.io)

2

Understanding Agents: Code Coverage for Coding Agents (asymmetric.re)

1

Cyclotron: The Streaming Multiprocessor Abstraction Is Broken [pdf] (cornell.edu)

1

UCCL-EP: Portable Expert-Parallel Communication – Full Results (uccl-project.github.io)

3

vLLM IR: A Functional Intermediate Representation for vLLM (github.com/vllm-project)

1

Test-Time Scaling Makes Overtraining Compute-Optimal (arxiv.org)

1

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods (arxiv.org)

3

Tracing a Full MoE Training Step Through the XLA Compiler (patricktoulme.substack.com)

1

Breaking Down the Cerebras Wafer Scale Engine (wafer.substack.com)

1

The need for better compiler frontend benchmarks: Carbon's benchmarking approach (llvm.org)

1

DAXFS: A Lock-Free Shared Filesystem for CXL Disaggregated Memory (arxiv.org)

2

AST Edits: The Code Editing Format Nobody Uses (geometricagi.github.io)

1

KernelEvolve: Meta's Ranking Engineer Agent Optimizes AI Infrastructure (fb.com)

3

Software Engineering Is Becoming Civil Engineering (christophermeiklejohn.com)

1

Adaptive Block-Scaled Data Types (arxiv.org)

1

AC4A: Access Control for Agents (arxiv.org)