Articles by matt_d
1

Chip design from the bottom up – Reiner Pope [video] (youtube.com)

2

LT2: Linear-Time Looped Transformers (charlesdddd.github.io)

2

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel (arxiv.org)

1

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Apps (arxiv.org)

33

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs (arxiv.org)

1

[RFC] Open Access to Standards Documents – LLVM Project (llvm.org)

3

Curly braces: An evolution of UNIX and C (thalia.dev)

1

NanoTag: Systems Support for Efficient Byte-Granular Overflow Detection on Arm (github.com/ice-rlab)

1

InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents (inferencebench.ai)

1

Tracking Capabilities for Safer Agents (arxiv.org)

2

Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation (arxiv.org)

1

Verifying EDA and compiler optimizations once and for all (samuelcoward.co.uk)

1

StepStone: LLM-Based GPU Kernel Driver Fuzzing via User-Space Libraries [pdf] (ucr.edu)

1

Graded Modal Types for Memory and Communication Safety (kent.ac.uk)

1

Systems Are Changing: The Architect's Role in the Era of Agentic Co-Design (sigarch.org)

1

Code-Specify-Test-Debug-Prove: Flexibly Integrating Separation Logic [pdf] (cam.ac.uk)

3

Detecting Relaxed Memory Concurrency Bugs in C and C++ Compilers (lukegeeson.com)

2

The downgrading semantics of memory safety (Extended version) (arxiv.org)

1

Direction-Preserving Number Representations (arxiv.org)

1

On the Unreasonable Effectiveness of PBT for Validating Formal Specifications (proofsandintuitions.net)

2

Understanding, Analyzing, and Optimizing Agentic AI: A CPU-Centric Perspective (arxiv.org)

2

Getting Confidence in (Agentic) Code (ucsd-cse-115-215.github.io)

1

Compute Optimal Tokenization: Scaling Laws for Data Compression in LLMs (co-tok.github.io)

1

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism (wuklab.io)

1

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference (supercomputing-system-ai-lab.github.io)

7

[flagged] KV cache is becoming the memory hierarchy of inference (touchdown-labs.com)

1

Ada-MK: Adaptive MegaKernel Optimization via DAG-Based Search for LLM Inference (arxiv.org)

42

How to Write to SSDs [pdf] (vldb.org)

1

Scalable GPU Acceleration of Scalar Functions in Analytical Databases [pdf] (vldb.org)

1

The agent principal-agent problem (crawshaw.io)

1

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale (frontier-cs.org)

1

Demystifying the Silence of Correctness Bugs in PyTorch Compiler (arxiv.org)

1

Fork, Explore, Commit: OS Primitives for Agentic Exploration (arxiv.org)

1

Systematically Auditing AI Agent Benchmarks with BenchJack (arxiv.org)

8

mimalloc: A new, high-performance, scalable memory allocator for the modern era (microsoft.com)

1

Let AI Agents Write Your Serving Stack with VibeServe (washington.edu)

2

TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-Scale Production (arxiv.org)

2

TorchLean: Verified Neural Networks in Lean (robertj1.com)

98

Deterministic Fully-Static Whole-Binary Translation Without Heuristics (arxiv.org)

1

Dynamic Persistent Tile Scheduling w/ Cluster Launch Control (CLC) on Blackwell (colfax-intl.com)

1

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? (github.com/uw-syfi)

2

CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure (arxiv.org)

1

Microbenchmark-Driven Analytical Performance Modeling Across Modern GPUs (arxiv.org)

2

PyTorch DevLog (pytorch.org)

2

VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU (arxiv.org)

1

Aurora: A Leverage-Aware Optimizer for Rectangular Matrices (tilderesearch.com)

1

The Two Abstractions of System Design: Hide or Reduce (muratbuffalo.blogspot.com)

1

Practical Formal Verification for MLIR Programs (arxiv.org)

1

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs (arxiv.org)

2

Capsules: Compile-time lock discipline in OxCaml (kcsrk.info)

1

Data Race Freedom in OxCaml (kcsrk.info)

2

cuda-oxide: a custom rustc backend for compiling GPU kernels in pure Rust (github.com/nvlabs)

3

A case study with Aeneas and jxl-rs (protzenko.fr)

1

Finite Functional Programming (arxiv.org)

3

CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion (arxiv.org)

1

SPEC CPU: The Next Generation (arxiv.org)

1

Persistent Iterators with Value Semantics (arxiv.org)

3

Continual Learning Bench 1.0 (continual-learning-bench.com)

3

The Valley of Calm (joemag.dev)

3

The Static Dynamic JVM – A Many Layered Dive [video] (youtube.com)

1

Learning Randomized Reductions (arxiv.org)

2

Metastability in Recovery: Cascading Recovery with a Loop (charap.co)

3

How the JVM Optimizes Generic Code – A Deep Dive (inside.java)

1

Tessera: Unlocking Heterogeneous GPUs Through Kernel-Granularity Disaggregation (arxiv.org)

2

MathDuels: Evaluating LLMs as Problem Posers and Solvers (arxiv.org)

1

Kernel Contracts: A Spec. Language for Correctness Across Heterogeneous Silicon (arxiv.org)

2

Revealing NVIDIA Driver Command Streams for CPU-GPU Runtime Behavior Insight (arxiv.org)

2

Guardians: Static verification for AI agent workflows (github.com/metareflection)

2

Fast GPU Linear Algebra via Compile Time Expression Fusion (arxiv.org)

3

The AI Compute Extensions (ACE) for x86 [pdf] (x86ecosystem.org)

1

Finding and Understanding Bugs in FPGA Place-and-Route Engines [video] (youtube.com)

1

AutoSP: Long-Context LLM Training via Compiler-Based Sequence Parallelism (pytorch.org)

2

Partial UDF Inlining (doi.org)

10

From Convergence to Confidence: Push-Button Verification for RDTs (kcsrk.info)

19

Low-Compilation-Cost Register Allocation in LLVM-Based Binary Translation (acm.org)

1

AdaExplore: Search for Efficient Kernel Generation (stiglidu.github.io)

1

vLLM-Compile: Bringing Compiler Optimizations to LLM Inference (docs.google.com)

2

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs (arxiv.org)

2

Compiler Testing – Part 1: Coverage-Guided Fuzzing with Grammars and LLMs (nowarp.io)

1

Disaggregated Serving for Hybrid SSM Models in vLLM (vllm-website-lx4pji0mz-inferact-inc.ver...

1

Great Paper: The Calculated Typer – Iowa Type Theory Commute Podcast S7 E6 (pocketcasts.com)

3

Barbara Liskov, Turing Award'08: Data Abstraction, Dijkstra, Distributed Systems (developing.dev)

4

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell (arxiv.org)

1

Reimagining Kernel Generation at the PTX Layer (standardkernel.com)

1

A Deductive System for (Hardware-Software) Contract Satisfaction Proofs (arxiv.org)

1

Tile Kernels: An optimized GPU kernels library written in TileLang (github.com/deepseek-ai)

2

AMD's Zen: Coming Back from the Dead (clamtech.org)

1

Learning to Repair Lean Proofs from Compiler Feedback (arxiv.org)

1

RLix: A scheduling layer for concurrent LLM RL (github.com/rlops)

2

Primus Projection: Estimate Memory and Performance Before You Train (amd.com)

1

PRowhammer: Propagating Bit-flips from CPU to GPU [pdf] (iitb.ac.in)

2

The Quantization Robustness of Diffusion Language Models in Coding Benchmarks (arxiv.org)

2

Different Perspectives of Memory System Simulation (arxiv.org)

2

Adding Compilation Metadata to Binaries to Make Disassembly Decidable (arxiv.org)

1

ICLR 2026 Outstanding Papers (iclr.cc)

1

Decoupled DiLoCo for Resilient Distributed Pre-Training (arxiv.org)

2

spmd_types: A type system for distributed (SPMD) tensor computations in PyTorch (github.com/meta-pytorch)

1

How Do LLM Agents Think Through SQL Join Orders? (ucbskyadrs.github.io)

1

Gluon&Linear Layouts Deep-Dive:Tile-Based GPU Programming with Low-Level Control [video] (youtube.com)

1

SonicMoE: A HW-Efficient and SW-Extensible Blueprint for Fine-Grained MoEs (dao-lab.ai)