Articles by matt_d
1

Breaking Down the Cerebras Wafer Scale Engine (wafer.substack.com)

1

The need for better compiler frontend benchmarks: Carbon's benchmarking approach (llvm.org)

1

DAXFS: A Lock-Free Shared Filesystem for CXL Disaggregated Memory (arxiv.org)

2

AST Edits: The Code Editing Format Nobody Uses (geometricagi.github.io)

1

KernelEvolve: Meta's Ranking Engineer Agent Optimizes AI Infrastructure (fb.com)

3

Software Engineering Is Becoming Civil Engineering (christophermeiklejohn.com)

1

Adaptive Block-Scaled Data Types (arxiv.org)

1

AC4A: Access Control for Agents (arxiv.org)

1

Rethinking Language Model Scaling Under Transferable Hypersphere Optimization (arxiv.org)

1

Distributed builds of LLVM with CMake, recc, and NativeLink (reidkleckner.dev)

1

A Pattern Generation Language for MLIR Compiler Matching and Rewriting (radbox.org)

2

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon (dao-lab.ai)

3

Compiler as a Service: C++ Goes Live – Interactive C++, interop, and beyond [video] (youtube.com)

3

Measuring AI Ability to Complete Long Software Tasks (muratbuffalo.blogspot.com)

1

6o6 v1.1: Faster 6502-on-6502 virtualization for a C64/Apple II Apple-1 emulator (oldvcr.blogspot.com)

2

uops.info Update: Emerald Rapids, Meteor Lake, Arrow Lake, and Zen 5 (uops.info)

1

MXFP8 GEMM: Up to 99% of cuBLAS Performance Using CUDA and PTX (danielvegamyhre.github.io)

2

PyTorch Autograd and Mutation (ezyang.com)

2

The Future of Python: Evolution or Succession – Brett Slatkin – PyCascades 2026 [video] (youtube.com)

2

SlopCodeBench: Benchmarking How Coding Agents Degrade over Long-Horizon Tasks (scbench.ai)

2

AutoRocq: Agentic Theorem Prover for Verification (github.com/nus-program-verification)

1

Wax: Optimizing Data Center Applications with Stale Profile (github.com/ice-rlab)

2

Dijkstra's Shortest-Path Algorithm: A visual exploration, following Sedgewick (joshmpollock.com)

3

Speculative Decoding: Performance or Illusion? (specdecode-bench.github.io)

2

Goedel-Code-Prover: Hierarchical Proof Search for Open SotA Code Verification (goedelcodeprover.github.io)

1

MLSys 2026 Papers (mlsys.org)

2

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU (arxiv.org)

2

Specula: A framework for finding deep bugs in system code using TLA+ (github.com/specula-org)

3

Equality Saturation for Optimizing High-Level Julia IR (acm.org)

1

UniTe: A Universal Tensor Abstraction for Capturing Spatial Relationships (acm.org)

2

Co-Design of B+-Tree Index with Emerging Zone Interfaces for Small KV Pairs (acm.org)

1

CounterPoint: Using Hardware Counters to Refute and Refine µarch Assumptions (arxiv.org)

1

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost (arxiv.org)

4

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems (muratbuffalo.blogspot.com)

1

What Is Coordination, Really? (jhellerstein.github.io)

1

Idempotent Slices with Applications to Code-Size Reduction (arxiv.org)

1

Microsoft Rust Training Books: Beginner, advanced, expert level material (github.com/microsoft)

2

LUMINA: LLM-Guided GPU Architecture Exploration via Bottleneck Analysis (arxiv.org)

1

Challenges and Design Issues in Finding CUDA Bugs via GPU-Native Fuzzing (arxiv.org)

1

SEVI: Silent Data Corruption of Vector Instructions in Hyper-Scale Datacenters (acm.org)

2

CrypTorch: PyTorch-based Auto-tuning Compiler for ML w/ Multi-party Computation (github.com/psu-paws)

2

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels (arxiv.org)

6

Tony Hoare and His Imprint on Computer Science (acm.org)

1

The End of Dijkstra's Algorithm? Breaking the Sorting Barrier for Shortest Paths [video] (youtube.com)

1

AlgoVeri: An Aligned Benchmark for Verified Code Gen. On Classical Algorithms (arxiv.org)

1

Specy: Learning Specifications for Distributed Systems from Event Traces [pdf] (princeton.edu)

1

Generalized Dot-Product Attention: Tackling Real-World Challenges in GPU Kernels (pytorch.org)

1

M^2RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling (arxiv.org)

1

Tools of the Trade: C2C Activation Offloading on Grace Blackwell (poolside.ai)

43

EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages (esolang-bench.vercel.app)

1

Speed-Of-Light ExecBench: A benchmark of real-world DL kernel problems (github.com/nvidia)

2

Equality Saturation and Symbolic Regression (egraphs.org)

2

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL (arxiv.org)

4

Vectorization of Verilog Designs and its Effects on Verification and Synthesis (arxiv.org)

1

LATTE ’26: Workshop on Languages, Tools, and Techniques for Accelerator Design (cornell.edu)

1

Read Less, Steer More (ezyang.com)

1

The Data Structures of Roads (sandboxspirit.com)

1

Verifying Move Borrow Checker in Lean:An Experiment in AI-Assisted PL Metatheory (proofsandintuitions.net)

4

Real or Slop? – Programming Languages Papers Edition (zackg.me)

33

Mamba-3 (together.ai)

1

EvoX: Letting AI Evolve Its Own Evolution Process (skydiscover-ai.github.io)

1

Native DSLs Ops in PyTorch (ianbarber.blog)

19

Flash-KMeans: Fast and Memory-Efficient Exact K-Means (arxiv.org)

2

Gluon: Explicit Performance (lei.chat)

2

Block Number Formats are (Still!) Direction Preservers (constantinides.net)

3

cuTile Rust: a safe, tile-based kernel programming DSL for Rust (github.com/nvlabs)

1

KernelBlaster: A framework for in context learning for code optimization (github.com/nvlabs)

1

Demystifying and Improving Lazy Promotion in Cache Eviction [pdf] (vldb.org)

1

Journeying through Optimization with Heuristics [video] (youtube.com)

3

To Sparsify or to Quantize: A Hardware Architecture View (sigarch.org)

6

Efficient sparse computations using linear algebra aware compilers (2025) (osti.gov)

1

A Field Guide to Reward Hacking in AI Kernel Generation (wafer.ai)

1

AI and the Mixed-Consistency Future (jhellerstein.github.io)

1

FIDES: End-to-end Compartments for Mixed-language Systems [pdf] (kcsrk.info)

1

Practical Type Inference: High‑Throughput Recovery of Real‑World Types (arxiv.org)

1

Idempotent Slices with Applications to Code-Size Reduction (arxiv.org)

1

Designing AI Chip Hardware and Software (docs.google.com)

2

Refinement Modeling and Verification of RISC-V Assembly Using Knuckledragger (philipzucker.com)

2

Breaking Control Flow Integrity by Abusing Modern C++ (Coroutines) – BH USA 2025 [video] (youtube.com)

1

Programming the Loop (ianbarber.blog)

2

Scalable Training of Mixture-of-Experts Models with Megatron Core (arxiv.org)

3

PolyBlocks: A Compiler Infrastructure for AI Chips and Programming Frameworks (arxiv.org)

2

Formalizing Data Structures and Algorithms with Agents (risemsr.github.io)

2

Thinnings: Sublist Witnesses and de Bruijn Index Shift Clumping (philipzucker.com)

2

Advent of Computing: Dan Temkin – Forty-Four Esolangs (libsyn.com)

1

Checking Write Bandwidth on GPUs (clamtech.org)

1

Challenges in Decompilation and Reverse Engineering of CUDA-Based Kernels [pdf] (nicolo.dev)

2

Block Number Formats Are Direction Preservers (constantinides.net)

2

Cutie Fly: CuTe Layout Representation and Algebra, CuTeDSL, FlyDSL (ianbarber.blog)

2

Converting Binary Floating-Point Numbers to Shortest Decimal Strings (wiley.com)

2

Controlling Floating-Point Determinism in NVIDIA CCCL (nvidia.com)

2

Bootstrapping Fuzzers for Compilers of Low-Resource Language Dialects Using LLMs (arxiv.org)

2

Custom Data Structures in E-Graphs (uwplse.org)

2

Formal Verification in the Age of AI (verse.systems)

3

CuTe Layout Representation and Algebra (arxiv.org)

1

Bespoke OLAP: Synthesizing Workload-Specific One-Size-Fits-One Database Engines (arxiv.org)

3

SkyDiscover: A Flexible Framework for AI-Driven Sci. and Algorithmic Discovery (skydiscover-ai.github.io)

4

Silent Backwards Compatibility Breaking Changes in PyTorch (ezyang.com)

1

Building an Open-Source Verilog Simulator with AI: 580K Lines in 43 Days (normalcomputing.com)

1

AgentCgroup: Understanding and Controlling OS Resources of AI Agents (github.com/eunomia-bpf)