Articles by matt_d
1

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs (arxiv.org)

1

Linear Algebra Kernels for the Age of Research (gpumode.com)

2

Latent learning: episodic memory complements parametric learning (openreview.net)

1

NEURA: A Unified and Retargetable Compilation Framework for CGRAs (acm.org)

2

System Call Stack Alignment (humprog.org)

1

Making FlashAttention-4 faster for inference (modal.com)

1

Precision Matters in Block Scales (constantinides.net)

2

Agents' Last Exam (arxiv.org)

2

Does the Harness Matter? Lessons from Ale-Claw on Agents' Last Exam (agents-last-exam.org)

1

Demystifying NVSHMEM: System-Level: Symmetric Memory, Device-Initiated Ops (arxiv.org)

1

Enumerating Ill-Typed Programs for Testing Type Analyzers (acm.org)

1

Agentic Memory Management for GPU Code Generation (ucbskyadrs.github.io)

1

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code? (uccl-project.github.io)

1

Frontier: A Discrete-Event Simulator for Modern LLM Serving (github.com/netx-lab)

2

Piper: A Programmable Distributed Training System (washington.edu)

1

Piper: A Programmable Distributed Training System (arxiv.org)

1

Radix Top-K: finding the top-k elements in an array without sorting (veitner.bearblog.dev)

1

A Case for a Simulation-Driven Exploration of Distributed GenAI Platforms (acm.org)

1

Defeat the Heap: Zero-Copy Data Movement in AXI4MLIR (arxiv.org)

2

Breaking the Ice: Analyzing Cold Start Latency in vLLM (arxiv.org)

2

An Empirical Comparison of General Context-Free Parsers (arxiv.org)

1

RFC: Programming Languages Course Reboot, 2026 – Shriram Krishnamurthi (docs.google.com)

1

CodegenBench: Can LLMs Write Efficient Code Across Architectures? (arxiv.org)

1

ACM SIGPLAN Programming Language Design and Implementation (PLDI) 2026 (acm.org)

2

Human Judgment as a Specification (brownplt.org)

3

OOBdump: Relocation Oriented Programming: Arbitrary code execution in objdump -g (calif.io)

3

Inference: Turning Electricity into Intelligence – Stanford CS336 – Dan Fu [video] (youtube.com)

5

FP8 Is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail (arxiv.org)

4

Modular Arithmetic Challenge (terrytao.wordpress.com)

1

The Return of Rigorous Full-System Timing Simulation (sigarch.org)

2

Co-Creator of Haskell: Functional Prog., Thinking in Types, Useless Languages [video] (youtube.com)

1

Types for more than memory safety in OxCaml – Stephen Dolan – VeTSS 2026 [video] (youtube.com)

112

The 29th International Obfuscated C Code Contest (IOCCC) 2025 Winners (ioccc.org)

3

Tensor Shapes in Pyrefly – Avik Chaudhuri – PyCon US 2026 Typing Summit [video] (youtube.com)

1

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution (benchevolver.github.io)

2

JITDomain: Instruction-level JIT code isolation (sciencedirect.com)

2

Serving Transformers: Lessons from the Trenches – Stanford CS25 Transformers [video] (youtube.com)

1

Constrained Adaptive Rejection Sampling (arxiv.org)

2

Training an Agentic Router for Optimal Cost-Performance on SWE Tasks (appliedcompute.com)

1

Diagramming Program Values by Spatial Refinement (brownplt.org)

1

Agent Arena: Causal Evaluation of Agents in the Real World (arena.ai)

1

Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures (arxiv.org)

1

Recent improvements to the type checker – Swift Compiler (swift.org)

1

Semantic Reification: A New Paradigm for Random Program Generation (sigplan.org)

1

GPU Forecasters: Language Models as Selective Surrogates for Kernel Optimization (arxiv.org)

1

Type-Error Ablation and AI Coding Agents (arxiv.org)

1

Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon (vllm.ai)

1

Directionality in Low Precision (constantinides.net)

1

O-POPE: High-Frequency Pipelined Outer Product based GEMM acceleration (arxiv.org)

2

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Agents (nvidia.com)

51

Strace-ui, Bonsai_term, and the TUI renaissance (janestreet.com)

2

Why Larger Models Learn More: Capacity, Interference, Rare-Task Retention (arxiv.org)

2

PyTorch's playbook for AI coding, as of May 2026 (pytorch.org)

2

When does fragmentation occur in the CUDA caching allocator? (pytorch.org)

2

Evaluating Apple Silicon for Data Processing [pdf] (tum.de)

2

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation (arxiv.org)

3

Do GPUs Need New Tabular File Formats? (arxiv.org)

1

Polyhedral Compilation in MLIR (sajidzubair.substack.com)

1

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs (arxiv.org)

2

Three Trends from MLSys 2026 (modular.com)

3

What are the real problems of continual learning? (infinitefaculty.substack.com)

1

What "Memory Compiler" Actually Means: From Bitcells to GDS Tiling (thecloudlet.github.io)

5

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-Offs, and Perf (arxiv.org)

1

MIT EECS/CSAIL Agentic Coding in Practice Seminar Series (csail.mit.edu)

1

The Load-Balance Problem Behind Hybrid Parallelism (hecate0821.github.io)

1

Delayed Tensor Parallelism for Faster Transformer Inference (kog.ai)

1

Continuous Diffusion Models Can Obey Formal Syntax (arxiv.org)

1

HartBreaker: Deterministic Fuzzing of Multi-Hart RISC-V CPUs (ethz.ch)

4

Tuning LLVM's SLP Vectorizer Cost Model (kaving.me)

1

A Friendly Tour of Substructural, Uniqueness, Ownership, Capabilities and more! (federicobruzzone.github.io)

1

FlashLib: Bringing Flash Magic to Classical Machine Learning Operators (flashml-org.github.io)

1

FML-Bench: A Controlled Study of AI Research Agent Strategies (arxiv.org)

2

Finding deadlocks in CuTe kernels with SPIN (metaworld.me)

2

A Case for Tracing Based DSL Kernel Languages (metaworld.me)

1

You don't need all the LLM benchmarks (smola.org)

1

Elusive order of async GPU kernels: scheduling, abstractions, DSL implications (ianbarber.blog)

1

MileStone: A Multi-Objective Compiler Phase Ordering Framework (arxiv.org)

3

SSV: Sparse Speculative Verification for Efficient LLM Inference (arxiv.org)

2

Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection (arxiv.org)

3

Characterization of machine learning compilers for LLM inference on NVIDIA GPUs (springer.com)

1

Chip design from the bottom up – Reiner Pope [video] (youtube.com)

2

LT2: Linear-Time Looped Transformers (charlesdddd.github.io)

2

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel (arxiv.org)

1

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Apps (arxiv.org)

33

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs (arxiv.org)

1

[RFC] Open Access to Standards Documents – LLVM Project (llvm.org)

3

Curly braces: An evolution of UNIX and C (thalia.dev)

1

NanoTag: Systems Support for Efficient Byte-Granular Overflow Detection on Arm (github.com/ice-rlab)

1

InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents (inferencebench.ai)

1

Tracking Capabilities for Safer Agents (arxiv.org)

2

Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation (arxiv.org)

1

Verifying EDA and compiler optimizations once and for all (samuelcoward.co.uk)

1

StepStone: LLM-Based GPU Kernel Driver Fuzzing via User-Space Libraries [pdf] (ucr.edu)

1

Graded Modal Types for Memory and Communication Safety (kent.ac.uk)

1

Systems Are Changing: The Architect's Role in the Era of Agentic Co-Design (sigarch.org)

1

Code-Specify-Test-Debug-Prove: Flexibly Integrating Separation Logic [pdf] (cam.ac.uk)

3

Detecting Relaxed Memory Concurrency Bugs in C and C++ Compilers (lukegeeson.com)

2

The downgrading semantics of memory safety (Extended version) (arxiv.org)

1

Direction-Preserving Number Representations (arxiv.org)

1

On the Unreasonable Effectiveness of PBT for Validating Formal Specifications (proofsandintuitions.net)