Articles by matt_d
1

ASM Visualizer: a new assembly visualization tool (diveintosystems.org)

1

Oral History of Jensen Huang – Computer History Museum [video] (youtube.com)

1

The Equational Theories Project: Collaborative Mathematical Research at Scale (terrytao.wordpress.com)

1

The Quest Toward That Perfect Compiler – ACM SPLASH / OOPSLA 2025 Keynote [video] (youtube.com)

1

Learning to love mesh-oriented sharding (ezyang.com)

1

Microbenchmarking NVIDIA's Blackwell: An In-Depth Architectural Analysis (arxiv.org)

1

tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection (arxiv.org)

1

RFC: Forming a Working Group on Formal Specification for LLVM (llvm.org)

3

hls4ml: A Flexible, OSS Platform for ML Acceleration on Reconfigurable Hardware (arxiv.org)

1

Nice to Meet You: Synthesizing Practical MLIR Abstract Transformers [pdf] (utah.edu)

1

SAT Etudes 2: Toy DPLL (philipzucker.com)

3

The Hitchhiker's Guide to Coherent Fabrics: 5 Programming Rules (sigarch.org)

1

Optimizing libdwarf .eh_frame enumeration (rovarma.com)

1

GSoC 2025: ClangIR Upstreaming (llvm.org)

2

Normal Forms for MLIR – 2025 US LLVM Developers' Meeting – Alex Zinenko [video] (youtube.com)

1

Place Capability Graphs: A General-Purpose Model of Rust's Ownership & Borrowing [video] (youtube.com)

1

LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations (arxiv.org)

2

What Scala can learn from Rust, Swift, and C++ [video] (youtube.com)

1

Lifetime Safety in Clang – 2025 US LLVM Developers' Meeting [video] (youtube.com)

3

Constant-time support coming to LLVM: Protecting cryptographic code (trailofbits.com)

1

Seymour Cray at 100 – Clive England – TNMoC Talk [video] (youtube.com)

5

Mitigating Application Resource Overload with Targeted Task Cancellation (muratbuffalo.blogspot.com)

1

MetaOCaml: Ten Years Later System Description (sciencedirect.com)

1

Where "Simulation" Came From (decomposition.al)

1

Inside VOLT: Designing an Open-Source GPU Compiler (arxiv.org)

1

An MLIR Pipeline for Offloading Fortran to FPGAs via OpenMP (acm.org)

3

Inside Nvidia GPU: Blackwell's Limitations & Future Rubin's Microarchitecture (github.com/zartbot)

1

Kitsune: Enabling Dataflow Execution on GPUs with Spatial Pipelines (acm.org)

1

DMA Collectives for Efficient ML Communication Offloads (arxiv.org)

4

10 Myths of Scalable Parallel Languages Part 8: Striving Toward Adoptability (chapel-lang.org)

8

Slicing Is All You Need: Towards a Universal One-Sided Distributed MatMul (arxiv.org)

2

Machine Scheduler in LLVM – Part II (myhsu.xyz)

1

The content-addressed storage (CAS) model of incremental build systems (jonmsterling.com)

2

Defeating the Training-Inference Mismatch via FP16 (arxiv.org)

3

Opportunistically Parallel Lambda Calculus (acm.org)

1

Place Capability Graphs: A General-Purpose Model of Rust's Ownership & Borrowing (acm.org)

2

Linear effects, exceptions, resources: Curry-Howard destructors correspondence (arxiv.org)

3

Making the Clang AST Leaner and Faster (cppalliance.org)

3

Draw high dimensional tensors as a matrix of matrices (ezyang.com)

1

Wafer-Scale AI Compute: A System Software Perspective (sigops.org)

2

Towards Automated GPU Kernel Generation (simonguo.tech)

1

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs (arxiv.org)

2

Triton Developer Conference 2025 Talks [video] (youtube.com)

1

OpenEstimate Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data (arxiv.org)

1

torchcomms: A modern PyTorch communications API (github.com/meta-pytorch)

2

Building an Open ABI and FFI for ML Systems (apache.org)

1

Instruction Set Migration at Warehouse Scale (arxiv.org)

2

Secure Parsing and Serializing with Separation Logic Applied to CBOR, CDDL, COSE [pdf] (microsoft.com)

2

The Calculated Typer – Haskell Symposium (ICFP⧸SPLASH'25) [video] (youtube.com)

1

PickleBall: Secure Deserialization of Pickle-Based Machine Learning Models (github.com/columbia)

1

Clang Bytecode Interpreter Update (redhat.com)

1

Scaling Instruction-Selection Verification Against Authoritative ISA Semantics (doi.org)

2

10 Myths of Scalable Parallel Languages Part 7: Minimalist Language Designs (chapel-lang.org)

1

CPU Autoscaling with a Kernel of Truth (acm.org)

1

SafeRace: WebGPU Memory Safety in the Presence of Data Races (acm.org)

1

A guided tour through Oxidized OCaml (gavinleroy.com)

1

Functional Networking for Millions of Docker Desktops (Experience Report) (acm.org)

1

Does Linux Provide Performance Isolation for NVMe SSDs? Configuring cgroups [pdf] (atlarge-research.com)

1

International Conference on Managed Programming Languages & Runtimes (MPLR) 2025 (acm.org)

1

Collective Matrix Multiplication – JAX Pallas:Mosaic GPU (jax.dev)

1

StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs (arxiv.org)

2

Can AI Co-Design Distributed Systems? Scaling from 1 GPU to 1k (harvard-edge.github.io)

1

Hybrid Architectures for Language Models: Systematic Analysis & Design Insights (arxiv.org)

2

LLMc: Beating All Compression with LLMs (washington.edu)

1

All in on MatMul? Don’t Put All Your Tensors in One Basket! (sigarch.org)

1

Muon Outperforms Adam in Tail-End Associative Memory Learning [video] (youtube.com)

2

Barbara Liskov Oral History [video] (youtube.com)

2

CMOS 2.0 – Redefining the Future of Scaling (arxiv.org)

1

Fuss-Free Universe Hierarchies (jonmsterling.com)

1

Pretraining Large Language Models with NVFP4 (arxiv.org)

3

The Next Computing Revolution: Bringing Processing Inside Memory (computer.org)

1

llvm-mos: Modern C/C++ on the Venerable 6502 | VCFMW 20 (2025) (2025) [video] (youtube.com)

1

Mercury: Unlocking Multi-GPU Optimization for LLMs via Remote Memory Scheduling [pdf] (storage.googleapis.com)

10

Optimizing a 6502 image decoder – part II: assembly (colino.net)

1

Arm A-Profile Architecture developments 2025: Armv9.7-A (arm.com)

1

TypeDis: A Type System for Disentanglement [pdf] (nyu.edu)

1

From CPU Transparency to GPU Complexity – The Performance Engineering Frontier (harvard-edge.github.io)

1

Quotient Polymorphism [pdf] (nott.ac.uk)

1

Labelled preorders and coercions: different approaches to multiple inheritance (jonmsterling.com)

3

GPU Mode Lecture 80: How FlashAttention 4 Works [video] (youtube.com)

1

When You Have a Fuzzer, Everything Looks Like a Reachability Problem [pdf] (ic.ac.uk)

4

F3: The Open-Source Data File Format for the Future (doi.org)

2

Efficient LLM:Bandwidth, Compute, Synchronization, and Capacity are all you need (arxiv.org)

3

HieraSynth: A Parallel Framework for Complete Super-Optimization [pdf] (lsrcz.github.io)

2

Fusion: An Analytics Object Store Optimized for Query Pushdown (doi.org)

2

Type Theory Forall – Philip Wadler – Type Classes, Monads, Logic, Future of PL [video] (youtube.com)

4

3rd Largest Element: SIMD Edition (parallelprogrammer.substack.com)

2

FP64 Floating-Point Emulation in INT8 (arxiv.org)

1

A Early History of Algebraic Data Types (hillelwayne.com)

1

RISC-V Conditional Moves (corsix.org)

3

Global Economic History: Cradle of Modernity [pdf] (upenn.edu)

2

"Is it time for a new proof assistant?" – Jon Sterling [video] (youtube.com)

2

OpenSTA: Open-source static timing analysis for FPGAs (zeroasic.com)

1

Arm SIMD Loops – C, ACLE intrinsics, inline assembly – Neon, SVE, SME (arm.com)

1

GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2 (arxiv.org)

2

Weak Memory Model Formalisms: Introduction and Survey (arxiv.org)

2

Program Optimisations via Hylomorphisms for Extraction of Executable Code (dagstuhl.de)

8

Identity Types (bartoszmilewski.com)

1

Categorical Foundations for CuTe Layouts (colfax-intl.com)

6

Transforming recursion into iteration for LLVM loop optimizations (dspace.mit.edu)