Field Notes

Built in the open

Deep dives from building Tungsten — performance engineering, GPU work, and compiler internals, written while the paint was still wet.

Performance1.4 → 21 GB/s

A JSON Lexer Performance Ladder

The rung-by-rung climb of a self-hosted JSON lexer from 1.4 GB/s to 21 GB/s — SIMD classification, branch elimination, and the microarchitectural detail behind every jump.

GPU · ML1.16× MLX

Passing MLX on Lightning-1.7B Decode

A pure-Tungsten nvfp4 decode path for a 1.7B-parameter model, taken from 0.71× to 1.16× of Apple's hand-tuned MLX — on the same Apple silicon.

Metal 4M5 Max

Metal 4 matmul2d on M5 Max

Getting Apple's new Metal 4 matmul2d and cooperative tensors running on the M5 Max — the headline instruction, and what it took to feed it.

Compiler0-byte header

The Slab AST

Rebuilding the compiler's AST so each node reference is one machine word with zero bytes of header — the data-structure design behind a faster self-host.

SIMD200 LOC

A simdjson-Class JSON Classifier in 200 Lines of C

Building a simdjson-class structural classifier from scratch: how 200 lines of C take a JSON lexer from 1.98 GB/s toward simdjson territory.

Debugging18% regression

When LTO Refuses to Inline Across -march=native

An 18% regression that turned out to be LTO silently refusing to inline across a target-features mismatch — and the one flag that fixed it.

CPU · SIMDApple M-series

SME / SME2 on Apple Silicon

ARM's Scalable Matrix Extension on Apple silicon: what SME and SME2 actually offer the CPU side, and how Tungsten reaches them.

NEONcautionary tale

NEON shrn/xtn Tail Compression — Don't Do It

Why the NEON shrn/xtn tail-compression trick backfires on Apple silicon's scan helpers — a benchmark-backed cautionary tale.