Why Your LLM Eval Harness Is Quietly Lying to You
Offline eval scores that climb while production quality flatlines are the default failure mode of applied AI. Here is how the gap opens, and how to close it.
I take startups and scale-ups from zero to scale, pairing AI-first delivery with classical systems-design foundations. I write about that work — the engineering, and the same systems lens turned on the human side: raising an autistic child, and physical training as the infrastructure for hard thinking. A technical archive for engineers and the AI-curious alike — written so both walk away with the mechanics.
Offline eval scores that climb while production quality flatlines are the default failure mode of applied AI. Here is how the gap opens, and how to close it.
Microfrontends promise team autonomy. In a regulated finance product they quietly traded one shared codebase for a distributed governance problem nobody owned.
Occasional deep-dives on applied AI and systems at scale, delivered to your inbox.