LLM Architecture

MoA Than One Curve: Teaching FFNs to Choose Their Nonlinearity

Model architecture has a recurring habit: when something works, we freeze it into a default and move the argument elsewhere. Attention gets the drama. Routing gets the diagrams. Context windows get the product demos. Meanwhile, the feedforward network sits there, quietly holding a large share of the parameters and applying the same nonlinearity to every token, every time, as if “one curve fits all” were a law of nature rather than a convenient engineering choice. ...

MoE Than a Cost Trick: How Sparse Experts Became an Architecture Stack

The old business pitch for Mixture-of-Experts was satisfyingly simple: activate fewer parameters, spend less compute, keep more capacity on the shelf. It sounded like cloud cost optimization with a PhD. Useful, but not exactly poetic. The newer story is more interesting. Three recent arXiv papers—DOT-MoE, DAG-MoE, and LoopMoE—suggest that MoE is no longer just a sparsity trick. It is becoming an architecture stack for conditional computation: first decide how experts are formed, then how selected experts interact, and finally how sparse expert systems can be reused over iterative depth.123 ...

Energy Bills for Transformers: CEM Makes Layer Design Less Empirical

Weights are expensive twice. First, they cost money to train. Then they cost money every time a model is served, copied, quantized, tuned, monitored, and occasionally blamed for a cloud bill that no one wants to read twice. This is why every architecture paper with the words “efficient,” “low-rank,” “shared,” or “recursive” immediately attracts attention. Some of that attention is deserved. Some of it is merely the industry’s permanent hunger for a cheaper miracle with a nicer benchmark table. ...

Memory That Actually Remembers: Why MemMachine Signals a Shift in AI Agent Architecture

Memory sounds simple until a business actually needs it. A sales agent should remember what the client objected to last month. A customer-support agent should remember that a refund exception was already approved. A research assistant should remember which dataset was rejected, not vaguely summarize it into “user prefers cleaner data.” A healthcare or financial assistant should not turn a precise historical statement into a soft personality trait because the memory layer wanted to look elegant. Cute demos tolerate this. Production systems do not. ...

EMoT: When AI Starts Thinking Like Fungus (and Why That’s Not as Weird as It Sounds)

The useful question is not whether fungus is smart Fungus is not the point. That needs saying first, because the title of the paper almost invites the wrong conversation. “Enhanced Mycelium of Thought” sounds like the kind of AI metaphor that appears five minutes before someone starts drawing circles around the word “emergence.” The useful question is more practical: when should an AI system keep a weak idea alive instead of deleting it? ...

Thinking Out Loud — Why LLMs Might Need Chain‑of‑Thought

Audit trails are boring until something goes wrong. In ordinary business operations, this is not controversial. If a payment approval, legal review, procurement decision, or trading order leaves intermediate records, people can reconstruct what happened. If the whole decision is buried inside a black-box system that simply outputs “approved,” “rejected,” or “buy now,” the audit team has a less glamorous job: guessing which invisible machinery produced the visible answer. Charming, in the way dental surgery is charming. ...

$Cover image$

Fast & Curious: How ‘Speed-First’ LLM Architectures Change the Build vs. Buy Math

TL;DR for operators Efficient LLMs are not just “smaller Transformers with a haircut.” That is the comfortable misconception, and like many comfortable things in enterprise AI, it becomes expensive once real users arrive. The survey reviewed here maps the major architectural routes for making large language models faster, cheaper, and more deployable: linear sequence models, sparse attention, efficient full attention, sparse mixture-of-experts, hybrid architectures, diffusion LLMs, and multimodal extensions.1 Its practical value is not that it declares a single winner. It does something more useful: it tells operators which bottleneck each family is trying to remove. ...