Edge AI

Measure Twice, Quantize Once

TL;DR for operators Compression is usually sold as a tidy pipeline: pick a smaller architecture, prune some layers, quantize the result, then call procurement and explain why the GPU bill is still rude. This paper argues that the pipeline itself is the problem.1 The authors propose a joint compression framework for Llama-3.1-8B that searches architectural choices and quantization choices together. That means the system does not first decide “how much model” it wants and only afterward decide “how many bits” each part deserves. It treats width, depth, layer importance, weight precision, activation precision, and latency as interacting deployment variables. ...

LoRA Was Supposed to Fit on the Edge. The Activations Disagreed.

TL;DR for operators LoRA does not magically make LLM fine-tuning fit on phones, laptops, or small edge boxes. It reduces the number of trainable parameters. The paper’s useful contribution is showing that this is only the opening move. The real memory bill arrives from activations, checkpoint boundaries, vocabulary-sized output computations, and tokens that are being processed even though they do not contribute to the loss. Apparently the memory allocator did not attend the product strategy meeting. ...

The One-Weird-Trick Era of LLM Efficiency Is Over

TL;DR for operators The useful lesson from Unifying Data, Memory, and Compute Efficiency in LLM Training: A Survey is not that one efficiency method is about to save everyone’s GPU bill. That would be charming, in the same way procurement decks are charming. The paper’s real contribution is to show why LLM efficiency has become a coupled operating problem: what data you train on changes the compute you spend; how you fit training into memory changes the optimization path; and when you stop, refresh, or reallocate compute depends on both.1 ...

Not Every Spike Is Positive: The MTJ Neuron Built for Signed Signals

TL;DR for operators The paper proposes a magnetic tunnel junction, or MTJ, neuron that can implement signed leaky integrate-and-fire dynamics: positive and negative spikes, not merely ordinary one-direction spiking dressed up in new device terminology.1 The important move is geometric. The authors align the pinned-layer easy axis with the short axis of an elliptical free layer, while the free layer’s own easy axis points along the height direction. That orthogonal-easy-axis arrangement changes how the free-layer magnetization accumulates, relaxes, and crosses thresholds. In business language, the paper is not saying “spintronics is cool.” It is saying “a particular magnetic geometry may give a compact physical substrate for richer spiking representations.” Subtle difference. Useful difference. ...

Frame Before You Aim: Why AI Needs the Right Reference Point

Business AI has acquired a slightly dangerous reflex: when a system underperforms, reach for a stronger model, a faster pipeline, or a more elaborate scoring function. Very enterprise. Very expensive. Occasionally useful. The more interesting failure mode is quieter. A system may have enough intelligence, enough data, and enough compute, yet still be solving the wrong version of the problem because it inherited the wrong reference frame. It reads a wearable signal as if it were clinical instrumentation. It schedules network traffic as if packets only matter after they announce themselves. It ranks alternatives as if the best and worst items in the current dataset were the same thing as business aspiration and business refusal. ...

Stale Gradients, Fresh Economics: CoCD’s Lightweight Route to Zeroth-Order AI

Memory is usually treated as a luxury in machine learning. More parameters, more activations, more optimiser state, more logs, more everything. Then the invoice arrives, the device overheats, and someone rediscovers the ancient corporate virtue of not wasting things. The paper Turning Stale Gradients into Stable Gradients makes a modest but interesting proposal: perhaps an optimiser should not throw away old gradient information just because it is old.1 In the right setting, yesterday’s partial derivative is not spoiled milk. It is a slightly outdated map. If the terrain has not shifted too violently, it may still point in a useful direction. ...

The Edge Case for LLM Routing: Why Cheap Local Inference Needs a Risk Gate

Phone. That is the simplest way to understand the problem. Not “AI infrastructure,” not “distributed inference,” not the usual diagram where a cloud box smiles down upon a client device. A phone receives a query. It must decide whether to answer locally or send the request to an edge server. Once it answers locally, the decision is done. There is no elegant after-the-fact escalation. The stronger model it did not call remains unused, quietly judging from the rack. ...

Eyes Wide Compute: Why Physical AI Needs Better Senses, Not Bigger Models

Camera first. Model second. That is not how most AI roadmaps are written. The usual enterprise recipe is tidier: pick a bigger model, add a cloud endpoint, compress something if the bill becomes embarrassing, then declare the system “edge-ready.” This works tolerably well when the input is a clean document, a database row, or an already-captured image. It works less well when the input is a moving camera in a dark warehouse, a microphone beside a noisy motor, a tactile pad on a robot gripper, or smart glasses trying to understand the world before the battery starts writing its resignation letter. ...

Peepholes in Orbit: When Black Boxes Learn to Explain Themselves

Alarm. That is the easy part. A satellite telemetry model notices something unusual in a reaction wheel, raises a flag, and reports an anomaly score. Wonderful. The machine has shouted. Now comes the harder question: what exactly should the spacecraft do with that shout? For ground-based analytics, a black-box anomaly score can be tolerable. An engineer can inspect logs, replay telemetry, compare signals, argue with the model, and eventually decide whether the alert was meaningful. In orbit, especially inside an autonomous Fault Detection, Isolation and Recovery system, that leisurely ritual becomes less charming. The system may need to react before a human has time to read the dashboard, let alone form a committee. ...

When Feelings Negotiate: Why Emotion Might Be the Missing Layer in AI Agents

Collections. That is probably not the first word people expect in an article about emotionally intelligent AI agents. It sounds too ordinary, too administrative, too full of overdue invoices and politely threatening emails. Good. That is exactly why it is useful. Imagine an automated debt-recovery assistant calling a small business owner whose cash flow has collapsed. The assistant has a target: shorten repayment time. The debtor has a story: delayed receivables, layoffs avoided, a promise to pay later. A normal chatbot can respond with empathy. A larger model can produce warmer phrasing. A compliance-tuned model can avoid saying obviously illegal things, which is a charmingly low bar. ...