Prompt Engineering

Promptfolios: When Buffett Becomes a System Prompt

TL;DR A fresh study builds five prompt‑guided LLM agents—each emulating a legendary investor (Buffett, Graham, Greenblatt, Piotroski, Altman)—and backtests them on NASDAQ‑100 stocks from Q4 2023 to Q2 2025. Each agent follows a deterministic pipeline: collect metrics → score → construct a weighted portfolio. The Buffett agent tops the pack with ~42% CAGR, beating the NASDAQ‑100 and S&P 500 benchmarks in the window tested. The result isn’t “LLMs discovered alpha,” but rather: prompts can reliably translate qualitative philosophies into reproducible, quantitative rules. The real opportunity for practitioners is governed agent design—measurable, auditable prompts tied to tools—plus robust validation far beyond a single bullish regime. ...

Graph and Circumstance: Maestro Conducts Reliable AI Agents

When agent frameworks stall in the real world, the culprit is rarely just a bad prompt. It’s the wiring: missing validators, brittle control flow, no explicit state, and second-hop retrieval that never gets the right handle. Maestro proposes something refreshingly uncompromising: optimize both the agent’s graph and its configuration together, with hard budgets on rollouts, latency, and cost—and let textual feedback from traces steer edits as much as numeric scores. ...

Forecast: Mostly Context with a Chance of Routing

Large language models can forecast surprisingly well when you hand them the right context. But naïve prompts leave money on the table. Today’s paper introduces four plug‑and‑play strategies—ReDP, CorDP, IC‑DP, RouteDP—that lift accuracy, interpretability, and cost‑efficiency without training new models. Here’s what that means for teams running demand, risk, or ops forecasts. Why this matters for business readers Most production forecasts are numeric workhorses (ARIMA/ETS/TS foundation models), while contextual facts—weather advisories, policy changes, promos, strikes—arrive as text. LLMs can read that text and adjust the forecast, but simply stuffing history+context into a prompt (“direct prompting”) is often fragile. The four strategies below are operational patterns you can drop into existing stacks without re‑architecting. ...

Numbers Don’t Speak for Themselves: How LLMs Interpret the Soul of Financial Reports

In finance, the devil isn’t just in the details—it’s in the narrative. That’s what makes this new study by Md Talha Mohsin both timely and essential: it directly evaluates how five top-tier LLMs—GPT-4, Claude 4 Opus, Perplexity, Gemini, and DeepSeek—perform in interpreting the most linguistically dense and strategically revealing part of corporate disclosures: the Business section of 10-K filings from the “Magnificent Seven” tech giants. Rather than focusing on raw numbers or sentiment snippets, the study asks: can these LLMs extract strategic intent, infer risk, and assess future outlooks the way human analysts do? ...

Inner Critics, Better Agents: The Rise of Introspective AI

When AI agents begin to talk to themselves—really talk to themselves—we might just witness a shift in how machine reasoning is conceived. A new paper, “Introspection of Thought Helps AI Agents”, proposes a reasoning framework (INoT) that takes inspiration not from more advanced outputs or faster APIs, but from an old philosophical skill: inner reflection. Rather than chaining external prompts or simulating collaborative agents outside the model, INoT introduces PromptCode—a code-integrated prompt system that embeds a virtual multi-agent debate directly inside the LLM. The result? A substantial increase in reasoning quality (average +7.95%) and a dramatic reduction in token cost (–58.3%) compared to state-of-the-art baselines. Let’s unpack how this works, and why it could redefine our mental model of what it means for an LLM to “think.” ...

Echo Chamber in a Prompt: How Survey Bias Creeps into LLMs

Large Language Models (LLMs) are increasingly deployed as synthetic survey respondents in social science and policy research. But a new paper by Rupprecht, Ahnert, and Strohmaier raises a sobering question: are these AI “participants” reliable, or are we just recreating human bias in silicon form? By subjecting nine LLMs—including Gemini, Llama-3 variants, Phi-3.5, and Qwen—to over 167,000 simulated interviews from the World Values Survey, the authors expose a striking vulnerability: even state-of-the-art LLMs consistently fall for classic survey biases—especially recency bias. ...

Humans in the Loop, Not Just the Dataset

When Meta and other tech giants scale back content moderation, the gap isn’t just technical—it’s societal. Civil society organizations (CSOs), not corporations, are increasingly on the frontlines of monitoring online extremism. But they’re often armed with clunky tools, academic prototypes, or opaque black-box models. A new initiative—highlighted in Civil Society in the Loop—challenges this status quo by co-designing a Telegram monitoring tool that embeds human feedback directly into its LLM-assisted classification system. The twist? It invites civil society into the machine learning loop, not just the results screen. ...

From Prompting to Porting: Surviving the LLM Upgrade Cycle

If you’re running a GenAI-powered application today, you’re likely sitting on a ticking time bomb. It isn’t your codebase or infrastructure — it’s your prompts. As Large Language Models (LLMs) evolve at breakneck speed, your carefully tuned prompts degrade silently, causing once-reliable applications to behave erratically. The case of Tursio, an enterprise search tool, makes one thing painfully clear: prompt migration is no longer optional — it’s survival. The Hidden Cost of Progress In 2023, Tursio ran reliably on GPT-4-32k. By mid-2025, it had to migrate twice — first to GPT-4.5-preview, then to GPT-4.1. Each model came with its own quirks: ...

Chains of Causality, Not Just Thought

Large language models (LLMs) have graduated from being glorified autocomplete engines to becoming fully-fledged agents. They write code, control mobile devices, execute multi-step plans. But with this newfound autonomy comes a fundamental problem: they act—and actions have consequences. Recent research from KAIST introduces Causal Influence Prompting (CIP), a method that doesn’t just nudge LLMs toward safety through general heuristics or fuzzy ethical reminders. Instead, it formalizes decision-making by embedding causal influence diagrams (CIDs) into the prompt pipeline. The result? A structured, explainable safety layer that turns abstract AI alignment talk into something operational. ...

Reflections in the Mirror Maze: Why LLM Reasoning Isn't Quite There Yet

In the quest for truly intelligent systems, reasoning has always stood as the ultimate benchmark. But a new paper titled “Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models” by Annie Wong et al. delivers a sobering message: even the most advanced LLMs still stumble in dynamic, high-stakes environments when asked to reason, plan, and act with stability. Beyond the Benchmark Mirage Static benchmarks like math word problems or QA datasets have long given the illusion of emergent intelligence. Yet this paper dives into SmartPlay, a suite of interactive environments, to show that LLMs exhibit brittle reasoning when faced with real-time adaptation. SmartPlay is a collection of dynamic decision-making tasks designed to test planning, adaptation, and coordination under uncertainty. The team evaluates open-source models such as LLAMA3-8B, DEEPSEEK-R1-14B, and LLAMA3.3-70B on tasks involving spatial coordination, opponent modeling, and planning. The result? Larger models perform better—but only to a point. Strategic prompting can help smaller models, but also introduces volatility. ...