GitHub Resources from arXiv Digests

Repository containing code, configurations, prompts, and data-processing utilities for according-to prompting and QUIP-Score experiments.

orionw/according-to implementation

Open GitHub

Repository containing code and results for the LLM-based zero-shot tree induction and embedding experiments, with feature-description files referenced for private-dataset feature names.

ml-lab-htw/llm-trees implementation

Open GitHub

Repository containing code and data for replicating the paper's LIME and SP-LIME experiments.

marcotcr/lime-experiments

Open GitHub

Repository linked by the paper as the code for PriorDynaFlow, the proposed a priori dynamic multi-agent workflow construction framework.

L1n111ya/PriorDynaFlow framework

Open GitHub

Repository for the KindsOfReasoning collection and raw outputs/evaluation results for OpenAI instruction-tuned models.

Kinds-of-Intelligence-CFI/KindsOfReasoning dataset

Open GitHub

Repository identified by the paper as containing code for the empirical experiments.

LoryPack/ReferenceInstancesPredictability implementation

Open GitHub

Repository containing the experimental software associated with the proposed causal evaluation framework for deferring systems.

andrepugni/PODS implementation

Open GitHub

Python, notebook, and spreadsheet materials for driver ranking, forced ARMA changes, and related paper experiments.

omhariyadav20/causal_drivers dataset

Open GitHub

EvolveGCN repository data folder referenced as the source for the SBM dynamic graph benchmark.

IBM/EvolveGCN dataset

Open GitHub

Repository containing datasets and code files supporting the paper's DeepSeek-versus-other-LLMs benchmark.

ZhengTracyKe/DeepSeek-and-other-LLMs dataset

Open GitHub

Repository containing accompanying code and resources for the book, including chapter folders and Python-oriented examples for XAI techniques.

Echoslayer/XAI_From_Classical_Models_to_LLMs implementation

Open GitHub

An implementation referenced for reducing GPU all-to-all communication bottlenecks in large-scale MoE execution.

deepseek-ai/DeepEP system

Open GitHub

A GitHub repository associated with the survey that organises resources and representative work on self-evolving AI agents.

EvoAgentX/Awesome-Self-Evolving-Agents

Open GitHub

Repository containing the EIIE portfolio-management code, configurations, figures, tables, and data used for the cryptocurrency replication and stock-market extension.

jackieli19/PGPortfolio framework

Open GitHub

AIDE-ML: AI-Driven Exploration in the Space of Code, used as the agentic AutoML baseline/interface for observing decisions and artifacts.

WecoAI/aideml framework

Open GitHub

Git repository providing open access to the research datasets supporting the study's findings and replication.

bigrasam/CRNData dataset

Open GitHub

Source code for a minimal MedLog prototype using an OpenAPI-described HTTP REST interface.

mims-harvard/medlog system

Open GitHub

Repository for the financial news dataset used with permission as part of the paper's Bloomberg dataset experiments.

philipperemy/financial-news-dataset dataset

Open GitHub

Open implementation code for the TGN-SEAL framework introduced and evaluated in the paper.

nssajadi/tgn-seal framework

Open GitHub

Public codebase containing the Agent-Driver pipeline, tool library, cognitive memory, reasoning engine, scripts for fine-tuning and inference, prepared data layout, and nuScenes evaluation workflow.

physical-superintelligence-lab/Agent-Driver framework

Open GitHub

The public Lean 4/Mathlib library containing the formalized QAOA and Ising-ring components, the lower-bound and attainability theorems, and the machine-checked proof of residualEnergy_isLeast.

urikol/QuantumOptimization

Open GitHub

Repository released by the authors for the minimal agentic theorem prover and experiment reproduction.

Axiomatic-AI/ax-prover-base framework

Open GitHub

Repository for the CLINC150 intent-classification and out-of-scope evaluation data used in the experiments.

clinc/oos-eval dataset

Open GitHub

Repository for the StackOverflow short-text intent dataset evaluated by the paper.

jacoxu/StackOverflow dataset

Open GitHub

Repository containing the Banking77 fine-grained banking-intent dataset evaluated by the paper.

PolyAI-LDN/task-specific-datasets dataset

Open GitHub

Repository for a hybrid LSTM, multi-head attention, Gaussian fuzzy-rule, and ARIX local-model forecasting system, including folders for LSTM, Transformer, Neuro-fuzzy, data utilities, model code, and an experiment notebook.

mihaozbot/Fuzzy-transformer system

Open GitHub

R-based analysis and reporting pipeline orchestrated with DVC, with scripts, parameters, environment configuration, intermediate outputs, and instructions for reproducing the reported results.

tmr-crypto/wf_optim_crypto_analysis implementation

Open GitHub

Repository containing code used for the paper, including notebooks and utilities for the daily trading strategy, the CNN model with the new loss, data downloading/processing, baselines, and analysis.

Tony-Guo-1/daily_trading_strategy system

Open GitHub

Repository containing Python, R, and notebook implementations, data files, portfolio outputs, and figures associated with the dynamic stock-recommendation study.

AI4Finance-Foundation/Dynamic-Stock-Recommendation-Machine_Learning-Published-Paper-IEEE implementation

Open GitHub

A repository created to keep pace with the fast-moving literature on LLMs and LLM-based agents in science.

ur-whitelab/LLMs-in-science

Open GitHub

Repository released by the author for resources associated with the LLM-agent survey.

xinzhel/LLM-Agent-Survey

Open GitHub

TensorTrade is cited as an open-source package/platform relevant to implementing and adapting RL methods to finance.

tensortrade-org/tensortrade framework

Open GitHub

Official code repository for the paper's Agora implementation and heterogeneous 100-agent demonstration.

agora-protocol/paper-demo

Open GitHub

The GitHub repository contains the self-improving coding-agent implementation introduced by the paper.

MaximeRobeyns/self_improving_coding_agent framework

Open GitHub

FakeNewsNet repository containing fake-news research data resources and crawler tooling for collecting news and related social-media data.

KaiDMML/FakeNewsNet dataset

Open GitHub

Repository containing the resulting mind map, collected notes for each reviewed grey resource, and an online repository of references for further exploration.

SAILResearch/replication-24-harsh-generative-ai-release-readiness-checklist dataset

Open GitHub

The Google Agent2Agent protocol repository, reviewed as a general-purpose inter-agent protocol and used in the paper's comparative use-case analysis.

google/A2A

Open GitHub

Repository for the Web-Agent Protocol, reviewed as a domain-specific inter-agent or human-computer interaction protocol.

OTA-Tech-AI/web-agent-protocol

Open GitHub

Repository for the agents.json specification, reviewed as a domain-specific context-oriented protocol for exposing website capabilities to agents.

wild-card-ai/agents-json

Open GitHub

Companion repository maintained by the authors to track ongoing developments in AI agent protocols.

zoe-yyx/Awesome-AIAgent-Protocol

Open GitHub

An Awesome Data Agents repository linked directly in the paper header, likely used to collect or organize data-agent resources associated with the survey.

HKUSTDial/awesome-data-agents

Open GitHub

JoyAgent is discussed as a proto-L3 system that begins to address predefined-toolset limitations through tool evolution and multi-level thinking.

jd-opensource/joyagent-jdgenie system

Open GitHub

GitHub repository established by the authors as the project page associated with the survey on embodied learning for object-centric robotic manipulation.

RayYoh/OCRM_survey

Open GitHub

Repository linked by the paper as the full list of surveyed papers and summary slides.

junhua/awesome-finance-ai-papers

Open GitHub

Repository for AlphaFin, a benchmark/resource associated with financial question answering and stock prediction.

AlphaFin-proj/AlphaFin dataset

Open GitHub

Repository for R-Judge, a benchmark for safety judgment and risk identification.

Lordog/R-Judge benchmark

Open GitHub

Repository for FinanceBench, a financial question-answering benchmark.

patronus-ai/financebench dataset

Open GitHub

Repository for a Japanese financial language-model benchmark.

pfnet-research/japanese-lm-financial-benchmark benchmark

Open GitHub

Repository for BBT-Fin/CUGE-related Chinese financial language benchmark resources.

ssymmetry/BBT-FinCUGE-Application benchmark

Open GitHub

Repository for FinEval, a Chinese benchmark for financial domain knowledge.

SUFE-AIFLM-Lab/FinEval benchmark

Open GitHub

Repository for PIXIU/FinMA-related financial LLM resources, instruction data, and evaluation benchmarks.

The-FinAI/PIXIU dataset

Open GitHub

Repository for CFBenchmark, a Chinese financial benchmark covering multiple financial NLP tasks.

TongjiFinLab/CFBenchmark benchmark

Open GitHub

Repository for DocMath-Eval, used for evaluating numerical reasoning over text and tables.

yale-nlp/DocMath-Eval dataset

Open GitHub

A curated repository of papers and benchmarks associated with the survey on reasoning with foundation models.

reasoning-survey/Awesome-Reasoning-Foundation-Models

Open GitHub

agentUniverse, a multi-agent ecosystem for autonomous agents.

agentuniverse-ai/agentUniverse framework

Open GitHub

Agno, described as an agentic workflow framework for LLM applications.

agno-agi/agno framework

Open GitHub

Phidata, described as a framework for building multi-modal agents with memory, knowledge, tools, and reasoning.

agno-agi/phidata framework

Open GitHub

Coze, described as an open framework for building agentic applications.

coze-dev/coze framework

Open GitHub

Flowise, described as a drag-and-drop UI to build LLM apps with LangChain.

FlowiseAI/Flowise framework

Open GitHub

LangGraph, described as a stateful multi-actor workflow library for LLM applications.

langchain-ai/langgraph framework

Open GitHub

Dify, described as an open-source LLM application development platform.

langgenius/dify framework

Open GitHub

Microsoft Semantic Kernel, compared as an agent workflow system.

microsoft/semantic-kernel framework

Open GitHub

n8n, described as a fair-code workflow automation platform with UI and integrations.

n8n-io/n8n

Open GitHub

OpenAI Swarm, described as a multi-agent framework by OpenAI.

openai/swarm framework

Open GitHub

Qwen-Agent, a QwenLM agent framework/repository compared in the survey.

QwenLM/Qwen-Agent framework

Open GitHub

An author-linked repository organizing papers on retrieval-augmented generation.

USTCAGI/Awesome-Papers-Retrieval-Augmented-Generation

Open GitHub

GPT Engineer, a software-development agent implementation cited in the engineering application survey and open-source project discussion.

AntonOsika/gpt-engineer implementation

Open GitHub

GPT Researcher, an experimental application that uses LLMs for research-question development, web crawling, source summarization, and aggregation.

assafelovic/gpt-researcher implementation

Open GitHub

AI Legion, an LLM-agent implementation cited in the survey's open-source library and reference set.

eumemic/ai-legion implementation

Open GitHub

LoopGPT, an LLM-agent implementation cited in the survey's open-source library and reference set.

farizrahman4u/loopgpt implementation

Open GitHub

AGiXT, an agent framework implementation cited in the survey as a dynamic AI automation platform.

Josh-XT/AGiXT framework

Open GitHub

DemoGPT, a software-development agent repository cited in the engineering application survey and open-source project discussion.

melih-unsal/DemoGPT implementation

Open GitHub

MiniAGI, an LLM-agent implementation cited in the survey's open-source library and reference set.

muellerberndt/mini-agi implementation

Open GitHub

AgentVerse, a multi-agent collaboration framework referenced among surveyed agent systems and open-source libraries.

OpenBMB/AgentVerse framework

Open GitHub

AgentGPT, an LLM-based autonomous-agent system cited in the survey's open-source library and reference set.

reworkd/AgentGPT implementation

Open GitHub

Auto-GPT, an autonomous LLM-agent implementation included in the construction taxonomy and open-source library discussion.

Significant-Gravitas/Auto-GPT implementation

Open GitHub

SmolModels/developer-style agent repository cited as a software engineering application artifact.

smol-ai/developer implementation

Open GitHub

WorkGPT, a workflow-oriented LLM-agent framework cited as similar to AutoGPT and LangChain.

team-openpm/workgpt implementation

Open GitHub

SuperAGI, an autonomous-agent framework cited in the survey's open-source library and reference set.

TransformerOptimus/SuperAGI implementation

Open GitHub

XLang, an LLM-agent/tool-use framework cited as supporting executable language grounding and interaction with databases, web applications, and physical robots.

xlang-ai/xlang implementation

Open GitHub

Repository for the survey on memory mechanisms of LLM-based agents, including the paper link and visual summaries of the survey sections.

nuster1128/LLM_Agent_Memory_Survey

Open GitHub

A curated reading list accompanying the survey, organized around LLM-agent optimization methods, datasets, benchmarks, and applications.

YoungDubbyDu/Awesome-LLM-Agent-Optimization-Papers

Open GitHub

Repository for the paper's code and experimental workflow comparing LLM self-explanations, human rationales, and post-hoc attribution explanations.

oeberle/self_explanations_human_rationales benchmark

Open GitHub

Repository path listed as the source for the gas station revenue dataset.

bighuang624/DSANet dataset

Open GitHub

Repository listed as the source for daily COVID-19 confirmed and recovered case data.

CSSEGISandData/COVID-19 dataset

Open GitHub

Repository listed as the source for SPMD and VED driving and vehicle energy datasets.

ElmiSay/DeepFEC dataset

Open GitHub

Repository listed as the source for the exchange-rate dataset used in LTSF studies.

laiguokun/multivariate-time-series-data dataset

Open GitHub

Repository listed as the source for daily stock opening price data.

z331565360/State-Frequency-Memory-stock-prediction dataset

Open GitHub

Repository listed as the source for the ETT transformer temperature/load dataset.

zhouhaoyi/ETDataset dataset

Open GitHub

Repository containing the medical RAG application, evaluation framework, text rechunking pipeline, configurations, and generated evaluation outputs.

abdullahmoosa/medrag-research framework

Open GitHub

OpenAI HumanEval benchmark for evaluating code generation with pass@k metrics.

openai/human-eval dataset

Open GitHub

Self-rewarding reasoning LLM implementation referenced as an example of using model-generated judgments to reduce annotation cost.

RLHFlow/Self-rewarding-reasoning-LLM framework

Open GitHub

AlpacaEval benchmark for automatic evaluation of instruction-following model outputs.

tatsu-lab/alpaca_eval benchmark

Open GitHub

Repository containing the Phase 1 SSL and contrastive encoder code, Phase 2 MoE PPO curriculum, inference and backtest scripts, routing diagnostics, and deployment-related components; the repository states that Phase 3 personalization is proprietary and not released.

rpishehv/PublicFinance-RL framework

Open GitHub

Repository containing code for training and evaluating the Transformer electricity price forecasting model and comparing it against EPF toolbox benchmarks.

osllogon/epf-transformers benchmark

Open GitHub

Repository linked from the paper that mirrors the paper title, abstract, framework figure, table of contents, and compiled relevant works for agent categories and attack types.

OSU-NLP-Group/AgentAttack framework

Open GitHub

Author-provided production-oriented implementation of the A-Mem agentic memory system.

WujiangXu/A-mem-sys

Open GitHub

Author-provided repository for evaluating the A-Mem method and reproducing benchmark experiments.

WujiangXu/AgenticMemory

Open GitHub

Repository for ABC-Bench, a benchmark for evaluating whether coding agents can explore repositories, edit code, configure environments, deploy containerized backend services, and pass external HTTP/API integration tests.

OpenMOSS/ABC-Bench dataset

Open GitHub

Optimized Faiss exact-search implementation for Intel CPUs.

architecture-research-group/ae-asplo25-iks-faiss package

Open GitHub

Cycle-approximate IKS simulator parameterized with RTL and memory/interconnect timing.

architecture-research-group/iks_simulator system

Open GitHub

Public repository containing the access-controlled website implementation, agent-side experiment scripts, modified SST/Auth component as a submodule, and experiment logs for delegated access workflows.

asu-kim/agentic-website framework

Open GitHub

Code and dataset repository for ActivationReasoning, including implementation and materials for replication or extension.

ml-research/ActivationReasoning dataset

Open GitHub

Official repository containing the FLARE code, task configurations, experimental data, retrieval setup, and run instructions.

jzbjyb/FLARE

Open GitHub

Repository containing data loaders, preprocessing, regime models, custom environments, PPO training pipelines, ablations, evaluation outputs, figures, and notebooks associated with the regime-aware portfolio framework.

GabrielNixon/RegimeAware-PPO framework

Open GitHub

SentiFin is cited as a benchmark dataset for sentiment analysis of Indian financial news headlines and is used to fine-tune the LLaMA 3.2 model.

pyRis/SEntFiN dataset

Open GitHub

An installable Python repository implementing AGoT, AIoT, and GIoT and providing experimental setups, datasets, and result files used to reproduce or inspect the paper's evaluations.

AgnostiqHQ/multi-agent-llm framework

Open GitHub

Repository containing the code developed and used for the study, with a README to support replication and methodology exploration.

stellacydong/rl-cvar-insurance-reserving framework

Open GitHub

GitHub repository identified by the paper as containing implementation details for the Dueling DDQN liquidity-provision method and baseline methods.

HaochenZhang717/Uniswap-v3 benchmark

Open GitHub

Repository stated to include scripts for data preprocessing, model training, performance evaluation, and ablation studies for AMDTL.

mlaurelli/amdtl framework

Open GitHub

Repository containing AMDM implementation code, simulation scripts, evaluation scripts, example data, results files, plots, and the paper materials.

Manishms18/Adaptive-Multi-Dimensional-Monitoring dataset

Open GitHub

Longformer repository for the long-document transformer model evaluated as an additional architecture in the paper.

allenai/longformer implementation

Open GitHub

A continuously updated collection of FFM-related publications, tools, datasets, and resources associated with the survey.

FinFM/Awesome-FinFMs

Open GitHub

The paper identifies this repository as containing representative Python code generated under the tested prompt levels.

VonAugustus/SnakeData implementation

Open GitHub

Claude-Agent-SDK framework used as one of the scaffold alternatives in the agentic scaffold impact study.

anthropics/claude-agent-sdk-python framework

Open GitHub

Public repository for the AgencyBench benchmark and evaluation toolkit released by the authors.

GAIR-NLP/AgencyBench dataset

Open GitHub

OpenAI-Agents-SDK framework used as one of the scaffold alternatives in the agentic scaffold impact study.

openai/openai-agents-python framework

Open GitHub

Repository for the AGENT KB cross-framework agent memory system introduced and evaluated in the paper.

OPPO-PersonalAI/Agent-KB framework

Open GitHub

Repository for the Agent Mentor / Agent Analytics open-source observability and analytics platform for agentic AI applications.

AgentToolkit/agent-mentor framework

Open GitHub

GitHub path identified by the paper as the code corresponding to the analytics pipeline used for semantic feature analysis.

AgentToolkit/agent-mentor implementation

Open GitHub

Repository containing experiment code, response matrices, IRT models, task embeddings, LLM-as-a-judge features, adaptive testing code, and scripts for the new task, new response, new agent, and new benchmark experiments.

dariakryvosheieva/agent-psychometrics implementation

Open GitHub

Repository for the paper's virtual trading arena, ArenaTrader implementation, prompts, code, and data.

wekjsdvnm/Agent-Trading-Arena dataset

Open GitHub

Official code repository implementing AWM pipelines for WebArena and Mind2Web in offline and online settings.

zorazrw/agent-workflow-memory

Open GitHub

Repository for the Agent-as-a-Judge project and DevAI-related evaluation artifacts.

metauto-ai/agent-as-a-judge framework

Open GitHub

GPT-Pilot is one of the three open-source code-generation agentic systems benchmarked in the paper.

Pythagora-io/gpt-pilot framework

Open GitHub

A GitHub repository collecting papers and resources for the survey on Agent-as-a-Judge.

ModalityDance/Awesome-Agent-as-a-Judge

Open GitHub

Repository associated with Agent-R, the paper's iterative self-training framework for training language-model agents to reflect and recover from errors.

bytedance/Agent-R framework

Open GitHub

Repository for AgentBench datasets, environments, and integrated evaluation package.

THUDM/AgentBench dataset

Open GitHub

Indeed Hiring Lab repository tracking the share of job postings mentioning artificial intelligence.

hiring-lab/AI-Hiring-Tracker dataset

Open GitHub

Google's Agent-to-Agent Protocol repository, referenced as the source for A2A, one of the modern agent communication protocols compared in the paper.

google/A2A implementation

Open GitHub

BlenderMCP repository cited as a GitHub integration for Blender Model Context Protocol.

ahujasid/blender-mcp implementation

Open GitHub

PowerAgent PowerMCP repository cited as a GitHub implementation for power-system simulation software MCPs.

PowerAgent/PowerMCP framework

Open GitHub

Repository hosting the draft proteomics_GROUNDING.md epistemic grounding specification and Appendix A preliminary testing materials.

OmicsContext/proteomics-context framework

Open GitHub

LangGraph is used as an example of developer-defined graph/state-machine execution with persistence, checkpoints, controlled cycles, guard nodes, and approvals.

langchain-ai/langgraph framework

Open GitHub

Swarm is used as an example of star-with-handoffs orchestration using lightweight specialists, routines, and controller selection.

openai/swarm framework

Open GitHub

Agent-zero repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

agent0ai/agent-zero framework

Open GitHub

ANUS repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

anus-dev/ANUS framework

Open GitHub

Camel repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

camel-ai/camel framework

Open GitHub

CrewAI repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

crewAIInc/crewAI framework

Open GitHub

MetaGPT repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

FoundationAgents/MetaGPT framework

Open GitHub

Google ADK repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

google/adk-python framework

Open GitHub

Source code repository released by the authors for reproducing the benchmark comparison.

GPT-Laboratory/22-Agentic-Framework-Comparison-for-Reasoning-Tasks-across-BBH-GSM8K-and-ARC-Benchmarks implementation

Open GitHub

LangChain repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

langchain-ai/langchain framework

Open GitHub

LangGraph repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

langchain-ai/langgraph framework

Open GitHub

Mastra repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

mastra-ai/mastra framework

Open GitHub

PraisonAI repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

MervinPraison/PraisonAI framework

Open GitHub

Autogen repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

microsoft/autogen framework

Open GitHub

Semantic-kernel repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

microsoft/semantic-kernel framework

Open GitHub

TaskWeaver repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

microsoft/TaskWeaver framework

Open GitHub

OpenAI-Agents-Python repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

openai/openai-agents-python framework

Open GitHub

Swarm repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

openai/swarm framework

Open GitHub

Pydantic-AI repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

pydantic/pydantic-ai framework

Open GitHub

Qwen-Agent repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

QwenLM/Qwen-Agent framework

Open GitHub

AutoGPT repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

Significant-Gravitas/AutoGPT framework

Open GitHub

SuperAGI repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

TransformerOptimus/SuperAGI framework

Open GitHub

Upsonic repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

Upsonic/Upsonic framework

Open GitHub

Agency-swarm repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

VRSEN/agency-swarm framework

Open GitHub

BabyAGI repository evaluated as a selected agentic framework in the paper's 22-framework benchmark study.

yoheinakajima/babyagi framework

Open GitHub

A curated and updateable repository of papers and resources organized according to the survey's agentic reasoning taxonomy.

weitianxin/Awesome-Agentic-Reasoning

Open GitHub

Code repository for the Agentic Reasoning framework introduced and evaluated in the paper.

theworldofagents/Agentic-Reasoning framework

Open GitHub

Public repository for Agentic Reward Modeling, including the RewardAgent implementation and materials intended to facilitate further research.

THU-KEG/Agentic-Reward-Modeling system

Open GitHub

A continuously updated collection of relevant studies for the Agentic Web.

SafeRL-Lab/agentic-web

Open GitHub

Repository for the AgenticPay benchmark and framework, including buyer and seller agents, negotiation environments, examples, metrics, and model backends for LLM-based commerce negotiation.

SafeRL-Lab/AgenticPay dataset

Open GitHub

Repository containing the AgentLAB benchmark code, attack scripts, environments, prompts, data, and usage instructions for evaluating LLM agents against long-horizon attacks.

TanqiuJiang/AgentLAB dataset

Open GitHub

Repository for the AgentRewardBench library, including tools for running agents, running judges, scoring judgments, loading trajectories, and submitting leaderboard results.

McGill-NLP/agent-reward-bench dataset

Open GitHub

Repository associated with AgentRx, the paper's diagnostic framework and benchmark artifact for AI-agent failure attribution.

microsoft/AgentRx dataset

Open GitHub

Official repository for PASB, including the benchmark data, baseline runners, judge code, audit scripts, and documentation.

henrymao2004/agent-sycophancy dataset

Open GitHub

Hermes-Agent is one of the two stateful personal-agent stacks evaluated by PASB.

NousResearch/hermes-agent

Open GitHub

OpenClaw is one of the two stateful personal-agent stacks evaluated by PASB.

openclaw/openclaw

Open GitHub

A curated GitHub repository associated with the paper that lists and classifies related papers on LLM-based agents in software engineering.

DeepSoftwareAnalytics/Awesome-Agent4SE dataset

Open GitHub

The GitHub repository for AgentScope, the multi-agent platform described in the paper.

modelscope/agentscope framework

Open GitHub

Public implementation of AgentStepper and relevant data for the paper's evaluation.

sola-st/AgentStepper dataset

Open GitHub

GitHub Gist containing the Claude Code implementation prompt for the case summarization by file name microservice.

https:/

Open GitHub

GitHub Gist containing the Case Summarization by Given Case Name Workflow pitch generated by the Planning Agent.

https:/

Open GitHub

Official repository for the AgentVerse framework introduced and evaluated in the paper.

OpenBMB/AgentVerse

Open GitHub

AgentWard prototype repository implementing a lifecycle security architecture for autonomous AI agents with native adaptation to OpenClaw.

FIND-Lab/AgentWard framework

Open GitHub

Agile V skills repository v1.3, described as the composable AI-agent skills library that operationalizes Agile V and includes context-engineering patterns.

Agile-V/agile_v_skills framework

Open GitHub

Repository released by the authors for the robust-transformers implementation used with AGRO.

bhargaviparanjape/robust-transformers framework

Open GitHub

A GitHub CLI extension that exposes a compact interface for reading review threads and submitting inline pull request comments.

agynio/gh-pr-review implementation

Open GitHub

Agyn platform for configuring and orchestrating multi-agent systems with explicit communication, roles, and dedicated sandboxes.

agynio/platform framework

Open GitHub

Repository for the AI Act Evaluation Benchmark, including structured EU AI Act compliance scenarios, QA pairs, documentation, and scripts.

davidath/ai-act-evaluation-benchmark dataset

Open GitHub

BabyAGI is used as a representative agentic framework showing how LLMs can be embedded in feedback loops to plan, act, adapt, and manage or prioritize subtasks.

yoheinakajima/babyagi framework

Open GitHub

Repository for the violent-python dataset of security-oriented Python code and natural-language descriptions.

dessertlab/violent-python dataset

Open GitHub

Repository for CodeBERT, the pre-trained programming-language model used as the fine-tuned open-model baseline.

microsoft/CodeBERT implementation

Open GitHub

Repository containing an implementation related to reproducing 'Clustering by fast search and find of density peaks'.

AIReproducibility2018/ClusteringByFastSearchAndFindOfDensityPeaks implementation

Open GitHub

User interface and implementation for customizing the visual framework to empirical studies using AI accuracy, human adherence, and final decision-making accuracy, and for computing reliance components and Q.

jhnnsjkbk/accuracy-reliance implementation

Open GitHub

A GitHub Action that reviews pull requests by sending changed file contents to ChatGPT/OpenAI and posting review comments back to the PR.

agogear/chatgpt-pr-review system

Open GitHub

Repository containing the paper's algorithm code and a link to the associated QuantConnect backtest artifact.

tiagomonteiro0715/AI-Powered-Energy-Algorithmic-Trading-Integrating-Hidden-Markov-Models-with-Neural-Networks implementation

Open GitHub

Repository released by the authors for the Writing Quality benchmark, WQRM models, data, and associated evaluation or editing code.

salesforce/creativity_eval

Open GitHub

GitHub repository linked by the paper for example Jupyter notebooks and materials related to AIDev and AI teammates in SE 3.0.

SAILResearch/AI_Teammates_in_SE3 dataset

Open GitHub

Repository for the AIRepr / LLM data-science reproducibility experiments and code released by the authors.

qunhualilab/LLM-DS-Reproducibility framework

Open GitHub

Repository stated by the paper as the public code and data release for AirQA.

OpenDFM/AirQA dataset

Open GitHub

Repository for the AISysRev web application, an LLM-based tool for title-abstract screening that imports CSV metadata, applies inclusion/exclusion criteria using multiple LLMs, supports manual screening with LLM guidance, and exports results to CSV.

EvoTestOps/AISysRev system

Open GitHub

Contains the source code, dataset-construction scripts, dependency documentation, and result-generation pipelines associated with the paper.

brains-group/FLARKO framework

Open GitHub

Repository containing code and configurations for training and evaluating AlphaDPO.

junkangwu/alpha-DPO implementation

Open GitHub

Repository containing a practical implementation of the online AdaVol recursive volatility-prediction method.

nicklaswerge/AdaVol package

Open GitHub

Public repository containing experimental code supporting the TDQN trading results reported in the paper.

ThibautTheate/An-Application-of-Deep-Reinforcement-Learning-to-Algorithmic-Trading benchmark

Open GitHub

Semantic Kernel is described as a modular, plugin-based framework for integrating LLM and agent capabilities into enterprise and software systems.

microsoft/semantic-kernel framework

Open GitHub

Swarm is described as a lightweight multi-agent interface framework for experimenting with multi-agent coordination.

openai/swarm framework

Open GitHub

BabyAGI is described as an experimental framework for autonomous task planning and iterative execution through a self-improving task loop.

yoheinakajima/babyagi framework

Open GitHub

Author-provided Deformable DETR code repository used for the reconstruction-based SSL evaluation.

gokulkarthik/Deformable-DETR implementation

Open GitHub

Author-provided DETR code repository used for the proposed SSL task experiments.

gokulkarthik/detr implementation

Open GitHub

Official SWE-bench experiment repository used by the authors as the source of execution logs, generated patches, and pass/fail statuses for the studied tools.

SWE-bench/experiments dataset

Open GitHub

Code repository for ACEFormer, the attention-based stock forecasting system introduced by the paper.

DurandalLee/ACEFormer implementation

Open GitHub

Repository for the Online-Mind2Web benchmark and associated evaluation artifacts introduced by the paper.

OSU-NLP-Group/Online-Mind2Web dataset

Open GitHub

GPT Engineer is described as an AI coding agent that can generate an entire codebase from a prompt and ask clarifying questions.

AntonOsika/gpt-engineer implementation

Open GitHub

MetaGPT is described as a multi-agent framework assigning different GPT roles to form a collaborative software entity for complex tasks.

geekan/MetaGPT framework

Open GitHub

AgentGPT is described as a framework for rapidly configuring and deploying autonomous AI agents.

reworkd/AgentGPT framework

Open GitHub

Multi-GPT is described as an experimental multi-agent system in which expertGPTs collaborate, communicate, and use short- and long-term memory.

sidhq/Multi-GPT system

Open GitHub

Auto-GPT is described as an early example of GPT-4 running fully autonomously by chaining LLM thoughts to achieve user-set goals.

Significant-Gravitas/Auto-GPT implementation

Open GitHub

SuperAGI is described as a developer-centric open-source framework for building, managing, and running autonomous AI agents.

TransformerOptimus/SuperAGI framework

Open GitHub

BabyAGI is described as a task-driven autonomous AI agent that builds and prioritises tasks toward an overall goal.

yoheinakajima/babyagi implementation

Open GitHub

A curated repository of reentrancy attacks from which the paper draws 13 sufficiently isolated real-world exploit contracts for evaluation.

pcaversaccio/reentrancy-attacks dataset

Open GitHub

The report states that EnglandCovid is a PyTorch Geometric Temporal dataset and cites this GitHub repository as reference [9].

benedekrozemberczki/pytorch_geometric_temporal dataset

Open GitHub

Repository referenced by the paper for more details on the EnglandCovid dataset conversion and temporal neural network analysis.

verma-rishu/Analysis_TNNs implementation

Open GitHub

A curated repository for the paper's agentic-memory survey, organised around the taxonomy introduced in the manuscript and intended to be updated as the field evolves.

FredJiang0324/Anatomy-of-Agentic-Memory

Open GitHub

Official repository for the paper, containing a CLAIR preference-generation notebook, cached results, documentation, and links to associated data.

ContextualAI/CLAIR_and_APO

Open GitHub

Repository for the AndroidWorld environment, task suite, evaluation logic, and associated experiments introduced by the paper.

google-research/android_world dataset

Open GitHub

Official API-Bank code and data directory containing implemented APIs, datasets, database initialization resources, evaluators, simulators, examples, and demo code.

AlibabaResearch/DAMO-ConvAI dataset

Open GitHub

GitHub repository for the DLS-DDPG implementation used or released with the paper.

Hisato-Komatsu/DLS-DDPG implementation

Open GitHub

Repository for AppWorld Engine and AppWorld Benchmark, including the simulated app environment, benchmark tasks, and evaluation infrastructure.

stonybrooknlp/appworld dataset

Open GitHub

Repository for APTBench, the benchmark for evaluating agentic potential of base LLMs during pre-training.

TencentYoutuResearch/APTBench dataset

Open GitHub

Repository identified by the paper as the code for the Arabic tool-calling benchmark work.

kubrak94/gorilla dataset

Open GitHub

AutoResearchClaw is cited as an example of the scenario-verticalized or research-oriented pattern and as the only pipeline/stage subagent case.

aiming-lab/AutoResearchClaw framework

Open GitHub

deer-flow is cited as an example of the scenario-verticalized or research-oriented pattern.

bytedance/deer-flow framework

Open GitHub

cline is included as a corpus project and cited as an example of tool-based delegation.

cline/cline system

Open GitHub

docker-agent is used as a representative orchestration-oriented project combining orchestrator-worker structure, hybrid context, MCP-first tooling, and advanced execution controls.

docker/docker-agent system

Open GitHub

fast-agent is used as an illustrative project showing tool delegation, hybrid context handling, and MCP-first tool registration.

evalstate/fast-agent framework

Open GitHub

Gemini CLI is included as one of the official or first-party products from major AI companies in the corpus.

google-gemini/gemini-cli system

Open GitHub

deepagents is cited as an example of the scenario-verticalized or research-oriented pattern.

langchain-ai/deepagents framework

Open GitHub

Mistral Vibe is included as one of the official or first-party products from major AI companies in the corpus.

mistralai/mistral-vibe system

Open GitHub

Kimi CLI is included as one of the official or first-party products from major AI companies in the corpus and is named as a balanced CLI representative.

MoonshotAI/kimi-cli system

Open GitHub

nullclaw is cited as an example of the Enterprise Full-Featured pattern.

nullclaw/nullclaw framework

Open GitHub

codex is included as an official product/public-evidence case in the corpus and appears in the complete project list.

openai/codex system

Open GitHub

OpenClaw is analyzed as a corpus project and cited as an example of multi-level recursive or balanced CLI-style infrastructure depending on context.

openclaw/openclaw framework

Open GitHub

OpenHands is included in the 70-project corpus and used as an example of event-driven, hybrid-context, enterprise-tool, advanced-safety infrastructure.

OpenHands/OpenHands framework

Open GitHub

agentpool is used to illustrate event-driven subagent architecture with hierarchical context, registry tooling, and enterprise safety classification.

phil65/agentpool framework

Open GitHub

Qwen Code is included as one of the official or first-party products from major AI companies in the corpus.

QwenLM/qwen-code system

Open GitHub

openfang is cited as an example of the Enterprise Full-Featured pattern.

RightNow-AI/openfang framework

Open GitHub

Repository containing three customer-service chatbot variants generated from different single prompts, with methodology, prompts, and source code for the paper's vibe-architecting demonstration.

phomarkon/vibe-architecting-case-study implementation

Open GitHub

GitHub repository for the ARDNS-FN-Quantum code, including the core script and interactive analysis or visualization notebook described by the paper.

umbertogs/ardns-fn-quantum framework

Open GitHub

Repository containing NeuLR and the paper's appendix or supporting resources for benchmark construction and evaluation.

DeepReasoning/NeuLR dataset

Open GitHub

LongBench is used as the main benchmark source for most datasets and for baseline evaluation settings.

THUDM/LongBench dataset

Open GitHub

GitHub repository for the Cross-Attention-only Time Series transformer introduced in the paper.

dongbeank/CATS benchmark

Open GitHub

Repository containing the code implementation for evaluating language models using the paper's framework and generating visualizations; described as configurable for other HuggingFace models, metrics, task types, domains, and reasoning types.

neelabhsinha/lm-application-eval-kit benchmark

Open GitHub

Repository containing code for ArgEval, the paper's argumentation-based LLM decision-support framework.

adamdejl/argeval framework

Open GitHub

Salesforce repository containing the Art_or_Artifice project folder and associated creativity-evaluation resources.

salesforce/creativity_eval benchmark

Open GitHub

Repository containing Python implementation files and supplementary materials for detecting hallucinations with specialized model divergence.

ACMCMC/ask-a-local implementation

Open GitHub

Repository reported by the authors as containing the code and data for Ask an Expert / BBMHReasoning experiments.

QZx7/BBMHReasoning implementation

Open GitHub

Repository containing code and scripts for the ASPIRE paper, including captioning, GPT-4 prompt use, spurious object detection, diffusion fine-tuning, image generation, and classifier training workflows.

Sreyan88/ASPIRE framework

Open GitHub

Tensor2Tensor repository containing the code used by the authors to train and evaluate the Transformer models.

tensorflow/tensor2tensor

Open GitHub

Repository linked by the authors as the code for the attention-retention continual-learning framework.

zugexiaodui/AttentionRetentionCL framework

Open GitHub

Author-referenced sample dataset of synthetic identification-document images covering five document types.

meetsandesh/identification_document_dataset dataset

Open GitHub

Author-referenced repository for generating synthetic document images used in the document identification and information extraction experiment.

meetsandesh/synthetic_document_generator dataset

Open GitHub

Repository containing the implementation of Document Augmentation for dense Retrieval (DAR).

starsuzi/DAR implementation

Open GitHub

Repository identified by the paper as the official source-code location for the Auto-ADMET method.

alexgcsa/auto-admet system

Open GitHub

Repository containing the Auto-FP benchmark resources, including code, datasets, meta-features, and comprehensive experimental results associated with the paper.

AutoFP/Auto-FP dataset

Open GitHub

Source-code repository for the AutoFlow framework introduced by the paper.

agiresearch/AutoFlow framework

Open GitHub

Repository for the Autoformer model and experiments introduced in the paper.

thuml/Autoformer benchmark

Open GitHub

The paper's AutoGen framework repository for building LLM applications via multi-agent conversations.

microsoft/autogen framework

Open GitHub

Repository for Bias Identification Test in Sentiments, including data files and templates for probing sentiment and toxicity models for bias; the paper specifically uses the disability facet.

PranavNV/BITS dataset

Open GitHub

Repository for the ADAS codebase introduced by the paper, including the Meta Agent Search implementation and experimental framework.

ShengranHu/ADAS framework

Open GitHub

Repository for ECLAIR, described as enterprise-scale AI for workflows, including code and scripts for the paper experiments and hospital workflow demo materials.

HazyResearch/eclair-agents framework

Open GitHub

Repository containing folders for datasets, models, plots, pretrained models, additional tests, requirements, and a README with benchmark results for AutoML-DC.

MilanShao/AutoML-for-Multi-Class-Anomaly-Compensation-of-Sensor-Drift dataset

Open GitHub

Supplementary repository containing datasets, generated manuscripts, run records, coding runs, and supplemental appendix materials used to support the paper's evaluation.

rkishony/data-to-paper-supplementary dataset

Open GitHub

Code implementation of the data-to-paper framework for backward-traceable AI-driven scientific research.

Technion-Kishony-lab/data-to-paper framework

Open GitHub

Hosts the working paper, data.json, figures, tables, and an interactive visualization of Top-N portfolio performance and alpha concentration.

mapledust0/AI-Stock-Nowcasting dataset

Open GitHub

Repository for the BacktestBench benchmark and AutoBacktest implementation, including project folders for AutoBacktest, datasets, figures, tables, environment setup, database setup, and reproduction scripts.

jensenw1/BacktestBench dataset

Open GitHub

Repository for the BadAgent attack on LLM agents.

DPamK/BadAgent implementation

Open GitHub

Repository for the BagStacking implementation provided by the authors within a Scikit-learn API framework.

SeffiCohen/BagStacking package

Open GitHub

AgentGPT is analysed as a general-purpose autonomous LLM-powered multi-agent system with user-guided alignment in selected aspects such as decomposition, agent generation, and resource utilization.

reworkd/AgentGPT framework

Open GitHub

Auto-GPT is analysed as a general-purpose autonomous LLM-powered multi-agent system with autonomous goal decomposition, task action management, and resource utilization.

Significant-Gravitas/Auto-GPT framework

Open GitHub

SuperAGI is analysed as a general-purpose autonomous LLM-powered multi-agent system with some user-guided alignment options for agent-related and resource-related aspects.

TransformerOptimus/SuperAGI framework

Open GitHub

BabyAGI is analysed as a general-purpose autonomous LLM-powered multi-agent system with a profile similar to Auto-GPT across many assessed aspects.

yoheinakajima/babyagi framework

Open GitHub

The fairseq BART directory provides the implementation interface, released BART checkpoints, and task-specific usage and fine-tuning instructions associated with the paper.

facebookresearch/fairseq

Open GitHub

Repository for General AgentBench, the unified benchmark and evaluation framework for general LLM agents.

cxcscmu/General-AgentBench benchmark

Open GitHub

Official repository for the WorfBench benchmark, WorfEval evaluation implementation, data, and experiment code.

zjunlp/WorfBench dataset

Open GitHub

Repository containing code for evaluating MT-BaxBench and MT-SECCODEPLT, the two splits of the MT-Sec evaluation kit.

JARVVVIS/mt-sec benchmark

Open GitHub

OpenAI Codex CLI, cited as a lightweight coding agent running in a terminal and evaluated as one of the agent scaffolds.

openai/codex

Open GitHub

GitHub repository for the FaithJudge leaderboard and associated faithfulness evaluation resources.

vectara/FaithJudge benchmark

Open GitHub

Repository containing the modular benchmarking framework, configurations, scripts, results, and visualization code for comparing retrieval strategies in biomedical RAG.

deviprasadbal/RAGHealthcareRetrievalStrategies benchmark

Open GitHub

Repository containing the agent scaffold, behaviour-analysis pipeline, SFT recipes, scripts, utilities, and evaluation suite associated with Behavior Priming for agentic search.

cxcscmu/Behavior-Priming-for-Agentic-Search system

Open GitHub

Repository released by the authors with TensorFlow code for BERT, pretrained BERT_BASE and BERT_LARGE checkpoints, and scripts for replicating major fine-tuning experiments.

google-research/bert

Open GitHub

Official implementation and pretrained sample model for the paper's learned reference-free summary reward, including metric comparison and reward-training scripts.

yg211/summary-reward-no-reference

Open GitHub

Official LEAP implementation containing the Python package, prompts, configuration files, data-processing scripts, SFT and DPO training scripts, and evaluation workflows for ALFWorld and WebShop, with InterCode-related code also included in the repository structure.

sanjibanc/leap_llm

Open GitHub

Repository by genai-analytics containing a beyond-black-box-benchmarking folder with benchmark, core, examples, and sdk subfolders.

genai-analytics/publications dataset

Open GitHub

ParlAI crowdsourcing code associated with the paper's multi-session model-chat human evaluation.

facebookresearch/ParlAI

Open GitHub

ParlAI implementation for loading and evaluating the Multi-Session Chat and PersonaSummary tasks introduced by the paper.

facebookresearch/ParlAI dataset

Open GitHub

Official codebase containing 1dCA data generation, evaluated model implementations, ACT variants, training scripts, evaluation pipelines, and experiment configurations.

RodkinIvan/associative-recurrent-memory-transformer benchmark

Open GitHub

Code repository for Multi-Objective Direct Preference Optimization and its experimental workflow.

ZHZisZZ/modpo implementation

Open GitHub

Repository backing the Agentic Factor Investing project homepage, containing a README, project-framework image, interactive site, and chart data; it is a showcase rather than a complete research-code release.

allenh16/agentic-factor-investing framework

Open GitHub

GoodAI baseline LTM system using a vector database and JSON scratchpad to augment an LLM controller.

GoodAI/goodai-ltm system

Open GitHub

Versioned branch of the GoodAI LTM Benchmark repository containing code, test definitions, experiments, result data, and reports corresponding to the paper.

GoodAI/goodai-ltm-benchmark dataset

Open GitHub

Code for f-DPO with reverse KL, forward KL, Jensen-Shannon, and alpha-divergence regularization, including scripts for IMDB, Anthropic HH, MT-Bench, PPO comparisons, and calibration experiments.

alecwangcq/f-divergence-dpo implementation

Open GitHub

Repository containing the paper's experiments and results for the Agent Assessment Framework.

sa4s-serc/asf benchmark

Open GitHub

Repository containing BIG-bench task definitions, JSON and programmatic evaluation infrastructure, documentation, model score files, BIG-bench Lite resources, and contribution workflows.

google/BIG-bench dataset

Open GitHub

Repository for the paper's supplemental materials, including the coding schema, all repository metadata, raw graph data, and the LLM coding prompt.

ShaokangJiang/cursorrule-supp dataset

Open GitHub

TokenScope is used to extract probabilities for the first decision token where the judge commits to A or B.

Amirresm/tokenscope package

Open GitHub

Repository containing the MuseD evaluation code and associated research artifact.

zhangyipin/mused dataset

Open GitHub

Official PyTorch implementation of BOSS for simulated ALFRED experiments associated with the paper.

clvrai/boss framework

Open GitHub

Project repository for the paper's automated peer-review vulnerability evaluation and adversarial robustness study.

Lin-TzuLing/Breaking-the-Reviewer benchmark

Open GitHub

Implementation used to generate the saliency-map explanations that formed the H2 baseline condition.

greydanus/visualize_atari

Open GitHub

The gym-sokoban implementation that the authors modified into Sokoban-switch and Sokoban-cells variants for precondition and cost-explanation experiments.

mpSchrader/gym-sokoban

Open GitHub

An aggregated dataset of chess opening names and move sequences used by the paper to create opening-position concept datasets.

lichess-org/chess-openings dataset

Open GitHub

Source-code repository for the BCDA study and its algorithmic experiments.

ShironT/bcda implementation

Open GitHub

OpenDev, the open-source command-line AI coding agent whose architecture, harness, context engineering, tool system, and lessons learned are described in the paper.

opendev-to/opendev framework

Open GitHub

Open-source Python package calibrated-explanations, including code repository, examples, notebooks, and evaluation/regression scripts for reproducing experiments.

Moffran/calibrated_explanations package

Open GitHub

Source code accompanying the paper, with scripts for decoding, extracting internal-consistency information, and evaluating self-consistency variants.

zhxieml/internal-consistency implementation

Open GitHub

The open-source CAMEL library introduced by the paper, including agent implementations, prompts, data-generation pipelines, analysis tools, examples, and links to generated datasets.

camel-ai/camel

Open GitHub

Repository designated by the paper for code, training conditions, and experimental run records.

basetenlabs/cortex

Open GitHub

JSON benchmark file for the paper's freelance-style task suite.

reveondivad/certify dataset

Open GitHub

Repository for the Econometrics-Agent/MetricsAI system that automates econometric analysis through an AI-agent workflow and domain-specific econometric tools.

HKU-Business-AI-Center/Econometrics-Agent framework

Open GitHub

Official repository directory containing the paper's code, data, and experimental setup for writing alignment through edits.

salesforce/creativity_eval

Open GitHub

Repository accompanying the paper, containing the ageval evaluator package, agent tools, baseline evaluators, datasets/labels, experiment configurations, notebooks, and a Gradio app for inspecting annotator outputs.

apple/ml-agent-evaluator framework

Open GitHub

Official repository containing 50 author-specific writing prompts and anonymized JSON evaluation data organized by quality versus style, prompting versus fine-tuning, and expert versus lay judges.

tuhinjubcse/GoodWritingBeGenerative dataset

Open GitHub

Repository linked by the paper as the available code for evaluating GPT models on mock CFA exams.

e-cal/gpt-cfa benchmark

Open GitHub

Official implementation of TextGym, evaluated language-agent configurations, EXE, and the benchmark experiments.

mail-ecnu/Text-Gym-Agents

Open GitHub

Repository containing resources and code for implementing and experimenting with LLMs for vehicle routing problems, including context materials, LLM framework code, oracle algorithms, and verifier code.

Zhehui-Huang/LLM_Routing benchmark

Open GitHub

Repository for RWE-bench, the benchmark and evaluation environment introduced by the paper for testing LLM agents on real-world evidence generation from medical databases.

somewordstoolate/RWE-bench dataset

Open GitHub

Repository containing the FINSABER framework, backtesting code, strategy interfaces, experiment scripts, documentation, and dataset links for reproducing or extending the paper's benchmark.

waylonli/FINSABER dataset

Open GitHub

Official repository containing code, prompts, modules, tests, and post-processing pipeline for generating and evaluating self-generated counterfactual explanations across the paper's datasets and models.

aisoc-lab/llm-sces implementation

Open GitHub

Public repository containing the code, method files, data directory, processing notebook, and framework assets used for the paper's MADR experiments.

SangyunLee1027/Code-for-Towards-Faithful-Explainable-Fact-Checking-via-Multi-Agent-Debate

Open GitHub

LangChain repository audited for the default path from model-produced actions to tool execution.

langchain-ai/langchain

Open GitHub

LangGraph repository audited for mandatory value authorization before ToolNode invocation.

langchain-ai/langgraph

Open GitHub

Reference implementation of the paper's deterministic fail-closed authorization gate, including framework integrations and proof scripts.

raceksd-source/scopegate-runtime

Open GitHub

LlamaIndex repository audited for central dispatch, schema validation, and per-call authorization behavior.

run-llama/llama_index

Open GitHub

Stripe Agent Toolkit repository audited for client-side authorization of model-supplied payment arguments.

stripe/agent-toolkit

Open GitHub

Repository for CAR-bench, including benchmark implementation, tools, task and evaluation workflow, results analysis, and documentation for evaluating multi-turn tool-using LLM agents under uncertainty.

CAR-bench/car-bench dataset

Open GitHub

Repository for the paper's cascaded LLM experiments, including implementation details for the framework evaluated in the paper.

fanconic/cascaded-llms system

Open GitHub

BitTern is presented as an open toolkit for low-cost, high-accuracy post-training ternary quantization and 1.58-bit models.

IntelChina-AI/BitTern implementation

Open GitHub

Repository containing the implementation for causal inference via style transfer for OOD generalisation.

nktoan/Causal-Inference-via-Style-Transfer-for-OOD-Generalisation framework

Open GitHub

Microsoft repository containing the Python implementation and supporting materials for the plug-and-play CoNLI hallucination detection and reduction framework.

microsoft/CoNLI_hallucination framework

Open GitHub

Repository for the SVAMP math word-problem benchmark.

arkilpatel/SVAMP dataset

Open GitHub

Repository for the ASDiv diverse math word-problem dataset.

chaochun/nlu-asdiv-dataset dataset

Open GitHub

Repository for the AQuA algebraic word-problem dataset.

deepmind/AQuA dataset

Open GitHub

Repository for BIG-bench, including the Date Understanding and Sports Understanding tasks.

google/BIG-bench

Open GitHub

BIG-bench task repository for the StrategyQA question-only evaluation setting.

google/BIG-bench dataset

Open GitHub

Repository associated with the CommonsenseQA benchmark.

jonathanherzig/commonsenseqa dataset

Open GitHub

Repository for the GSM8K grade-school math word-problem benchmark.

openai/grade-school-math dataset

Open GitHub

Repository for the Chat2Workflow benchmark and associated workflow-generation/evaluation resources.

zjunlp/Chat2Workflow dataset

Open GitHub

Public GitHub location for ChatCollab code and data used or produced by the paper.

ChatCollab dataset

Open GitHub

MetaGPT repository, representing a prior meta-programming multi-agent framework compared with ChatCollab.

geekan/MetaGPT framework

Open GitHub

ChatDev repository, representing a prior communicative-agent software-development system compared with ChatCollab.

OpenBMB/ChatDev framework

Open GitHub

Repository for the ChatDev framework and the code/data artifact associated with the paper.

OpenBMB/ChatDev dataset

Open GitHub

Official implementation and configuration repository for the ChatEval multi-agent referee framework and its evaluation experiments.

chanchimin/ChatEval

Open GitHub

GitHub repository identified by the paper as the location where the dataset can be accessed.

ZihanChen1995/ChatGPT-GNN-StockPredict dataset

Open GitHub

The paper states that the MMF-Trans code has been open sourced at this GitHub URL, with data requiring authorized access. The repository page itself returned 404 during extraction, so its contents could not be inspected.

MMF-Trans framework

Open GitHub

ClawNet repository containing the governed multi-agent social network implementation, including core/gateway, server, desktop, macOS client, and setup components.

hkgai-official/ClawNet framework

Open GitHub

Repository for CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models, including code, configs, data, docs, and README material describing the framework and evaluation setup.

heimy2000/CMAT dataset

Open GitHub

A GitHub repository listed by the paper as an accompanying curated resource for papers on code as agent harness.

YennNing/Awesome-Code-as-Agent-Harness-Papers

Open GitHub

Repository containing released benchmark data derived from SWE-CARE, pipeline scripts for filtering, environment building, test generation, agent resolution, and tool evaluation, plus compressed raw experimental outputs.

c-CRAB-Benchmark/dataset dataset

Open GitHub

Repository for the CodeAssistBench benchmark, datasets, prompts, scripts, and evaluation framework for AI coding assistants on real GitHub issues.

amazon-science/CodeAssistBench dataset

Open GitHub

Open-source repository associated with the CodeScout model family and RL recipe for code localization agents.

OpenHands/codescout system

Open GitHub

GitHub repository for CodeTaste, including benchmark infrastructure, agent/evaluation scripts, documentation, and links to benchmark artifacts and precomputed outputs.

logic-star-ai/codetaste dataset

Open GitHub

Repository for Codev-Bench, the developer-centric repository-level code-completion benchmark constructed with Codev-Agent.

LingmaTongyi/Codev-Bench dataset

Open GitHub

Repository titled KotiJaddu/Masters-Project, described on GitHub as code supporting the author's Master's thesis, with Python source and model folders.

KotiJaddu/Masters-Project implementation

Open GitHub

Repository for the LLMWorkflowGenerator project, containing the Python Controller implementation, Android/Termux-oriented workflow automation code, sample files, experiment outputs, and setup instructions.

dos-group/LLMWorkflowGenerator system

Open GitHub

Repository provided by the authors for reproducing the LLM, human-comparison, and algorithmic bandit experiments.

zzy620/LLM-exploration-exploitation implementation

Open GitHub

Repository for the Bitcoin Fee Rate Prediction Project, including data, model notebooks, scripts, result outputs, plots, and citation information for arXiv:2502.01029.

majiangqin/bitcoin dataset

Open GitHub

Repository reported by the paper as the code for confidence estimation in LLM-based dialogue state tracking.

jennycs0830/Confidence_Score_DST implementation

Open GitHub

Official repository and Python package for CONFLARE, supporting document loading, cleaning, chunking, calibration-set creation or loading, and conformal retrieval-augmented generation.

Mayo-Radiology-Informatics-Lab/conflare framework

Open GitHub

Repository linked by the authors as the public code for experiments in Conformal Prediction as Bayesian Quadrature.

jakesnell/conformal-as-bayes-quad implementation

Open GitHub

Repository linked by the paper for CCA/SWE-Bench-related materials.

facebookresearch/cca-swebench system

Open GitHub

Open-source PyTorch repository from which the authors selected reproducible GitHub issues requiring specialist debugging.

pytorch/pytorch

Open GitHub

Python repository for generating PortBench portfolio-theory tasks and evaluating LLM portfolio decisions.

noahardyx/PortBench framework

Open GitHub

Repository accompanying the paper with selected Python code for feature construction, metrics, and the model. The README states that the full proprietary backtesting and continuous-futures processing framework and licensed data are not included.

joelowj/mtl-tsmom implementation

Open GitHub

Repository released by the authors for constructing hierarchical NAS spaces based on CFGs and reproducing the BOHNAS experiments.

automl/hierarchical_nas_construction framework

Open GitHub

GitHub repository identified by the authors as containing the standard performance forecaster model for the Allora network.

allora-network/allora-forecaster framework

Open GitHub

Repository directory containing the implementation and resources for the CoT-MAE contextual masked auto-encoder introduced and evaluated in the paper.

caskcsg/ir implementation

Open GitHub

Repository containing the paper's data and codebase, including data collection materials for the user feedback and annotation tasks.

lil-lab/qa-from-hf dataset

Open GitHub

Python repository containing slow-tail, V-shape, and combined max-cash modules, execution scripts, data-schema documentation, and reproducible output paths for the empirical cash-overlay study.

shaun19920309/gd-cash-overlay-filters implementation

Open GitHub

Repository containing code, data artifacts, and reproducible empirical materials for the continuous smooth-signal growth-versus-defensive allocation framework.

ZheliXiong/continuous-smooth-signals-growth-tech-defensive-income-allocation implementation

Open GitHub

Repository identified by the paper as the posted model code for reproducing the ConFIRM workflow.

WilliamGazeley/ConFIRM framework

Open GitHub

GitHub repository associated with the paper's cooperative knowledge distillation method.

MLivanos/Cooperative-Knowledge-Distillation framework

Open GitHub

Official CooperBench repository containing the benchmark package, dataset tooling, task-running CLI, evaluation logic, and cooperative/solo/team settings for coding-agent experiments.

cooperbench/CooperBench dataset

Open GitHub

Repository containing the implementation and configuration details for LSNPC experiments.

huangweipeng7/lsnpc implementation

Open GitHub

Repository containing the implementation of CryptoMamba, baseline models, data preprocessing, model training, evaluation metrics, and trading simulation scripts.

MShahabSepehri/CryptoMamba benchmark

Open GitHub

RAGQALeaderboard is used as a benchmark environment for evaluating RAG-QA performance across multi-hop, single-hop, and biomedical question-answering tasks.

AQ-MedAI/RagQALeaderboard dataset

Open GitHub

Source-code repository for DAG-MoE, including the learned structural aggregation module evaluated in the paper.

JiaruiFeng/DAG-MoE system

Open GitHub

IBM Agentics is the agentic AI framework on which DAO-AI is built; the paper uses its ATypes, logical transduction, and scalable workflow concepts.

IBM/Agentics framework

Open GitHub

GitHub repository identified by the paper as containing DAO-GP source code and datasets.

anonymous273800/DAO-GP dataset

Open GitHub

FastChat v0.2.5 question JSONL containing 80 diverse evaluation queries.

lm-sys/FastChat dataset

Open GitHub

Repository implementing data-local, ensemble-based, LLM-guided NAS for multiclass multimodal time-series classification, including local executor scripts, remote LLM proposer scripts, schemas, and a Flask control interface.

emilhar/arl framework

Open GitHub

Repository cited by the paper for NVIDIA<ef><bf><bd>s Comprehensive Verilog Design Problems benchmark, which provides the selected code-generation and code-comprehension tasks used to evaluate SLMs and LLMs.

NVlabs/verilog-eval dataset

Open GitHub

Repository for the DB2-TransF time-series forecasting model introduced and evaluated in the paper.

SteadySurfdom/DB2-TransF benchmark

Open GitHub

Code repository for the limited teacher supervision decoding method introduced in the paper.

HJ-Ok/DecLimSup implementation

Open GitHub

AllenNLP contains the official PyTorch ELMo module and scalar-mix implementation for computing the paper's contextual representations.

allenai/allennlp

Open GitHub

Official TensorFlow implementation for training the pretrained bidirectional language model and computing ELMo representations introduced by the paper.

allenai/bilm-tf

Open GitHub

Repository containing the data-pipeline code, hedging simulator, actor-critic training workflow, saved configurations and models, notebooks, evaluation utilities, tests, figures, and paper materials for reinforcement-learning-based hedging of SPX and SPY exposures.

tlucius16/deep-hedging-rl framework

Open GitHub

Public repository containing software code and datasets or scripts for reproducing the baseline algorithms and experiments.

andrepugni/ESC benchmark

Open GitHub

Repository stated by the paper to contain the market simulation and experiment code.

EduardoGarrido90/micro_agents implementation

Open GitHub

Repository linked by the paper as containing the publicly available datasets and code used for the DRL-in-finance experiments.

Andrei-T-Neagu/DRL_in_Finance dataset

Open GitHub

Repository containing code associated with the paper's A2C, DDPG, PPO, ensemble-selection, and backtesting workflow.

AI4Finance-Foundation/Deep-Reinforcement-Learning-for-Automated-Stock-Trading-Ensemble-Strategy-ICAIF-2020 implementation

Open GitHub

Repository containing source code, datasets, and supplementary materials associated with the multi-horizon NEM electricity price forecasting benchmark.

GaniMosman/Multi-Horizon-EPF-NEM dataset

Open GitHub

The repository contains DeepAries source code, market data folders, checkpoints, model components, experiment code, and instructions for training and inference.

dmis-lab/DeepAries dataset

Open GitHub

Repository path containing DeepPlanning benchmark code, travel and shopping domain runners, evaluation utilities, configuration files, and instructions for reproducing benchmark results.

QwenLM/Qwen-Agent dataset

Open GitHub

The DeepSpeed repository contains the framework, code, tutorials, and documentation for large-scale model training and inference, including DeepSpeed-MoE components.

microsoft/DeepSpeed framework

Open GitHub

Repository for DeepTraderX, a deep-learning trading agent running in Threaded-BSE.

armandcismaru/DeepTraderX framework

Open GitHub

Threaded Bristol Stock Exchange repository providing the asynchronous market simulator and working trading agents used for DTX training data and experiments.

MichaelRol/Threaded-Bristol-Stock-Exchange benchmark

Open GitHub

Repository identified by the paper as containing code and data for the finance hallucination experiments.

mk322/fin_hallu benchmark

Open GitHub

Code, documentation, and demos for ToolUniverse, the ecosystem for building AI scientists from language models, reasoning models, and agents.

mims-harvard/ToolUniverse framework

Open GitHub

Official implementation repository released for the Demonstrate-Search-Predict framework.

stanfordnlp/dsp

Open GitHub

Salesforce AI Research repository for FinDAP: Demystifying Domain-adaptive Post-training for Financial LLMs, including framework materials, training scripts, evaluation guidance, and links to model/data artifacts.

SalesforceAIResearch/FinDAP dataset

Open GitHub

Repository associated with the AgentFail dataset and website for failure lifecycle data and analyses.

Jenna-Ma/JaWs-AgentFail dataset

Open GitHub

Repository containing DPR training, retrieval, evaluation, data-processing tools, configurations, and released model resources.

facebookresearch/DPR

Open GitHub

GitHub repository for the TSCC2019 competition data used as real Hangzhou traffic data in the paper's traffic-light-control benchmark.

tianrang-intelligence/TSCC2019 dataset

Open GitHub

Repository containing the blank Design-OS template, design-case artifacts, prompt files, simulation code, plots, and verification report used to support reuse and replicability testing.

bankh/design-os framework

Open GitHub

GitHub repository for the TalkTuner paper, including code and data for a dashboard that visualizes and controls a chatbot LLM's internal user model.

yc015/TalkTuner-chatbot-llm-dashboard dataset

Open GitHub

GitHub repository for the evolutionary multi-objective neural architecture search approach introduced in the paper.

DevilYangS/EMO-NAS-CD framework

Open GitHub

Repository linked by the paper for reproducing or inspecting the OAS generation experiment.

marques-vinicius/OAGen implementation

Open GitHub

Repository describing DeXposure-FM as a time-series graph foundation model for forecasting inter-protocol credit exposure, with scripts for experiments, macroprudential tools, checkpoints, and dataset helpers.

EVIEHub/DeXposure-FM framework

Open GitHub

The code URL reported in the paper's v2 HTML/PDF abstract for the DeXposure-FM project.

EVIEHub/graph-dexposure framework

Open GitHub

Code repository for the DiffLOB regime-conditioned diffusion model.

ZhuoHan1998/DiffLOB implementation

Open GitHub

Training code for supervised fine-tuning followed by DPO preference learning on causal Hugging Face language models, with dataset and trainer utilities.

eric-mitchell/direct-preference-optimization implementation

Open GitHub

Google Java Format repository; the paper uses Newlines.java from this repository for generated-vs-original unit-test comparison.

google/google-java-format

Open GitHub

Junit5 Modular World sample module; the paper uses a Flavor.java code snippet from this repository for prompt-based test generation and comparison.

junit-team/junit5-samples

Open GitHub

Contains code, data, configurations, and scripts for the FineLogic training and evaluation experiments.

YujunZhou/FineLogic benchmark

Open GitHub

Author repository for the GSM8K-AI-SubQ reasoning dataset and baselines for distilling LLM decomposition abilities into compact language models.

DT6A/GSM8K-AI-SubQ dataset

Open GitHub

Official source code repository for the Distilling step-by-step method introduced in the paper.

google-research/distilling-step-by-step implementation

Open GitHub

Repository containing code for the supervised and unsupervised LLM uncertainty experiments reported in the paper.

gahdritz/llm_uncertainty implementation

Open GitHub

GitHub repository released by the authors for Distributed Conformal Prediction via Message Passing.

HaifengWen/Distributed-Conformal-Prediction implementation

Open GitHub

Official code repository for the Division-of-Thoughts framework introduced and evaluated in the paper.

tsinghua-fib-lab/DoT framework

Open GitHub

Official self-contained SayCan implementation in a simulated tabletop environment using a UR5 setup, ViLD affordances, GPT-3 planning, and a CLIPort pick-and-place policy.

google-research/google-research implementation

Open GitHub

Official SayCan dataset files mapping natural-language user instructions and initial conditions to possible solution plans.

say-can/say-can.github.io dataset

Open GitHub

Repository containing code and notebooks associated with the S&P 500 graph-neural-network forecasting project.

waderylan/sp500-gnn implementation

Open GitHub

Repository containing code and data for evaluating uncertainty estimation in LLM instruction-following.

apple/ml-uncertainty-llms-instruction-following dataset

Open GitHub

Repository reported by the paper as containing the prediction model implementation and experiment data.

majuanjuan/Doexpertsperformbetter dataset

Open GitHub

Repository for the DocAgent multi-agent code documentation generation framework introduced by the paper.

facebookresearch/DocAgent framework

Open GitHub

Repository for the code, human-subject study materials, results, and supplementary materials associated with the Persona framework and AAAI 2025 paper.

YODA-Lab/Persona framework

Open GitHub

Semantic routing package for routing inputs by embedding or intent similarity.

aurelio-labs/semantic-router package

Open GitHub

Framework using repeated generations, verification prompts, and confidence estimates to decide whether to escalate to larger models.

automix-llm/automix framework

Open GitHub

AWS multi-agent orchestration framework that includes prompt-based routing or agent selection patterns.

awslabs/multi-agent-orchestrator framework

Open GitHub

Implementation associated with routing prompts to pre-trained experts after fine-tuned meta-model categorisation.

godcherry/ExpertTokenRouting implementation

Open GitHub

Implementation associated with deciding whether a query requires a complex prompting strategy.

imagination-research/sot implementation

Open GitHub

Iterative multi-agent code generation system using execution success as a routing signal.

JieyuZ2/EcoAssistant system

Open GitHub

LLM routing implementation associated with assessing model adequacy through multiple responses and ground-truth comparison.

kvadityasrivatsa/llm-routing implementation

Open GitHub

Routing-agent implementation using synthetic data and small classifiers for classification-based routing.

lamini-ai/llm-routing-agent implementation

Open GitHub

Orchestrator implementation using decoder-only LLM representations for routing or model selection.

Leeroo-AI/leeroo_orchestrator system

Open GitHub

Framework for serving and evaluating routers that choose between LLMs using preference-oriented routing strategies.

lm-sys/RouteLLM framework

Open GitHub

Task-planning framework in which an LLM selects among models or tools based on descriptions and user tasks.

microsoft/JARVIS framework

Open GitHub

Implementation assessing consistency across reasoning representations for cascade-style routing.

MurongYue/LLM_MoT_cascade implementation

Open GitHub

OpenAI multi-agent orchestration framework discussed as an example of prompt-based routing practice.

openai/swarm framework

Open GitHub

Fine-tuned model framework for API call generation, discussed as treating routing as a code generation problem.

ShishirPatil/gorilla implementation

Open GitHub

Framework for reducing LLM application cost using LLM cascades and related strategies.

stanford-futuredata/Frugalgpt framework

Open GitHub

Adaptive RAG framework that routes among no retrieval, single-step retrieval, and multi-step retrieval paths according to query complexity.

starsuzi/Adaptive-RAG framework

Open GitHub

Code and data for a multi-LLM routing benchmark and evaluation framework.

withmartian/routerbench dataset

Open GitHub

GitHub repository containing the Indonesian financial-domain language-model code and post-trained IndoBERT models.

intanq/indonesian-financial-domain-lm implementation

Open GitHub

Implementations of original, sparse, continuous, and related statistical jump models.

Yizhan-Oliver-Shu/jump-models framework

Open GitHub

Code and model artifacts for Reinforced Token Optimization, the paper's DPO-derived token-reward and PPO/RL alignment method.

zkshan2002/RTO implementation

Open GitHub

Official repository containing the DS-1000 data, execution-based evaluators, environment files, inference scripts, and released baseline results.

xlang-ai/DS-1000 dataset

Open GitHub

Repository for evaluating LLMs on the DSBC dataset, including response generation, LLM-as-judge evaluation, command-line usage, and dataset evaluation utilities.

traversaal-ai/DSBC-Data-Science-Task-Evaluation dataset

Open GitHub

Code repository for the DSTCGCN traffic forecasting model introduced and evaluated in the paper.

water-wbq/DSTCGCN implementation

Open GitHub

A scikit-learn-style implementation of a collection of statistical jump models, including the model family used to identify asset-specific market regimes.

Yizhan-Oliver-Shu/jump-models package

Open GitHub

Repository indicated by the paper as the code website for DVGNN.

gorgen2020/DVGNN implementation

Open GitHub

Repository containing experimental analysis, tools, and code associated with dynamic design of machine-learning pipelines via metalearning.

ealcobaca/dynamic-design-machine-learning-pipelines framework

Open GitHub

The first author's repository providing implementations of statistical jump models, including methods relevant to the sparse jump-model analysis used in the paper.

Yizhan-Oliver-Shu/jump-models package

Open GitHub

Official GitHub repository containing code for the paper Dynamic Graph Convolutional Network with Attention Fusion for Traffic Flow Prediction.

trainingl/AFDGCN implementation

Open GitHub

Repository for the Dynamic Meta-Learning for Adaptive XGBoost-Neural Ensembles implementation, including source code and test datasets as described by the paper.

aasedek/Adaptive-XGBoost-Neural-Network-Ensemble framework

Open GitHub

Python repository containing pseudo-answer generation, silver-label construction, input processing, multi-task training, inference, and evaluation scripts for the six reported datasets.

XieZilongAI/E2E-AFG framework

Open GitHub

Repository containing the EasyRAG pipeline, ingestion code, retrievers, rerankers, prompt templates, challenge scripts, Docker deployment, FastAPI service, Streamlit WebUI, and processed challenge assets.

BUAADreamer/EasyRAG dataset

Open GitHub

Official Microsoft Research implementation of the Sui Generis scoring pipeline introduced in the paper.

microsoft/SuiGeneris

Open GitHub

Public GitHub repository for the study's replication materials, mirrored in Zenodo.

codescene-research/echoes-of-ai-emse-2025 dataset

Open GitHub

Code repository for CAID, including the multi-agent workflow where a central manager delegates tasks to engineer agents that run asynchronously in isolated git worktrees, plus scripts and task modules for Commit0 and PaperBench experiments.

JiayiGeng/CAID system

Open GitHub

NovGrid extends MiniGrid with a generalized novelty generator so environment properties and dynamics can change and agents can be evaluated on adaptation to those changes.

eilab-gt/NovGrid dataset

Open GitHub

OWL is compared against Efficient Agents in the agent framework evaluation.

camel-ai/owl framework

Open GitHub

Smolagents is compared against Efficient Agents and OWL in the agent framework evaluation.

huggingface/smolagents framework

Open GitHub

Repository linked by the paper as its code resource, associated with the OAgents/Efficient Agents work.

OPPO-PersonalAI/OAgents framework

Open GitHub

Public code repository for the DASH architecture-search algorithm introduced and evaluated in the paper.

sjunhongshen/DASH framework

Open GitHub

fairseq example code and resources for training and evaluating the paper's MoE language models.

pytorch/fairseq implementation

Open GitHub

Source-code repository for Helium, the workflow-aware LLM serving system proposed and evaluated in the paper.

mlsys-io/helium_demo framework

Open GitHub

Repository for the Temporal Neural Common Neighbor model introduced in the paper.

GraphPKU/TNCN implementation

Open GitHub

Official repository containing scripts and links to reconstruct the ELI5 dataset, build support documents and splits, format multi-task data, train and evaluate the reported models, and use the released pretrained checkpoint.

facebookresearch/ELI5 dataset

Open GitHub

Repository for EmbedLLM materials, described by the paper as containing the dataset, code, and embedder for further research and application.

richardzhuang0412/EmbedLLM dataset

Open GitHub

Repository for the Embodied Web Agents project, including web environment hosting instructions and model-running folders for indoor, outdoor, and geolocation tasks.

Embodied-Web-Agent/Embodied-Web-Agent system

Open GitHub

Python SDK that lets users obtain drone observations and issue control actions through the Embodied City online API.

tsinghua-fib-lab/embodied-city-python-sdk package

Open GitHub

Repository containing simulator-related materials, datasets, task code, prompts, VLN code, and documentation for the EmbodiedCity benchmark.

tsinghua-fib-lab/EmbodiedCity dataset

Open GitHub

Code and experiment resources for measuring how context-characteristic sensitivity changes across instruction-fine-tuning stages.

copenlu/context-characteristics-sensitivity implementation

Open GitHub

Repository named Emoji-Embedding-For-Finance, cited throughout the paper as the source for model-vs-BERT comparisons, emoji frequencies, BTC/VCRIX figures, sentiment time series, and trading-strategy outputs.

QuantLet/Emoji-Embedding-For-Finance implementation

Open GitHub

Official repository for the paper, containing code and data artifacts for generating analysis-report features and training the hybrid asset pricing model.

chengjunyan1/AAPM implementation

Open GitHub

Official code and data repository for the ALCE benchmark and evaluation framework introduced in the paper.

princeton-nlp/ALCE dataset

Open GitHub

Pre-release Python/PyTorch code for building chart-image datasets, splitting data, training the ResNet trader, and inferring triple-I weights.

ZhuZhouFan/TWMA implementation

Open GitHub

Repository for implementing Iter-CoT, the paper's iterative bootstrapping method for chain-of-thought prompting.

GasolSun36/Iter-CoT implementation

Open GitHub

Preferred Multi-turn Benchmark for Finance in Japanese, used by the paper to evaluate generation quality across financial dialogue tasks.

pfnet-research/pfmt-bench-fin-ja benchmark

Open GitHub

Pathway is used as the vector store implementation in the paper's retrieval system.

pathwaycom/pathway implementation

Open GitHub

Stated repository for the modular Python prototype implementing the neuro-symbolic ontology-based LLM validation pipeline.

ruslanmv/Neuro-symbolic-interaction system

Open GitHub

Repository containing implementation for the PSX interpretability algorithms and models discussed in the paper.

sahar-arshad/PSX-Interpretability implementation

Open GitHub

Repository indicated by the paper as containing code for AdvDistill; the URL returned 404 when fetched during extraction.

shreyansh-2003/AdvDistill implementation

Open GitHub

Public implementation of the structured-matrix Transformer enhancement framework introduced in the paper.

newbeezzc/MonarchAttn framework

Open GitHub

Repository containing the LoT prompting implementation, scripts for CoT and LoT experiments, requirements, and quick-run instructions.

xf-zhao/LoT framework

Open GitHub

GitHub repository cited by the paper as the source of the second obesity classification dataset.

pymche/Machine-Learning-Obesity-Classification dataset

Open GitHub

Repository reported by the paper as containing code, curated data pointers, generated figures, and result tables for the reproduction and robustness analysis.

ZheliXiong/Ensemble-RL-through-Classifier-Models implementation

Open GitHub

Implementation of faithfulness-aware decoding used for the advanced-decoding baseline.

amazon-science/faithful-summarization-generation implementation

Open GitHub

FactPEGASUS code and models used to test whether the proposed augmentation transfers to another factuality-aware contrastive pipeline.

meetdavidwan/factpegasus implementation

Open GitHub

Official CLIFF implementation used as the contrastive-learning baseline and as the training framework combined with the proposed counterfactual augmentation.

ShuyangCao/cliff_summ implementation

Open GitHub

Repository hosting the ESG-FTSE corpus of news articles with ESG relevance labels.

mariavpavlova/ESG-FTSE-Corpus dataset

Open GitHub

Repository released by the authors containing code for the agentic benchmark assessment and related experiments.

uiuc-kang-lab/agentic-benchmarks benchmark

Open GitHub

Repository associated with the paper's contamination-detection work for LLM evaluation. The paper links it as code and data; the currently visible README describes a lightweight tool for identifying and analysing potential contamination without access to LLM training data.

liyucheng09/Contamination_Detector dataset

Open GitHub

Repository URL listed by the paper for the implementation of the confidence IQN experiments.

YHL04/confidenceiqn implementation

Open GitHub

A production-grade evaluation toolkit for LLM agent outputs, including metrics for diversity, reliability, cascade uncertainty, perturbation consistency, consistency, factual grounding, hallucination, explainability, and drift.

mukund1985/llm-eval-toolkit framework

Open GitHub

Repository containing the AGENTbench harness used to evaluate coding agents under NONE, LLM, and HUMAN repository-level context settings on AGENTbench and SWE-bench Lite.

eth-sri/agentbench benchmark

Open GitHub

Public GitHub repository released by the authors to support reproducibility and further research on financial relationship graph evaluation.

FreddieNIU/Financial-Graph-Evaluation dataset

Open GitHub

Repository described as a resource hub for understanding, detecting, and mitigating biases in financial-domain LLMs, including a Structural Validity Checklist, a literature review dashboard, and an automatic bias detection dashboard.

Eleanorkong/Awesome-Financial-LLM-Bias-Mitigation framework

Open GitHub

Repository associated with the M4 competition dataset and methods, used by the paper for benchmark data and base learner forecasts.

Mcompetitions/M4-methods dataset

Open GitHub

Repository containing the source code for the paper's FFORMA and ES-RNN ensemble experiments.

Pieter-Cawood/FFORMA-ESRNN benchmark

Open GitHub

Official repository containing the LoCoMo data release, conversation-generation code, prompts, and evaluation scripts.

snap-research/LoCoMo dataset

Open GitHub

Phoenix is cited among tools that provide analytics and evaluation orchestration capabilities for agent or LLM evaluation.

Arize-ai/phoenix

Open GitHub

DeepEval is listed among tools that support analytics, evaluation orchestration, and debugging for LLM or agent evaluation workflows.

confident-ai/deepeval

Open GitHub

OpenAI Evals is discussed as an open-source framework for specifying evaluation tasks and metrics and automating execution and reporting.

openai/evals

Open GitHub

HAL is cited as a holistic agent leaderboard or harness for centralized and reproducible agent evaluation.

princeton-pli/hal-harness benchmark

Open GitHub

Inspect AI is cited as a framework for large language model evaluations and included among evaluation tooling examples.

UKGovernmentBEIS/inspect_ai

Open GitHub

Repository named in the paper for ICD coding explainability evaluation resources, including RD-IV-10-related artifacts and generated rationales.

mingyangligithub/ICD-Coding-Explainability-Evaluation dataset

Open GitHub

Paper-provided implementation link for the evaluation-awareness scaling-law experiments; the arXiv HTML resolves through an Anonymous Github mirror.

eval-awareness-scaling-laws benchmark

Open GitHub

R code, datasets, and experiment scripts for EvoAAA, the evolutionary autoencoder architecture search methodology introduced in the paper.

fcharte/EvoAAA dataset

Open GitHub

Repository containing code and technical details for the Multi-Agent Scoring System for essay assessment.

AzizovDilshod/Multi-agent-System-for-Essay-Assessment system

Open GitHub

Repository providing BanditBench and inference code; the paper also notes installation via pip install banditbench.

allenanie/EVOLvE package

Open GitHub

Official repository for the CodeAct framework, evaluation code, CodeActAgent deployment components, scripts, and links to the released data and models.

xingyaoww/code-act implementation

Open GitHub

Repository identified by the paper as containing the code and data for the software-developing agent framework evaluated in the study.

OpenBMB/ChatDev dataset

Open GitHub

The repository is linked by the paper as the location for data and code and contains files including a notebook, UNSW-NB15 dataset archive, and feature metadata.

pcwhy/XML-IntrusionDetection dataset

Open GitHub

Repository named by the authors as containing all code for the experiments that apply feature importance, SHAP, and LIME to PPO portfolio-management predictions.

aleedelarica/XDRL-for-finance implementation

Open GitHub

Existing Reinforcement learning in portfolio management repository that the authors identify as the starting framework for integrating explainability into a PPO-based model.

deepcrypto/Reinforcement-learning-in-portfolio-management- framework

Open GitHub

The cpath package implements the paper's counterfactual-path method for R and Python.

pievos101/cpath package

Open GitHub

UnifiedSKG is the structured knowledge grounding model used in the paper's text-to-SQL case study.

HKUNLP/UnifiedSKG system

Open GitHub

Repository containing the benchmark data, task files, representative logs, and evaluation scripts for AutoGen, MetaGPT, and TaskWeaver.

lurf21/Agent_Evaluation_Framework dataset

Open GitHub

GPT-Engineer is listed as a code-domain LLM-based single-agent system and cited as a GitHub project that generates code repositories from prompts.

AntonOsika/gpt-engineer system

Open GitHub

GPTresearcher is listed as a research-domain LLM-based autonomous agent for online comprehensive research.

assafelovic/gpt-researcher system

Open GitHub

AIlegion is listed as a universal LLM-powered autonomous agent platform.

eumemic/ai-legion implementation

Open GitHub

LoopGPT is listed as a universal modular Auto-GPT-style LLM agent framework.

farizrahman4u/loopgpt framework

Open GitHub

AGiXT is listed as a universal AI automation platform with instruction management, memory, and plugins.

Josh-XT/AGiXT

Open GitHub

LangChain is discussed as an open-source framework supporting LLM-based agent software development and tool integration.

langchain-ai/langchain framework

Open GitHub

DemoGPT is listed as a code-support LLM-based agent system for creating LangChain applications through prompts.

melih-unsal/DemoGPT system

Open GitHub

AgentGPT is discussed as an agent framework offering browser-based assembly, configuration, deployment, fine-tuning, and local data incorporation.

reworkd/AgentGPT framework

Open GitHub

Auto-GPT is discussed as an open-source agent template/framework for decomposing objectives and executing tasks in a loop.

Significant-Gravitas/Auto-GPT framework

Open GitHub

SmolModels / smol-ai developer is listed as a code-domain LLM-based agent system with self-feedback and tool use.

smol-ai/developer system

Open GitHub

WorkGPT is discussed as an open-source GPT agent framework for invoking APIs.

team-openpm/workgpt framework

Open GitHub

SuperAGI is listed as a universal open-source autonomous AI agent framework.

TransformerOptimus/SuperAGI framework

Open GitHub

XLang is discussed as an open-source framework for building and evaluating language model agents through executable language grounding.

xlang-ai/xlang framework

Open GitHub

BabyAGI is listed as an LLM-based agent that creates tasks from objectives and stores or retrieves task results.

yoheinakajima/babyagi system

Open GitHub

BabyAGI is described as an OpenAI-powered task management system that uses vector databases such as Chroma or Weaviate to manage, prioritize, execute, store, and recall task-related information.

yoheinakajima/babyagi framework

Open GitHub

Code for defining text-to-text tasks and mixtures, preprocessing and evaluating datasets, training and fine-tuning T5 models, and reproducing the paper's experiments, with links to released checkpoints.

google-research/text-to-text-transfer-transformer

Open GitHub

Repository for ExpNote: Black-box Large Language Models are Better Task Solvers with Experience Notebook, including experiment code, datasets, scripts, and setup instructions.

forangel2014/ExpNote dataset

Open GitHub

Repository containing the 50-claim development dataset, the 25-claim post-finalization dataset, and dataset statistics used in the paper.

LaraHack/linkflows_claims_dataset

Open GitHub

Repository containing materials from Stages 1, 2, and 3 of the expert formalization study evaluating the super-pattern.

LaraHack/linkflows_formalization_study

Open GitHub

Repository containing the paper's released Self-Contrast code and data.

THUDM/Self-Contrast implementation

Open GitHub

Open-source implementation of the paper's GPT-2 generation, candidate ranking, and training-data memorization attack workflow.

ftramer/LM_Memorization

Open GitHub

Repository containing FactReview, RefCopilot, demos, source code, scripts, tests, and documentation for evidence-grounded reviews of ML papers.

DEFENSE-SEU/FactReview framework

Open GitHub

Repository containing pilot and main JSON datasets, prompt generation code, evaluation code, prompt templates, and unit-group definitions for the FAITH paper.

ZHANG-MENGAO/FAITH dataset

Open GitHub

PyTorch code, configurations, training and unmasking scripts, evaluation utilities, and pretrained checkpoints for ImageNet 256x256 and 512x512 MaskDiT models.

Anima-Lab/MaskDiT implementation

Open GitHub

Code repository linked by the paper for the FedSPM method and experiments.

zijianwang0510/FedSPM

Open GitHub

Official repository for reproducing the paper's output-refinement and policy-refinement experiments.

aypan17/llm-feedback

Open GitHub

Repository associated with the Cross-Attentive Time-Series Trend Network described and evaluated in the paper.

kieranjwood/x-trend framework

Open GitHub

Open-source repository associated with PIXIU and FinBen, containing financial LLM resources, evaluation datasets, benchmark materials, code, and links to related models and leaderboards.

The-FinAI/PIXIU dataset

Open GitHub

Repository indicated by the authors for the FinBERT2 work, including the specialized encoder and related downstream variants or resources.

ValueSimplex/FinBERT2 system

Open GitHub

ProgramFC repository used as the implementation basis for the LLM-based composite fact-checking system in the experiment.

teacherpeterpan/ProgramFC system

Open GitHub

Official repository for offline reward-model training and preference-based language-model fine-tuning, including smaller fine-tuned models and a subset of collected human labels.

openai/lm-human-preferences

Open GitHub

GitHub repository for a Chinese financial news sentiment classification dataset containing train and test CSV files and financial news sentiment labels.

wwwxmu/Dataset-of-financial-news-sentiment-classification dataset

Open GitHub

Repository containing code for FinEAS and the paper's BERT, BiLSTM, and FinBERT financial-news sentiment experiments, together with reported result tables.

lhf-labs/finance-news-analysis-bert benchmark

Open GitHub

Repository containing data preprocessing code, FineFT algorithm training/validation/testing scripts, VAE routing components, trading-environment implementation, baseline code, and analysis utilities.

qinmoelei/FineFT_code_space framework

Open GitHub

Python source code repository for FINMEM, the LLM trading agent with layered memory and character design.

pipiku915/FinMem-LLM-StockTrading system

Open GitHub

Repository for the FinReport code and datasets released by the paper.

frinkleko/FinReport dataset

Open GitHub

Astock dataset used for stock and news data, train-validation-test splitting, OOD analysis, and backtesting.

JinanZou/Astock dataset

Open GitHub

Repository for the paper's dataset construction pipeline, data collection module, FinRpt framework modules, benchmark evaluation code, fine-tuning setup, reinforcement-learning setup, and website front-end code.

jinsong8/FinRpt dataset

Open GitHub

GitHub repository associated with the paper's benchmark for online financial RAG evaluation.

PhealenWang/financial_rag_benchmark dataset

Open GitHub

Repository released by the authors for FinTexTS framework code and pilot study implementation.

leejaehoon2016/FinTexTS dataset

Open GitHub

Open-source implementation repository for FinWorld, the end-to-end financial AI research and deployment platform introduced in the paper.

DVampire/FinWorld framework

Open GitHub

Contains FireAct prompts, task and tool definitions, trajectory-generation and evaluation code, fine-tuning scripts, example training data, and model references.

anchen1011/FireAct implementation

Open GitHub

Repository containing the Flow multi-agent workflow automation implementation, including workflow management code, prompts, validators, notebooks, and generated examples.

tmllab/2025_ICLR_FLOW framework

Open GitHub

Code repository for the FlowAgent framework introduced by the paper.

Lightblues/FlowAgent framework

Open GitHub

Official repository containing FlowBench source data organization, turn-level and session-level evaluation code, scripts, prompts, and setup instructions.

Justherozen/FlowBench dataset

Open GitHub

GitHub Gist with a replenishment plan generated by the Flowr DC Replenishment Planning Agent, including outlet allocations, route assignments, vehicle assignments, route summary, consolidation opportunities, and human review queue.

https:/

Open GitHub

GitHub Gist with a purchase-order report generated by the Flowr Procurement and Ordering Agent, including order quantities, supplier justifications, delivery estimates, consolidated supplier orders, and human review flags.

https:/

Open GitHub

Repository made available by the authors for the proposed models used to forecast extreme Bitcoin volatility movements.

dorienh/bitcoin_synthesizer implementation

Open GitHub

Repository for the FiGASR package and related sentiment-indicator resources.

lucabarbaglia/FiGASR package

Open GitHub

Repository containing Python code for fine-grained aspect-based sentiment analysis in the economic news setting.

sergioconsoli/SentiBigNomics implementation

Open GitHub

Public multivariate time-series dataset repository containing the exchange-rate data used as real-data-E.

laiguokun/multivariate-time-series-data dataset

Open GitHub

Lag-Llama pretrained model repository used as the probabilistic zero-shot forecaster in the experiments.

time-series-foundation-models/Lag-Llama implementation

Open GitHub

Official implementation of the Diffusion Model-Based Predictor baseline for robust offline RL against state-observation perturbations.

zhyang2226/DMBP implementation

Open GitHub

Repository containing the implementation and results for LSTM and GRU time-series forecasting experiments discussed by the paper.

Alebuenoaz/LSTM-and-GRU-Time-Series-Forecasting implementation

Open GitHub

Repository containing FoT code, task and benchmark scripts, datasets, model and method modules, and evaluation instructions for Game of 24, GSM8K, MATH-500, and AIME.

iamhankai/Forest-of-Thought framework

Open GitHub

Repository containing more detailed prompts and implementation code for the LAWN antenna self-evolution framework discussed in the paper.

ChangyuanZhao/LAE_evolving framework

Open GitHub

Repository stated by the authors to contain the complete source code employed in the research.

Suhasnadh/Power-Consuption implementation

Open GitHub

Repository for the paper 'From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents', containing implementation-related folders such as agents, configs, envs, plan generation, prompts, tasks, utilities, and experiment scripts.

import-myself/AHP framework

Open GitHub

Repository for the paper From Emergence to Control: Probing and Modulating Self-Reflection in Language Models, describing probing vectors, model insertion, and reflection analysis for controlling self-reflection in LLMs.

xzAscC/ProbingReflection implementation

Open GitHub

A maintained paper list and resource collection about LLM-as-a-judge.

llm-as-a-judge/Awesome-LLM-as-a-judge

Open GitHub

Repository for the paper's semi-automated ontology and knowledge-graph construction pipeline, including prompts, code, data, generated artifacts, results, and evaluation materials.

fusion-jena/automatic-KG-creation-with-LLM

Open GitHub

Repository containing the reproducibility-related publication data and analysis code from the authors' prior biodiversity deep-learning study, which supplied the source dataset for the current pipeline.

fusion-jena/Reproduce-DLmethods-Biodiv

Open GitHub

Repository titled for the paper 'From Large AI Models to Agentic AI: A Tutorial on Future Intelligent Communications' and identified by GitHub as the code repository for the paper.

jiangfeibo/ComAgent framework

Open GitHub

BeeAI is described as the experimental platform central to IBM's ACP, supporting local-first orchestration, agent discovery, REST endpoints, SDKs, telemetry, and multi-agent execution.

i-am-bee/beeai-framework framework

Open GitHub

The MCP servers repository is cited as an ecosystem of reference and integration servers for file management, databases, Google Drive, Git, GitHub, GitLab, Slack, Google Maps, image generators, and search APIs.

modelcontextprotocol/servers implementation

Open GitHub

OpenAI Swarm is reviewed as a lightweight, stateless abstraction for multi-agent systems with agent definitions, dynamic handoffs, context management, direct function calling, streaming, and backend flexibility.

openai/swarm framework

Open GitHub

Repository reported by the paper as containing code and data for the proposed news-aware LLM time series forecasting framework.

ameliawong1996/From_News_to_Forecast dataset

Open GitHub

Repository containing code and data for the HippoRAG family, including the framework introduced and evaluated in this paper.

OSU-NLP-Group/HippoRAG framework

Open GitHub

Repository path for a Workflow Composer that transforms natural-language research questions into executable HyperFlow workflows for the 1000 Genomes Project, including source code, tests, Skills/resources, CLI commands, and evaluation-related materials.

hyperflow-wms/1000genome-workflow dataset

Open GitHub

A companion GitHub repository associated with the survey on agentic workflow optimization.

IBM/awesome-agentic-workflow-optimization

Open GitHub

Repository for the paper containing raw experimental data, reasoning processes and outputs, keyword-statistic results, LLM-as-a-judge scores, and score-analysis spreadsheets.

ChangWenhan/FromThinking2Output dataset

Open GitHub

Repository associated with the paper<ef><bf><bd>s recursors/refiners work, containing generated Prolog programs, models, execution traces, and related materials for the implemented system.

ptarau/recursors framework

Open GitHub

AWorld-RL is the linked project repository that houses FunReason-MT materials and describes the introduced multi-turn function-calling data-synthesis framework.

inclusionAI/AWorld-RL framework

Open GitHub

Repository identified by the paper as containing the pre-processed dataset, raw news article data, and implementation code for M2VN.

Yoontae6719/M2VN-Multi-Modal-Learning-Network-for-Volatility-Forecasting dataset

Open GitHub

Repository containing GAAMA's typed memory graph, semantic and PPR retrieval components, prompts, storage adapters, and LoCoMo evaluation scripts.

swarna-kpaul/gaama framework

Open GitHub

Repository for the Game-theoretic LLM paper, including source code, setup instructions, complete-information game experiments, workflow experiments, and Deal-or-No-Deal experiments.

Wenyueh/game_theory benchmark

Open GitHub

Repository reported by the author as containing the code and dataset used in the paper.

bvsdinda/TPE-LSTM dataset

Open GitHub

ytopt is a machine-learning-based autotuning and hyperparameter-optimization framework used in Section IV-B to search coefficient spaces for the reviewed scaling laws.

ytopt-team/ytopt package

Open GitHub

The Lila benchmark repository supplies the program-form mathematical reasoning data used in the paper's mathematical program-synthesis experiments.

allenai/Lila dataset

Open GitHub

Public repository for the paper's generative-agent architecture and Smallville simulation code.

joonspk-research/generative_agents

Open GitHub

PyTorch repository containing generative sparse-index-tracking code, comparison code, a GECCO 2023 directory, backtesting material, and instructions for obtaining the associated dataset.

kayuksel/generative-opt implementation

Open GitHub

Official repository containing Gorilla inference resources, APIBench data, evaluation scripts, model outputs, and materials for reproducing the paper's results.

ShishirPatil/gorilla

Open GitHub

OpenAI Evals, a framework for creating and running model benchmarks and inspecting performance sample by sample.

openai/evals

Open GitHub

Public prompts and code for the GPT-based automated review generation workflow used in the study.

zrobertson466920/GPT_Auto_Review implementation

Open GitHub

Repository linked by the paper as the implementation of Grammar Search for multi-agent systems.

mayanks43/grammar_search framework

Open GitHub

Source code for GraphGPT and the Graph Eulerian Transformer workflow.

alibaba/graph-gpt framework

Open GitHub

RAGChecker is an open-source visual analytics tool that compares LLM outputs against source documents, supports GraphEval+ and SICI-style detection, and presents claim reliability through an interactive quadrant-based visualization.

tanmayagrawal21/RAGChecker framework

Open GitHub

Repository for the GTA benchmark, dataset, code, and evaluation materials for general tool agents.

open-compass/GTA dataset

Open GitHub

Official repository for H3M-SSMoEs containing Python code, model components, training and backtesting scripts, and links to datasets and model weights.

PeilinTime/H3M-SSMoEs dataset

Open GitHub

Repository stated as the release location for HDFlow code and data.

wenlinyao/HDFlow dataset

Open GitHub

Repository associated with the paper's released LOFin benchmark and HiREC implementation.

deep-over/LOFin-bench-HiREC dataset

Open GitHub

Repository containing code to reproduce the HCNN experiments reported in the paper.

FinancialComputingUCL/HomologicalCNN benchmark

Open GitHub

Repository containing annotated transcripts of each participant made available for future studies.

safety-research/how-ai-impacts-skill-formation dataset

Open GitHub

TheAgentCompany repository contains sandboxed work environments, task directories, evaluators, task instructions, and supporting files for many task instances listed in the paper's appendix task table.

TheAgentCompany/TheAgentCompany dataset

Open GitHub

Repository for coding-agent token-consumption analysis, including dataset-building scripts, multi-model analysis scripts, phase-level token decomposition, and self-prediction correlation computation.

LongjuBai/agent_token_consumption_analysis implementation

Open GitHub

Python codebase for running the ensemble generalization-gap experiments, computing dataset complexity metrics, and generating per-dataset outputs and figures.

zubair0831/ensemble-generalization-gap benchmark

Open GitHub

GitHub directory linked by the paper for the deterministic prediction task source code, including Pauli string multiplication, divide-and-conquer, letter replacement, and addition-related files.

EdenCodeInc/PyCliffordMCP benchmark

Open GitHub

Repository containing the dataset and experimental code for the study of memory addition, deletion, and experience-following behaviour in LLM agents.

yuplin2333/agent_memory_manage dataset

Open GitHub

The repository provides code folders for HAG-XAI object detection, HAG-XAI image classification, FullGradCAM for Yolo-v5s, and FullGradCAM for Faster-RCNN, along with links to experimental materials, human attention data, and pretrained model files.

GitVirTer/HAG-XAI framework

Open GitHub

Repository containing data and analysis code for the human-alignment AI-assisted decision-making study.

Networks-Learning/human-alignment-study dataset

Open GitHub

Contains the Auto_Driving_Highway code, prompt materials, results, and instructions for training an RL agent with an LLM in the reward loop.

JingYue2000/In-context_Learning_for_Automated_Driving framework

Open GitHub

Azure Verified Modules is used by the paper as a cloud-native curated module ecosystem that demonstrates governance, standards, testing, versioning, and consistent interfaces.

Azure/Azure-Verified-Modules framework

Open GitHub

Repository indicated by the paper for prompts, code, anonymised data, and the full questionnaire related to the AI Narrative Test.

Mosh0110/AI-narrative-test benchmark

Open GitHub

Repository for the S&P 500 membership forecasting project, including EDA.ipynb, Modeling_Process.ipynb, model-interpretability figures, and Python dependencies.

VidhiAgrawal/sp500_finance implementation

Open GitHub

Source code for the HybRank hybrid and collaborative passage-reranking model.

zmzhang2000/HybRank implementation

Open GitHub

Official implementation repository for HyperAgent, including the multi-agent software engineering framework and benchmark reproduction scripts.

FSoft-AI4Code/HyperAgent framework

Open GitHub

GitHub repository for the ICICLE method introduced and evaluated in the paper.

gmum/ICICLE framework

Open GitHub

Repository containing ToolEmu code, emulators, evaluators, curated toolkit and test-case assets, scripts, and notebooks.

ryoungj/ToolEmu benchmark

Open GitHub

GitHub repository released by the authors for IL-PCSR and developed models.

Exploration-Lab/IL-PCSR dataset

Open GitHub

Complete codebase and setup instructions for the sentiment-driven stock prediction experiments.

Walids35/capstone-stock-prediction implementation

Open GitHub

Repository maintained by the authors as an official page and living resource list for the survey on implicit reasoning in LLMs.

digailab/awesome-llm-implicit-reasoning

Open GitHub

Open-source repository cited as the source for the Reuters & Bloomberg news-title stock-prediction dataset used to build headline vectors.

WenchenLi/news-title-stock-prediction-pytorch dataset

Open GitHub

Repository titled Anote-Text-Classification containing dataset folders, trial runs, requirements, README explanations, and model performance comparisons for GPT-3.5 Turbo, SetFit, and BERT across the evaluated datasets.

iiWhiteii/Anote-Text-Classification implementation

Open GitHub

Repository for the CARE native retrieval-augmented reasoning framework, including scripts, evaluation resources, documentation, and training examples.

FoundationAgents/CARE framework

Open GitHub

Official FactCC implementation and model used to score whether generated summary claims are factually consistent with source documents.

salesforce/factCC implementation

Open GitHub

Preliminary implementation of the paper's multiagent debate experiments, with code for arithmetic, GSM8K, biography, and MMLU tasks.

composable-models/llm_multiagent_debate

Open GitHub

Repository for Fusion-in-Decoder, providing the Natural Questions version augmented with DPR-retrieved passages that RETRO uses for its question-answering experiment and representing a principal comparison system.

facebookresearch/FiD

Open GitHub

NexusRaven-V2 repository linked by the paper for the Nexus function-calling evaluation benchmark.

nexusflowai/NexusRaven-V2 benchmark

Open GitHub

Repository containing notebooks for SFT and LoRA-based DPO training, generation and reward-model evaluation of the four OPT-350M variants, evaluation data and JSON outputs, plots, and examples of noisy preference pairs.

PiyushWithPant/Improving-LLM-Safety-and-Helpfulness-using-SFT-and-DPO dataset

Open GitHub

Code repository for the paper's QE-based machine-translation feedback-training experiments and RAFT+ method.

zwhe99/FeedbackMT implementation

Open GitHub

Code repository for the MAP paper, including implementations and evaluation scripts for Tower of Hanoi, CogEval graph tasks, PlanBench, StrategyQA, and transfer experiments.

MAPLLM/MAPICLR2025sub system

Open GitHub

Microchain is the agent-based library used to inject function declarations, descriptions, and examples into prompts, call functions during the reasoning loop, and record assistant/user chat-completion chains.

galatolofederico/microchain framework

Open GitHub

Public repository for InlineCoder, the framework introduced and evaluated in the paper.

ythere-y/InlineCoder dataset

Open GitHub

Repository linked by the paper for InfoMosaic-Bench/InfoMosaic-Flow resources.

DorothyDUUU/Info-Mosaic dataset

Open GitHub

Published ETT dataset collected by the authors and used as a core benchmark dataset in the paper.

zhouhaoyi/ETDataset dataset

Open GitHub

Source code for the Informer model and experiments.

zhouhaoyi/Informer2020 framework

Open GitHub

Aider source repository analyzed for user-driven loop design, repo-map retrieval, edit formats, and summarization behavior.

Aider-AI/aider implementation

Open GitHub

OpenCode source repository analyzed for tool interface design, event bus, SQLite session persistence, dynamic tools, and role-based sub-agents.

anomalyco/opencode implementation

Open GitHub

Moatless Tools source repository analyzed for MCTS orchestration, tree-structured state, action classes, semantic search, in-memory shadow mode, and actor-critic routing.

aorwall/moatless-tools implementation

Open GitHub

AutoCodeRover source repository analyzed for phased scaffold control, search-only tools, AST-aware retrieval, SBFL, and Docker-based execution.

AutoCodeRoverSG/auto-code-rover implementation

Open GitHub

Cline source repository analyzed for recursive control flow, IDE coupling, shadow git checkpoints, delegation, and LLM-initiated compaction.

cline/cline implementation

Open GitHub

Prometheus source repository analyzed for LangGraph-based phased control, graph-scoped state, knowledge graph retrieval, per-node tool scoping, and multi-tier persistence.

EuniAI/Prometheus implementation

Open GitHub

Gemini CLI source repository analyzed as one of the 13 coding agent scaffolds.

google-gemini/gemini-cli implementation

Open GitHub

Codex CLI source repository analyzed for event-driven ReAct control, dynamic tool rebuilding, sandboxing, Guardian safety routing, memory extraction, and sub-agent delegation.

openai/codex implementation

Open GitHub

Agentless source repository analyzed for fixed-pipeline architecture, hierarchical localization, sampling, and JSONL pipeline state.

OpenAutoCoder/Agentless implementation

Open GitHub

OpenHands source repository analyzed for event-sourced architecture, tool interfaces, Docker execution, and delegation mechanisms.

OpenHands/OpenHands implementation

Open GitHub

mini-swe-agent source repository analyzed as a deliberately minimal baseline scaffold with a single bash tool and simple ReAct loop.

SWE-agent/mini-swe-agent implementation

Open GitHub

SWE-agent source repository analyzed for ReAct control loop, tool bundles, retry behavior, Docker isolation, and compaction processors.

SWE-agent/SWE-agent implementation

Open GitHub

DARS-Agent source repository analyzed for depth-first tree search, SWE-agent-derived tools, greedy LLM critic selection, and Docker reset/replay state recovery.

vaibhavagg303/DARS-Agent implementation

Open GitHub

A topic-agnostic, provenance-first pipeline for legitimate AI-assisted drafting of a Related Work section using human-authored structured notes, strict no-invention constraints, interaction logs, generated taxonomy, draft output, audit table, provenance card, and related artifacts.

RutaBinkyte/AI-RO framework

Open GitHub

Repository associated with the paper's survey of instruction-tuning research.

xiaoya-li/Instruction-Tuning-Survey

Open GitHub

Python repository associated with hybrid physics-based and data-driven building energy modeling, with folders for evaluation, feature generation, forecasting, models, preprocessing, and utilities.

Leo-VK/hybrid_bem dataset

Open GitHub

Repository containing code notebooks, data files, tickers, metrics/metadata, SQL queries, and the list of companies associated with the paper's algorithmic stock-market trading system.

JuanCarlosKing/StockmarketAlgoritmicTrading dataset

Open GitHub

Repository for the MiniGrid gridworld environment used for the paper's four-room and six-room navigation experiments.

maximecb/gym-minigrid benchmark

Open GitHub

Repository for the FastAPI control server, callback-augmented interactive trainer, React/TypeScript dashboard, examples, and LLM-based tuning demonstration.

yuntian-group/interactive-training framework

Open GitHub

The paper states that the ICM protocol is open source under the MIT license and that referenced workspaces are available or buildable through this repository.

RinDig/Interpretable-Context-Methodology-ICM- framework

Open GitHub

Repository identified by the paper as containing the code for the GDELT headline extraction, FinBERT sentiment scoring, feature engineering, modeling, and backtesting workflow.

yukepenn/macro-news-sentiment-trading implementation

Open GitHub

Public code repository for the paper's context-faithfulness experiments.

liyp0095/ContextFaithful benchmark

Open GitHub

Repository for InvestLM, the LLaMA-based financial-domain instruction-tuned model released to the research community under the same licensing terms as LLaMA.

AbaciNLP/InvestLM implementation

Open GitHub

Python project for forecasting variance-covariance matrices in a Markowitz framework across cryptocurrency and traditional asset markets; the repository README states that it produced the paper's results.

Maciej-13/vcov_forecast framework

Open GitHub

Code and data repository for reproducing the paper's evaluator training, SFT+ILR, SFT+DPO, naive-ILR, and supervision-quality experiments.

helloelwin/iterative-label-refinement dataset

Open GitHub

Repository containing Stage 1 JEPA+DAAM training, Stage 2 decoder training, FSQ and mixed-radix packing, a HiFi-GAN decoder with optional DAAM gating, and DeepSpeed integration.

gioannides/Density-Adaptive-JEPA system

Open GitHub

Repository associated with the Jr. AI Scientist system; the paper points readers there for issues, comments, questions, and planned codebase release.

Agent4Science-UTokyo/Jr.AI-Scientist system

Open GitHub

Official K-COMP codebase with retrieval, data-processing, training, inference scripts, environment configuration, short medical descriptions, and links to model checkpoints and datasets.

jeonghun3572/K-COMP framework

Open GitHub

Repository path containing code or supplementary material for the shell-based keyword-search agent used in the paper.

amazon-science/aws-research-science system

Open GitHub

Official implementation of the KnowAgent framework, including HotpotQA and ALFWorld path-generation scripts, trajectory filtering and merging, and LoRA-based knowledgeable self-learning.

zjunlp/KnowAgent

Open GitHub

Repository containing the question-generation compression workflow, paper-card generation, syntactic multihop experiments, datasets, evaluation scripts, and reported result files.

anvix9/llama2-chat system

Open GitHub

Official repository released with the paper, containing extensible implementations of KTO, DPO, offline PPO, ORPO, and other human-aware loss functions.

ContextualAI/HALOs implementation

Open GitHub

Repository for the technical report, providing code for high-frequency trading prediction under label imbalance, including MLP, LSTM, BERT, and Mamba backbones plus class weighting and data balancing options. The original dataset is not provided because of copyright restrictions.

RS2002/Label-Unbalance-in-High-Frequency-Trading framework

Open GitHub

Official LaMMA-P repository containing code, PDDL resources, scripts, Fast Downward submodule setup, and MAT-THOR test dataset files for running the method in AI2-THOR.

tasl-lab/LaMMA-P dataset

Open GitHub

Official repository accompanying the GPT-3 paper, containing synthetic arithmetic and word-scrambling datasets, dataset statistics, 175B samples, a model card, and benchmark-overlap examples.

openai/gpt-3

Open GitHub

Official demonstration code implementing the paper's language-model planning and admissible-action grounding workflow.

huangwl18/language-planner

Open GitHub

PIXIU provides FinMA models, FLARE financial evaluation benchmark tasks, and instruction data used as a baseline or source in the paper.

chancefocus/PIXIU benchmark

Open GitHub

Open-source CAMEL framework for autonomous cooperation among communicative agents using inception prompting and role-play.

camel-ai/camel framework

Open GitHub

Open-source multi-agent collaborative framework associated with MetaGPT, discussed as a representative framework that embeds human workflow processes and SOPs into language-agent collaboration.

geekan/MetaGPT framework

Open GitHub

Open-source AutoGen framework for creating LLM applications using customizable agents that can be programmed through natural language and code.

microsoft/autogen framework

Open GitHub

Author-maintained repository for tracking LLM-based multi-agent papers and organizing them into streams such as frameworks, orchestration and efficiency, problem solving, world simulation, datasets, and benchmarks.

taichengguo/LLM_MultiAgents_Survey_Papers

Open GitHub

A GitHub repository made public by the authors to curate papers related to LLM safety.

tjunlp-lab/Awesome-LLM-Safety-Papers

Open GitHub

Repository containing the paper list associated with the survey on LLM-based agents for software engineering.

FudanSELab/Agent4SE-Paper-List dataset

Open GitHub

Repository containing the mathematical problem dataset, evaluation code, and model results associated with the paper.

jboye12/llm-probs dataset

Open GitHub

Repository containing the data and code for the LLM biased reinforcement learning experiments.

william-hayes/LLMs-biased-RL dataset

Open GitHub

The repository contains task implementations for the Iowa Gambling Task, Cambridge Gambling Task, and Wisconsin Card Sort Test, LLM integration utilities, oTree interface components, run scripts, configuration files, and links to analytical code and data.

ynulihao/LLM_vs_Human_Decision_Making benchmark

Open GitHub

Repository for TraceLLM, the LLM-based synthetic microservice trace generator and related experimental code.

ldos-project/TraceLLM system

Open GitHub

Official source-code repository for the ICPI algorithm and experiments introduced in the paper.

ethanabrooks/icpi

Open GitHub

Repository for the paper's LLM-based anomaly detection implementation, including data-processing scripts, raw data archive, demo scripts for SFT, transfer learning, LoRA, online detection, catastrophic forgetting, and ICL.

PoSeiDon-Workflows/LLM_AD implementation

Open GitHub

A GitHub repository linked by the authors as a resource for recent work on LLMs for ML workflows.

t-harden/LLM4AutoML

Open GitHub

The repository contains the paper's source code in a Jupyter notebook and the RAG knowledge-base text used for the demonstration.

explainable-digital-twins/RAG-DDDAS system

Open GitHub

Repository for transformer and foundation models for financial time-series forecasting, including data folders, model implementations, scripts, result files, notebooks, and reproduction instructions.

UVA-MLSys/Financial-Time-Series dataset

Open GitHub

Fin-LLAMA: efficient fine-tuning of quantized LLMs for finance.

Bavest/fin-llama implementation

Open GitHub

Cornucopia-LLaMA-Fin-Chinese: Chinese finance-oriented LLaMA model referenced by the survey.

jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese implementation

Open GitHub

An up-to-date resource list for large multimodal agents associated with the survey paper.

jun0wanan/awesome-large-multimodal-agents

Open GitHub

The official code repository for experiments and evaluation associated with the Leaky Thoughts paper.

parameterlab/leaky_thoughts benchmark

Open GitHub

Repository linked by the paper as the available code for the conformal abstention and uncertainty-evaluation work.

sinatayebati/vlm-uncertainty benchmark

Open GitHub

Official implementation and experiments for Graph Neural Controlled Differential Equations.

WonderSeven/graph-neural-cdes framework

Open GitHub

Repository described as the official codebase for the ACL submission titled 'Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization'.

gb-kgp/VocabReplace-Then-Expand implementation

Open GitHub

Contains data resources and code for the two-stage training procedure, inference, and evaluation of fine-grained attributed generation.

LuckyyySTA/Fine-grained-Attribution dataset

Open GitHub

Repository for reproducing the FAIR paper, including scripts for inferring student mistakes, collecting teacher responses, training distilled student models, and testing accuracy.

zhuochunli/Learn-from-Committee framework

Open GitHub

Official implementation of the paper's TraceCodegen training framework, including preprocessing, execution, buffer-based self-sampling, training configurations, and inference.

microsoft/TraceCodegen

Open GitHub

Official implementation and pipeline repository for representation learning in time-domain high-energy astrophysics, including event-file representations, feature extraction, dimensionality reduction, clustering, datasets, encoders, and demonstration notebook.

StevenDillmann/ml-xraytransients-mnras implementation

Open GitHub

Repository titled 'BERTOps: Learning Representations on Logs for AIOps' containing data, results, source code, annotated datasets, dataset distributions, and scripts for preparing a pretrained ITOps-domain model.

BertOps/bertops dataset

Open GitHub

Code accompanying the paper's limited-expert-prediction learning-to-defer experiments.

ptrckhmmr/learning-to-defer-with-limited-expert-predictions implementation

Open GitHub

Repository containing code for the paper's Robot Language Model approach to grounded task planning.

dnandha/RobLM system

Open GitHub

Contains code for the supervised baseline, trained reward model, PPO fine-tuned policy, model card, and links to the released human-feedback and evaluation data.

openai/summarize-from-feedback dataset

Open GitHub

Official implementation and pretrained model weights for the CLIP method introduced and evaluated in the paper.

OpenAI/CLIP implementation

Open GitHub

Official experiment code for training LEVER verifiers, executing generated programs, and reproducing the paper's language-to-code evaluations.

niansong1996/lever

Open GitHub

CAIL2019 is used as an additional legal case similarity setting involving civil cases such as private lending, IP disputes, and maritime law.

alumik/cail2019 dataset

Open GitHub

Repository associated with the paper's rationale-augmented dialogue understanding experiments and released resources.

ShoRit/RATDIAL dataset

Open GitHub

Repository containing the LitSearch dataset and code for constructing or evaluating the scientific literature retrieval benchmark.

princeton-nlp/LitSearch dataset

Open GitHub

Repository named IBM/live-api-bench with code for converting BIRD benchmark SQL queries into API call sequences and producing slot-filling and selection-style benchmark outputs.

IBM/live-api-bench dataset

Open GitHub

Repository for the LiveTradeBench platform/package used to run, monitor, and benchmark LLM-based trading agents across U.S. stock and Polymarket environments.

ulab-uiuc/live-trade-bench package

Open GitHub

Contains the LiveVectorLake Python implementation, chunk-level CDC components, Milvus and Delta Lake integrations, query engine, tests, generated test data, benchmark scripts, documentation, and architecture materials.

praj-tarun/LiveVectorLake dataset

Open GitHub

Official repository linked by the paper for access to the LLaMA model family and associated inference resources.

facebookresearch/llama

Open GitHub

Repository for the LlamaDuo LLMOps pipeline implementation.

deep-diver/llamaduo system

Open GitHub

Repository for the LLF-Bench environments, installation instructions, wrappers, and task implementations introduced by the paper.

microsoft/LLF-Bench benchmark

Open GitHub

Repository containing the nudging experimental framework, configuration for nudge types and models, and R analysis code for statistical tests and plots.

PapayaResearch/nudging benchmark

Open GitHub

Repository associated with the WORKS 2025 Flowcept Agent materials, including software, synthetic workflow, data, analysis code, query set, and prompts referenced for reproducibility.

flowcept/FlowceptAgent-WORKS25 system

Open GitHub

Flowcept code repository used as the provenance capture and observability foundation for the agent architecture and implementation.

ORNL/flowcept framework

Open GitHub

Repository for modernbert_predict_masked.

AnswerDotAI/ModernBERT

Open GitHub

Repository for medsam_inference.

bowang-lab/MedSAM

Open GitHub

Repository for flowmap_overfit_scene.

dcharatan/flowmap

Open GitHub

Repository for esm_fold_predict.

facebookresearch/esm

Open GitHub

Repository for stamp_extract_features and stamp_train_classification_model.

KatherLab/STAMP

Open GitHub

Public repository for TOOLMAKER code and TM-BENCH benchmark.

KatherLab/ToolMaker framework

Open GitHub

Repository for pathfinder_verify_biomarker.

LiangJunhao-THU/PathFinderCRC

Open GitHub

Repository for musk_extract_features.

lilab-stanford/MUSK

Open GitHub

Repository for conch_extract_features.

mahmoodlab/CONCH

Open GitHub

Repository for uni_extract_features.

mahmoodlab/UNI

Open GitHub

Repository for nnunet_train_model.

MIC-DKFZ/nnUNet

Open GitHub

Repository for medsss_generate.

pixas/MedSSS

Open GitHub

Repository for tabpfn_predict.

PriorLabs/TabPFN

Open GitHub

Repository for retfound_feature_vector.

rmaphoh/RETFound_MAE

Open GitHub

Repository for cytopus_db.

wallet-maker/cytopus

Open GitHub

Repository containing the cross-provider validation framework, SEC 10-K test corpus, synthetic financial database, 480 run traces, reproducibility manifests, and setup instructions for release v0.1.0.

ibm-client-engineering/output-drift-financial-llms dataset

Open GitHub

Repository containing a Python pipeline for extracting LLM uncertainty features, analysing and selecting features, balancing imbalanced data, training calibrated Ridge and XGBoost meta-models, tuning thresholds, and evaluating cost-aware uncertainty models.

ZEFR-INC/lpp-research framework

Open GitHub

Official implementation of Chameleon, the plug-and-play compositional reasoning system reproduced and structurally analyzed in the paper.

lupantech/chameleon-llm

Open GitHub

Official repository containing the prompts, outputs, token-count materials, and supporting files for the paper's five experiments.

AKSW/AI-Tomorrow-2023-KG-ChatGPT-Experiments

Open GitHub

CAMEL is an open-source multi-agent framework for role-playing and agent collaboration.

camel-ai/camel framework

Open GitHub

Code repository for improving factuality and reasoning through multi-agent debate.

composable-models/llm_multiagent_debate framework

Open GitHub

MetaGPT is a multi-agent framework that models a software company using role assignments and SOP-style workflows.

FoundationAgents/MetaGPT framework

Open GitHub

AutoAgents generates different roles for GPTs to form a collaborative entity for complex tasks.

Link-AGI/AutoAgents framework

Open GitHub

Microsoft AutoGen, a framework for building multi-agent AI applications.

microsoft/autogen framework

Open GitHub

Code repository for Solo Performance Prompting / multi-persona self-collaboration.

MikeWangWZHL/Solo-Performance-Prompting framework

Open GitHub

AgentVerse provides task-solving and simulation frameworks for multiple LLM-based agents.

OpenBMB/AgentVerse framework

Open GitHub

ChatDev implements LLM-powered multi-agent collaboration for software development.

OpenBMB/ChatDev framework

Open GitHub

Repository connected to AI Scientist-generated papers reported as having passed peer review at an ICLR workshop.

SakanaAI/AI-Scientist-ICLR2025-Workshop-Experiment

Open GitHub

Code repository for MAD, a multi-agent debate framework using large language models.

Skytliang/Multi-Agents-Debate framework

Open GitHub

A curated collection of resources associated with the survey, described by the authors as containing over 200 related papers on agent hallucinations.

ASCII-LAB/Awesome-Agent-Hallucinations

Open GitHub

Repository containing code, scripts, data folders, requirements, and reproduction instructions for LLM-DSE: Searching Accelerator Parameters with LLM Agents.

Nozidoali/LLM-DSE framework

Open GitHub

Repository reported by the paper as containing the source code and data for the LLM-BLM study.

youngandbin/LLM-BLM framework

Open GitHub

Repository for reproducing LLM-Explorer experiments and implementation details.

tsinghua-fib-lab/LLM-Explorer framework

Open GitHub

Repository for the paper's LLM-REVal framework, including simulation components for LLM-driven research and review workflows.

PlusLabNLP/LLM-REVal benchmark

Open GitHub

Contains the DELEGATE-52 relay runners, direct and agentic model wrappers, prompts, and 52 domain-specific parsers and evaluators.

microsoft/DELEGATE52 dataset

Open GitHub

Official implementation of the Chronos time-series foundation model used for zero-shot inference and sequential fine-tuning.

amazon-science/chronos-forecasting framework

Open GitHub

Configuration files for the CNN-Transformer statistical-arbitrage benchmark replicated in the paper.

gregzanotti/dlsa-public benchmark

Open GitHub

Repository directory containing the IPCA, PCA, and Fama-French residual-return datasets used in the paper's backtests.

gregzanotti/dlsa-public dataset

Open GitHub

Official code repository for the paper, containing evaluation scripts, configuration files, a CSV of evaluation results, notebooks, figures, and a modified Lingua training submodule.

brendel-group/llm-line implementation

Open GitHub

Repository for the Needle-In-A-Haystack long-context retrieval test adapted in the paper's fixed- and random-needle experiments.

gkamradt/needle-in-a-haystack benchmark

Open GitHub

Official repository for the LLoCO framework and experiments.

jeffreysijuntan/lloco framework

Open GitHub

Official LongBench repository used for the paper's SingleDoc, MultiDoc, and summarization evaluation; baseline numbers are taken from this repository.

THUDM/LongBench dataset

Open GitHub

Repository for Local-Splitter, an MCP-compatible and OpenAI-compatible outbound LLM request shim that uses a local small model as a triage layer to reduce cloud token usage.

jayluxferro/local-splitter system

Open GitHub

GitHub repository linked by the paper for LocalEval resources; the repository page inspected stated that available resources were being prepared.

tsinghua-fib-lab/LocalEval dataset

Open GitHub

Public repository for LoCoBench-Agent, including the benchmark framework, evaluation setup, data-download instructions, and metrics for long-context software engineering agent evaluation.

SalesforceAIResearch/LoCoBench-Agent dataset

Open GitHub

Repository containing LogicBench data, evaluation code, and supporting reasoning-chain materials.

Mihir3009/LogicBench dataset

Open GitHub

Public-facing repository for the paper's KD experiments, configuration files, training code, plotting scripts, and geometric feature analysis utilities.

Thegolfingocto/KD_wo_CE benchmark

Open GitHub

Repository containing LongReasonArena benchmark data, input-generation utilities, inference code, and evaluation scripts.

LongReasonArena/LongReasonArena dataset

Open GitHub

Repository containing code, data, and prompts for evaluating LLMs as path planners, including the artifact associated with 'Look Further Ahead: Testing the Limits of GPT-4 in Path Planning'.

MohamedAghzal/llms-as-path-planners benchmark

Open GitHub

Official repository containing the paper's QA and key-value datasets, prompt construction code, data-generation scripts, tests, and experiment instructions.

nelson-liu/lost-in-the-middle

Open GitHub

Repository for LPS-BENCH containing benchmark examples, mock tools, evaluators, prompt templates, schemas, scripts, and the multi-agent case synthesis pipeline.

tychenn/LPS-Bench dataset

Open GitHub

Repository for the general-purpose MadEvolve code-evolution framework with MAP-Elites, island populations, multiple LLM providers, evaluation backends, and analysis tools.

tianyi-stack/MadEvolve framework

Open GitHub

Repository for the Self-Supervised Audio Spectrogram Transformer used as the architectural and empirical baseline for MAE-AST.

YuanGongND/ssast implementation

Open GitHub

Repository for MaGNet containing Python implementation files for MAGE, hypergraph modules, 2D attention modules, training, datasets/model-weight download links, and backtesting.

PeilinTime/MaGNet dataset

Open GitHub

Repository containing the authors' simulator and market-making agents for scaled beta policy experiments.

JJJerome/rl4mm framework

Open GitHub

Repository for the FOREC/Cross-Market Product Recommendation baseline used for model parameters and comparisons.

hamedrab/FOREC implementation

Open GitHub

Repository linked by the paper for the efficient cross-market recommendation implementation associated with the proposed market-aware models.

samarthbhargav/efficient-xmrec implementation

Open GitHub

Repository released by the authors for Markov Chain of Thought code and associated research artifacts.

james-yw/Markov-Chain-of-Thought dataset

Open GitHub

Repository for MASEval, the framework-agnostic multi-agent system evaluation library introduced and evaluated in the paper.

parameterlab/MASEval framework

Open GitHub

Repository hosting Audio-MAE code and pretrained models for masked spectrogram autoencoding and downstream audio tasks.

facebookresearch/AudioMAE implementation

Open GitHub

Provides code and model assets for MASS pre-training and fine-tuning, including unsupervised and supervised NMT, text summarization, and conversational response generation.

microsoft/MASS implementation

Open GitHub

Official repository for the MatPlotAgent framework and MatPlotBench benchmark introduced and evaluated in the paper.

thunlp/MatPlotAgent benchmark

Open GitHub

Sibyl System repository included in the selected AutoGen application sample.

Ag2S1/Sibyl-System system

Open GitHub

AutoTx repository for planning and executing on-chain transactions.

agentcoinorg/AutoTx

Open GitHub

GPT-Academic repository for LLM-assisted academic reading, writing, translation, and code/project analysis workflows.

binary-husky/gpt_academic

Open GitHub

Composio platform/repository evaluated as a flexible agent application or platform with multiple autonomy-related configurations.

ComposioHQ/composio

Open GitHub

h2oGPT repository for private local GPT-style chat and document interaction.

h2oai/h2ogpt

Open GitHub

GraphRag_Ollama repository combining AutoGen, GraphRAG, Ollama, and related tooling.

karthikvenkatesan-eaton/Autogen_GraphRAG_Ollama

Open GitHub

Langflow platform for building and deploying AI-powered agents and workflows.

langflow-ai/langflow

Open GitHub

Letta platform for stateful agents with advanced memory.

letta-ai/letta

Open GitHub

AutoGen open-source framework for building AI agent systems using language models, multi-agent conversations, and tool use.

microsoft/autogen framework

Open GitHub

AutoGen Studio application within the AutoGen repository.

microsoft/autogen system

Open GitHub

Dream Team repository for building a team of AI agents with AutoGen.

yanivvak/dream-team

Open GitHub

GitHub source for the multiple-choice Truthful-QA variant used in the model-level ranking experiment.

manyoso/haltt4llm dataset

Open GitHub

CHALE repository used as a hallucination-evaluation dataset with non-hallucinated, half-hallucinated, and hallucinated answer categories.

weijiaheng/CHALE dataset

Open GitHub

Search/RAG infrastructure repository.

devflowinc/trieve

Open GitHub

Large language model training framework repository.

EleutherAI/gpt-neox

Open GitHub

NLP framework repository.

flairNLP/flair

Open GitHub

Glasgow Haskell Compiler repository.

ghc/ghc

Open GitHub

Haskell Cabal build/package repository.

haskell/cabal

Open GitHub

Transformer model library repository.

huggingface/transformers

Open GitHub

Property-based testing library repository.

HypothesisWorks/hypothesis

Open GitHub

JavaScript DOM implementation repository.

jsdom/jsdom

Open GitHub

Python spreadsheet/data-analysis tool repository.

mito-ds/mito

Open GitHub

Machine-learning library repository.

scikit-learn/scikit-learn

Open GitHub

JavaScript standard library monorepo.

stdlib-js/stdlib

Open GitHub

Repository associated with the paper's Time Series Transformer mechanistic interpretability experiments.

mathiisk/TST-Mechanistic-Interpretability implementation

Open GitHub

Repository for the MedBayes-Lite clinical uncertainty governance layer and associated experimental implementation.

eliashossain001/medbayes-lite framework

Open GitHub

Repository containing the official Python implementation of MEME, including code for argument extraction, mode identification/alignment, and training scripts; the README notes that the dataset used in the project is private due to compliance requirements.

gta0804/MEME framework

Open GitHub

Repository for the MeMemo browser-based HNSW retrieval toolkit, documentation, and RAG Playground example application.

poloclub/mememo framework

Open GitHub

Canonical repository for the MemGPT system, now named Letta, implementing stateful LLM agents with persistent memory and related tooling.

letta-ai/letta

Open GitHub

Repository for MemR3, the memory retrieval via reflective reasoning controller.

Leagein/memr3 framework

Open GitHub

Official open-source implementation of the MemTools framework introduced and evaluated in the paper.

JJJAYYYZhao/MemTools-public

Open GitHub

Repository providing data and code for reproducing analyses, with folders for one-dimensional generated regression, two-armed bandit, real-world regression, and an MMLU benchmark.

juliancodaforno/meta-in-context-learning benchmark

Open GitHub

Repository containing the MetaOptimize implementation, package files, example usage, and experiment code for CIFAR10, ImageNet, TinyStories, and continual CIFAR100 experiments.

sabersalehk/MetaOptimize framework

Open GitHub

Official MetaTool repository containing ToolE data, tool descriptions and embeddings, scenario lists, prompt templates, model-generation scripts, and evaluation code.

HowieHwong/MetaTool benchmark

Open GitHub

Official repository for the Mind2Web dataset, benchmark processing and evaluation resources, and MindAct fine-tuning and model code.

OSU-NLP-Group/Mind2Web dataset

Open GitHub

Repository linked by the paper for the MindWatcher agent framework, models, benchmark resources, and related implementation artifacts.

TIMMY-CHAN/MindWatcher framework

Open GitHub

Repository for MiniCheck code, model usage, synthetic data generation code, benchmark evaluation demo, and links to LLM-AggreFact and Hugging Face model resources.

Liyan06/MiniCheck benchmark

Open GitHub

Repository for the Minions communication protocol enabling small on-device models to collaborate with frontier cloud models.

HazyResearch/minions system

Open GitHub

Repository containing the paper's mKG-RAG implementation and associated resources.

xandery-geek/mKG-RAG framework

Open GitHub

MedicalZooPytorch repository used in the benchmark's OOD training set.

black0017/MedicalZooPytorch

Open GitHub

Text classification repository listed as TCL in the benchmark's OOD training repositories.

brightmart/text_classification

Open GitHub

DeepFloyd IF repository used for multi-modal image tasks.

deep-floyd/if

Open GitHub

Deep Graph Library repository used for graph-model tasks.

dmlc/dgl framework

Open GitHub

PyTorch-GAN repository used for image-GAN tasks.

eriklindernoren/PyTorch-GAN

Open GitHub

ESM repository used for protein/biomedical tasks.

facebookresearch/esm

Open GitHub

Public ML-Bench code and benchmark resources released by the paper.

gersteinlab/ML-bench dataset

Open GitHub

BERT repository used for ML-Bench tasks.

google-research/bert

Open GitHub

PyTorch Image Models repository used as an OOD evaluation repository.

huggingface/pytorch-image-models

Open GitHub

Grounded Segment Anything repository used as an OOD evaluation repository.

IDEA-Research/Grounded-Segment-Anything

Open GitHub

Muzic repository used for music/audio tasks.

microsoft/muzic

Open GitHub

OpenCLIP repository used for multi-modal tasks.

mlfoundations/open_clip

Open GitHub

vid2vid repository used for video tasks.

NVIDIA/vid2vid

Open GitHub

OpenDevin agent framework evaluated in ML-Agent-Bench with GPT-4o, GPT-4, and GPT-3.5.

OpenDevin/OpenDevin system

Open GitHub

LAVIS repository used for multi-modal tasks.

salesforce/lavis framework

Open GitHub

Stable Diffusion repository used in the benchmark's OOD training set.

Stability-AI/stablediffusion

Open GitHub

Tensor2Tensor repository used in the benchmark's OOD training set.

tensorflow/tensor2tensor

Open GitHub

Time-Series-Library repository used for time-series tasks.

thuml/Time-Series-Library

Open GitHub

Learning3D repository used for 3D vision tasks.

vinits5/learning3d

Open GitHub

External-Attention-pytorch repository used for attention-use tasks.

xmu-xiaoma666/External-Attention-pytorch

Open GitHub

Repository for the benchmark and evaluation resources introduced by the paper.

chchenhui/mlrbench dataset

Open GitHub

Official repository containing inference and evaluation code, setup instructions, and links to the released dataset and leaderboard.

Alpha-Innovator/MME-Reasoning dataset

Open GitHub

Official repository for the MMIR-TCM framework and its associated MedTCM and TDEU artifacts.

jw-chae/MMIR-TCM

Open GitHub

Repository for the MMReason benchmark, with citation and evaluation instructions and integration through VLMEvalKit.

HJYao00/MMReason dataset

Open GitHub

Official XiaoMi repository for Mobile-Bench, including Appium/emulator setup materials, API/UI agent code folders, data/result directories, prompts, requirements, and README instructions.

XiaoMi/MobileBench dataset

Open GitHub

Repository for MobileAgentBench, an automated benchmark for mobile LLM agents with a Python library and default tasks using SimpleMobileTools apps.

MobileAgentBench/mobile-agent-bench benchmark

Open GitHub

Repository named Modalitites-Impact-In-ML under the belgats GitHub account. It was public but empty when accessed.

belgats/Modalitites-Impact-In-ML implementation

Open GitHub

Public Python repository containing MODE ingestion and inference code, clustering and centroid-routing components, benchmark scripts, evaluation data and logs, tests, and documentation.

rahulanand1103/mode framework

Open GitHub

Official Model Context Protocol server collection used to identify official and community MCP integrations.

modelcontextprotocol/servers

Open GitHub

Replication package released by the authors for the MCP server empirical study.

SAILResearch/replication-25-mcp-server-empirical-study dataset

Open GitHub

Public repository containing the paper's collected landscape data and implementation examples.

security-pride/MCP_Landscape dataset

Open GitHub

Python/TensorFlow implementation of Robust Log-Optimal Strategy with Reinforcement Learning discussed as a policy-based portfolio optimization approach.

fxy96/Robust-Log-Optimal-Strategy-with-Reinforcement-Learning framework

Open GitHub

Python/TensorFlow implementation of the PGPortfolio or EIIE-style portfolio reinforcement learning framework discussed in the survey.

ZhengyaoJiang/PGPortfolio framework

Open GitHub

GitHub repository for the SYMBA crypto multi-agent reinforcement learning market simulator.

johannlussange/symba_crypto framework

Open GitHub

Synthetic train, development, and test files used for the paper's arithmetic argument-extraction experiments.

AI21Labs/MRKL_synthetic_data dataset

Open GitHub

Auto-GPT is discussed as an autonomous AI application whose main agent, plugins, memory, file access, internet access, code execution, image generation, and oracle-like functions can be represented within the proposed multi-agent graph framework.

Significant-Gravitas/Auto-GPT framework

Open GitHub

BabyAGI is discussed as an AI agent system with task creation, prioritisation, and execution chains that can be modelled as interconnected agents plus a vector-database plugin.

yoheinakajima/babyagi system

Open GitHub

Official source-code repository for the paper, including scalar and two-dimensional consensus experiments, topology configuration, plotting utilities, and experiment execution code.

WindyLab/ConsensusLLM-code

Open GitHub

Official data release containing generated experiment outputs for the agent-count, temperature, and personality analyses.

WindyLab/ConsensusLLM-code

Open GitHub

Repository associated with the experiment in which a fine-tuned Llama 2 7B Chat model attempts to manipulate an LLM overseer or reward model, with RL increasing jailbreak attempts.

AlexMeinke/fooling-the-overseer implementation

Open GitHub

Repository associated with the experiment that repeatedly rewrites news articles using LLMs and evaluates degradation in factual accuracy.

qfeuilla/DistordedNews implementation

Open GitHub

Repository associated with the experiment testing whether specialized LLM driving agents fine-tuned on different traffic conventions fail to coordinate when yielding to an emergency vehicle.

SUMEETRM/driving_llms implementation

Open GitHub

Official MH-MoE implementation based on TorchScale and fairseq, including setup and pretraining scripts.

yushuiwx/MH-MoE implementation

Open GitHub

Repository created to support the survey by collecting and categorizing relevant research papers, datasets, application scenarios, framework figures, and future-direction materials on value alignment in agentic AI systems.

Wei-ZENG1020/Value-Alignment-Agentic-AI-Papers-Survey-Taxonomy

Open GitHub

Repository containing the Multi-LogiEval data and associated evaluation or reasoning-chain artifacts.

Mihir3009/Multi-LogiEval dataset

Open GitHub

GitHub repository made available by the authors for MMTB.

yupeijei1997/MMTB dataset

Open GitHub

Repository containing the official implementation of Agentic Predictor for multi-view performance prediction in LLM-based agentic workflows.

DeepAuto-AI/agentic-predictor system

Open GitHub

Public repository for the MARBLE framework and MultiAgentBench code and data used to develop, test, and evaluate LLM-based multi-agent systems.

MultiagentBench/MARBLE dataset

Open GitHub

Repository identified by the paper as containing the project, data, Python code, ontology model, mapping rules, hyperparameter tuning code for ComplEx, TransE, and DistMult, and code to train and predict using TransE.

durgeshnandini/Multidimensional-Knowledge-Graph-Embeddings-for-International-Trade-Flow-Analysis dataset

Open GitHub

A constantly updated paperlist related to the survey topic of multimodality representation learning.

marslanm/multimodality-representation-learning

Open GitHub

Public repository containing modules that form MRT and utility code used for data piping in the paper's experiments.

EgoPer/Multiple-Resolution-Tokenization implementation

Open GitHub

Toolkit and repository for creating, sharing, and materializing the natural-language prompt templates used to build P3 and train T0.

bigscience-workshop/promptsource

Open GitHub

Code and instructions for reproducing T0 training, evaluation, inference, and ablation checkpoints.

bigscience-workshop/t-zero

Open GitHub

LASSO is a platform for scalable software code analysis and observation, combining dynamic and static program analysis, code search, N-version assessment, automated test generation, software experimentation, and benchmarking.

SoftwareObservatorium/lasso framework

Open GitHub

Supplemental repository containing accepted submissions, the call for papers, nanopublication files, the questionnaire, and visualization materials for the field study.

LaraHack/formalization_papers_supplemental

Open GitHub

Repository containing the nanopublication collection and graph assets used to analyze and visualize the formalization-paper special-issue workflow.

LaraHack/fpsi_analytics

Open GitHub

Repository for Nanobench, the template-driven interface used by participants to create formalizations, class definitions, submissions, reviews, responses, updates, and decisions as nanopublications.

peta-pico/nanobench

Open GitHub

Repository for Tapas, the generic triple-store interface used to run template-based SPARQL queries and display submission and review overviews.

peta-pico/tapas

Open GitHub

Repository providing an open-source NRT-style training recipe built on top of verl, including scripts and configuration for NRT training.

sharkwyf/native-reasoning-models framework

Open GitHub

CodRep dataset/competition artifact for code refinement and defect detection.

ASSERT-KTH/CodRep dataset

Open GitHub

BigCloneBench dataset for clone detection and related code-understanding tasks.

clonebench/BigCloneBench dataset

Open GitHub

Code Contests dataset for code generation from competitive-programming problems.

deepmind/code_contests dataset

Open GitHub

InCoder repository for generative code models.

dpfried/incoder implementation

Open GitHub

Description2Code dataset for mapping descriptions to code.

ethancaballero/description2code dataset

Open GitHub

CodeSearchNet dataset covering multiple programming languages for code search and code-language tasks.

github/CodeSearchNet dataset

Open GitHub

APPS benchmark for code generation from programming problems.

hendrycks/apps benchmark

Open GitHub

Project CodeNet dataset for program understanding, generation, and refinement tasks.

IBM/Project_CodeNet dataset

Open GitHub

Copilot for Xcode source editor extension integrating GitHub Copilot and ChatGPT-style functions into Xcode.

intitni/CopilotForXcode system

Open GitHub

CodeXGLUE benchmark for multiple code intelligence tasks.

microsoft/CodeXGLUE dataset

Open GitHub

PyCodeGPT/CERT-related repository for Python code generation.

microsoft/PyCodeGPT implementation

Open GitHub

xCodeEval/ExecEval repository for multilingual code evaluation tasks.

ntunlp/xCodeEval benchmark

Open GitHub

HumanEval benchmark for evaluating code-generation models.

openai/human-eval benchmark

Open GitHub

WikiSQL dataset for SQL generation/summarization-style tasks in the paper's table.

salesforce/WikiSQL dataset

Open GitHub

CONCODE dataset for mapping natural language and program context to code.

sriniiyer/concode dataset

Open GitHub

Code-LMs repository associated with PolyCoder and code-language-model research.

VHellendoorn/Code-LMs implementation

Open GitHub

Official implementation repository for the expanded Natural Language Reinforcement Learning project, including shared NLRL libraries and later Maze, Breakthrough, and Tic-Tac-Toe experiments.

waterhorse1/Natural-language-RL

Open GitHub

Python code for collecting LLM ranking data, training the scoring model, and training policies with direct-score or potential-difference rewards.

sy-shi/RLAIF_ScoreDiff implementation

Open GitHub

GroundHog is the authors' Theano-based recurrent neural network framework containing the neural machine translation implementation used for the paper.

lisa-groundhog/GroundHog

Open GitHub

Repository released by the authors for Multimodal-ZeroShotTM, Multimodal-Contrast, and the multimodal topic modeling experiments.

gonzalezf/multimodal_neural_topic_modeling benchmark

Open GitHub

Repository for NNHedge, containing components for simulated instruments, neural hedging models, data loading, training, and assessment.

guijinSON/NNHedge framework

Open GitHub

Author-linked reference implementation of NSR, including SVD compression, TaskKnowledgeBank logic, allocation policies, dataset builders, experiment runners, and visualization code.

bhyoon-me/nsr

Open GitHub

Repository containing code for persona vector generation, the user-study interface, and user-study analysis associated with the neural transparency paper.

mitmedialab/neural-transparency system

Open GitHub

Repository containing code, data folders, source files, scripts, requirements, and usage instructions for running CaRing and evaluating ProofWriter, GSM8K, and PrOntoQA experiments.

DAMO-NLP-SG/CaRing framework

Open GitHub

Repository for the Next-Generation LLM for UAV system, including the main application, short-, medium-, and long-range route-planning utilities, path-planning utilities, control-platform utilities, examples, and integrated data folders.

liangqiyuan/NeLV framework

Open GitHub

The repository describes NormCode as a language for auditable multi-step AI workflows, lists core components such as infra, canvas_app, cli_orchestrator.py, documentation, and examples, and links to the arXiv paper.

lys5588/Normcode-paper framework

Open GitHub

The fairseq NormFormer example provides architecture flags and causal-language-model training commands corresponding to the paper.

facebookresearch/fairseq implementation

Open GitHub

Repository containing code and configuration for training and inference of Qwen3 8B, DeepSeek LLM 7B, and LLaMA3 8B Instruct on financial sentiment datasets, including preprocessing, model configuration, training, inference, and evaluation.

NLPforFinance/fine-tuning-of-lightweight-large-language-models implementation

Open GitHub

Repository whose README cites the NumHTML AAAI 2022 paper and states that code and data used for the paper are provided.

YangLinyi/HTML-Hierarchical-Transformer-based-Multi-task-Learning-for-Volatility-Prediction dataset

Open GitHub

Repository containing the paper's Off-Policy Corrected Reward Modeling implementation and the code and resulting data for filtering the short Alpaca-Farm setting.

JohannesAck/OffPolicyCorrectedRewardModeling implementation

Open GitHub

Repository for the ACL 2025 OMGM coarse-to-fine multimodal retrieval and RAG framework.

ChaoLinAViy/OMGM framework

Open GitHub

Repository for the OmniCode benchmark code and data.

seal-research/OmniCode dataset

Open GitHub

Repository associated with the paper's implementation of probabilistic monolithic and model reconciling explanation algorithms.

YODA-Lab/Probabilistic-Monolithic-Model-Reconciling-Explanations benchmark

Open GitHub

Repository stated by the paper as containing code to reproduce the experiments for the uncertainty-measure framework.

ml-jku/uncertainty-measures implementation

Open GitHub

Repository linked by the paper as the code for the controlled shortest-path reasoning experiments.

riccardoalberghi/DP benchmark

Open GitHub

Public ReAct repository used as the source code-base for the original ReAct setup and corresponding experiments.

ysymyth/ReAct framework

Open GitHub

Repository containing the RepoExec benchmark/source code for executable repository-level code generation evaluation and related dependency-utilization tooling.

FSoft-AI4Code/RepoExec dataset

Open GitHub

Repository containing code for the paper, including configurations for BM25 baseline, ReAct agent variants, self-reflection, AutoCodeRover-related settings, and evaluation scripts.

JetBrains-Research/ai-agents-code-editing benchmark

Open GitHub

Repository containing the ARC task data and supporting materials for the benchmark introduced and analyzed in the paper.

fchollet/ARC dataset

Open GitHub

Repository for the paper's PyTorch implementation, including scripts, prompts, helper functions, and instructions for running memory representation and retrieval experiments.

zengrh3/StructuralMemory dataset

Open GitHub

Official ToolBench repository containing benchmark tasks, action-generator and evaluator code, tests, setup instructions, and evaluation commands.

sambanova/toolbench benchmark

Open GitHub

Repository path for Agent Spec runtime adapters that translate Agent Spec components into framework-specific equivalents for popular agentic frameworks.

oracle/agent-spec implementation

Open GitHub

WayFlow is presented as the paper's reference runtime for executing Agent Spec components, including native support for Agent Spec Agents and Flows.

oracle/wayflow framework

Open GitHub

Repository announced by the authors for code and data supporting the neutral event graph induction framework.

liusiyi641/Neutral-Event-Graph dataset

Open GitHub

Open-source code and data repository for the OpenAgentSafety framework and benchmark task suite.

Open-Agent-Safety/OpenAgentSafety dataset

Open GitHub

Official implementation repository for OpenAI Gym, the reinforcement-learning environment toolkit introduced and described by the paper.

openai/gym

Open GitHub

Repository linked by the authors as the location of all generated code and other answers used in the study.

lmous/openai-gpt4-coding-assistant implementation

Open GitHub

The open-source implementation of the OpenHands platform introduced and evaluated in the paper.

All-Hands-AI/OpenHands

Open GitHub

A modular, parallelized code base for simulating constant-product AMM trading against a CEX, designed for large-scale experiments and extensible market-design variants.

JasonSome/cpmm-trading implementation

Open GitHub

Library of ADMM applications for sparse and low-rank optimization used to test NewADMM.

canyilu/LibADMM package

Open GitHub

Huawei Cloud VM-placement traces used in the cloud resource scheduling case study.

huaweicloud/VM-placement-dataset dataset

Open GitHub

Official implementation of OCTree for optimized feature generation for tabular data via LLMs with decision-tree reasoning.

jaehyun513/OCTree framework

Open GitHub

Open-source MA-Gym codebase for evaluating Manager Agents in graph-based multi-agent workflow orchestration.

DeepFlow-research/manager_agent_gym framework

Open GitHub

Open-Finance-Lab AgenticTrading is an open-source experimental playground for LLM-powered trading agents, backtests, paper-trading simulations, reasoning logs, benchmark comparisons, and the FinAgent orchestration subsystem.

Open-Finance-Lab/AgenticTrading framework

Open GitHub

Repository for the OrderFusion model and workflow, including package installation, tutorial notebook, data-reading, model optimization, evaluation, and forecast-plotting functions.

runyao-yu/OrderFusion package

Open GitHub

Repository for Orla, the library and execution engine for constructing and serving LLM-based agentic workflows with stage mapping, orchestration, and workflow-level memory management.

dorcha-inc/orla framework

Open GitHub

Official Uber Research repository containing implementations of the original POET and Enhanced POET algorithms.

uber-research/poet

Open GitHub

Public source-code repository for PAMS, the Python-based Platform for Artificial Market Simulations.

masanorihirano/pams framework

Open GitHub

Repository for the RedPajama data recipe and corpus from which the paper draws compute-dependent pretraining subsets.

togethercomputer/RedPajama-Data dataset

Open GitHub

Repository provided by the authors to support reproduction of the Identifier-Organizer-Adapter data synthesis and distillation framework.

BokwaiHo/IOA framework

Open GitHub

Official open-source repository for the PedNStream pedestrian-network simulator, including core LTM modules, scenario data, examples, visualization, tests, and configuration files.

WaimenMak/PedNStream

Open GitHub

Official ParlAI repository containing the Persona-Chat task infrastructure and associated dialogue-model resources.

facebookresearch/ParlAI

Open GitHub

Code used to scrape and process the dataset, construct train/test and benchmark datasets, and build and evaluate baseline phishing detection models.

phreshphish/phreshphish dataset

Open GitHub

CMBAgent Benchmarks repository used for CAMB tool-grounded precision tasks.

cmbagent/Benchmarks dataset

Open GitHub

Repository containing the implementation of the Point-M2AE hierarchical point-cloud pre-training framework.

ZrrSkywalker/Point-M2AE implementation

Open GitHub

Agent Network Protocol is treated as an open-network agent discovery and collaboration protocol using decentralized identifiers and JSON-LD.

agent-network-protocol/AgentNetworkProtocol

Open GitHub

Model Context Protocol is treated as a standardized context-ingestion and tool-invocation protocol relevant to execution-level transitions.

modelcontextprotocol

Open GitHub

Contains framework code, domain profiles, prompt builders, placeholder QA, deterministic replacement, paired full-resolution benchmark posters, evaluation outputs, runtime and failure audits, prompts, manifests, and configuration records.

tyy99phy/paper_poster_harness

Open GitHub

Repository associated with the paper's dataset and experimental framework for evaluating LLM-generated scientific reviews against human reviews and post-publication outcomes.

akhilpandey95/LMRSD dataset

Open GitHub

Repository containing code for the paper, including PhraseBank preprocessing, training, and testing scripts for the LLaMA financial sentiment analysis experiments.

luosting/LLaMA-Financial-sentiment-analysis implementation

Open GitHub

Apache Airflow repository; used as an example among ten large open-source industrial projects from which function-summary pairs were sampled.

apache/airflow

Open GitHub

RxJava repository; used as an example among ten large open-source industrial projects from which function-summary pairs were sampled.

ReactiveX/RxJava

Open GitHub

Original StockNet codebase for stock movement prediction from tweets and historical prices.

yumoxu/stocknet-code benchmark

Open GitHub

Repository for the StockNet dataset, containing historical stock-price data and tweet-data structure for stock movement prediction from tweets and historical prices.

yumoxu/stocknet-dataset dataset

Open GitHub

Repository reported by the paper as the source code for the LLM-enhanced tweet emotion analysis and stock movement prediction framework.

anv0101/stock-prediction implementation

Open GitHub

LLT R package used by the authors to transform the cryptocurrency datasets before classifier evaluation.

mtkurbucz/LLT package

Open GitHub

A Python benchmark for backtesting prediction-market trading agents using real Kalshi market replay data, included episodes, agent interfaces, simulator configuration, and metrics output methods.

Oddpool/PredictionMarketBench dataset

Open GitHub

Official code repository for the prefix-tuning method and experiments introduced by the paper.

XiangLi1999/PrefixTuning

Open GitHub

Official implementation repository for ProAgent, the LLM-based agent system introduced by the paper for Agentic Process Automation.

OpenBMB/ProAgent framework

Open GitHub

Public code repository for the trajectory-probing experiments and analyses introduced in the paper.

AndresAlgaba/probing_reasoning_traces benchmark

Open GitHub

Repository containing task templates, benchmark-generation scripts, preprocessing, model-prediction configurations, response extraction, and evaluation code for reproducing and extending the ProcBench experiments.

ifujisawa/proc-bench benchmark

Open GitHub

Repository containing code, prompts, and data for reproducing or evaluating Program of Thoughts prompting.

wenhuchen/Program-of-Thoughts benchmark

Open GitHub

Official repository location for the MBPP programming-problem benchmark introduced by the paper.

google-research/google-research dataset

Open GitHub

Notebook implementing the MathQA-to-Python translation and generation workflow used for MathQA-Python.

google/trax implementation

Open GitHub

Official ProgramBench repository for the benchmark, tooling, usage guide, and baseline evaluation workflow.

facebookresearch/ProgramBench dataset

Open GitHub

Repository containing the progressive multimodal search-agent rollout, image and text tools, retrieval and summarization services, TN-GSPO modifications, training scripts, evaluation scripts, and dataset configuration files.

DingWu1021/Promsa framework

Open GitHub

Repository for the paper's trace-rewriting experiments, including configuration files, source scripts for trace generation, rewriting, distillation and evaluation, prompt-optimization scripts, and pre-generated GSM8K/MATH rewritten trace datasets.

xhOwenMa/trace-rewriting dataset

Open GitHub

Repository for evaluating LLM agents on real-world coding tasks, with benchmark code, configuration, data folders, unit-test workflow, PyInstruct data link, and PyLlama3 model link.

Mercury7353/PyBench dataset

Open GitHub

Repository containing the public implementation and data-processing pipeline associated with the paper.

DARE-ML/DeepGR4J-Extremes framework

Open GitHub

Repository containing Qlib's code, documentation, and additional platform features beyond those described in the paper.

microsoft/qlib framework

Open GitHub

Python and shell-script repository containing QTMRL or multi-indicator experiment code, non-multi-indicator model scripts, baseline scripts, dependency specifications, and instructions for downloading a related multi-indicator dataset.

ChenJiahaoJNU/QTMRL framework

Open GitHub

Repository containing paper versions, an overview image, README, and Jupyter Notebook implementation for Quantformer.

zhangmordred/QuantFormer implementation

Open GitHub

Repository containing code, data, model implementations, visualisations, and setup instructions for classic and quantile versions of linear regression, BD-LSTM, Conv-LSTM, and ED-LSTM across BTC, ETH, Sunspot, Mackey-Glass, and Lorenz datasets.

sydney-machine-learning/quantiledeeplearning dataset

Open GitHub

Official Quark implementation with separate branches for toxicity unlearning, sentiment steering, and repetition reduction, plus evaluation scripts and released checkpoint links.

GXimingLu/Quark

Open GitHub

Contains the contextual chunker, embedding and reranker training code, Kaggle submission notebooks, official competition data layout, and experiments corresponding to the baseline and final system.

nuinashco/unlp2026_shared_task system

Open GitHub

Repository containing the RAGSmith codebase and the datasets used in the study.

yAquila/RAGSmith dataset

Open GitHub

Author-maintained repository containing Python and MATLAB implementations and examples for randomizing affine-diffusion models and computing randomized characteristic-function constructions.

LechGrzelak/Randomization implementation

Open GitHub

Repository associated with Rational Tuning experiments for LLM cascade modeling and threshold optimization.

mzelling/rational-llm-cascades implementation

Open GitHub

Alpaca Eval is one of the two central input sets compared in the paper's automatic bencher experiments.

tatsu-lab/alpaca_eval dataset

Open GitHub

Repository for the paper's released code and data supporting the RealRank automatic LLM ranking analysis.

yale-nlp/RealRank benchmark

Open GitHub

CHEF is one of the three real-world datasets used to evaluate STEEL.

THU-BPM/CHEF dataset

Open GitHub

CrewAI is presented as a Python framework for defining LLM-based agents, tasks, tools, sequential execution, entity memory, and callbacks.

crewAIInc/crewAI framework

Open GitHub

Jupyter notebook implementing a workflow that performs root-cause analysis from a directly-follows graph abstraction and adds evaluation steps such as confidence scoring and reasoning output.

fit-alessandro-berti/agents-trial implementation

Open GitHub

Jupyter notebook implementing the process-mining fairness workflow with protected-group identification and comparison between protected and non-protected cases.

fit-alessandro-berti/agents-trial implementation

Open GitHub

Official Chronos repository for pretrained time-series forecasting models.

amazon-science/chronos-forecasting implementation

Open GitHub

Datadog Toto repository for Time-Series-Optimized Transformer for Observability.

DataDog/toto implementation

Open GitHub

Official Google Research TimesFM repository for the Time Series Foundation Model.

google-research/timesfm implementation

Open GitHub

IBM Granite TSFM TinyTimeMixer model implementation.

ibm-granite/granite-tsfm implementation

Open GitHub

MOMENT repository for a family of open time-series foundation models.

moment-timeseries-foundation-model/moment implementation

Open GitHub

TiRex repository for zero-shot forecasting across long and short horizons.

NX-AI/tirex implementation

Open GitHub

Moirai/Uni2TS repository for universal time-series forecasting Transformers.

SalesforceAIResearch/uni2ts implementation

Open GitHub

Sundial repository for highly capable time-series foundation models.

thuml/Sundial implementation

Open GitHub

Lag-Llama repository for probabilistic time-series foundation forecasting.

time-series-foundation-models/lag-llama implementation

Open GitHub

Repository for the Re4 Scientific Computing Agent, including source code in Jupyter Notebook and Python script formats and a README describing the rewriting-resolution-review-revision logical chain.

ChengAo21/Re4_Sci_Agent framework

Open GitHub

Repository for the ICLR 2023 ReAct prompting paper, including data, prompts, HotpotQA, FEVER, ALFWorld, and WebShop notebooks, plus Wikipedia environment wrappers.

ysymyth/ReAct benchmark

Open GitHub

Repository for the paper's training-data synthesis implementation.

BAAI-DCAI/Training-Data-Synthesis implementation

Open GitHub

GitHub repository associated with the REAL websites, framework, and leaderboard for benchmarking autonomous web agents.

agi-inc/agisdk framework

Open GitHub

Google Research code for REALM pre-training, document-index refreshing, example generation, released checkpoints, and integration with the ORQA fine-tuning code.

google-research/language implementation

Open GitHub

Repository containing the yield-farming simulation environment used for strategy analysis.

xujiahuayz/yieldAggregators implementation

Open GitHub

Official implementation and release repository for the DramaSR-LRM pipeline, benchmark data structure, training, inference, evaluation, and model checkpoints.

198808xc/DramaSR-LRM benchmark

Open GitHub

GitHub directory containing REVEAL code, AIGC-text-bank files, configurations, inference scripts, training scripts, and setup instructions.

microsoft/AnthropomorphicIntelligence dataset

Open GitHub

The ParlAI repository contains the framework, tasks, model access, fine-tuning and evaluation code used to reproduce and extend the paper's chatbot recipes.

facebookresearch/ParlAI

Open GitHub

Repository released by the authors for ReCode, including the framework and associated research artifacts.

ZJU-CTAG/ReCode dataset

Open GitHub

GitHub repository for the benchmark code and evaluation tooling introduced by the paper.

JiseungHong/ReCUBE benchmark

Open GitHub

HuggingFace Diffusers repository, used for refactoring diffusion-model implementation files such as UNet and scheduler sources.

huggingface/diffusers

Open GitHub

Official repository released by the authors with implementations, demos, prompts, and research artifacts for Reflexion experiments.

noahshinn024/reflexion

Open GitHub

Repository containing the implementation associated with the regime-aware continual adaptive portfolio-management framework.

Dumail/ReCAP framework

Open GitHub

Repository associated with the paper's Qwen2.5-0.5B SFT, DPO, RLOO, reward-model, Countdown, and external-verifier experiments.

Yifu93/LLM-Reinforcement-Learning implementation

Open GitHub

Code repository for the Relational Representation Distillation implementation reported by the paper.

giakoumoglou/distillers implementation

Open GitHub

Public source code for the paper's RPS-based ordinal conformal prediction method and experiments.

stefanahaas41/rps-ordinal-conformal-prediction

Open GitHub

The Arena Hard repository is used as the pairwise chatbot evaluation setup; the authors generated additional model outputs and scored them with the compared judges.

lm-sys/arena-hard benchmark

Open GitHub

The public implementation of RepoAgent, an LLM-powered repository agent for generating, maintaining, updating, and understanding repository-level documentation.

OpenBMB/RepoAgent framework

Open GitHub

Repository containing RepoGraph code for constructing and retrieving repository graph context, plus integrations with Agentless and SWE-agent and scripts for SWE-bench evaluation.

ozyyshr/RepoGraph framework

Open GitHub

RepoReviewer repository containing the Python backend, Next.js frontend, screenshots, CLI/API/web workflow, and templates or utilities for future empirical evaluation and annotation.

peng1z/RepoReviewer framework

Open GitHub

Official codebase for training ReProbe/UHead-style verifiers, generating annotated reasoning datasets, evaluating ReProbe and baselines, and reproducing benchmark tables.

ReProbe/ReProbe benchmark

Open GitHub

GitHub repository containing data from the paper's experiments, provided to support reproducibility of the proposed evaluation approach.

andstor/agentic-ai-eval-replication-package implementation

Open GitHub

Repository containing PopQA training and test data, preprocessing artifacts, generator and reranker LoRA adapters, training and testing scripts, prompt-generation code, and utilities.

CoderrrSong/CoRAG dataset

Open GitHub

Repository for Byzantine Fault Tolerance in LLM-Based Multi-Agent Systems, including pilot experiments, prompt-level confidence probing, hidden-level confidence probing, datasets, pretrained confidence probes, and experiment scripts.

Z1ivan/Byzantine-Fault-Tolerance-in-LLM-MAS framework

Open GitHub

The KILT repository provides the standardized Wikipedia knowledge source used for retrieval in both dialogue benchmarks.

facebookresearch/KILT

Open GitHub

ParlAI is the dialogue research framework in which the paper trains and evaluates all models and through which the RAG-based implementations and pretrained models were released.

facebookresearch/ParlAI

Open GitHub

The paper's RAG implementation and experiment scripts were ported to and open-sourced within the Hugging Face Transformers repository.

huggingface/transformers

Open GitHub

Official companion repository for the survey, containing survey materials and an evolving RAG knowledge base covering papers, datasets, benchmarks, evaluation resources, and toolkits.

Tongji-KGLLM/RAG-Survey

Open GitHub

Open-Finance-Lab repository for FinRL Contest 2024, linked directly in the paper alongside the contest website.

Open-Finance-Lab/FinRL_Contest_2024 framework

Open GitHub

Official repository released by the authors with code, data, and agent-generated traces for reproducing the study.

ChuanMeng/text-ranking-in-deep-research benchmark

Open GitHub

Implementation repository used for the Rank1 reasoning-based re-ranker evaluated in the pipeline and query-mismatch experiments.

orionw/rank1

Open GitHub

Repository for the BrowseComp-Plus fixed-corpus deep-research benchmark and its evaluation tooling.

texttron/BrowseComp-Plus dataset

Open GitHub

Repository for reproducing or using RFEval, the paper's benchmark for auditing reasoning faithfulness under counterfactual reasoning intervention.

AIDASLab/RFEval dataset

Open GitHub

Repository containing the implementation and experiments for the risk-aware GUMDP framework and ERM-MCTS evaluation introduced in the paper.

gh0stwin/risk-aware-gumdp

Open GitHub

GitHub repository for the simulator used to quantify uncertainty propagation in AI-augmented systems.

EMezzi/AI-Augmented implementation

Open GitHub

Official Risky-Bench repository containing benchmark datasets, data generation scripts, evaluator components, and evaluation workflows for the paper.

SophieZheng998/Risky-Bench dataset

Open GitHub

Repository linked by the paper as the code artifact for Variational Alignment with Re-weighting; the repository page was empty when checked.

DuYooho/VAR implementation

Open GitHub

Repository containing prompts and templates used to instantiate LLM-Modulo for Travel Planner and Natural Plan domains.

Atharva-Gundawar/LLM-Modulo-prompts framework

Open GitHub

A project that crawls WeChat users' profile pictures with usernames and visualizes them; used as the source project for the bug-fixing task.

yangxuanxc/wechat_friends

Open GitHub

Repository for serving, training, and evaluating LLM routers, including router types corresponding to the paper such as matrix factorization, similarity-weighted ranking, BERT, causal LLM, and random routing.

lm-sys/routellm framework

Open GitHub

Official RouterBench code repository containing data converters, prompt embeddings, predictive and cascading routers, evaluation code, configurations, tests, and visualization utilities.

withmartian/routerbench benchmark

Open GitHub

Repository reported by the paper for the AtomicTranslation code used in the language-to-logic translation experiments.

KrisAesoey/AtomicTranslation dataset

Open GitHub

GitHub repository for a convolutional neural stock-market technical analyser used as the starting point for the paper's proposed model.

philipxjm/Convolutional-Neural-Stock-Market-Technical-Analyser implementation

Open GitHub

Repository for SafeArena, a benchmark for assessing harmful capabilities and safety risks of autonomous web agents.

McGill-NLP/safearena dataset

Open GitHub

Repository containing code and prompts used when writing the paper, including agent setup notebooks, safety architecture notebooks, image-generation safety notebooks, requirements, utilities, and unsafe agent test requests.

ishaandomkundwar/Agent-Safety dataset

Open GitHub

Google Gemini full-stack LangGraph quickstart prototype used as the basis for a query-decomposition, search, reflection, and answer-finalisation scaffold.

google-gemini/gemini-fullstack-langgraph-quickstart system

Open GitHub

Repository for SafeSearch code, dataset, prompts, assets, and red-teaming configurations.

jianshuod/SafeSearch dataset

Open GitHub

Official implementation of the Vera safety-testing framework, including taxonomy exploration, case generation, adaptive execution, Vera-Bench, and guard-model fine-tuning components.

Yunhao-Feng/Vera framework

Open GitHub

DeepSpeed repository containing the distributed training framework and MoE functionality introduced and evaluated by the paper.

microsoft/DeepSpeed framework

Open GitHub

A revised Tau2-Bench repository used by the authors because they considered the original benchmark data noisy.

AGI-Eval-Official/tau2-bench-revised dataset

Open GitHub

Repository for BIG-bench, from which the paper evaluates 62 tasks covering reasoning, knowledge, social behaviour, and other language-model capabilities.

google/BIG-bench benchmark

Open GitHub

LLM-Random research codebase containing model-training infrastructure and research configurations associated with the fine-grained MoE scaling-law experiments.

llm-random/llm-random implementation

Open GitHub

NL2Flow is the automated workflow problem generation and symbolic evaluation framework used to generate planning problems, compile PDDL, and evaluate plans.

IBM/nl2flow framework

Open GitHub

NL2FLOW-Runner contains code for running the experiments reported in the paper.

IBM/nl2flow-runner implementation

Open GitHub

Repository containing canonical tool contracts, interface condition renderers, validation and structured diagnostics, deterministic sandbox executors, run matrix harnesses, structured logging, and aggregate metric scripts.

akgitrepos/schema-first-tool-apis-experiments benchmark

Open GitHub

OpenHands CodeAct is used as one of the agent frameworks evaluated on ScienceAgentBench, alongside direct prompting and self-debug.

All-Hands-AI/OpenHands framework

Open GitHub

Repository containing code and data access instructions for ScienceAgentBench, including benchmark structure, agent/evaluation scripts, and links to benchmark materials.

OSU-NLP-Group/ScienceAgentBench dataset

Open GitHub

Repository for the Calo-VQ model used by SciFi to reproduce a calorimeter simulation inference and plotting pipeline.

qibin2020/calo-VQ implementation

Open GitHub

Repository containing implementation details for the SciFi autonomous agentic scientific workflow framework.

qibin2020/scifi framework

Open GitHub

Repository for Reasoning-Reinforced Representation for Search, with release-status information and a linked model artifact.

ytgui/Search-R3 framework

Open GitHub

Public repository containing the SecRepoBench implementation, benchmark metadata, descriptions, harnesses, tools, and scripts for running inference and evaluation.

ai-sec-lab/SecRepoBench dataset

Open GitHub

Provides training code, HH-RLHF data with preference-strength information, and the GPT-4-cleaned validation set.

OpenLMLab/MOSS-RLHF dataset

Open GitHub

Official code repository containing the implementation needed to reproduce the GazeReward framework and experiments.

Telefonica-Scientific-Research/gaze_reward framework

Open GitHub

Repository containing code for data generation and risk-control experiments associated with selective conformal classification.

git4review/conformal_selective_classification implementation

Open GitHub

Contains data-processing, SFT, iterative reward-model training and filtering, inference, GPT-4 evaluation, and PPO code for Self-Evolved Reward Learning.

microsoft/DKI_LLM framework

Open GitHub

Repository for Self-Evolving GPT; the README states that se_gpt_MAIN.py shows the main workflow of the framework.

ArrogantL/se_gpt framework

Open GitHub

Official code and data repository for generating Self-Instruct data, classifying tasks, generating instances, filtering and formatting data, fine-tuning GPT3, and evaluating on the released user-oriented tasks.

yizhongw/self-instruct dataset

Open GitHub

Repository containing the original Self-RAG implementation, critic and generator data-creation workflows, retriever setup, training scripts, and short- and long-form evaluation code.

AkariAsai/self-rag

Open GitHub

Official repository for the SELF-REFINE framework, containing code, prompts, data, task runners, evaluation scripts, and examples for the paper's studied tasks.

madaan/self-refine

Open GitHub

Contains the paper's hybrid-retrieval defense, detection implementations, evaluation and experiment scripts, result files, sanitized examples, corpus setup instructions, figures, and reproducibility documentation.

scthornton/semantic-chameleon system

Open GitHub

Contains the pipeline for loading CoQA and TriviaQA, generating answers, clustering semantic similarities, computing likelihoods and uncertainty measures, evaluating AUROC, and reproducing the paper's analyses, together with the released hand-labelled semantic-equivalence data.

lorenzkuhn/semantic_uncertainty dataset

Open GitHub

Code repository linked by the paper for Sentence-BERT and sentence-transformer models.

UKPLab/sentence-transformers

Open GitHub

The paper's released Sequence Tutor implementation within TensorFlow Magenta, including a checkpointed melody RNN.

tensorflow/magenta

Open GitHub

Code repository for the ShapG method introduced in the paper.

vectorsss/shapG implementation

Open GitHub

Self-service snack bar kiosk system repository containing requirements documentation, automated Robot Framework acceptance tests, C4 architecture documentation, ADRs, and traceability audit material.

JYU-GENIUS-project/snackbar_v1 implementation

Open GitHub

First iteration for vibe coding in the study; implements a snack kiosk system using React, TypeScript, PostgreSQL, Node, and Express.

JYU-GENIUS-project/VibeCode1 implementation

Open GitHub

Lovable-generated campus-treats prototype used as the unstructured vibe coding example.

Lovablekokeilu/campus-treats implementation

Open GitHub

Open-source Python framework for building and orchestrating linear deterministic agentic workflows.

DevenPanchal/simpliflow framework

Open GitHub

Repository containing ready-made example workflows, workflow utilities, and usage material for simpliflow.

DevenPanchal/simpliflow-usage

Open GitHub

Repository containing supplementary materials for the paper, organized into metrics, heatmaps, and equity-line outputs.

Maciej-13/spxw-options-sizing-paper-appendix

Open GitHub

Official repository for the skfolio Python library, containing the implementation of the portfolio-optimization and risk-management framework described in arXiv:2507.04176.

skfolio/skfolio framework

Open GitHub

Qwen3-Coder repository linked by the paper for the 30B open-weights code-generation model evaluated under NO-SPARK and WITH-SPARK conditions.

QwenLM/Qwen3-Coder

Open GitHub

Repository for the Smoothie label-free LLM routing method.

HazyResearch/smoothie implementation

Open GitHub

Public Python implementation associated with the proposed social recommendation system.

BehafaridMjf/Social-Recommendation-System implementation

Open GitHub

Official Google Research directory containing self-contained prototype notebooks for the Socratic Models applications evaluated in the paper.

google-research/google-research

Open GitHub

Official SPA-Bench repository containing benchmark code, data, framework, model server, pipeline, documentation, and setup files.

ai-agents-2030/SPA-Bench dataset

Open GitHub

Repository associated with the Sparse Logit Sampling paper and Random Sampling KD method; at extraction time the README stated that code would be uploaded after company approvals.

akhilkedia/RandomSamplingKD implementation

Open GitHub

GitHub Spec Kit documentation for spec-driven development, including the staged specification, planning, and tasking workflow used as the conceptual and procedural base for Spec Kit Agents.

github/spec-kit framework

Open GitHub

Repository containing training and distillation scripts, BigBench Hard and mathematical-benchmark evaluation code, processed-data workflows, prompting notebooks, and notebooks for aligning code-davinci and FlanT5 tokenized outputs.

FranxYao/FlanT5-CoT-Specialization implementation

Open GitHub

GitHub repository associated with the SPEECH method and released resources.

zjunlp/SPEECH dataset

Open GitHub

A curated repository associated with the paper that organizes efficient architecture papers according to the survey's categories.

weigao266/Awesome-Efficient-Arch

Open GitHub

Repository for the Semantic Pyramid Indexing FAISS/Qdrant plug-in introduced and evaluated by the paper.

FastLM/SPI_VecDB framework

Open GitHub

Official KernelBench repository used by the paper to benchmark runtime of PyTorch baselines and LLM-generated kernels.

ScalingIntelligence/KernelBench dataset

Open GitHub

Repository for the SciBORG manuscript, with setup instructions, framework usage examples, agent construction, and benchmarking guidance tied to the paper release.

chopralab/sciborg_manuscript_repo framework

Open GitHub

Directory containing SciBORG benchmark notebooks and trace notebooks corresponding to the supporting-information experiments.

chopralab/sciborg benchmark

Open GitHub

Repository reported by the paper as the source code for reproducing the pumped-storage DDQN state-representation study.

Fluxons/hydrodam implementation

Open GitHub

Repository containing the processed datasets, subject code, static-analysis implementation, machine-learning pipeline, scripts, outputs, feature selection, tuning, cross-validation, and SHAP analyses used in the paper.

imran9pk/replication-package_method_energy_java

Open GitHub

Repository containing historical changes in S&P 500 constituents, used to restrict backtest positions to stocks that belonged to the index on each date.

fja05680/sp500 dataset

Open GitHub

IMPROVER is an operational probabilistic weather forecast post-processing system used to bias-correct, calibrate, threshold, smooth, and blend forecast products.

metoppv/improver implementation

Open GitHub

Contains data-engineering utilities, synthetic-data generation, the three experimental training flows, SFT, DPO and DTFT configurations, and instructions for reproducing the final StatLLaMA path.

HuangDLab/StatLLaMA implementation

Open GitHub

Repository for the BESSTIE sentiment and sarcasm classification benchmark for varieties of English.

unswnlp/BESSTIE dataset

Open GitHub

Repository for the InstruSum instruction-controllable summarization dataset referenced and used as an evaluation target in the paper.

yale-nlp/InstruSum dataset

Open GitHub

Repository indicated by the paper as containing all code and supplementary materials used for the SV-LSTM hybrid model study.

aperekhodko/sv_lstm_hybrid_model implementation

Open GitHub

Repository linked by the authors as the location of the curated dataset used for the stock movement and volatility prediction experiments.

hao1zhao/Bigdata23 dataset

Open GitHub

Repository titled MSGCA: Stock Movement Prediction with Multimodal Stable Fusion via Gated Cross-Attention Mechanism, containing code and data folders for the proposed framework.

changzong/MSGCA dataset

Open GitHub

Repository containing code, configuration files, data-preparation scripts, pretraining routines, and trading scripts for STORM.

DVampire/Storm framework

Open GitHub

Official repository for the StoryScope pipeline, configuration, feature taxonomy, development stories, feature assignments, trained XGBoost models, and reproduction scripts.

jenna-russell/storyscope dataset

Open GitHub

Official ECCV 2026 release containing StoryAD-QA annotations, answer keys, evaluation code, and generation and answering prompts; it does not redistribute copyrighted movie media.

SEE-AI-Lab/ECCV2026_StoryTeller_StoryAD_QA

Open GitHub

Code repository for the LLM-agent Cournot competition simulations and experimental workflow.

smojha/collusive-llm-agents implementation

Open GitHub

The referenced Llama 3 model card is associated with Llama-3-8B-Instruct, one of the local open-source models used in the ranking task.

meta-llama/llama3

Open GitHub

Code released for the paper, including modular segmentation-attention and syntactic-attention layers and training scripts for translation, question answering, and natural language inference.

harvardnlp/struct-attn

Open GitHub

Searchat is a local-first semantic search system for AI coding-agent conversations, supporting verbatim, distilled, and cross-layer retrieval over agent transcript histories.

Process-Point-Technologies-Corporation/searchat framework

Open GitHub

Open-source AI hedge fund project whose structured-summary prompt template, JSON action schema, and next-open execution conventions are adapted for the paper's LLM trading-agent backtests.

virattt/ai-hedge-fund framework

Open GitHub

Public reference implementation of Context-Aware Decoding, the decoding method adapted by FinCAD for parametric look-ahead-bias mitigation.

xhan77/context-aware-decoding implementation

Open GitHub

Open-source SuperHF training code, reward-model training code, PPO-RLHF baselines, experiments, evaluations, and chart-generation resources.

openfeedback/superhf implementation

Open GitHub

Repository for the code used in the paper, with directories for the rate, prepare, assess, and pipeline stages.

COMSYS/artifact-evaluation-llm-support system

Open GitHub

Companion repository that organizes works on LLM-agent evaluation according to the survey's structure and tracks papers, benchmarks, methodologies, and frameworks.

Asaf-Yehudai/LLM-Agent-Evaluation-Survey

Open GitHub

Framework for evaluating and optimizing agents and models in container environments, discussed as part of emerging standardized cross-environment agent evaluation.

harbor-framework/harbor framework

Open GitHub

LangChain AgentEvals package for evaluating agent trajectories, including trajectory matching and graph-based evaluation.

langchain-ai/agentevals framework

Open GitHub

HAL harness for centralized and reproducible evaluation across agent benchmarks.

princeton-pli/hal-harness

Open GitHub

Repository for SWE-agent, the LM-based agent system that attempts to fix GitHub issues using an agent-computer interface and configurable tools.

SWE-agent/SWE-agent framework

Open GitHub

Repository containing SWE-Bench-CL data, dataset construction scripts, naive and agentic evaluation procedures, and LangGraph/FAISS-based continual-learning agent implementations.

thomasjoshi/agents-never-forget dataset

Open GitHub

Repository linked by the paper for the SWE-CI benchmark, associated code, and evaluation resources.

SKYLENAGE-AI/SWE-CI dataset

Open GitHub

Repository for the SWE-EVO benchmark, including benchmark materials and evaluation support for coding agents in long-horizon software evolution scenarios.

SWE-EVO/SWE-EVO dataset

Open GitHub

Official repository released for the SWE-Lancer benchmark, public Diamond split, code, and evaluation environment.

openai/SWELancer-Benchmark dataset

Open GitHub

Repository for SWE-QA-Pro, including evaluation materials for direct and agent modes and links to the paper and benchmark release.

TIGER-AI-Lab/SWE-QA-Pro dataset

Open GitHub

Repository containing experiment code for SynthSAEBench SAE architecture evaluations.

decoderesearch/synth-sae-bench-experiments benchmark

Open GitHub

Codebase provided by the authors for reproducing TABCF experiments.

Panagiotou/TABCF implementation

Open GitHub

The paper states that this repository contains the reproducibility code for the benchmark of tabular classification methods.

machinelearningnuremberg/TabularStudy benchmark

Open GitHub

Official TaskBench code and dataset directory within the Microsoft JARVIS repository.

microsoft/JARVIS dataset

Open GitHub

A JSON file encoding TDD principles as governance objects with bibliographic grounding, human-oriented intent, AI-native interpretation, operational constraints, and anti-patterns.

shahbazsiddeeq/TDD-manifesto implementation

Open GitHub

Open-source Python framework extending SMAC with meta-learning and ensemble learning for pipeline automation.

automl/auto-sklearn framework

Open GitHub

Open-source Python implementation of sequential model-based algorithm configuration for hyperparameter optimization.

automl/SMAC3 framework

Open GitHub

Open-source Python Auto-Pipeline tool using genetic programming to optimize tree-structured machine learning pipelines.

EpistasisLab/tpot framework

Open GitHub

Open-source Python framework for automated feature generation from relational datasets.

Featuretools/featuretools framework

Open GitHub

Open-source Python tool for hyperparameter optimization.

hyperopt/hyperopt framework

Open GitHub

Open-source Keras-based framework for searching deep network architectures using Bayesian optimization and network morphism.

keras-team/autokeras framework

Open GitHub

Microsoft open-source toolkit for neural architecture search and hyperparameter tuning across local or cloud execution environments.

microsoft/nni framework

Open GitHub

Open-source TensorFlow framework for automatically learning neural network architectures and ensembles.

tensorflow/adanet framework

Open GitHub

Open LLaMA is the publicly available 13B model used in the paper's instruction-based fine-tuning experiments.

openlm-research/open_llama implementation

Open GitHub

Repository containing TextAtari agent/environment code, translators from Gym-style environments to natural language, prompt/decider modules, manuals, language trajectories, and visualization assets.

Lww007/Text-Atari-Agents system

Open GitHub

Python implementation of the three-stage TextReg pipeline, including RuleBank, gradient purification, semantic edit regularization, the guided optimizer, example execution scripts, and paper-aligned metrics.

luchengfu6/TextReg framework

Open GitHub

Repository accompanying the paper, with runnable TEP pipelines, configurations, benchmark support, and implementation code for deep compound AI system optimization.

MinghuiChen43/TEP framework

Open GitHub

Repository for the QA-FEEDBACK, Longformer reward-model, and T5/PPO experimental workflow used to study the reward-model accuracy paradox.

EIT-NLP/AccuracyParadox-RLHF implementation

Open GitHub

Repository associated with the AI Cosmologist paper, described by the paper as containing code and experimental data and by the repository README as containing configuration files, examples, and best AI-generated code for the Galaxy Zoo and Quijote demonstrations.

adammoss/aicosmologist system

Open GitHub

Repository linked by the paper for LLM uncertainty decomposition, with folders and scripts related to input uncertainty, decoding uncertainty, model uncertainty, data, models, utilities, and uncertainty scoring.

aditya-taparia/LLM-Uncertainty implementation

Open GitHub

Repository released by the authors for the agent-to-agent negotiation and transaction benchmark code and data.

ShenzheZhu/A2A-NT benchmark

Open GitHub

Meta Research repository associated with the paper's MAE?WSP pre-pretraining approach.

facebookresearch/maws implementation

Open GitHub

Public implementation of Progressive Transformers retrained and evaluated to provide a BLEU reference scale.

BenSaunders27/ProgressiveTransformersSLP

Open GitHub

Implementation repository for the sign-pose VAE variants introduced and evaluated in the paper.

GFaure9/SignPoseVAE implementation

Open GitHub

Public implementation of Sign-IDD retrained and evaluated as a non-latent diffusion reference baseline.

NaVi-start/Sign-IDD

Open GitHub

Repository containing the paper's transparency materials, including SP-1 AI-usage summary, SP-2 navigation index, SP-3 documentation-adequacy account, SP-4 process documentation, and SP-5 development records.

MicheleLoi/JPEP implementation

Open GitHub

A framework for training or evaluating agentic systems with reinforcement learning.

Agent-One-Lab/AgentFly framework

Open GitHub

A Microsoft framework listed as part of the agentic RL framework ecosystem.

microsoft/agent-lightning framework

Open GitHub

An open-source framework for scaling LLM reinforcement learning.

NovaSky-AI/SkyRL framework

Open GitHub

A repository for verifiable environments used in LLM reinforcement learning.

PrimeIntellect-ai/verifiers framework

Open GitHub

A benchmark for evaluating agents on real software engineering issues.

swe-bench/SWE-bench benchmark

Open GitHub

A framework for tool-use reinforcement learning with LLM agents.

TIGER-AI-Lab/verl-tool framework

Open GitHub

A web environment benchmark used to evaluate autonomous agents on browser tasks.

web-arena-x/webarena benchmark

Open GitHub

Repository for the paper's released data, including generated Code Llama outputs used to support follow-on work on ranking LLM code-generation candidates.

slp-rl/budget-realloc dataset

Open GitHub

Repository accompanying the paper, containing SPR task materials, generated research outputs, case studies for Agent Laboratory and The AI Scientist v2, and pitfall-detection code.

niharshah/AIScientistPitfalls dataset

Open GitHub

Open-source implementation of The AI Scientist-v2, one of the two AI scientist systems evaluated in the paper.

SakanaAI/AI-Scientist-v2 system

Open GitHub

Open-source implementation of Agent Laboratory, one of the two AI scientist systems evaluated in the paper.

SamuelSchmidgall/AgentLaboratory system

Open GitHub

Repository released by the authors for obtaining and preprocessing decaNLP datasets, training and evaluating models, reproducing experiments, and tracking decaScore progress.

salesforce/decaNLP

Open GitHub

Official repository releasing generated stories and the crowdworker and in-house human-evaluation results used by the paper.

ZhuohanX/TheNextChapter

Open GitHub

Meta Llama Guard 2 model-card repository path for the safeguard model used to classify whether generated outputs violate predefined safety categories.

meta-llama/PurpleLlama implementation

Open GitHub

Official Google Research codebase for reproducing the paper's prompt-tuning experiments, with training configurations, released prompts, and T5.1.1 LM-adapted checkpoints.

google-research/prompt-tuning

Open GitHub

T5 code, evaluation metrics, preprocessing routines, and checkpoint index used by the paper for its base models, data preparation, and reproducibility.

google-research/text-to-text-transfer-transformer

Open GitHub

Resources and out-of-domain development data for the MRQA 2019 shared task used in the paper's zero-shot question-answering transfer experiments.

mrqa/MRQA-Shared-Task-2019

Open GitHub

Repository associated with the Test-Driven AI Agent Definition paper, containing code or benchmark artifacts for compiling tool-using agents from behavioral specifications.

f-labs-io/tdad-paper-code benchmark

Open GitHub

GitHub Spec Kit is an open-source toolkit for specification-driven development using commands such as /speckit.constitution, /speckit.specify, /speckit.plan, /speckit.tasks, /speckit.analyze, and /speckit.implement.

github/spec-kit framework

Open GitHub

Official repository containing code and supporting materials for automated literature collection, deduplication, filtering, review assistance, and the paper's experiments.

trigaten/The_Prompt_Report

Open GitHub

Companion code and artifact repository containing analysis scripts, prompt templates, query lists, judge labels, numerical result JSONs, probe configurations, leakage diagnostics, Procrustes tests, and steering analyses.

amanmehta-maniac/refusal-residue-release

Open GitHub

A GitHub repository collecting papers related to LLM-based agents, linked by the survey as a related-papers resource.

WooooDyy/LLM-Agent-Paper-List

Open GitHub

Public repository for the AIDev dataset, schema/CSV files, replication materials, and example notebooks supporting analyses of autonomous coding agents in GitHub PR workflows.

SAILResearch/AI_Teammates_in_SE3 dataset

Open GitHub

Repository linked by the paper for the reproduction work associated with the study.

AIReproducibility2018 implementation

Open GitHub

Repository containing released evaluation experiments and artifacts for baseline agents and model configurations on TheAgentCompany.

TheAgentCompany/experiments

Open GitHub

Repository for the TheAgentCompany benchmark environment, tasks, data, and evaluation infrastructure introduced by the paper.

TheAgentCompany/TheAgentCompany dataset

Open GitHub

A project associated with the survey's core-competency test framework for LLM evaluation.

HITSCIR-DT-Code/Core-Competency-Test-for-the-Evaluation-of-LLMs dataset

Open GitHub

Repository containing the DeepFund code used to implement the live fund-investment benchmark, agent workflow, prompts, and evaluation system.

HKUSTDial/DeepFund system

Open GitHub

The official Time-MoE repository associated with the paper's architecture and released resources.

Time-MoE/Time-MoE dataset

Open GitHub

Repository containing code, data-processing materials, experiment scripts, results, and plotting notebooks for reproducing the paper's analyses.

felipemaiapolo/efficbench implementation

Open GitHub

Repository containing the tinyBenchmarks Python package, demos, tutorials, and links to tiny datasets for estimating LLM performance from curated small benchmark subsets.

felipemaiapolo/tinyBenchmarks package

Open GitHub

Official repository for the TLOB paper, including code folders for models, preprocessing, data handling, configuration, training scripts, requirements, and backtesting script.

LeonardoBerti00/TLOB benchmark

Open GitHub

BMTools integrates the paper's evaluated tools and provides an open-source platform for extending foundation models with APIs and for building and sharing tool plugins.

OpenBMB/BMTools dataset

Open GitHub

Repository for ToolRoCo, a multi-turn tool-using LLM benchmark for collaborative robotic tasks with Cabinet, PackGrocery, and Sort tasks and four cooperation paradigms.

ColaZhang22/Tool-Roco benchmark

Open GitHub

Official repository containing the ToolAlpaca data, prompts, multi-agent generation code, training scripts, evaluation code, and recorded evaluation outputs.

tangqiaoyu/ToolAlpaca dataset

Open GitHub

Hosts the ToolCAD project website and paper-facing artifact page.

gongyifeiisme/toolcad-project system

Open GitHub

Repository currently titled Ziqiao-git/C-World but with README content for ToolGym. It describes ToolGym as an open-world tool-using environment built on 5,571 tools across 204 applications and includes code folders for task creation, tool retrieval, state controller, runtime, and evaluation.

Ziqiao-git/C-World dataset

Open GitHub

Repository for ToolBench/ToolLLM artifacts, including code, trained models, and demo released by the authors.

OpenBMB/ToolBench dataset

Open GitHub

Repository for the ToolMisuseBench benchmark implementation, generator, evaluator, and experiment reproduction workflow.

akgitrepos/toolmisusebench benchmark

Open GitHub

Repository for ToolPRMBench, the benchmark and associated code/data for evaluating process reward models in tool-using agents.

David-Li0406/ToolPRMBench dataset

Open GitHub

Detectron2 is listed among object detection and image segmentation models/frameworks in the TorchTraceAP application table.

facebookresearch/detectron2 framework

Open GitHub

PyTorch Holistic Trace Analysis package referenced as the source of profiling metrics and rule-based trace-event runtime outlier analysis.

facebookresearch/HolisticTraceAnalysis package

Open GitHub

Open-source implementation of Self-Supervised Contrastive Pre-Training for Time Series via Time-Frequency Consistency.

mims-harvard/TFC-pretraining implementation

Open GitHub

GitHub repository for the Light Aircraft Game benchmark, which the paper includes as an application simulator in its benchmark characterization table.

liuqh16/CloseAirCombat benchmark

Open GitHub

The BosqueLanguage GitHub organisation is identified by the paper as the public location for experimental versions of the AISE-related systems, including the Bosque ecosystem components discussed in the paper.

BosqueLanguage framework

Open GitHub

Repository containing X-FM code and pre-trained models.

zhangxinsong-nlp/XFM implementation

Open GitHub

Repository containing code for survival-model calibration methods including CiPOT and CSD-style calibration, plus experiment reproduction resources.

shi-ang/MakeSurvivalCalibratedAgain implementation

Open GitHub

A GitHub repository linked by the paper/project page that curates efficient-agent papers and resources corresponding to the survey taxonomy.

yxf203/Awesome-Efficient-Agents

Open GitHub

Repository containing prompts and queries used in the experiments for the LLM-powered security alert investigation workflow.

Rub3cula/CyberHunt2025 system

Open GitHub

A maintained collection of papers, methods, benchmarks, and resources associated with RAG-reasoning and agentic deep-research systems.

DavidZWZ/Awesome-RAG-Reasoning

Open GitHub

Apollo is an open autonomous driving platform used as the representative automated-driving system context for the safety-requirements derivation task.

ApolloAuto/apollo system

Open GitHub

Code repository implementing the paper's unified MoE compression framework and proposed Expert Trimming methods.

CASE-Lab-UMD/Unified-MoE-Compression framework

Open GitHub

ROS framework for embodied intelligence applications using robot API configuration and LLM calls.

Auromix/ROS-LLM framework

Open GitHub

ROS 2 command-line interface extension with LLM support.

fujitatomoya/ros2ai system

Open GitHub

Stretch AI system orchestrating skills for language-directed mobile manipulation.

hello-robot/stretch_ai implementation

Open GitHub

MCP server connecting AI assistants to installed ROS 2 applications and system operations.

lpigeon/ros-mcp-server implementation

Open GitHub

Project integrating llama.cpp with ROS 2 to enable LLM inference.

mgonzs13/llama_ros implementation

Open GitHub

Tool using LLMs to generate ROS codebases from high-level descriptions.

RoboCoachTechnologies/ROScribe system

Open GitHub

ROS/ROS2 MCP server enabling natural-language commands and monitoring of robot states and sensor data.

robotmcp/ros-mcp-server

Open GitHub

ROS-MCP project included among the paper's representative ROS/MCP integrations.

Yutarop/ros-mcp implementation

Open GitHub

Repository containing the implemented prototype, generated code, and execution traces for the LLM workflow generation experiments.

dos-group/LLMWorkflowGenerator system

Open GitHub

Repository linked by the authors as the paper list for the survey on reasoning in large language models.

jeffhj/LM-reasoning

Open GitHub

Haystack is the framework used to implement the paper's RAG workload with retrieval and question-answering pipeline behavior.

deepset-ai/haystack framework

Open GitHub

GitHub repository stated by the paper as the available source code for the multimodal financial forecasting work.

sarthak-12/thesis-dsaa framework

Open GitHub

DeepMarket is the official open-source Python framework for LOB market simulation with deep learning. It contains TRADES and CGAN implementations/checkpoints, ABIDES-based simulation components, evaluation utilities, and the TRADES-LOB synthetic dataset.

LeonardoBerti00/DeepMarket dataset

Open GitHub

Repository for TradeTrap, the paper's system-level stress-testing framework for LLM-based autonomous trading agents.

Yanlewen/TradeTrap framework

Open GitHub

Repository for TradingAgents, the multi-agent LLM financial trading framework introduced and evaluated in the paper.

TauricResearch/TradingAgents framework

Open GitHub

Repository for BIG-bench, one of the principal benchmark suites used to compare Chinchilla with Gopher across diverse language-model capabilities.

google/BIG-bench benchmark

Open GitHub

Repository containing released model samples for sampling-based NLP evaluations reported in the paper.

openai/following-instructions-human-feedback

Open GitHub

Repository containing reference code for ILF experiments, refinement scoring, reward models, evaluation scripts, and links to the released SLF5K and finetuning datasets; the authors note that excluded data-generation and cleaning steps mean it is not a ready-to-run reproduction package.

JeremyAlain/imitation_learning_from_language_feedback implementation

Open GitHub

Repository for TRAJECT-Bench, including public data, tool definitions, query generation materials, and evaluation scripts for model and ReAct-style agentic tool-use evaluation.

PengfeiHePower/TRAJECT-Bench dataset

Open GitHub

GitHub data source cited for the Stochastic Block Model dynamic graph benchmark used in the experiments.

IBM/EvolveGCN dataset

Open GitHub

Official ROLAND repository used by the authors to run the ROLAND baseline five times for MAP/MRR comparison.

snap-stanford/roland implementation

Open GitHub

Repository containing Python code, prompt files, forex price data, annotated sentiment data, prediction files, and scripts for reproducing prompt runs and comparative results.

giorgosfatouros/Financial-Sentiment-Analysis-with-ChatGPT dataset

Open GitHub

Repository released by the authors containing code, generated translations, and human quality assessments for the quality-aware cascaded translation system.

deep-spin/translate-smart implementation

Open GitHub

Repository containing Tree of Thoughts code, task implementations, prompts, and logged experimental trajectories.

princeton-nlp/tree-of-thought-llm framework

Open GitHub

Official AG2 example repository from which the paper draws four representative MAS applications for case-study evaluation.

ag2ai/build-with-ag2

Open GitHub

Open-source implementation repository for the TrinityGuard framework introduced by the paper.

AI45Lab/TrinityGuard framework

Open GitHub

Repository for TrustAgent code, safety regulations, assets/data, and experiment-running instructions.

agiresearch/TrustAgent dataset

Open GitHub

A custom library referenced by the paper as the interface through which generated Python code controlled the Boston Dynamics Spot robot.

sheepskins/spottyai system

Open GitHub

GitHub repository for the Electricity Transformer Dataset used as ETTh1 and ETTm1 benchmark data in the experiments.

zhouhaoyi/ETDataset dataset

Open GitHub

Repository for a research demonstration of the Lean-Agent Protocol, including a frontend, FastAPI orchestrator, Lean worker, policy environment, audit log, and natural-language-to-Lean/back-translation workflow.

arkanemystic/lean-agent-protocol framework

Open GitHub

Repository containing the authors' implementation of UMoE, the shared-expert architecture that unifies attention-MoE and FFN-MoE modules.

ysngki/UMoE system

Open GitHub

Repository associated with the uncertainty-manipulation attack and confidentiality-preserving audit protocol.

cleverhans-lab/confidential-guardian system

Open GitHub

Repository associated with training-dynamics-based selective classification experiments.

cleverhans-lab/sc implementation

Open GitHub

Repository associated with experiments and analysis for decomposing the selective-classification gap.

cleverhans-lab/sc-gap implementation

Open GitHub

Repository used to reproduce or support the private selective-classification experiments.

cleverhans-lab/selective-classification implementation

Open GitHub

The vLLM repository provides the production inference engine and fused MoE execution pipeline into which the authors integrate their activation-sparse routed-expert code path.

vllm-project/vllm framework

Open GitHub

Open-source package for replicating experiments, with raw trajectories hosted on Hugging Face, analysis scripts, and data referenced in the paper.

ARiSE-Lab/understanding-apr-agents

Open GitHub

Official SWE-bench experiments repository used to retrieve public agent logs, trajectories, and patch diffs for studied configurations.

swe-bench/experiments

Open GitHub

Repository containing metadata, analysis data, tools, and code for the cryptocoin correlation analysis.

quapsale/cryptoanalytics dataset

Open GitHub

Official repository containing ITERATER datasets, preprocessing code, intent-classification code, revision-model training and evaluation code, and demonstration materials.

vipulraheja/IteraTeR dataset

Open GitHub

Repository for MAFBench, the unified benchmark suite introduced by the paper for controlled evaluation of multi-agent LLM frameworks.

CoDS-GCS/MAFBench benchmark

Open GitHub

Repository for ORCA, described as a step toward automating multi-agent system construction from high-level task descriptions using empirical benchmark evidence and cost-aware execution models.

CoDS-GCS/ORCA system

Open GitHub

Public Python repository containing data, source code, scripts, tests, results, analysis outputs, and paper figures for the Logic-in-LLMs study.

XAheli/Logic-in-LLMs dataset

Open GitHub

Repository for the LLM-ification of CHI review, including sampled CHI papers, qualitative codes, metadata, taxonomy images, and supplementary materials.

rrrrrrockpang/llm-chi dataset

Open GitHub

Open-source implementation of LIME used by the authors to produce local feature-based explanations for the income prediction and biography classification tasks.

marcotcr/lime package

Open GitHub

Source code for USEagent, the unified software-engineering agent architecture evaluated in the paper.

nus-apr/USEagent framework

Open GitHub

Source code for USEbench, the unified software-engineering benchmark used to evaluate USEagent and baselines.

nus-apr/USEbench dataset

Open GitHub

Official implementation of CROSS for the paper, including model code, utilities for LLM temporal-chain embeddings, training scripts for temporal link prediction, logs, and dataset-preparation instructions.

SiweiPro/CROSS framework

Open GitHub

Repository collecting materials related to data assessment and selection for language-model instruction tuning.

yuleiqin/fantastic-data-engineering

Open GitHub

Repository for the source code used in the paper's statistical-significance benchmark of online regression over multiple datasets.

mabushaera/Online-Regression-Statistical-Significance benchmark

Open GitHub

Repository for the TextFusionHTS framework and experiments reported by the paper.

xinzzzhou/TextFusionHTS framework

Open GitHub

Official implementation of the precedent-based reaction-plausibility evaluator used as URSA's automated Solv-2 component.

insilicomedicine/ChemCensor

Open GitHub

Official implementation of URSA, including route validation, building-block checks, collapsed route variants, ChemCensor scoring, and dataset-level Solv-N metrics.

insilicomedicine/URSA

Open GitHub

Repository accompanying USF-MAE with pretraining code, preprocessing notebooks, pretrained checkpoints, figures, and access links for OpenUS-46.

Yusufii9/USF-MAE dataset

Open GitHub

Code repository for D3PO, the paper's reward-model-free direct preference fine-tuning method for diffusion models.

yk7333/D3PO implementation

Open GitHub

Repository containing code for Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning, including simulation pipeline, scripts, installation instructions, and training/evaluation commands.

UWRobotLearning/RISE implementation

Open GitHub

Repository containing code for cryptocurrency price prediction using RNN-based models and comparison of LSTM, GRU, and Bi-LSTM methods.

shamima08/Cryptocurrency-Price-Prediction-using-RNN implementation

Open GitHub

Stable Baselines3 is cited as the implementation source for state-of-the-art baseline algorithms such as SAC, TD3, and PPO used in the experiments.

DLR-RM/stable-baselines3 package

Open GitHub

Codebase for the variational-quantum-circuit DDPG/DQN portfolio agents, baselines, training, evaluation, and reproducibility workflow introduced in the paper.

VincentGurgul/qrl-dpo-public framework

Open GitHub

Flask repository used to illustrate semantic, Louvain, and label-propagation clustering over source files.

pallets/flask

Open GitHub

Python Poetry repository used to demonstrate graph construction, object statistics, issue-driven retrieval, and top-k results.

python-poetry/poetry

Open GitHub

OWASP IoTGoat is deliberately insecure OpenWrt-based firmware containing vulnerability challenges mapped to the OWASP IoT Top 10.

OWASP/IoTGoat benchmark

Open GitHub

Official API and supporting resources for running the VirtualHome household simulator and executing activity programs.

xavierpuigf/virtualhome dataset

Open GitHub

Unity source code for constructing VirtualHome environments and translating activity programs into low-level executable character actions.

xavierpuigf/virtualhome_unity

Open GitHub

Repository containing raw collected data, Study 1 to Study 3 folders, preregistration documents, analysis scripts written and executed by the system, analysis outputs, and manuscripts.

Explore-Science/Virtuous-Machines-Towards-Artificial-General-Science dataset

Open GitHub

Official repository linked by the paper for Visual Semantic Entropy.

tadeephuy/visual-semantic-entropy

Open GitHub

Repository containing code associated with W-RAG weak-label generation and retriever fine-tuning experiments.

jmnian/WRAG framework

Open GitHub

Repository containing code and data for BadAgents, including poisoned data and code paths for Query-Attack, Observation-Attack, and Thought-Attack experiments.

lancopku/agent-backdoor-attacks dataset

Open GitHub

The GitHub repository hosts the WebArena code, browser environment, evaluation harness, configuration files, Docker environment resources, prompts, scripts, and reproduction materials for the paper.

web-arena-x/webarena dataset

Open GitHub

Repository for the WebShop environment, product and instruction setup, search engine, baseline rule/IL/RL models, tests, and sim-to-real transfer code.

princeton-nlp/WebShop benchmark

Open GitHub

Repository for the ICLR 2025 AI feedback tool that evaluated submitted reviews for vagueness or genericity, possible misunderstanding of the paper, and unprofessional tone, then generated private improvement suggestions.

zou-group/review_feedback_agent system

Open GitHub

LLMAgora, the configurable arena used to run two-agent scenarios with public utterances, private reflections, surveys, parameter sweeps, logging, and optional semantic, NLI, and emotion analyses.

danmohad/LLMAgora

Open GitHub

Paper-linked GitHub URL for the BigData22 stock-movement dataset; the URL returned 404 during extraction, so repository availability was not verified.

stocktweet/stock-tweet dataset

Open GitHub

Repository associated with Hybrid Deep Sequential Modeling for Social Text-Driven Stock Prediction and its dataset.

wuhuizhe/CHRNN dataset

Open GitHub

Repository releasing a stock movement prediction dataset from tweets and historical stock prices.

yumoxu/stocknet-dataset dataset

Open GitHub

Repository for StockAgent, the LLM-based multi-agent stock trading simulation framework studied in the paper.

MingyuJ666/Stockagent framework

Open GitHub

A research-grade signal-only decision-support system for cross-sectional ranking of AI-focused U.S. equities with uncertainty quantification, regime-aware deployment gating, PIT-safe data handling, and walk-forward evaluation outputs.

sinsasanderink/AIStockForecaster-PIT-Safe-Ranking-First-Signals-for-AI-Equities-FMP-Kronos-FinText-TSFM- framework

Open GitHub

Official implementation of OntoGraphRAG v1.0.0, the experiment harness, scripts for all tables and figures, per-query run logs, GPS replay stores, robustness artefacts, and reproducibility manifests used by the paper.

julka01/OntoGraphRAG

Open GitHub

MM-TSFlib implementation used for Informer, FEDformer, PatchTST, iTransformer, and DLinear backbone experiments.

AdityaLab/MM-TSFlib benchmark

Open GitHub

TimeCMA codebase used as an evaluated aligning-based MMTS comparison.

ChenxiLiu-HNU/TimeCMA implementation

Open GitHub

LeRet codebase used as an evaluated aligning-based MMTS comparison.

hqh0728/LeRet implementation

Open GitHub

Time-LLM codebase used as an evaluated aligning-based comparison model.

KimMeen/Time-LLM implementation

Open GitHub

Codebase for the Context is Key forecasting benchmark, reproduced to examine LLM performance scaling.

ServiceNow/context-is-key-forecasting benchmark

Open GitHub

Repository containing an anonymized CAIA evaluator implementation, a benchmark.csv dataset with 178 evaluation questions, evaluation scripts for with-tool and without-tool settings, mock tools, prompts, and dependencies.

caiba-ai/caia-benchmark-0927 dataset

Open GitHub

Repository containing the compact RLHF pipeline, transition classifier, experiment configurations, analysis scripts, manuscript sources, result tables, figures, examples, and a Gradio-based response-comparison interface.

zabahana/rlhf-failure-modes-diagnostics framework

Open GitHub

Repository identified as the code for 'When Routing Collapses: On the Degenerate Convergence of LLM Routers'.

AIGNLAI/EquiRouter implementation

Open GitHub

Repository stated by the paper as the location where code and data for AgentDebug will be available.

ulab-uiuc/AgentDebug dataset

Open GitHub

NVIDIA's transformer inference engine, extended in the paper with DeepSpeed MoE support, TUPE attention, expert routing, quantized MoE computation, and batch pruning.

NVIDIA/FasterTransformer framework

Open GitHub

Repository containing prompts, research ideas, selected outputs, and failure analyses for the four autonomous research attempts, including MARL-idea, SALVO-WM-idea, SDTS-WM-idea, SemEnt-ALGN-idea, and workflow prompts.

Lossfunk/ai-scientist-artefacts-v1

Open GitHub

Repository for preparing, analyzing, and visualizing survey responses for the 'Will Agents Replace Us?' preprint project, including Python scripts for data preparation, exploratory analysis, inferential analysis, and manuscript figure generation.

nkkko/agent-perceptions dataset

Open GitHub

Repository for DeepFund, a platform intended to evaluate LLM trading capability across financial markets using a unified environment, multi-agent system, external information ingestion, trading decisions, and arena-style performance presentation.

HKUSTDial/DeepFund framework

Open GitHub

Repository for Latency Sensitive Benchmarks, including HFTBench and StreetFighter benchmark code and evaluation examples for latency-aware LLM-agent assessment.

HaoKang-Timmy/LatencySensitiveBench benchmark

Open GitHub

Repository linked by the paper for the proposed automated wireless-agent workflow design system.

jwentong/WirelessAgent-R2 framework

Open GitHub

Repository for WirelessBench, including the wireless tasks and released scoring code.

jwentong/WirelessBench dataset

Open GitHub

Framework used for running, managing, and reproducing web-agent experiments on BrowserGym benchmarks.

ServiceNow/AgentLab implementation

Open GitHub

Open-source Gym-style browser environment for implementing and evaluating web agents with rich observations and action spaces.

ServiceNow/BrowserGym framework

Open GitHub

Open-source benchmark package for evaluating browser agents on ServiceNow-based knowledge-work tasks.

ServiceNow/WorkArena dataset

Open GitHub

OpenBMB/WorkflowLLM is the official repository for the WorkflowLLM project, described as a data-centric framework for enhancing LLM workflow orchestration with WorkflowBench and WorkflowLlama resources.

OpenBMB/WorkflowLLM dataset

Open GitHub

Official implementation artifact for Worldscape-MoE, including training and inference entry points, modality-specific data preparation, validation tools, and optional offline VAE encoding.

EmbodiedCity/Worldscape-MoE.code

Open GitHub

Official code repository for training and evaluating Direct Preference Head models, including benchmark evaluation scripts and links to released model checkpoints.

Avelina9X/direct-preference-heads implementation

Open GitHub

Repository for the paper's XGBoost-based NEPSE log-return forecasting workflow and benchmark outputs.

sahajrajmalla/nepse-xgboost-forecasting dataset

Open GitHub

Public repository containing the cost-aware LLM routing system, training/data-preprocessing components, evaluation and serving pipeline, router tests, and documentation.

SalesforceAIResearch/xRouter framework

Open GitHub

GitHub Resources from arXiv Digests

How To Use This Page

Repository Index