Abstract
Reinforcement learning (RL) agents often face high sample complexity in sparse or delayed reward settings, due to limited prior knowledge. Conversely, large language models (LLMs) can provide subgoal structures, plausible trajectories, and abstract priors that support early learning. Yet heavy reliance on LLMs introduces scalability issues and risks dependence on unreliable signals.
We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early learning. This graph stores decision-relevant information, such as trajectory segments and subgoal decompositions, and is co-constructed from the agent's high-return experiences and LLM outputs, amortizing LLM queries into a persistent memory instead of relying on continuous real-time supervision.
The Challenge of LLM-RL Integration
While large language models offer rich prior knowledge for reinforcement learning, existing integration approaches face critical limitations:
Methods requiring per-step LLM supervision create bottlenecks, introduce latency, and limit autonomous decision-making.
Frequent LLM queries (500+ per training run) become computationally expensive and impractical for real-world deployment.
LLMs can hallucinate, provide inconsistent outputs, or lack grounding in physical environments, risking misleading guidance.
Heavy reliance on LLM outputs can override environment feedback, limiting the agent's ability to learn from actual interactions.
Key Contributions
Memory-Integrated Framework
A reinforcement learning agent that integrates LLM-derived guidance through a memory graph co-constructed from agent experience and offline or infrequent online LLM outputs.
Utility-Based Shaping
A novel utility-shaped advantage estimation that incorporates graph-derived utility into advantage computation, compatible with any advantage-based policy-gradient method.
Convergence Guarantees
Theoretical guarantees showing that decaying shaping influence preserves long-horizon convergence properties while correcting inaccuracies in LLM outputs.
Empirical Validation
Demonstrated improvements in sample efficiency over RL baselines with performance comparable to continuous LLM supervision methods, using far fewer queries.
Method Overview
MIRA combines the strengths of RL and LLMs through a structured approach that maintains autonomy while leveraging prior knowledge.
Memory Graph Construction
Build an evolving graph storing trajectory segments, subgoal decompositions, and decision-relevant information from both agent experience and LLM suggestions.
Utility Signal Derivation
Compute utility based on similarity between agent behavior and stored trajectories, weighted by goal alignment and confidence scores.
Advantage Shaping
Augment standard advantage estimates with utility-based guidance:
We compute the shaped advantage as \( \tilde{A}_t = \eta_t A_t + \xi_t U_t \), where \( \eta_t \) and \( \xi_t \) control the contribution of policy and utility signals.
Adaptive Decay
As the policy improves, decay the shaping weight ξt to ensure convergence to the true reward function while preserving early learning benefits.
Evaluation Environments
| Environment | Sparse Reward | Partial Observability | Sequential Dependencies | Irreversible Dynamics | Distractors |
|---|---|---|---|---|---|
| FrozenLake | ✓ | ✓ | |||
| RedBall | ✓ | ✓ | |||
| LavaCrossing | ✓ | ✓ | ✓ | ||
| DoorKey | ✓ | ✓ | ✓ | ||
| RedBlueDoor | ✓ | ✓ | ✓ | ||
| Distracted DoorKey | ✓ | ✓ | ✓ | ✓ |
Key Results
Performance Highlights
Sample Efficiency: MIRA achieves faster early-stage learning than PPO across all environments, reaching optimal or near-optimal performance in significantly fewer training iterations.
Query Efficiency: Comparable final performance to LLM4Teach while using ~95% fewer LLM queries, demonstrating effective amortization of LLM guidance into persistent memory.
Robustness: Memory-based approach remains stable even when late-stage LLM outputs are degraded or unreliable, as the agent has already accumulated sufficient experience.
Generalization: Consistent performance across diverse environments including navigation tasks, irreversible dynamics, sequential dependencies, and distractor-rich settings.
Citation
@inproceedings{nourzad2026mira_iclr,
title={MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance},
author={Nourzad, Narjes and Joe-Wong, Carlee},
booktitle={International Conference on Learning Representations},
year={2026}
}
@inproceedings{nourzad2026mira_aaai,
title={Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning(Student Abstract)},
author={Nourzad, Narjes and Joe-Wong, Carlee},
booktitle={Association for the Advancement of Artificial Intelligence},
year={2026}
}