MIRA: Memory-Integrated Reinforcement Learning Agent

Abstract

Reinforcement learning (RL) agents often face high sample complexity in sparse or delayed reward settings, due to limited prior knowledge. Conversely, large language models (LLMs) can provide subgoal structures, plausible trajectories, and abstract priors that support early learning. Yet heavy reliance on LLMs introduces scalability issues and risks dependence on unreliable signals.

We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early learning. This graph stores decision-relevant information, such as trajectory segments and subgoal decompositions, and is co-constructed from the agent's high-return experiences and LLM outputs, amortizing LLM queries into a persistent memory instead of relying on continuous real-time supervision.

The Challenge of LLM-RL Integration

While large language models offer rich prior knowledge for reinforcement learning, existing integration approaches face critical limitations:

Continuous Dependency

Methods requiring per-step LLM supervision create bottlenecks, introduce latency, and limit autonomous decision-making.

Scalability Costs

Frequent LLM queries (500+ per training run) become computationally expensive and impractical for real-world deployment.

Unreliable Signals

LLMs can hallucinate, provide inconsistent outputs, or lack grounding in physical environments, risking misleading guidance.

Reward Dilution

Heavy reliance on LLM outputs can override environment feedback, limiting the agent's ability to learn from actual interactions.

Key Contributions

Memory-Integrated Framework

A reinforcement learning agent that integrates LLM-derived guidance through a memory graph co-constructed from agent experience and offline or infrequent online LLM outputs.

Utility-Based Shaping

A novel utility-shaped advantage estimation that incorporates graph-derived utility into advantage computation, compatible with any advantage-based policy-gradient method.

Convergence Guarantees

Theoretical guarantees showing that decaying shaping influence preserves long-horizon convergence properties while correcting inaccuracies in LLM outputs.

Empirical Validation

Demonstrated improvements in sample efficiency over RL baselines with performance comparable to continuous LLM supervision methods, using far fewer queries.

Method Overview

MIRA combines the strengths of RL and LLMs through a structured approach that maintains autonomy while leveraging prior knowledge.

Memory Graph Construction

Build an evolving graph storing trajectory segments, subgoal decompositions, and decision-relevant information from both agent experience and LLM suggestions.

Utility Signal Derivation

Compute utility based on similarity between agent behavior and stored trajectories, weighted by goal alignment and confidence scores.

Advantage Shaping

Augment standard advantage estimates with utility-based guidance:

We compute the shaped advantage as \( \tilde{A}_t = \eta_t A_t + \xi_t U_t \), where \( \eta_t \) and \( \xi_t \) control the contribution of policy and utility signals.

Adaptive Decay

As the policy improves, decay the shaping weight ξ_t to ensure convergence to the true reward function while preserving early learning benefits.

Evaluation Environments

Environment	Sparse Reward	Partial Observability	Sequential Dependencies	Irreversible Dynamics	Distractors
FrozenLake	✓			✓
RedBall	✓	✓
LavaCrossing	✓	✓		✓
DoorKey	✓	✓	✓
RedBlueDoor	✓	✓	✓
Distracted DoorKey	✓	✓	✓		✓

Key Results

2-10×

Faster convergence vs. baseline PPO

+6-10%

Success rate improvement over HRL

~30

Total LLM queries per run

500+

Queries saved vs. LLM4Teach

> 1.3×

Final performance vs. LLM-RS

90%+

Success rate on complex tasks

Performance Highlights

Sample Efficiency: MIRA achieves faster early-stage learning than PPO across all environments, reaching optimal or near-optimal performance in significantly fewer training iterations.

Query Efficiency: Comparable final performance to LLM4Teach while using ~95% fewer LLM queries, demonstrating effective amortization of LLM guidance into persistent memory.

Robustness: Memory-based approach remains stable even when late-stage LLM outputs are degraded or unreliable, as the agent has already accumulated sufficient experience.

Generalization: Consistent performance across diverse environments including navigation tasks, irreversible dynamics, sequential dependencies, and distractor-rich settings.

Citation

@inproceedings{nourzad2026mira_iclr,
  title={MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance},
  author={Nourzad, Narjes and Joe-Wong, Carlee},
  booktitle={International Conference on Learning Representations},
  year={2026}
}



@inproceedings{nourzad2026mira_aaai,
  title={Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning(Student Abstract)},
  author={Nourzad, Narjes and Joe-Wong, Carlee},
  booktitle={Association for the Advancement of Artificial Intelligence},
  year={2026}
}