Reinforcement Learning for RAG: Bandits for Retrieval + RL for Generation

March 2026 • Aryan Patodiya

The Problem

Most Retrieval-Augmented Generation (RAG) systems treat retrieval as static. The retriever fetches documents using fixed parameters (k, reranker, embedding model), and the generator produces an answer.

But retrieval quality depends on:

Query type
Domain specificity
Latency constraints
Downstream task reward

A static pipeline cannot adapt to these variations.

Core Idea

Treat retrieval as a multi-armed bandit problem.

Each arm = a retrieval strategy (k, source mix, reranking policy, etc.)
Reward = downstream generation quality
Bandit balances exploration vs exploitation

Then optimize the generator using reinforcement learning (e.g., GRPO), with rewards shaped by correctness, helpfulness, citations, or cost.

Closed Feedback Loop

Retrieval influences generation quality. Generation reward updates retrieval strategy.

The result: an adaptive RAG system that improves over time instead of degrading.

Why This Matters

Cost-aware retrieval policies
Dynamic document selection
Personalized retrieval behavior
Long-horizon system optimization

Reinforcement Learning for RAG: Bandits for Retrieval + RL for Generation

The Problem

Core Idea

Closed Feedback Loop

Why This Matters

Comments