Dec 15, 2025

4 min read

AIReinforcement LearningMLOps

Building Production-Ready RL Agents

A deep dive into creating reinforcement learning agents that work in production environments, with lessons learned from the Fundora project.

The Gap Between Research and Production

Reinforcement learning has produced remarkable results in controlled environments -- from mastering Atari games to beating world champions at Go. But anyone who has tried to deploy an RL agent into a live production system knows there is a massive chasm between a notebook demo and a reliable, maintainable service. When I started building the recommendation engine for Fundora, a personalized investment advisory platform, I ran headfirst into that chasm.

This post distills the hard-won lessons from that journey: what works, what breaks, and the engineering patterns that keep RL agents running smoothly in production.

Why RL for Fundora?

Fundora needed to recommend investment strategies that adapted to each user's risk tolerance, goals, and market conditions in real time. Traditional recommendation approaches -- collaborative filtering, static rule engines -- could not capture the sequential nature of financial decisions. Each recommendation changes the user's portfolio, which changes the optimal next recommendation. That temporal dependency is exactly what RL is designed for.

We modeled the problem as a contextual bandit with delayed rewards. The agent observes a user's current portfolio state, market signals, and behavioral features, then selects from a set of candidate strategies. Rewards are computed from portfolio performance over a rolling 30-day window.

Lesson 1: Simulation Is Your Safety Net

The single most important investment we made was building a high-fidelity simulator before touching real user traffic. We replayed 18 months of historical market data and user interaction logs through the simulator, validating that our agent's policies produced reasonable outputs before any live deployment.

The simulator also served as our continuous integration test. Every model update ran through a battery of simulated scenarios -- bull markets, flash crashes, low-liquidity periods -- and had to pass minimum performance thresholds before promotion.

Key takeaway: Never train or deploy an RL agent without a simulator. It is your test suite, your staging environment, and your rollback validation layer all in one.

Lesson 2: Reward Engineering Is the Real Challenge

Getting the reward function right consumed more engineering time than the model architecture itself. Our first attempt used raw portfolio return as the reward signal. The agent quickly learned to recommend high-volatility strategies because the upside rewards dominated. Users were unhappy.

We iterated through several reward formulations before settling on a risk-adjusted metric that combined Sharpe ratio, maximum drawdown penalty, and a user satisfaction proxy derived from engagement signals. The final reward function looked roughly like this:

python

def compute_reward(portfolio_return, volatility, max_drawdown, engagement_score):
    sharpe = portfolio_return / (volatility + 1e-8)
    drawdown_penalty = max(0, max_drawdown - 0.15) * 10.0
    engagement_bonus = engagement_score * 0.1
    return sharpe - drawdown_penalty + engagement_bonus

Key takeaway: Spend disproportionate time on reward engineering. A well-shaped reward function matters far more than a clever architecture.

Lesson 3: Offline-First Training with Online Fine-Tuning

We adopted a two-phase training strategy. Phase one uses offline RL (specifically Conservative Q-Learning) trained on historical interaction logs. This produces a reasonable baseline policy without any live experimentation risk. Phase two deploys the offline policy behind an epsilon-greedy exploration wrapper and fine-tunes with live feedback.

This approach let us launch with a policy that was already competent on day one, then improve continuously. We also implemented policy constraints -- the online policy could never deviate more than a configurable KL-divergence threshold from the offline baseline, preventing catastrophic drift.

Lesson 4: Observability Changes Everything

RL agents are notoriously difficult to debug. Unlike supervised models where you can inspect predictions against labels, an RL agent's quality is only apparent over time. We built a custom observability stack that tracked:

Action distributions per user segment (detecting mode collapse early)
Reward trends over rolling windows (catching reward hacking)
State coverage metrics (ensuring the agent explored diverse portfolio configurations)
Policy divergence from the baseline (automated alerting on drift)

We piped these metrics into Grafana dashboards with Prometheus, and set up PagerDuty alerts for anomalies. When the agent started recommending a single strategy to 80% of users one weekend, the action distribution alert caught it within 15 minutes.

Lesson 5: Versioned Model Serving with Rollback

We treated model versions the same way we treat code releases. Every trained policy was serialized with its full configuration, reward function hash, training data snapshot ID, and evaluation metrics. We used MLflow for tracking and a custom Kubernetes operator for blue-green deployment of model versions.

Rollback was a first-class operation. If post-deployment monitoring detected degradation, the system could revert to the previous policy version within seconds. We exercised this capability in production three times during the first six months.

Architecture Overview

The production system consisted of four main components:

Feature Store: Real-time user and market features served via Redis, with batch features computed hourly in Spark
Policy Server: A FastAPI service wrapping the trained policy, deployed as a Kubernetes Deployment with HPA
Reward Pipeline: An asynchronous Kafka-based pipeline that computed delayed rewards and wrote them to the training data store
Training Loop: A periodic retraining job running on GPU nodes, triggered when sufficient new reward data accumulated

What I Would Do Differently

If I were starting over, I would invest in better counterfactual evaluation methods earlier. We relied heavily on A/B testing for policy evaluation, which is slow and expensive. Techniques like doubly robust estimation and importance-weighted evaluators would have let us evaluate candidate policies offline with higher confidence, reducing the number of live experiments needed.

I would also standardize the action space more carefully from the beginning. We added new strategy types mid-project, which required retraining from scratch. A more modular action representation would have made this incremental.

Conclusion

Building production RL agents is fundamentally an engineering discipline, not just a modeling exercise. The model architecture matters, but the surrounding infrastructure -- simulation, observability, reward pipelines, versioned deployment -- is what determines whether your agent survives contact with real users. The Fundora project taught me that investing in these systems early pays compound returns as the product evolves.

Back to all posts

Dec 15, 2025

4 min read

AIReinforcement LearningMLOps

Building Production-Ready RL Agents

A deep dive into creating reinforcement learning agents that work in production environments, with lessons learned from the Fundora project.

The Gap Between Research and Production

This post distills the hard-won lessons from that journey: what works, what breaks, and the engineering patterns that keep RL agents running smoothly in production.

Why RL for Fundora?

Lesson 1: Simulation Is Your Safety Net

Key takeaway: Never train or deploy an RL agent without a simulator. It is your test suite, your staging environment, and your rollback validation layer all in one.

Lesson 2: Reward Engineering Is the Real Challenge

python

def compute_reward(portfolio_return, volatility, max_drawdown, engagement_score):
    sharpe = portfolio_return / (volatility + 1e-8)
    drawdown_penalty = max(0, max_drawdown - 0.15) * 10.0
    engagement_bonus = engagement_score * 0.1
    return sharpe - drawdown_penalty + engagement_bonus

Key takeaway: Spend disproportionate time on reward engineering. A well-shaped reward function matters far more than a clever architecture.

Lesson 3: Offline-First Training with Online Fine-Tuning

Lesson 4: Observability Changes Everything

Action distributions per user segment (detecting mode collapse early)
Reward trends over rolling windows (catching reward hacking)
State coverage metrics (ensuring the agent explored diverse portfolio configurations)
Policy divergence from the baseline (automated alerting on drift)

Lesson 5: Versioned Model Serving with Rollback

Architecture Overview

The production system consisted of four main components:

Feature Store: Real-time user and market features served via Redis, with batch features computed hourly in Spark
Policy Server: A FastAPI service wrapping the trained policy, deployed as a Kubernetes Deployment with HPA
Reward Pipeline: An asynchronous Kafka-based pipeline that computed delayed rewards and wrote them to the training data store
Training Loop: A periodic retraining job running on GPU nodes, triggered when sufficient new reward data accumulated

What I Would Do Differently

Conclusion

Back to all posts

Building Production-Ready RL Agents

The Gap Between Research and Production

Why RL for Fundora?

Lesson 1: Simulation Is Your Safety Net

Lesson 2: Reward Engineering Is the Real Challenge

Lesson 3: Offline-First Training with Online Fine-Tuning

Lesson 4: Observability Changes Everything

Lesson 5: Versioned Model Serving with Rollback

Architecture Overview

What I Would Do Differently

Conclusion

Share Post

Comments

Building Production-Ready RL Agents

The Gap Between Research and Production

Why RL for Fundora?

Lesson 1: Simulation Is Your Safety Net

Lesson 2: Reward Engineering Is the Real Challenge

Lesson 3: Offline-First Training with Online Fine-Tuning

Lesson 4: Observability Changes Everything

Lesson 5: Versioned Model Serving with Rollback

Architecture Overview

What I Would Do Differently

Conclusion

Share Post

Comments