Shipping AI Agents for Enterprise Clients
How we built and deployed vision-aware AI agents and real estate automation systems at AlysAI for enterprise clients.
From Demos to Deployments
Everyone is building AI agents. Few are shipping them to enterprise clients who expect uptime SLAs, audit trails, and predictable costs. At AlysAI, we spent the past year doing exactly that -- building and deploying autonomous AI agents for real estate firms and enterprise operations teams. This post covers the architectural decisions, failure modes, and engineering patterns that made it possible.
The Problem Space
Our primary clients were real estate enterprises that needed to automate document-heavy workflows: lease abstraction, property valuation analysis, compliance checking, and tenant communication. These tasks involved processing PDFs, images of floor plans, scanned contracts, and structured data from multiple systems. A human analyst might spend 4-6 hours per property doing work that our agents needed to complete in minutes.
The second category was vision-aware agents for property inspection. Using computer vision models combined with LLM reasoning, these agents could analyze property photos, identify maintenance issues, cross-reference against inspection checklists, and generate structured reports.
Architecture: The Agent Orchestration Layer
We settled on a multi-agent orchestration architecture rather than a single monolithic agent. Each agent specializes in a narrow task and communicates through a central orchestrator. This design was motivated by three hard-learned lessons:
Reliability through specialization. A single agent tasked with "analyze this entire lease document" would hallucinate terms approximately 12% of the time. Breaking the task into specialized sub-agents -- one for financial terms extraction, one for clause classification, one for date parsing -- brought the error rate below 2%.
Cost predictability. Enterprise clients need to forecast costs. By routing simpler sub-tasks to smaller models (GPT-4o-mini, Claude Haiku) and reserving larger models for complex reasoning steps, we reduced per-document processing costs by 60% while maintaining quality thresholds.
Auditability. Each sub-agent produces a typed, structured output with confidence scores. The orchestrator logs every decision point. When a client asks "why did the system flag this clause as non-standard?", we can trace the exact reasoning chain.
The orchestrator itself is a directed acyclic graph (DAG) execution engine built on top of LangGraph. Each node is an agent with defined input/output schemas, and edges can be conditional based on confidence thresholds or classification results.
Vision Agents: Beyond Text
The property inspection agents were technically the most challenging. They needed to:
- Ingest a set of property photos (typically 20-50 per inspection)
- Classify each photo by room type and feature
- Identify visible issues (water damage, structural cracks, appliance condition)
- Cross-reference findings against the property's maintenance history
- Generate a structured inspection report with severity ratings
We used a pipeline approach: a fine-tuned CLIP model for initial classification, GPT-4 Vision for detailed analysis of flagged images, and a reasoning agent that synthesized findings into the final report.
The key engineering challenge was handling image quality variance. Enterprise clients upload photos from phones in varying lighting conditions. We built a preprocessing pipeline that normalized exposure, detected blur, and rejected unusable images with explanations rather than producing unreliable analysis.
class ImageQualityGate:
def __init__(self, blur_threshold=100.0, brightness_range=(40, 220)):
self.blur_threshold = blur_threshold
self.brightness_range = brightness_range
def evaluate(self, image_path: str) -> QualityResult:
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
laplacian_var = cv2.Laplacian(gray, cv2.CV_64F).var()
mean_brightness = gray.mean()
issues = []
if laplacian_var < self.blur_threshold:
issues.append("Image is too blurry for reliable analysis")
if not (self.brightness_range[0] <= mean_brightness <= self.brightness_range[1]):
issues.append("Image lighting is outside acceptable range")
return QualityResult(
passed=len(issues) == 0,
issues=issues,
metrics={"blur_score": laplacian_var, "brightness": mean_brightness}
)Handling Enterprise Requirements
Building the agent logic was perhaps 40% of the work. The remaining 60% was enterprise infrastructure:
Authentication and multi-tenancy. Every API call flows through a tenant-aware middleware that enforces data isolation. Client A's documents are never accessible to Client B's agents, even at the vector store level. We partition Pinecone namespaces by tenant ID and encrypt embeddings at rest.
Rate limiting and cost controls. We implemented per-tenant token budgets with configurable alerts. When a client's monthly usage approaches their contracted limit, the system automatically downgrades to smaller models for non-critical tasks and notifies the account manager.
Compliance logging. Every LLM call is logged with full prompt, response, model version, latency, and token count. These logs are immutable (append-only to S3 with versioning) and retained for the contractually required period. Several clients in regulated industries required this for their own audit obligations.
Graceful degradation. When OpenAI or Anthropic APIs experience outages, the system falls back through a priority chain: primary provider, secondary provider, cached results for identical inputs, and finally a "manual review required" state that queues the task for human processing.
The Reliability Stack
We deployed on AWS EKS with the following observability stack:
- Tracing: OpenTelemetry with Jaeger for end-to-end request tracing across agent hops
- Metrics: Prometheus + Grafana dashboards tracking latency percentiles, error rates, token usage, and agent-specific accuracy metrics
- Alerting: PagerDuty integration with tiered severity based on client SLA tier
- Testing: A regression test suite of 500+ document/image pairs with known correct outputs, run nightly
The regression suite was the most valuable investment. We caught three model-version-related regressions before they reached production, each of which would have produced incorrect financial figures in lease abstractions.
Lessons Learned
Start with the output schema, not the prompt. Enterprise clients care about structured, predictable outputs they can feed into their existing systems. Define the exact JSON schema of every agent's output before writing a single prompt. This forces clarity on what the agent actually needs to produce.
Build the human-in-the-loop path first. No matter how good your agents are, some tasks will require human review. We built the escalation and review UI before the agents themselves. This meant we could ship a partially-automated solution immediately and increase automation coverage over time.
Measure accuracy on real client data, not benchmarks. Our agents performed beautifully on public datasets but struggled with the specific formatting conventions of certain law firms' lease documents. We built client-specific evaluation sets within the first week of every new engagement.
Version everything. Prompts, model versions, preprocessing pipelines, evaluation sets -- all versioned in git with automated promotion workflows. When a client reports an issue, we can reproduce exactly what the system did on that date with that configuration.
What Comes Next
We are now building agents that can operate across multiple enterprise systems -- pulling data from Salesforce, cross-referencing against documents in SharePoint, and updating records in client-specific ERPs. The orchestration layer we built scales naturally to these multi-system workflows, but the authentication and permission challenges multiply significantly. That is the next frontier.
Comments