Voice AI for 31 Million Citizens: Building Sarathi
How we built the Sarathi Voice Agent to deliver government services in Assamese and Bodo using faster-whisper ASR, Bhashini TTS, and sub-second latency optimization for kiosk deployment.
The Problem: Government Services That No One Can Access
Assam has 31 million citizens. The majority speak Assamese; significant populations speak Bodo, Bengali, and other languages. Government services -- land records, welfare scheme enrollment, grievance filing -- are increasingly digitized, but the interfaces are in English or Hindi, behind forms that assume literacy and internet familiarity. The result is a digital divide where the people who need government services most are the least able to access them.
Sarathi was born from a simple idea: what if citizens could access any government service by simply speaking in their own language? At Sarathi Studio, we built a voice-first AI agent that understands Assamese and Bodo, navigates government service workflows, and responds with natural speech -- all running on kiosk hardware deployed in rural service centers.
This post covers the technical architecture, the unique challenges of low-resource language AI, and how we achieved sub-second response latency on modest hardware.
Architecture Overview
The Sarathi pipeline has four stages:
- ASR (Automatic Speech Recognition): Convert citizen speech to text
- Intent Understanding: Determine which government service the citizen needs and extract relevant parameters
- Service Orchestration: Interact with government APIs to fulfill the request
- TTS (Text-to-Speech): Convert the response to natural speech in the citizen's language
Each stage presented unique challenges for low-resource languages.
ASR: faster-whisper with Language-Specific Tuning
We evaluated several ASR options for Assamese. Google Speech-to-Text supports Assamese but with a word error rate (WER) around 35% in our testing -- unusable for a transactional system. Commercial alternatives performed even worse. OpenAI's Whisper large-v3, however, achieved 18% WER out of the box on our test set.
We deployed Whisper using the faster-whisper library, which uses CTranslate2 for optimized inference. This gave us 4x faster decoding compared to the original Whisper implementation, which was critical for our latency budget.
from faster_whisper import WhisperModel
class SarathiASR:
def __init__(self):
self.model = WhisperModel(
"large-v3",
device="cuda",
compute_type="float16",
num_workers=2,
)
# Language-specific VAD parameters
self.vad_parameters = {
"threshold": 0.4,
"min_speech_duration_ms": 250,
"max_speech_duration_s": 30,
"min_silence_duration_ms": 600, # Assamese has longer pauses
"speech_pad_ms": 200,
}
def transcribe(self, audio_path: str, language: str = "as"):
segments, info = self.model.transcribe(
audio_path,
language=language,
beam_size=5,
best_of=5,
vad_filter=True,
vad_parameters=self.vad_parameters,
)
return {
"text": " ".join([s.text for s in segments]),
"language": info.language,
"language_probability": info.language_probability,
"duration": info.duration,
}To reduce WER further, we fine-tuned Whisper on 800 hours of Assamese speech data collected from All India Radio archives, local news broadcasts, and volunteer recordings. We also collected 200 hours of Bodo speech, making Sarathi one of the few AI systems with dedicated Bodo language support. Post-fine-tuning WER dropped to 12% for Assamese and 16% for Bodo.
The Voice Activity Detection Challenge
Assamese conversational speech has different prosodic patterns than English. Speakers use longer pauses between phrases, and the pitch contours that signal sentence boundaries are different. The default VAD parameters in faster-whisper, tuned for English, would frequently cut off speakers mid-sentence or merge separate utterances.
We spent two weeks tuning VAD parameters on a 50-hour validation set of conversational Assamese recorded in actual government service center environments -- with background noise, multiple speakers, and the acoustic characteristics of the kiosk hardware.
Intent Understanding: Structured Extraction over Classification
Rather than treating intent understanding as a classification problem (which would require training data for every possible intent), we use an LLM-based structured extraction approach. The citizen's transcribed speech is processed by a prompted Claude Haiku model that extracts a structured intent object:
INTENT_PROMPT = """
You are a government service assistant for Assam, India.
Given the citizen's request (translated to English), extract:
1. service_category: one of [land_records, welfare_schemes, grievance,
certificate, pension, ration_card, utility, general_inquiry]
2. action: what the citizen wants to do (check_status, apply, download, update, inquire)
3. parameters: relevant details (name, ID numbers, dates, locations)
4. confidence: your confidence in the extraction (0.0-1.0)
If the request is ambiguous, set confidence below 0.6 and include
clarifying_question in the response.
Citizen request: {transcribed_text}
Translated request: {translated_text}
Respond in JSON format.
"""When confidence is below 0.6, the system generates a clarifying question in the citizen's language rather than guessing. This is crucial -- an incorrect government form submission can waste weeks of a citizen's time.
Bhashini TTS: Making the System Speak
For text-to-speech, we integrate with India's Bhashini platform, which provides neural TTS voices for Indian languages including Assamese and Bodo. Bhashini's TTS is built on a modified VITS architecture trained on studio-quality recordings by native speakers.
The integration was straightforward, but latency was a concern. Bhashini's API adds 400-800ms of network latency depending on text length. For our kiosk deployment, we implemented two optimizations:
Chunked Streaming TTS
We split the response text at sentence boundaries and begin TTS synthesis for the first sentence while the LLM is still generating subsequent sentences. This hides the TTS latency behind the generation latency.
Local TTS Cache
Government services involve repetitive phrases ("Your application has been submitted", "Please provide your Aadhaar number", "The current status of your request is..."). We pre-synthesize the 500 most common response fragments and cache the audio locally on the kiosk. Cache hit rate in production averages 35%, which eliminates TTS latency entirely for those fragments.
# Kiosk TTS configuration
tts:
provider: bhashini
fallback: local_cache
cache:
max_entries: 500
preload_common_phrases: true
audio_format: opus
sample_rate: 22050
streaming:
enabled: true
chunk_by: sentence
min_chunk_length: 10 # characters
languages:
- code: as
voice: assamese_female_01
- code: brx
voice: bodo_female_01Latency Optimization: The 2-Second Budget
Citizens expect conversational responsiveness. Our target was end-to-end latency under 2 seconds from when the citizen stops speaking to when the system begins responding audibly. Here is how the budget breaks down:
- VAD end-of-speech detection: 200ms (the silence threshold)
- ASR transcription: 400ms (for typical 5-10 second utterances)
- Translation (Assamese to English for the LLM): 150ms
- Intent extraction (Claude Haiku): 350ms
- Service API call: 300ms (cached responses for common queries)
- Response translation (English to Assamese): 150ms
- TTS first chunk: 200ms (from cache) or 500ms (from Bhashini API)
Total: 1.75 seconds (cache hit) or 2.05 seconds (cache miss). We hit our budget for the majority of interactions.
The key architectural insight was parallelizing where possible. Translation and intent extraction can begin before ASR is fully complete -- we stream partial transcriptions and begin translation on the first complete sentence while ASR continues on the remainder.
Kiosk Deployment Challenges
Deploying AI in rural service centers introduced constraints that cloud-first architectures never consider:
Intermittent connectivity. Internet connectivity in rural Assam is unreliable. The kiosk maintains a local fallback mode with a smaller Whisper model (medium) and a cached subset of common service interactions. The fallback mode handles approximately 60% of queries without internet access.
Power fluctuations. We experienced corrupted model files from sudden power loss. The kiosk now stores models on a read-only partition with checksums verified at boot.
Acoustic environment. Service centers are noisy. We added a directional microphone array with beamforming to the kiosk hardware, which improved ASR accuracy in noisy environments by 8 WER points compared to a standard microphone.
User patience. Citizens unfamiliar with voice AI need explicit audio cues: a tone when the system is listening, a different tone when it is processing, spoken confirmation of what it understood before taking action. These UX details matter more than any model improvement.
Impact and Metrics
Sarathi has been deployed in 12 service centers across 4 districts in Assam. In the first three months:
- 4,200+ citizen interactions processed
- 78% task completion rate (citizen successfully completed their intended service action)
- Average interaction time: 3.2 minutes (compared to 25+ minutes with human-assisted form filling)
- Languages used: 71% Assamese, 18% Bodo, 11% Hindi
The most requested services are ration card status checks (31%), land record queries (24%), and welfare scheme eligibility inquiries (19%).
What Comes Next
We are working on three fronts: expanding language support to Mising and Karbi (two more indigenous languages of Assam), adding proactive outreach capabilities (the system calls citizens to notify them about scheme eligibility), and building an open-source toolkit so other states can deploy similar systems. The technology works. The challenge now is scaling the deployment and training local teams to maintain the kiosks.
Comments