Jan 5, 2026

6 min read

Voice AINLPSocial Impact

Voice AI for 31 Million Citizens: Building Sarathi

How we built the Sarathi Voice Agent to deliver government services in Assamese and Bodo using faster-whisper ASR, Bhashini TTS, and sub-second latency optimization for kiosk deployment.

The Problem: Government Services That No One Can Access

Assam has 31 million citizens. The majority speak Assamese; significant populations speak Bodo, Bengali, and other languages. Government services -- land records, welfare scheme enrollment, grievance filing -- are increasingly digitized, but the interfaces are in English or Hindi, behind forms that assume literacy and internet familiarity. The result is a digital divide where the people who need government services most are the least able to access them.

Sarathi was born from a simple idea: what if citizens could access any government service by simply speaking in their own language? At Sarathi Studio, we built a voice-first AI agent that understands Assamese and Bodo, navigates government service workflows, and responds with natural speech -- all running on kiosk hardware deployed in rural service centers.

This post covers the technical architecture, the unique challenges of low-resource language AI, and how we achieved sub-second response latency on modest hardware.

Architecture Overview

The Sarathi pipeline has four stages:

ASR (Automatic Speech Recognition): Convert citizen speech to text
Intent Understanding: Determine which government service the citizen needs and extract relevant parameters
Service Orchestration: Interact with government APIs to fulfill the request
TTS (Text-to-Speech): Convert the response to natural speech in the citizen's language

Each stage presented unique challenges for low-resource languages.

ASR: faster-whisper with Language-Specific Tuning

We evaluated several ASR options for Assamese. Google Speech-to-Text supports Assamese but with a word error rate (WER) around 35% in our testing -- unusable for a transactional system. Commercial alternatives performed even worse. OpenAI's Whisper large-v3, however, achieved 18% WER out of the box on our test set.

We deployed Whisper using the faster-whisper library, which uses CTranslate2 for optimized inference. This gave us 4x faster decoding compared to the original Whisper implementation, which was critical for our latency budget.

python

from faster_whisper import WhisperModel

class SarathiASR:
    def __init__(self):
        self.model = WhisperModel(
            "large-v3",
            device="cuda",
            compute_type="float16",
            num_workers=2,
        )
        # Language-specific VAD parameters
        self.vad_parameters = {
            "threshold": 0.4,
            "min_speech_duration_ms": 250,
            "max_speech_duration_s": 30,
            "min_silence_duration_ms": 600,  # Assamese has longer pauses
            "speech_pad_ms": 200,
        }

    def transcribe(self, audio_path: str, language: str = "as"):
        segments, info = self.model.transcribe(
            audio_path,
            language=language,
            beam_size=5,
            best_of=5,
            vad_filter=True,
            vad_parameters=self.vad_parameters,
        )
        return {
            "text": " ".join([s.text for s in segments]),
            "language": info.language,
            "language_probability": info.language_probability,
            "duration": info.duration,
        }

To reduce WER further, we fine-tuned Whisper on 800 hours of Assamese speech data collected from All India Radio archives, local news broadcasts, and volunteer recordings. We also collected 200 hours of Bodo speech, making Sarathi one of the few AI systems with dedicated Bodo language support. Post-fine-tuning WER dropped to 12% for Assamese and 16% for Bodo.

The Voice Activity Detection Challenge

Assamese conversational speech has different prosodic patterns than English. Speakers use longer pauses between phrases, and the pitch contours that signal sentence boundaries are different. The default VAD parameters in faster-whisper, tuned for English, would frequently cut off speakers mid-sentence or merge separate utterances.

We spent two weeks tuning VAD parameters on a 50-hour validation set of conversational Assamese recorded in actual government service center environments -- with background noise, multiple speakers, and the acoustic characteristics of the kiosk hardware.

Intent Understanding: Structured Extraction over Classification

Rather than treating intent understanding as a classification problem (which would require training data for every possible intent), we use an LLM-based structured extraction approach. The citizen's transcribed speech is processed by a prompted Claude Haiku model that extracts a structured intent object:

python

INTENT_PROMPT = """
You are a government service assistant for Assam, India.

Given the citizen's request (translated to English), extract:
1. service_category: one of [land_records, welfare_schemes, grievance,
   certificate, pension, ration_card, utility, general_inquiry]
2. action: what the citizen wants to do (check_status, apply, download, update, inquire)
3. parameters: relevant details (name, ID numbers, dates, locations)
4. confidence: your confidence in the extraction (0.0-1.0)

If the request is ambiguous, set confidence below 0.6 and include
clarifying_question in the response.

Citizen request: {transcribed_text}
Translated request: {translated_text}

Respond in JSON format.
"""

When confidence is below 0.6, the system generates a clarifying question in the citizen's language rather than guessing. This is crucial -- an incorrect government form submission can waste weeks of a citizen's time.

Bhashini TTS: Making the System Speak

For text-to-speech, we integrate with India's Bhashini platform, which provides neural TTS voices for Indian languages including Assamese and Bodo. Bhashini's TTS is built on a modified VITS architecture trained on studio-quality recordings by native speakers.

The integration was straightforward, but latency was a concern. Bhashini's API adds 400-800ms of network latency depending on text length. For our kiosk deployment, we implemented two optimizations:

Chunked Streaming TTS

We split the response text at sentence boundaries and begin TTS synthesis for the first sentence while the LLM is still generating subsequent sentences. This hides the TTS latency behind the generation latency.

Local TTS Cache

Government services involve repetitive phrases ("Your application has been submitted", "Please provide your Aadhaar number", "The current status of your request is..."). We pre-synthesize the 500 most common response fragments and cache the audio locally on the kiosk. Cache hit rate in production averages 35%, which eliminates TTS latency entirely for those fragments.

yaml

# Kiosk TTS configuration
tts:
  provider: bhashini
  fallback: local_cache
  cache:
    max_entries: 500
    preload_common_phrases: true
    audio_format: opus
    sample_rate: 22050
  streaming:
    enabled: true
    chunk_by: sentence
    min_chunk_length: 10  # characters
  languages:
    - code: as
      voice: assamese_female_01
    - code: brx
      voice: bodo_female_01

Latency Optimization: The 2-Second Budget

Citizens expect conversational responsiveness. Our target was end-to-end latency under 2 seconds from when the citizen stops speaking to when the system begins responding audibly. Here is how the budget breaks down:

VAD end-of-speech detection: 200ms (the silence threshold)
ASR transcription: 400ms (for typical 5-10 second utterances)
Translation (Assamese to English for the LLM): 150ms
Intent extraction (Claude Haiku): 350ms
Service API call: 300ms (cached responses for common queries)
Response translation (English to Assamese): 150ms
TTS first chunk: 200ms (from cache) or 500ms (from Bhashini API)

Total: 1.75 seconds (cache hit) or 2.05 seconds (cache miss). We hit our budget for the majority of interactions.

The key architectural insight was parallelizing where possible. Translation and intent extraction can begin before ASR is fully complete -- we stream partial transcriptions and begin translation on the first complete sentence while ASR continues on the remainder.

Kiosk Deployment Challenges

Deploying AI in rural service centers introduced constraints that cloud-first architectures never consider:

Intermittent connectivity. Internet connectivity in rural Assam is unreliable. The kiosk maintains a local fallback mode with a smaller Whisper model (medium) and a cached subset of common service interactions. The fallback mode handles approximately 60% of queries without internet access.

Power fluctuations. We experienced corrupted model files from sudden power loss. The kiosk now stores models on a read-only partition with checksums verified at boot.

Acoustic environment. Service centers are noisy. We added a directional microphone array with beamforming to the kiosk hardware, which improved ASR accuracy in noisy environments by 8 WER points compared to a standard microphone.

User patience. Citizens unfamiliar with voice AI need explicit audio cues: a tone when the system is listening, a different tone when it is processing, spoken confirmation of what it understood before taking action. These UX details matter more than any model improvement.

Impact and Metrics

Sarathi has been deployed in 12 service centers across 4 districts in Assam. In the first three months:

4,200+ citizen interactions processed
78% task completion rate (citizen successfully completed their intended service action)
Average interaction time: 3.2 minutes (compared to 25+ minutes with human-assisted form filling)
Languages used: 71% Assamese, 18% Bodo, 11% Hindi

The most requested services are ration card status checks (31%), land record queries (24%), and welfare scheme eligibility inquiries (19%).

What Comes Next

We are working on three fronts: expanding language support to Mising and Karbi (two more indigenous languages of Assam), adding proactive outreach capabilities (the system calls citizens to notify them about scheme eligibility), and building an open-source toolkit so other states can deploy similar systems. The technology works. The challenge now is scaling the deployment and training local teams to maintain the kiosks.

Back to all posts

Jan 5, 2026

6 min read

Voice AINLPSocial Impact

Voice AI for 31 Million Citizens: Building Sarathi

How we built the Sarathi Voice Agent to deliver government services in Assamese and Bodo using faster-whisper ASR, Bhashini TTS, and sub-second latency optimization for kiosk deployment.

The Problem: Government Services That No One Can Access

This post covers the technical architecture, the unique challenges of low-resource language AI, and how we achieved sub-second response latency on modest hardware.

Architecture Overview

The Sarathi pipeline has four stages:

ASR (Automatic Speech Recognition): Convert citizen speech to text
Intent Understanding: Determine which government service the citizen needs and extract relevant parameters
Service Orchestration: Interact with government APIs to fulfill the request
TTS (Text-to-Speech): Convert the response to natural speech in the citizen's language

Each stage presented unique challenges for low-resource languages.

ASR: faster-whisper with Language-Specific Tuning

python

from faster_whisper import WhisperModel

class SarathiASR:
    def __init__(self):
        self.model = WhisperModel(
            "large-v3",
            device="cuda",
            compute_type="float16",
            num_workers=2,
        )
        # Language-specific VAD parameters
        self.vad_parameters = {
            "threshold": 0.4,
            "min_speech_duration_ms": 250,
            "max_speech_duration_s": 30,
            "min_silence_duration_ms": 600,  # Assamese has longer pauses
            "speech_pad_ms": 200,
        }

    def transcribe(self, audio_path: str, language: str = "as"):
        segments, info = self.model.transcribe(
            audio_path,
            language=language,
            beam_size=5,
            best_of=5,
            vad_filter=True,
            vad_parameters=self.vad_parameters,
        )
        return {
            "text": " ".join([s.text for s in segments]),
            "language": info.language,
            "language_probability": info.language_probability,
            "duration": info.duration,
        }

The Voice Activity Detection Challenge

Intent Understanding: Structured Extraction over Classification

python

INTENT_PROMPT = """
You are a government service assistant for Assam, India.

Given the citizen's request (translated to English), extract:
1. service_category: one of [land_records, welfare_schemes, grievance,
   certificate, pension, ration_card, utility, general_inquiry]
2. action: what the citizen wants to do (check_status, apply, download, update, inquire)
3. parameters: relevant details (name, ID numbers, dates, locations)
4. confidence: your confidence in the extraction (0.0-1.0)

If the request is ambiguous, set confidence below 0.6 and include
clarifying_question in the response.

Citizen request: {transcribed_text}
Translated request: {translated_text}

Respond in JSON format.
"""

Bhashini TTS: Making the System Speak

The integration was straightforward, but latency was a concern. Bhashini's API adds 400-800ms of network latency depending on text length. For our kiosk deployment, we implemented two optimizations:

Chunked Streaming TTS

Local TTS Cache

yaml

# Kiosk TTS configuration
tts:
  provider: bhashini
  fallback: local_cache
  cache:
    max_entries: 500
    preload_common_phrases: true
    audio_format: opus
    sample_rate: 22050
  streaming:
    enabled: true
    chunk_by: sentence
    min_chunk_length: 10  # characters
  languages:
    - code: as
      voice: assamese_female_01
    - code: brx
      voice: bodo_female_01

Latency Optimization: The 2-Second Budget

VAD end-of-speech detection: 200ms (the silence threshold)
ASR transcription: 400ms (for typical 5-10 second utterances)
Translation (Assamese to English for the LLM): 150ms
Intent extraction (Claude Haiku): 350ms
Service API call: 300ms (cached responses for common queries)
Response translation (English to Assamese): 150ms
TTS first chunk: 200ms (from cache) or 500ms (from Bhashini API)

Total: 1.75 seconds (cache hit) or 2.05 seconds (cache miss). We hit our budget for the majority of interactions.

Kiosk Deployment Challenges

Deploying AI in rural service centers introduced constraints that cloud-first architectures never consider:

Power fluctuations. We experienced corrupted model files from sudden power loss. The kiosk now stores models on a read-only partition with checksums verified at boot.

Impact and Metrics

Sarathi has been deployed in 12 service centers across 4 districts in Assam. In the first three months:

4,200+ citizen interactions processed
78% task completion rate (citizen successfully completed their intended service action)
Average interaction time: 3.2 minutes (compared to 25+ minutes with human-assisted form filling)
Languages used: 71% Assamese, 18% Bodo, 11% Hindi

The most requested services are ration card status checks (31%), land record queries (24%), and welfare scheme eligibility inquiries (19%).

What Comes Next

Back to all posts

Voice AI for 31 Million Citizens: Building Sarathi

The Problem: Government Services That No One Can Access

Architecture Overview

ASR: faster-whisper with Language-Specific Tuning

The Voice Activity Detection Challenge

Intent Understanding: Structured Extraction over Classification

Bhashini TTS: Making the System Speak

Chunked Streaming TTS

Local TTS Cache

Latency Optimization: The 2-Second Budget

Kiosk Deployment Challenges

Impact and Metrics

What Comes Next

Share Post

Comments

Voice AI for 31 Million Citizens: Building Sarathi

The Problem: Government Services That No One Can Access

Architecture Overview

ASR: faster-whisper with Language-Specific Tuning

The Voice Activity Detection Challenge

Intent Understanding: Structured Extraction over Classification

Bhashini TTS: Making the System Speak

Chunked Streaming TTS

Local TTS Cache

Latency Optimization: The 2-Second Budget

Kiosk Deployment Challenges

Impact and Metrics

What Comes Next

Share Post

Comments