Case Study

Voiux FSR API

Production-ready voice transcription for developers

Built a self-hosted voice transcription API powered by Whisper, offering developers a simple REST interface for converting audio to text with speaker diarization support.

Voiux API dashboard showing transcription results

Client

Voiux

A voice transcription API designed for developers who need reliable, accurate speech-to-text conversion without the complexity of managing ML infrastructure.

Studio

tuniverstudio

A boutique AI integration studio specialising in conversational agents, AI-first web experiences and voice receptionists for ambitious teams.

Deployment

Self-hosted (hw-asus:3006)

Full control over data and infrastructure.

Target users

Developers & integration teams

REST API designed for easy integration.

Core model

OpenAI Whisper

Industry-leading accuracy for multiple languages.

Implementation timeline

API design

Week 1

Designed REST API endpoints, request/response schemas, and error handling patterns following OpenAPI best practices.

Core implementation

Week 2

Built FastAPI backend with Whisper integration, audio preprocessing pipeline, and speaker diarization module.

Deployment & docs

Week 3

Deployed to self-hosted server, added authentication, rate limiting, and comprehensive API documentation.

introduction

Introduction & Background

Voice transcription is a cornerstone feature for many modern applications—from meeting notes to content accessibility. But integrating speech-to-text often means choosing between expensive cloud APIs with privacy concerns or wrestling with complex ML pipelines. Voiux was built to give developers a simple, self-hosted alternative with enterprise-grade accuracy.

  • Self-hosted means full data sovereignty
  • Whisper delivers state-of-the-art transcription accuracy

challenge

Problem & Constraints

Developers need reliable voice transcription that works across different audio formats, handles multiple speakers, and doesn't require sending sensitive recordings to third-party cloud services. The solution had to be easy to integrate via standard REST API while supporting advanced features like speaker diarization.

  • Must handle various audio formats (MP3, WAV, M4A, etc.)
  • Speaker diarization required for meeting transcription use cases

solution

Solution Overview

We built a FastAPI-based transcription service that wraps OpenAI's Whisper model with production-ready features. The API accepts audio uploads, processes them through a preprocessing pipeline (normalization, noise reduction), runs Whisper transcription, and optionally performs speaker diarization using Pyannote Audio.

  • Clean REST API with JSON responses
  • Speaker timestamps and confidence scores included in output

implementation

Implementation Journey

The system uses FFmpeg for audio preprocessing and format conversion. Whisper handles the core transcription with support for 99+ languages. For speaker diarization, we integrated Pyannote Audio to identify and label different speakers in the recording. FastAPI provides automatic OpenAPI documentation and async request handling.

  • Audio preprocessing with FFmpeg for format normalization
  • Whisper large-v3 model for maximum accuracy
  • Speaker diarization with Pyannote Audio segmentation
  • ReSTful API with FastAPI automatic documentation
  • Docker containerization for consistent deployment

results

Results & Impact

The Voiux FSR API provides developers with a straightforward endpoint for voice transcription. Running on self-hosted infrastructure gives teams full control over their audio data while still benefiting from state-of-the-art ML models.

  • Simple REST integration for any programming language
  • Privacy-preserving self-hosted deployment

Details we can still add

Answering these will let us enrich the manuscript with sharper narrative proof points and concrete outcomes.

  • Scale: What is the maximum audio duration and concurrent request capacity?
  • Languages: Which languages have been tested most extensively?
  • Pricing: Is this offered as a service or only self-hosted deployment?
  • Features: Are there plans for real-time streaming transcription?

Stack at a glance

  • Python
  • FastAPI
  • OpenAI Whisper
  • FFmpeg
  • Pyannote Audio
  • Docker

Add voice intelligence to your product

We help teams integrate speech-to-text, voice agents, and audio analytics. Bring us your use case—we'll design the pipeline.