Mistral AI Launches Open-Source Voice Model Voxtral TTS

Mistral AI Launches Open-Source Voice Model Voxtral TTS
TechCrunch

Key Points

  • Mistral AI releases Voxtral TTS, an open‑source text‑to‑speech model.
  • Supports nine languages, including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
  • Custom voice adaptation requires less than five seconds of audio.
  • Runs on edge devices such as smartwatches, smartphones, and laptops.
  • Time‑to‑first‑audio of 90 ms and a real‑time factor of 6× for fast rendering.
  • Aims to compete with ElevenLabs, Deepgram, and OpenAI in enterprise voice solutions.
  • Part of Mistral’s broader plan for an end‑to‑end multimodal AI platform.

Mistral AI, a French artificial‑intelligence firm, has introduced Voxtral TTS, an open‑source text‑to‑speech model designed for real‑time performance on edge devices. The model supports nine languages, can be customized with a voice sample of less than five seconds, and delivers a time‑to‑first‑audio of 90 ms with a real‑time factor of 6×. Mistral positions the model as a low‑cost, high‑quality alternative for enterprise voice assistants, dubbing, and real‑time translation, directly competing with established players such as ElevenLabs, Deepgram, and OpenAI.

Introduction

Mistral AI, a French artificial‑intelligence company, announced the release of Voxtral TTS, an open‑source text‑to‑speech model. The model is built to run on a range of edge devices, from smartwatches to laptops, offering a cost‑effective solution for enterprises seeking voice‑enabled applications.

Multilingual Capabilities

Voxtral TTS supports nine languages, including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model can switch between languages without losing the distinctive characteristics of a custom voice, making it suitable for dubbing and real‑time translation scenarios.

Customization and Voice Fidelity

The system can adapt a custom voice using a sample of less than five seconds. It captures subtle accents, inflections, intonations, and irregularities, aiming for a human‑like sound rather than a robotic tone.

Performance Metrics

Designed for real‑time use, Voxtral TTS achieves a time‑to‑first‑audio (TTFA) of 90 ms for a 10‑second, 500‑character input. Its real‑time factor (RTF) of 6× means a 10‑second clip is rendered in roughly 1.6 seconds.

Strategic Positioning

By offering an open‑source, customizable model, Mistral seeks to attract enterprises that want to fine‑tune voice technology to their specific needs. The company highlights the model’s low cost compared with competing solutions and its suitability for integration into a broader multimodal platform that processes audio, text, and images.

Future Outlook

Mistral previously released transcription models for batch and low‑latency real‑time processing. With Voxtral TTS, the firm aims to provide a complete suite of voice products, positioning itself against competitors such as ElevenLabs, Deepgram, and OpenAI while emphasizing an end‑to‑end platform for multimodal AI applications.

#speech synthesis#text-to-speech#open source#voice AI#Mistral AI#multilingual#edge computing#real-time audio#enterprise AI
Generated with  News Factory -  Source: TechCrunch

Also available in:

Mistral AI Launches Open-Source Voice Model Voxtral TTS | AI News