After Moshi, Kyutai launches Unmute, the free voice of the AI

The Kyutai artificial intelligence laboratory, founded in 2023 by Iliad, CMA CGM and Schmidt Sciences, has just presented Unmute, a technology aimed at providing large language models with a voice and an ear. Behind this innovation, the ambition to transform interactions with AI into fluid vocal exchanges, without latency or rigidity, while part of an opening logic.

Tl; dr what you need to know about Kyutai’s unmute

👥 For whom is it important?

Researchers and engineers in vocal and multimodal AI
European startups and publishers incorporating AI assistants
Public institutions looking for sovereign alternatives
Open source developers and language models
Technical directions exploring vocal modular bricks

💡 Why is it strategic?

Add vocal capacities to all LLM via a modular system
Strong reduction in latency thanks to early synthesis
Alternative positioning open to Openai, Google or Baidu solutions
Technology published in open source to promote adoption in Europe
Interoperable tool, excluding owner infrastructure, usable on a scale

🔧 What it changes concretely

Vocal interaction without noticeable, more natural delay
Configurable voice from short samples, without heavy training
Behavioral customization by simple textual textual
Possible deployment on existing systems, without cloud dependence
Testable technology immediately, publication of the imminent code

Unmute is based on a modular architecture articulated around two bricks, a real -time vocal transcription module with a semantic end of speech detector, and a proactive vocal synthesis, capable of starting to speak before the textual response is finalized. The interaction thus gains in continuity, without the conventional ruptures linked to speech towers or treatment times.

The tool allows you to configure a voice in a few seconds of audio samples and control the personality of the agent by prompt textual. It is designed to adapt to all use cases, from customer support to on -board assistance, including training or creation tools.

Kyutai positions unmot as a free alternative to proprietary solutions proposed by the dominant actors. The latter have considerably accelerated the development of their vocal offers in the last twelve months. OPENAI has just announced an advanced version of chatgpt incorporating a real -time vocal mode, based on Whisper (STT) and an owner generative engine, capable of emulating human intonations and emotions. The published demonstration shows continuous dialogues with less than 300 ms of latency, coupled with a detection capacity of silences and interruptions close to human conversation.

Google continues to integrate Gemini into the Android ecosystem, with vocal functions available locally on certain devices, aimed at offline. Deepmind published at the end of April a set of benchmarks on its vocal systems, showing performance close to humans at the speed of response, prosody and understanding of the context.

Meta, via her Voicebox project, explores multitasking TTS models capable of reproducing a voice from a few seconds of audio, with applications still limited to research for security reasons. Amazon, for its part, continues to integrate Alexa into more powerful generation models, with an emphasis on the historicization of the user context.

On the side of the Chinese actors, Baidu and Iflytek strengthen their capacities of multimodal vocal assistants on mobile, often natively embarked on bones, with a logic of complete integration between recognition, generation and synthesis, sometimes coupled with proprietary recommendation engines.

In this context, Kyutai’s proposal differs less by its performance, still little documented, than by its strategic choice of a modular system, interoperable, and published in open source. As with Moshi or Hibiki, the objective is to allow European developers, researchers or companies to appropriate technology without dependence on closed APIS or infrastructure. The testable version of Unmute is already online on Unmute.sh, pending a complete publication of the source code in the coming weeks.

This approach is part of a logic of European technological sovereignty, while the majority of reference vocal bricks are today American or Chinese. But the open source does not guarantee adoption, the success of UNUMUT will depend on its ability to easily integrate into industrial uses, to demonstrate equivalent performance with owner standards, and to mobilize a community of contributors capable of maintaining and improving it.

Since its creation, Kyutai has published several notable models, Moshi, Hibiki, Mimi, Helium, Moshivis, with a workforce limited to around twenty people.