Alibaba AI voice model outperforms western rivals

A new artificial intelligence voice model from Alibaba Group Holding has beaten Western rivals OpenAI and xAI on a major global benchmark, underscoring its technical edge in capturing complex Chinese dialects and accents. Fun-Realtime-TTS-Preview, developed by Alibaba’s Tongyi Lab, has secured the fifth spot on the Artificial Analysis Speech Arena leaderboard with a score of 1,190. It was the only Chinese-engineered voice system in the global top five. The Speech Arena benchmark is operated by Artificial Analysis, a San Francisco-based AI evaluation organization backed by investors including former GitHub Chief Executive Nat Friedman and Google Brain Founder Andrew Ng. The platform ranks models through blind user evaluations of generated speech clips using an Elo-based system. Speech Arena users test how well models can perform across three core capabilities – converting speech into text, enabling end-to-end voice understanding and conversational interaction, and transforming text into natural-sounding speech.

In a separate Artificial-Analysis Word Error Rate index, Alibaba’s Fun-Realtime-ASR model ranked first with a word error rate of 1.8%, meaning fewer than two words out of every 100 were transcribed incorrectly. The breakthrough addresses a long-standing bottleneck for voice tech in Asia. According to a May report by the Baidu Developer Center, traditional speech systems trained on standard Mandarin see their accuracy fall below 60% for speakers with an accent, and drop to under 30% for regional Chinese dialects. Alibaba has been trying to bridge this gap. According to the firm’s cloud unit, the new model supports more than 30 languages, seven major Chinese dialects and over 20 regional accents.

Chinese AI developers are increasingly pivoting from general-purpose chatbots towards embedding voice AI assistants into daily applications. The growing industry focus on speech models reflects expectations that voice interfaces could become a key gateway for deploying AI across industries. As one of the most intuitive forms of human-computer interaction, voice requires little user training. Voice-based AI systems are generally seen as easier for mainstream users to adopt than text-based interfaces because they require less user training and can operate more naturally across devices such as smartphones, smart speakers and in-car assistants, the South China Morning Post reports.