Chinese technology company Alibaba has released a large-scale audio-language model, Qwen-Audio, that handles more than 30 distinct audio tasks — including multilingual automatic speech recognition (ASR) and translation.
According to a November 2023 paper by Alibaba researchers Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou, predecessors support “a limited range of interaction capabilities,” but directly co-training models on all tasks and datasets can cause interference issues.
Qwen-Audio’s multitask training framework, by contrast, uses a set of hierarchical tags to encourage knowledge-sharing while avoiding interference. “Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts,“ the authors concluded.
