The latest addition to the MERaLiON family of speech-text large language models. Our flagship model, MERaLiON-2-10B, demonstrates competitive performance across benchmark evaluations in tasks such as multilingual automatic speech recognition (ASR), speech translation (ST), audio scene understanding, emotion recognition, and general speech comprehension. These results are comparable to those achieved by other state-of-the-art open-source AudioLLMs, including Qwen2.5-Omni-7B and Phi-4-multimodal-instruct.
MERaLiON-2-10B is specifically designed to follow complex instructions with a nuanced understanding of Singapore’s multilingual and multicultural context. It integrates a localized Whisper-large-v3 speech encoder and Gemma-2-9b text decoder. The following graph presents task-specific evaluation scores, assessed using the LLM-as-a-Judge framework across multiple datasets. For the speech translation task, performance is measured using the BLEU metric, where higher scores indicate better translation quality.
We also provide MERaLiON-2-3B that balances performance with reduced computational requirements, enabling broader accessibility and lightweight deployment.
Extended Audio Length: Support audio inputs up to 300 seconds (5 minutes) for audio & speech question answering tasks, 30s for a satisfactory performance for speech transcription (ASR) and speech translation (ST) tasks.
Expanded Language Coverage: In addition to English, Chinese, and Singlish, V2 introduces support for Malay, Tamil, and other South-East Asia languages including Indonesian, Thai, and Vietnamese.
Improved Performance: Achieves higher performance across a wide range of tasks. See the Evaluation section for detailed benchmarks.
Higher Quality Training Data: Trained on 120,000 hours of curated speech and audio data, filtered for quality and diversity, with an emphasis on local and multilingual audio sources.
Three Model Variants: Available in general-purpose (MERaLiON-2-10B), ASR-optimized (MERaLiON-2-10B-ASR) and light-weight (MERaLiON-2-3B) configurations to balance latency, compute efficiency, and task performance across different deployment needs.
MERaLiON stands for Multimodal Empathetic Reasoning and Learning in One Network.
MERaLiON-2 is a family of Speech-Text Large Language Models tailored for Singapore’s multilingual and multicultural landscape, as well as the wider Southeast Asian region. The 10B model integrates a localized Whisper-Large-V3 speech encoder with the Gemma2-9b-IT text decoder. The 3B model integrates a localized Whisper-Large-V3 speech encoder with the Gemma2-2b-IT text decoder.
MERaLiON-2-10B is finetuned on 120,000 hours of speech and audio data across 6 diverse tasks: Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), Audio Captioning (AC), Audio-Scene Question Answering (ASQA) and Paralinguistic Question Answering (PQA). The model supports long-form audio inputs of up to 300 seconds (5 minutes) and is specifically adapted to handle the linguistic nuances, accents, and dialects commonly found across Singapore and neighboring countries.
MERaLiON-2 is an upgraded version of MERaLiON-AudioLLM.
We aim to build a LLM ecosystem and foster strong expertise in developing and deploying scalable, impactful AI solutions of high value and relevance to citizenries and businesses.
We encourage tech companies and businesses to harness the collaborative power and contributions of the open-source community to develop more diverse representations that further enhance the MERaLiON model!