Models

MERaLiON-2-10B

The latest addition to the MERaLiON family of speech-text large language models. Our flagship model, MERaLiON-2-10B, demonstrates competitive performance across benchmark evaluations in tasks such as multilingual automatic speech recognition (ASR), speech translation (ST), audio scene understanding, emotion recognition, and general speech comprehension. These results are comparable to those achieved by other state-of-the-art open-source AudioLLMs, including Qwen2.5-Omni-7B and Phi-4-multimodal-instruct.

MERaLiON-2-10B is specifically designed to follow complex instructions with a nuanced understanding of Singapore’s multilingual and multicultural context. It integrates a localized Whisper-large-v3 speech encoder and Gemma-2-9b text decoder. The following graph presents task-specific evaluation scores, assessed using the LLM-as-a-Judge framework across multiple datasets. For the speech translation task, performance is measured using the BLEU metric, where higher scores indicate better translation quality.

model_capability

We also provide MERaLiON-2-3B that balances performance with reduced computational requirements, enabling broader accessibility and lightweight deployment.

Model Description:

MERaLiON stands for Multimodal Empathetic Reasoning and Learning in One Network.

MERaLiON-2 is a family of Speech-Text Large Language Models tailored for Singapore’s multilingual and multicultural landscape, as well as the wider Southeast Asian region. The 10B model integrates a localized Whisper-Large-V3 speech encoder with the Gemma2-9b-IT text decoder. The 3B model integrates a localized Whisper-Large-V3 speech encoder with the Gemma2-2b-IT text decoder.

MERaLiON-2-10B is finetuned on 120,000 hours of speech and audio data across 6 diverse tasks: Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), Audio Captioning (AC), Audio-Scene Question Answering (ASQA) and Paralinguistic Question Answering (PQA). The model supports long-form audio inputs of up to 300 seconds (5 minutes) and is specifically adapted to handle the linguistic nuances, accents, and dialects commonly found across Singapore and neighboring countries.

MERaLiON-2 is an upgraded version of MERaLiON-AudioLLM.

Evaluation Benchmarks and Leaderboard

AudioBench
SeaEval

Releases

We aim to build a LLM ecosystem and foster strong expertise in developing and deploying scalable, impactful AI solutions of high value and relevance to citizenries and businesses.

We encourage tech companies and businesses to harness the collaborative power and contributions of the open-source community to develop more diverse representations that further enhance the MERaLiON model!

Download from Hugging Face