EuroLLM-22B

Community Article Published December 14, 2025
drawing

We’re thrilled to unveil EuroLLM-22B—a fully open language model developed in Europe. Built using the cutting-edge EuroHPC infrastructure, EuroLLM-22B marks a major milestone in our mission to deliver state-of-the-art, multilingual language models tailored to all European languages. In this post, we provide an overview of the model and highlight its benchmark performance.

Stay tuned for the upcoming technical report describing all the data and model development details, extra checkpoints, and the future release of even larger, more powerful models!

Pre-trained model: https://huggingface.co/utter-project/EuroLLM-22B-2512
Post-trained model: https://huggingface.co/utter-project/EuroLLM-22B-Instruct-2512

Introduction

While the quality of open-source large language models (LLMs) has been improving rapidly, most are English-centric or support only a limited set of languages, leaving many European languages underserved. To bridge this gap, we launched the EuroLLM project, with the aim of creating a suite of fully open LLMs capable of understanding and generating text across all the 24 official European Union (EU) languages, as well as 11 commercially and strategically important international languages.

Our journey began with the release of EuroLLM-1.7B (see Martins et al., 2024) and EuroLLM-9B (Martins et al., 2025), smaller models which deliver strong performance in machine translation and rank competitively in general benchmarks. Today, we are excited to release EuroLLM-22B, which ranks as the best fully open European-made LLM to date.

Our work doesn’t stop here—we’re already developing more powerful models with multimodal capabilities to expand the EuroLLM family.

Context size: 32K tokens.

Languages supported: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.

Developed by: Instituto Superior Técnico - University of Lisbon, Instituto de Telecomunicações, University of Edinburgh, Aveni, Unbabel, University of Paris-Saclay, University of Amsterdam, Naver Labs, Sorbonne Université.

Authors: Miguel Moura Ramos, Duarte M. Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M. Guerreiro, Ricardo Rei, Nicolas Boizard, Amin Farajian, Mateusz Klimaszewski, José G. C. de Souza, Barry Haddow, François Yvon, Pierre Colombo, Alexandra Birch, André F. T. Martins.

Results

We demonstrate the performance of EuroLLM-22B on multiple benchmarks including multilingual general benchmarks (using translations of English benchmarks), machine translation, and English general benchmarks.

Multilingual

EuroLLM 22B Blog Results - Multilingual

Table 1: Comparison of fully open and open-weight LLMs on a suite of multilingual benchmarks, averaging over all languages supported by EuroLLM-22B that are present in each benchmark. The table reports scores on HellaSwag, MMLU, MMLU-Pro, ARC-Challenge, MGSM, FLORES, and WMT24++. The Borda Count (Colombo et al., 2022) reflects the average ranking of each model across all benchmarks. Bold values indicate the best overall system for each benchmark, while underscored values denote the best fully open system.

English

EuroLLM 22B Blog Results - English

Table 2: Comparison of fully open and open-weight LLMs on a suite of English benchmarks. The table reports scores on IFEval, HellaSwag, MMLU, MMLU-Pro, BBH, ARC-Challenge, GPQA, GSM8K, MATH-500, and HumanEval. The Borda Count reflects the average ranking of each model across all benchmarks. Bold values indicate the best overall system for each benchmark, while underscored values denote the best fully open system.

Pre-training

EuroLLM-22B was trained on approximately 4 trillion tokens, using 400 Nvidia H100 GPUs on the MareNostrum5 supercomputer, thanks to an EuroHPC extreme-scale access grant. The training process was carefully structured into three key phases:

  1. Initial Pre-training (3.6 trillion tokens) This phase includes the warm-up and constant learning rate stages, during which the model is trained on a mixture of web data alongside higher quality sources such as parallel data, Wikipedia, Arxiv, books, math, code and Apollo datasets. This balanced mix helps the model build a strong multilingual foundation.
  2. Annealing (400 billion tokens) During this phase, there is a linear decay of the learning rate and we adjust the data mix to reduce the proportion of web data while increasing the multilingual content and select the highest quality data—by making use of quality filters such as [CometKiwi-22](https://huggingface.co/Unbabel/wmt22-cometkiwi-da) and [EuroFilter](https://huggingface.co/utter-project/EuroFilter-v1). This shift helps the model refine its understanding across diverse languages and domains.
  3. Annealing to Zero (100 billion tokens) In this final stage, the learning rate decays linearly to zero. In this phase, the data mix was optimized to be of even higher quality, in order to polish the model's performance, and long context data sources were upsampled to increase the model context window to 32k tokens.

Post-training

During post-training, we adapt EuroLLM to be an instruction-following model capable of handling multi-turn conversations. We start by regenerating the final responses from publicly available datasets using several open models, and keep the best candidate using a reward model. To this data, we add records from other datasets (Nemotron, Hermes-3 and Tulu 3), removing duplicates based on the first prompt. This pipeline shows how EuroLLM can be easily adapted for your use-cases.

The model excels at translation tasks being capable of translating across all official EU languages, matching or outperforming strong models like Gemma-3-27B, Qwen-3-32B and Apertus-70B. Furthermore, when it comes to general benchmarks, it is the best EU-made fully open model.

Acknowledgments

We thank EuroHPC for the Extreme-Scale compute grant (EHPC-EXT-2023E01-042) that allowed us to train the EuroLLM models, as well as the Barcelona Supercomputer Center (BSC) for their support. This work was partly supported by the EU’s Horizon Europe Research and Innovation Actions (UTTER, contract 101070631), the project DECOLLAGE (ERC-2022-CoG 101088763), and the Portuguese Recovery and Resilience Plan through project C645008882-00000055 (Center for ResponsibleAI).

References

Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G.C. de Souza, André F.T. Martins. Tower: An Open Multilingual Large Language Model for Translation-Related Tasks. COLM 2024.

Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stéphan Clémençon. What are the best systems? New perspectives on NLP Benchmarking. NeurIPS 2022.

Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, Thien Nguyen. Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback. EMNLP System Demonstrations 2023.

Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins. EuroLLM: Multilingual Language Models for Europe. 2024.

Pedro Henrique Martins, João Alves, Patrick Fernandes, Nuno M. Guerreiro, Ricardo Rei, Amin Farajian, Mateusz Klimaszewski, Duarte M. Alves, José Pombal, Manuel Faysse, Pierre Colombo, François Yvon, Barry Haddow, José G. C. de Souza, Alexandra Birch, and André F. T. Martins. EuroLLM-9B: Technical Report, 2025. URL: https://arxiv.org/abs/2506.04079.

Community

Sign up or log in to comment