Nov 26, 2024

Multilingual and open source: OpenGPT-X research project releases large language model

OpenGPT-X is now making its large language model available for download. Following the launch of the European LLM Leaderboard in mid-July, the consortium of the German Federal Ministry for Economic Affairs and Climate Action (BMWK) funded research project – in cooperation with TU Dresden – has now also released the underlying model “Teuken-7B”. It was trained from the beginning with the 24 official languages of the EU and has seven billion parameters. As a technological foundation, the free model can thus be adapted, supplemented and specialized for applications of Generative Artificial Intelligence (AI). Furthermore, it can be used to implement a wide range of AI applications.

Teuken-7B is one of the few AI language models currently available, where the base model has been trained from the ground up in multiple languages. In addition to its multilinguality, the highlights include a multilingual pre-processing stage (“tokenizer”), which ensures more efficient training and operation, as well as embedding in the infrastructure of the european Gaia-X ecosystem. As an open source model, companies and organizations can create their own customized models and run them in real-world applications. This is intended to address the need for transparent and customizable solutions in generative AI in both science and industry.

Generative AI by a strong network – with a European perspective

Teuken-7B was launched as a freely usable open source model with a European perspective. Ten partners, including TU Dresden with the two CIDS departments ZIH and ScaDS.AI Dresden/Leipzig under the leadership of the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS, worked closely together in the OpenGPT-X joint project funded by the German Federal Ministry for Economic Affairs and Climate Protection (BMWK).

"I am excited to witness today’s publication of Teuken-7B, a large language model based on Gaia-X, and would like to congratulate the OpenGPT-X project on having reached this important milestone. A special feature of Teuken-7B is that it enables the secure use of sensitive corporate data, as the Gaia-X standards guarantee data storage and processing in accordance with the strictest European data protection and security regulations. This new model and innovations like this strengthen the digital sovereignty, competitiveness and resilience of Germany and of Europe. This is why the Federal Ministry for Economic Affairs and Climate Action is funding the project with approximately 14 million euros in total," says Dr. Franziska Brantner, Parliamentary State Secretary at BMWK.

TU Dresden has provided infrastructure for the project (in addition to Forschungszentrum Jülich). The setup and installation for the model training and evaluations were also supported. For the training, the efficiency was examined and optimized by assessing GPU utilization and various parallelization strategies. The trained models were evaluated in terms of their various capabilities, including logical thinking and translation capability. The results can be viewed in the previously published leaderboard.

Highlights of the language model

Improved tokenizer increases the efficiency of language models in non-English settings

During model development, OpenGPT-X placed great emphasis on the (energy-) efficient use of computational resources and conducted intensive research on the tokenizer in particular. As a central element of large AI language models, tokenizers break down words into individual sub-word tokens. The fewer tokens, the faster language models generate an answer.

Access through the European Gaia-X infrastructure

OpenGPT-X was funded by the BMWK program "Innovative and practical applications and data spaces in the Gaia-X digital ecosystem" with the aim to enable actors in the Gaia-X ecosystem to develop innovative language applications and transfer them into concrete application scenarios in their respective domains.

Free use for scientific and commercial purposes

Developers can download Teuken-7B from Hugging Face for free and work with it in their own development environment. The model has already been optimized for chat applications by means of instruction tuning. Instruction tuning is used to adapt large AI language models so that the model understands user instructions correctly. The model is available in a version for research purposes and a version under the "Apache 2.0" license, which companies can also use for commercial purposes and integrate into their own AI applications.

Additional links:

Model download and model cards: https://huggingface.co/openGPT-X
Technical information, benchmarks and research findings of OpenGPT-X: https://opengpt-x.de/en/models/teuken-7b
OpenGPT-X publications: https://opengpt-x.de/news
European LLM Leaderboard: https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard
Feedback and technical questions: https://discord.gg/RvdHpGMvB3
Schedule a demo: www.iais.fraunhofer.de/opengpt-x
Gaia-X: https://gaia-x-hub.de/

About OpenGPT-X

The OpenGPT-X project, funded by the German Federal Ministry of Economic Affairs and Climate Action (BMWK) with approximately EUR 14 million, started on 1 January 2022 and will end on 31 March 2025. The ten project partners include Fraunhofer IAIS, Fraunhofer IIS, IONOS, DFKI, Aleph Alpha, Forschungszentrum Jülich, TU Dresden, ControlExpert, WDR, and KI Bundesverband. Under the consortium lead of Fraunhofer IAIS and Fraunhofer IIS, the project explores the entire value chain of generative AI: from highly scalable, GPU-based infrastructure and data for the training of large language models, through model development, to productive application in the form of proto-types and proofs of concepts (PoCs).

The overall goal of the project is to develop a large language model that is available as open source to research and industry, and that addresses the multilingual needs of Europe. With the release of Teuken-7B, the project has achieved this goal, providing a public research alternative for future scientific research and commercial applications of generative AI.

About CIDS

As a unifying element across all research and teaching areas, digitalization is a central strategic focus at TU Dresden, because the digital transformation is reshaping organizational structures, processes and products. In science, it offers new opportunities to explore forward-looking solutions and contribute to society. Hence, adapting to new technologies like edge and cloud computing is now universally essential. The necessary skills, closely linked to HPC, Big Data, Data Analytics, and AI, require future infrastructures to be dynamic and autonomous, able to optimize resource use while maintaining data sovereignty for users.

The Center for Interdisciplinary Digital Sciences (CIDS) reinforces TU Dresden's commitment to leading in digitalization, HPC, and AI, positioning it as a competitive hub for interdisciplinary research and innovation. With its two departments ZIH and ScaDS.AI, CIDS integrates two competence centers for HPC and AI.

Funding, project management and contacts

The OpenGPT-X project has been funded by the BMWK since 2022 and is largely coordinated by the Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS).

Contacts:

Fraunhofer IAIS:
Dr. Nicolas Flores-Herr, Dr. Michael Fromm

TUD Dresden University of Technology, ScaDS.AI Dresden/Leipzig:
Dr. René Jäkel, Klaudia-Doris Thellmann