Living Lab Lecture No. 30: Compressing Large Language Models

Aaron Klein

Event website

11:00 AM - 12:00 PM

Online

https://tud.link/i8zf

Large Language Models (LLMs) mark a new era in Artificial Intelligence. However, their large size poses significant challenges for inference in real-world applications due to substantial GPU memory requirements and high inference latency.

In this talk, we discuss techniques to compress pre-trained LLMs, reducing their resource consumption during inference while maintaining their performance. More specifically, we approach the problem from a multi-objective Neural Architecture Search (NAS) perspective to jointly optimize performance and efficiency.

By considering the LLM as a super-network consisting of a large but finite number of sub-networks, we can identify a set of Pareto-optimal sub-networks that balance parameter count and validation performance. We empirically demonstrate that using NAS techniques for fine-tuning enhances the prunability of pre-trained LLMs and explore how this impacts real-world applications.

Living Lab Lecture No. 30: Compressing Large Language Models

About this page

Any question?